Understanding Data Loading into GERMINATE: Molecular Marker Insertion Workflow

Loading Data into GERMINATE How data is loading into the GERMINATE tables.

Loading Molecular Marker Data The arrows in the figure show flow of information as it is inserted into the database. Black arrows indicate data are being held temporarily, green indicates the insertion to the database and blue that data already inserted are being used to insert information into another table. In the latter case, ID’s assigned by the database are used to trace back to the original data. The colours in the tables follow the dataset and metadatasets through the process of being inserted into the database. The peach colour denotes the Accession metadataset, green denotes the Marker metadataset, and purple denotes the allele data. The DATA table represents a sample of how molecular marker data are typically submitted; a set of markers analyzed in a set of accessions. Box A represents the Accession data and metadata inserted into GERMINATE. On entry, each Accession is assigned an accession_id which is unique in the database and this ID is used to reference the appropriate accession in the accession metadataset. The order or number of accession_id’s has no influence on the order of accessions in the metadataset. The ReferenceData table uses a data index to track the correct order of the accession_id’s. Accession Order Data Order Box A ReferenceData Accession metadataset dataset 2 Accessions [dataset 1, dimension0] Unique Allelic States accession_id reference_id instcode_id dataset_id accenumb index_id table_id Box B 5 -> Accessions table reference_id = accession_id Box C StringData EnumUnits- ArraysText Box C demonstrates how the allelic state of the accession by marker is translated into an integer id (enum_index). This ID is stored in appropriate order in the IntegerData table. The enum_index can then be used to translate back to the actual allele value or to an allele index if only the relative allele states between accessions are required in a query. The AlleleIndex table was created to speed up queries where technology is unimportant and the relative allele values will suffice to answer the question. Marker metadataset IntegerData Box D dataset 3 [dataset 1, dimension 1] dataset 1 unit id enum index string data index id dataset id metadataset_id text[ ] Metadatasets dimension dataset_id integer_data (enum_index) size dataset_id index_id AlleleIndex allele index array enum index Datasets dimension count experiment_id data_type_id dataset_id method_id dataset_discription Box B indicates where the marker information is inserted into the database, again retaining the order in the original dataset by the data index value. Box D displays the metadata information recorded in the database required to recreate the dataset. This includes the number of dimensions for a dataset and relates the metadatasets to the dataset.

Genetic Map Data • 3 sets of data • Population data • Stored in Pedigree table, reference to individuals in reference table which links population to the dataset. • Data used to create linkage map • Stored similar to genetic data • Genetic linkage map data

Genetic Map Data Original Data Loci Linkage Groups Positions The positions for the loci in cM (indicated by the method, not shown here) is the primary dataset. The Linkage Groups and Loci are added as metadatasets for this dataset. Any additional information users may wish to store can be added as added dimensions to the dataset. The primary dataset is then linked to the populations and genetic data used to create the maps using the linking table. The Grey boxes are database assigned ID's String Data Real Data String Data string_data dataset_id dataset_id dataset_id string_id index_id index_id real_data index_id Datasets Metadatasets dimension count experiment_id data_type_id method_id dataset_id metadataset_id dataset_discription dimension dataset_id size

Trait Data Original Data Units This trait data all uses the same method but three different experiments are done. Each experiment then has two datasets the data value and the accession. The colors follow the loading of each experiment into the database. The ID's (method_id, dataset_id, etc.) are assigned by the database. Methods Experiments Datasets Metadatasets

Trait data IntegerData ReferenceData The data values are translated to an integer using the EnumUnits table and an integer loaded into the database. This is done because for large datasets searching a integer table will be faster than a string table. The reference_id here correspondes to the id for the accession in the original data entry. EnumUnits AlleleIndex

Understanding Data Loading into GERMINATE: Molecular Marker Insertion Workflow