Loading Tabular Data
Introduction
This tutorial was prepared using Genome Workbench version 3.7.0
All the illustrations in this tutorial are fully applicable for the MS Windows 10 users. Unix/Linux users or users of the other MS Windows versions might experience minor variations in default settings, column order, window size, and other insignificant differences.
This tutorial will take you through the steps necessary to import tables into Genome Workbench. Table can be any file that has a fixed number of fields in each row and has well-defined separators, such as tabs, whitespaces or commas, between those fields. This can be very useful for user-created files, such as data exported from Microsoft Excel (tm). When a file is in a format already supported by Genome Workbench, such as a BED file, the file can be opened directly by selecting File on the open dialog with the File Format set to "Bed Files" or "Autodetect format".
- Select and Preview the Table
- Select Field Delimiters
- Identify Field Types
- Transform and Save Results
- View Transformed Tables
The following file will be used as the sample dataset for this tutorial:
Step 1: Select and Preview the Table
To start, Select Open command from the File drop-down menu. Select File Import on the left side of the pop-up dialog.
This opens the file-import wizard. On the first page under File Format: select "Table files". You may also select "Autodetect format"but that may not work consistently for all table files.
Under Filenames click the folder icon and use the file browser to select sample_table.txt.
Click the Next> button (you might see warning message asking to confirm that you want to upload your data as a Table file, click Yes). You will see the contents of the file under Table Preview section on the next page of the Open dialog (you might need to increase the size of the dialog to see the content of the Table Preview section).
Notice the option Table Data Format on this screen. It is telling you that the system recognizes this is a delimited table. This option lets you decide how to import the table, as a Delimited table or a Fixed Width table. Most of the tables we deal with should be imported as delimited.
The next option, Import data starting with row lets you skip the first rows of the table because, for example, they are comment rows. Just set the number of the first row you want to import here. The following field, Comment Character can be set to a single character that precedes all of the header rows. It may be filled in automatically after you select the file, but you can change it if you wish.
The last option Parse Column Names from Selected Row lets you click on a row in the table from which the importer will then extract column header names. Column names will be parsed from that row with the same rules used for parsing other rows. Header names can also be entered manually.
Step 2: Select Field Delimiters
After selecting Next > on the preview screen you will be able to select the delimiters for separating the table rows into fields. When the file was read in the table loader made a "best guess" on the delimiters and on what will be shown on this page. To change them, click on the combination of delimiters and watch the results under Table Data to see which combination separates the fields correctly.
In addition to choosing the delimiters you can choose whether to Treat adjacent delimiters as a single delimiter. If your delimiter is a tab, for example, and you have two adjacent tabs with no text in-between, that can create a blank field, or it can be treated as a single break between fields. You can also choose whether there is a quote-character for defining strings. If you choose this option and your delimiter is a comma (for example) than any commas that appear between quotes will not be treated as field separators.
If on the first screen you indicated that the Table Data Format was a Fixed Width format, then you will be taken to a different screen for dividing rows into fields. On this screen you split existing columns in two by clicking on Insert Column and then clicking on the column to be split. To remove a column, click on Remove Column and then click on the column to be removed. After you add and remove columns you can move the column dividers around until the fields are properly divided.
Step 3: Identify Field Types
After leaving the delimiter or fixed width pages that divided the rows into fields, the table importer takes its best guess as to the underlying data and semantic types for each column. On this page you can review and update the column names, data type and properties.
To update the name and properties for a column, click on the column name in the table at the bottom of the page. When you then pick a Data Type such as Sequence ID or Integer the Properties list will change to show options for that type. If you select an Integer type you need to indicate whether the value is One-Based or not. The standard convention in biological data is for data to be one-based, e.g. the first nucleotide in a sequence is position one. If a field is 0-based, the more common programming convention, then the first element of the set would be position 0. Lastly under types, if you do not want to import a column, select Skipped.
If one or more ID columns in the table represent a chromosome number or letter, then select an assembly for the column so that an accession can be found for the ID in the next import step. To choose an assembly for a column, click on the column header, then enter a search term to find the correct genome and then click Find Assembly.
Cick on Find Assembly brings up the Select Assembly search dialog.
Search and select the correct assembly here for the current column. Note that if you have chromosome identifiers in more than one column, you can assign a different assembly to each of these ID columns.
After the column names, types and, if needed, assemblies have been specified for the columns - hit Next > to go to the transformations page.
Step 4: Transform and Save Results
The transformation page allows you to save your table and, if the proper fields are available, to specify locations or features in order to link your table entries to other data. The three transformation and save options are:
- Convert to Feature Table
- Create Locations
- Create SNP Variation Features
- Keep Table Unchanged
The last transformation option, Keep Table Unchanged saves your table directly as an ASN file with the column names and types you specified on the previous Identify Field Types page. This is the only option that you will be able to use if your table does not include the ID, start position and stop position columns needed to create features or locations.
The Convert to Feature Table transformation option adds and updates fields as needed so that the rows can be viewed as region features in Genome Workbench. If none of the columns were identified as a region field, a region field is added automatically. An ID, start, and stop column are also required. If there is no strand column, one is added based on whether stop is greater than or equal to start (positive) or start is greater than stop (negative). Lastly, if more than one ID column has been identified, you will not be able to create region features since these features can only refer to one region per row. To fix this, go back to the previous page and set the type of the column(s) you do not want to transform to be Text or Integer rather than Sequence ID.
Summary of rules for transforming rows into region features:
- Strand not provided: Add strand field, setting value to be positive if start>=stop, and negative otherwise. Swap start and stop values for negative strands.
- Strand field provided as string: Add numeric strand field using standard NCBI mapping
- Start is one-based: subtract 1 from start
- Stop is one-based: subtract 1 from stop
- Region field not provided: Add region field with generated region names (region + row#)
- Stop field not provided but a length field is: Add a stop field where stop == start + length
Error conditions while converting region features: If any rows have errors, an error column will be appended to the table. Each row will have its own error column value.
- Start or stop are negative (this is an error whether nor not strand is provided)
- Start or stop fields are 0 when they are specified as being 1-based by the user
- Length is provided instead of stop and value is less than 0
- A strand is provided and stop<start
- A strand is provided but its value is not recognized, causing strand to be saved as "unknown".
- Seq-id not recognized
- Required field missing - Seq-id, start, and stop (or length) are required
The Create Locations option does not change any of the existing columns but it does add a new location column to the table for each set of ID, start and stop fields in the table. When you open the resulting ASN file in Genome Workbench you will be able to click on the locations and jump to the corresponding positions in the Graphical View.
Summary of rules for adding locations (multiple locations per row are allowed if data supports it)
- Strand not provided: Specify + strand in seq-loc if stop>=start, and – otherwise
- Strand field provided as string: Add numeric strand field to seq-loc using ENa_strand enum values
- Strand not provided: Set value of strand in seq-loc based on whether start > stop
- Strand not provided and stop < start: Swap start and stop fields in seq-loc, but do not update values in table
- Start is one-based: Subtract 1 from start and from start position in the generated seq-loc
- Stop is one-based: Subtract 1 from stop and from stop position in the generated seq-loc
- Stop field not provided but length is: Set stop value in seq-loc to start + length
Possible errors when adding locations
- The start or stop field is negative
- Start or stop fields are 0 when they are specified as being 1-based by the user
- Length field provided instead of start and its value < 0
- A strand is provided and stop<start
- Strand is provided but its value is not recognized, so an error is logged and value in seq-loc is set based on whether stop>start
- Seq-id not provided or provided but not recognized
- Missing or non-numeric number field for start, stop, or, if stop not provided, length
Summary of rules for transforming rows into SNP features
- RSID and Genotype columns are mandatory
This window also gives you the option to save your transformation parameters for the table by selecting Save Import Parameters. This option is provided for debugging and for future use.
Once you have chosen a Transformation Type click Next > to select project.
Click Finish to save the resulting ASN file which you will then be able to open as a table in Genome Workbench.
Step 5: View Transformed Tables
Once you have saved the transformed table you will see it in your project tree and you will be able to open it as a Generic Table View or as a Text View.
If the table was saved without changes then the fields should appear just as they did in the original file, minus any fields that you chose not to import.
If you added locations to your table, you will see the location field(s) in the Generic Table View on the right hand side of the table. Double-clicking on the locations will cause the corresponding sequence to open at the specified location. If you have not previously selected a default view you will be presented with the Open View intermediate dialog.
If you specified a feature table in your transformation step, then you can also open the corresponding sequence in the same folder of the project tree and see the features on the sequence. Note that to see region features in the Graphical View, you also need to select content Regions as shown in the image (or find Region track in the Sequence category in the Gear icon/Configure Tracks dialog).
Current Version is 3.8.2 (released December 12, 2022)
General
Help
Tutorials
- Basic Operation
- Using Active Objects Inspector
- Configure tracks and track display settings
- Working with Non-Public Data
- Viewing Multiple Alignments and Trees
- Broadcasting
- Genes and Variation
- Generating and Viewing Sequence Overlap Alignment
- Working with BAM Files
- Loading Tabular Data
- Working with VCF Files
- Sequence View Markers
- Opening Projects in Genome Workbench
- Publication quality graphics (PDF/SVG image export)
- Editing in Genome Workbench
- Create Protein Alignments using ProSplign
- GFF-CIGAR export for alignments
- Exporting Tree Nodes to CSV
- Generic Table View
- Running BLAST search against custom BLAST databases
- Using Phylogenetic Tree
- Coloring methods in Multiple Alignment View
- Displaying translation discrepancies
- Searching in Genome Workbench
- Graphical View Navigation and Manipulation
- Using the Text View to Review and Edit a Submission
- BAM haplotype filtering
- Displaying new non-NCBI molecules with annotations
- Creating phylogenetic tree from precalculated multiple alignment
- Creating phylogenetic tree starting from search
- Video Tutorials
General use Manuals
- Tree Viewer Formatting
- Tree Viewer Broadcasting
- Genome Workbench Macro
- Query Syntax in Genome Workbench and Tree Viewer
- Multiple Sequence Aligners
- Running Genome Workbench over X Window System
NCBI GenBank Submissions Manuals
- Table of Contents
- Introduction
- Genome Submission Wizard
- Save Submission File
- Reports
- Import
- Sequences
- Add Features
- Add Publication
- Comments
- Editing Tools