Working with VCF Files

Step 1: Introduction

This tutorial was created using Genome Workbench v.3.6.0. Genome Workbench can now visualize genome size VCF collections. VCF files can be opened from local and remote locations and viewed in a few different views including Graphical Sequence view, VCF table view, and Active Object Inspector view.

For this tutorial we selected VCF data for Drosophila melanogaster assembly from EBI (as it is on January 5, 2021). Example files can be downloaded locally from our ftp (file size is ~60MB unzipped (~16MB zipped)) or used directly from the remote location:

https://ftp.ncbi.nlm.nih.gov/toolbox/gbench/samples/vcf/GCA_000001215.4_current_ids.vcf.gz

Some publicly available VCF data also can be found here (we cannot guarantee the stability of this data):

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502

http://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_1/by_species/

http://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_1/by_assembly/

Step 2: Uploading VCF file

From the File menu choose Open and select File Import from the left side of dialog. For the File Format select VCF (Variant Call Format) files from the dropdown list. Use the Folder icon to browse to the VCF local file location or paste the URL for the remote file:

https://ftp.ncbi.nlm.nih.gov/toolbox/gbench/samples/vcf/GCA_000001215.4_current_ids.vcf.gz

Click Next.

open dialog VCF

The user can designate the assembly on the Next page of the dialog. The assembly identifier specified in the header of the VCF file (if any) will be shown on the VCF File Assembly Identifier box. Click on the Find Assembly button (it should be active) to search and select the desired matching assembly.

open dialog mapping

Click OK on the Select Assembly popup dialog and see the assembly information is populated in the Open VCF dialog.

open dialog assembly populated

Click Next and observe that the next page of the Open dialog shows the possible INFO field tags used in the VCF file. The INFO field tags are described in the header of the file. These tags are provided in column 8 in the format (key(ID) value) and separated by semicolons. More information about the VCF file format can be found at Variant Call Format - Wikipedia. The data in the INFO fields is parsed as columns in the VCF Table View (see step 6 of this tutorial).

open dialog fields info

Notes:

  • If you deselect all info fields, the VCF Table View will show only default columns: variant position, ID, Type, and Alleles.
  • If info field is not selected by default (no checkmark), it means it has multiple values (SID in our example, see above image). You can select all fields manually; it might increase the upload time.
  • If the VCF file has Sample columns you will see an additional page of the Open dialog with list of samples. Samples can be selected on this page to view in the VCF Table View.
  • Some INFO fields might not have any values. These columns will not show up in the table view even if you selected them during uploading (ALMM, ASMM in our example).

Click Next. Since our example VCF file doesn’t have information about samples, the next page of the open dialog will ask you to select sequences for which to load variation data, offering two options:

  • Load all sequences referenced in the file
  • Load selected sequences only

The first option is selected by default. You can always use it, however uploading might take a while if, for example, your VCF file contains data for a lot of scaffolds.

open dialog chromosomes

Let us select the second option Load selected sequence only. Observe that all available parts of the current assembly become active. This page always shows all assembly parts (chromosomes, mitochondrion, plastid genomes) that are currently known for the assembly you selected to load VCF data for.

open dialog all checked

If a VCF file has data for parts that are not listed (some scaffolds, for example), you can always use the search option to bring this scaffold to the list (need to know scaffold ID). To see your search result included in the list of the selected sequences, you need to clean up the search box (click on “x” icon inside the box) and click Next. If you click Next button without cleaning a search query, the result of the search will be added to the list without showing the updated full list of the selected molecules.

Let’s leave all parts in the list selected and click Next and Finish to add to the new project. You can see progress in the Task View and in the Event View windows.

Loading progress

In case the VCF file does not have data for all chromosomes/parts you selected, at the end of the uploading process you will see an error message informing you that some chromosomes are absent in your VCF file. Our example file does not have data for NC_24512.1 (chrY).

Chromosome not in file

Click Close to close the window.

At this point all uploaded data can be seen in the project tree with information about the total number of variants for every molecule.

Data in project

In the Event View you can also see how much time it took to upload all VCF data. Note: It is possible to open Graphical Sequence View and manipulate data that has already been uploaded into the project while some other data is still in the process of uploading.

Step 3: Viewing VCF data: Graphical Sequence View

Open any sequence available in the project in the Graphical Sequence View (GSV) and find the VCF track (by default it will be added below all other tracks configured in the view). Here we opened accession NT_033795 (chromosome 2L):

GSV heatmap

Step 4: Viewing VCF data: Tooltips and Active Object Inspector

Zoom in to see individual variation data. Hovering over the variation will open a tooltip with some extra information. Similar information can be seen in the Active Object Inspector View, including position, alternate allele, allele length. It is possible to select a few SNPs of interest in GSV and see the description for all of them in the Active Object Inspector View.

GSV AOI zoomed

Step 5: Viewing VCF data: Search for variation (in GSV and Search View)

If you are interested in a particular variant you can use the search functionality for faster finding/zooming. Let’s zoom out and perform a search for rs202372428. Clicking on the binocular icon will bring the view to the region with the selected SNP at the sequence zoom level.

GSV search result

VCF search can be performed starting from the common Search View. Let’s open chromosome NC004354.4 in the Graphical Sequence View, then open Search View (if not already opened). From the Search Tool dropdown list select VCF search and for the Search Context select NC004354.4 (this dropdown shows all molecules opened in the GSV, currently you should see two molecules in the list - NT_033779.5 and NC004354.4). Paste rs881222392 in the Search Expression box and click the Start button. Search result will appear in the main part of the Search View.

Search view result

Now let’s demonstrate communication between the Search View and the Graphical Sequence View (GSV). Place the corresponding GSV panel next to the Search View panel and click on the search result row in the Search View. Observe that the GSV display zooms to the selected variant in the Search View, rs881222392.

Search view broadcast CSV

Step 6: Viewing VCF data in Table View (for range)

Adjust the GSV NC004354.4 window back as it is in the image below and search for gene nocte. Select the region that covers this gene.

Region selected for table

Open the context menu (right click menu) and select Open New View. In the Open View dialog select VCF Table View.

open VCF table

Click Next. It will open the VCF Table View with variation data for the selected region. The table has eight sortable columns. First four columns are default ones. They show position, ID, Type and Alleles.

The other four columns correspond to the information fields that were checked in the INFO fields dialog during uploading (see step 2 of this tutorial). They might give you clues about the validity of variants: LOE (lack of evidence flag, present if no submitted variant includes genotype or frequency information), SS_Validated (number of submitted variants clustered in an RS that were validated by any method as indicated by the dbSNP validation status), VC (variant class according to the Sequence Ontology), RS_Validated (flag present when the RS was validated by any method as indicated by the dbSNP validation status). Two info fields that were select during uploading (ALMM, ASMM) do not show up since they do not have data in the VCF info column.

VCF region table view

In case you do not see all expected columns selected during uploading, simply right click on the header of the table to open the context menu and choose additional columns from the list.

VCF table context menu

Now, place the Graphical Sequence View window for NC_004354.4 next to the VCF Table View. Select any variation in the table view and observe that the GSV display automatically zooms to show that variant.

VCF table GSV broadcasting

Note: the maximum number of the rows visible in the VCF Table View is 100000. If you try to open region (or file from the project view) that includes more variant rows, only the first 100000 rows will be opened with the warning message on the last row of the table: “Warning: Only the first 100000 rows can be shown”.

Step 7: Export table as CSV file

VCF table for the region can be exported as the CSV file. Point mouse to the data in the table and right-click. In the context menu select Export to CSV.

VCF table CSV context menu

In the Export To CSV dialog select columns to export and provide location and name for the csv file, then click OK. Here we removed checkmarks for extra columns in order to download the first four (default) columns only.

CSV export dialog

Exported file opened in the Excel should have all four columns with headers.

CSV in excel

Step 8: Save Project

To save a VCF project for future use, go to File menu/Save Project as, select the project by adding a checkmark, click Save Selected button, provide a project name, and click Save.

Current Version is 3.7.1 (released October 13, 2021)

Release Notes

Downloads

General


Help


Tutorials


General use Manuals


NCBI GenBank Submissions Manuals


Other Resources


Support Center

Last updated: 2021-03-08T22:04:12Z