Working with Non-Public Data

Step 1: Introduction

This tutorial demonstrates two different ways to manage private data in Genome Workbench.

  • You have created your own sequence and want to work with it in Genome Workbench
  • You want to view your own data/annotation on a publicly available sequence

We will demonstrate using some of the Genome Workbench tools on the data not found in the NCBI databases.

It is recommended that you complete Basic Operation tutorial first.

Sample data you will need to complete this tutorial - BX530088_BX572102.

Step 2: Getting Started

For the first exercise we are going to do the following:

  • Load a user-generated AGP file (download sample)
  • SPLIGN some mRNAs on that AGP sequence
  • Create a FASTA file from the AGP
  • BLAST that FASTA sequence to see what is related to it
  • WindowMask that FASTA sequence (or part of it) to look for repetitive regions

Genome workbench starts up and displays the main screen. Choose File=>Open from the main menu, select File Import on the left side of the dialog, click the folder icon on the right to point to the file location. Genome Workbench understands many different file formats including fasta files with local IDs. For this step choose BX530088_BX572102.comp.agp from the data files downloaded. Click Next => see dialog for Fasta file uploading (Note: In case if your sequences are local, you will need upload fasta file. For our current example, sequences have been submitted in GeneBank and have accession numbers, thus fasta will be pulled up automatically), click Next again to accept the defaults. Then click Finish to add the data file to a new project.

Now that your data is loaded, you can view it by selecting the data in the project tree, right clicking and choosing Open New View. Then choose Graphical View. While this is not very interesting you can zoom in to see the sequence.

ADP file opened in GSV

Step 3: Apply the tool (SPLIGN) to private data

Now let us align an mRNA to our sequence. We will use the SPLIGN tool. SPLIGN (or SPLiced Aligner) is a global alignment tool used in NCBI's annotation pipeline. Open the NM_020137.3 RID from GenBank database (File=>Open) and add it to the project.

Open accession from GenBank dialog

Click Next and Finish. Both entries are now shown in the data folder.

Select both entries (SHIFT+left click in both MS Windows and Mac OS). With both entries selected click Tools=>Run Tool to open the Tools dialog and choose SPLIGN and Next (if you click on SPLIGN text exactly, you will be taken to the next screen even without having to choose Next).

Select BX530088... for the Genomic Sequence and NM_020137.3 for the Transcript Sequence. If you do not see both sections of the dialog you need to drag down the lower border of the dialog box.

Run SPLIGN tool dailog

Click Finish, results will be added to the existing project ones finished. SPLIGN alignment will be displayed in the Graphical view as an Alignment track.

SPLIGN result in GSV

Step 4: Export a FASTA file

Select the data file in the Project Tree View we loaded previously. Right click (control click in the Mac OS) on the selected data and choose Export. Select FASTA as the format, select a location, and give the file a name.

Export FASTA dialog

Click Finish.

Now open the FASTA file you have just created. Choose File=>Open. Select the file and click Next. Accept the default settings and click Next again. Choose to create a new project and click Finish.

Select the FASTA data in the Project Tree View and double click it. From the Open View menu choose Graphical View.

Exported FASTA opened in GSV

Step 5: Alignment (BLAST and Clean Up)

To perform BLAST alignment for the entire sequence choose Run Tool (Tools=>Run Tool from the main menu, or Right Click (control-click on the Mac OS)). From the Run Tool dialog choose BLAST Search. (Note: you can perform BLAST for the particular region as well, in this case you need to select region of interest by click on the top ruler and drag in any direction).

Run Tool dialog BLAST selected

Click Next.

In the BLAST Search dialog ensure you have selected the Nucleotide option, Nucleotide-Nucleotide (MegaBLAST) from the Program menu, and nr(Nucleotide collection (nt)) from the Database menu. Input biomol mrna[prop] search string into the Entrez Query field.

BLAST dialog query subject

Click Next.

From the next dialog, accept the general parameters and check the Filter low complexity regions and select Human from the Species specific repeats for dropdown list.

Run BLAST dialog options

Then click Finish. As BLAST is finished ( ), results will be added to the to the corresponding project (New Project (1)). It can take some time for the analysis to return and present the results.

BLAST result in GSV

To see individual hits in more clear way, we will apply Clean Up Alignment tool to our BLAST alignment. This tool will filter hits and place all hits to the same accession as a separated row. Select BLAST result in the project view and run tool dialog, choose Clean Up Alignments and click Next.

Run Tool dialog CleanUpAlignments selected

Accept default in the next dialog and click Finish.

Run CleanUpAlignments dialog options

Cleaned Up BLAST result should appear in the Tools Result folder in the corresponding project (New Project (1)).

CleanUpAlignments result in GSV tooltip shown

Zoom in to see individual hits, open tooltips for more information about hits/alignments.

Step 6: WindowMasker

In this step we will use WindowMasker on the FASTA sequence to look for repetitive regions. First let us upload the mask. Select Tools=>WindowMasker Data. (Note: WindowMasker path is not available for the outside NCBI users). In the dialogue that appears select human.tar.gz as the mask and click OK button. Window masker folder will be created automatically in the “GenomeWorkbench2” folder and data downloaded.

WindowMasker download dialog

In the Graphical Sequence View collapse the Cleaned alignment track and select the region by clicking on the ruler and dragging a selection around a region.

Sequence region selected in GSV

Choose Tools=>Run Tool from the main menu. Select Search/Find Repetitive Sequences with WindowMasker row and click Next (if you click on tool’s text exactly, you will be taken to the next screen even without having to click Next).

Run Tool dialog WindowMasker selected

Ensure that our region of sequence is selected, select 9606 Homo sapiens from the Mask using parameters for dropdown list.

Run WindowMasker dialog options

Note: If not downloaded previously, window Masker Files can be downloaded via Configure option of this dialog:

Run WindowMasker configure download

Click Next, choose a project to add the results to and click Finish. It can take some time for the job to complete.

The result is a histogram showing regions of repeats. If the histogram does not appear automatically, select the content menu at the bottom of the graphical view and choose Repeat Regions.

WindowMasker result histogram

You can scroll and zoom just like you would any other view.

WindowMasker result zoomed

Step 7: Conclusion

There are multiple ways to use Genome Workbench and this only shows some very simple examples. It gives you enough background to start exploring your data in new and interesting ways. It gives you the privacy you need along with the access to public data desired. For more information on working with BAM and GFF3 files refer to Displaying new non-NCBI molecules with annotations tutorial.

Current Version is 3.6.0 (released March 04, 2021)

Release Notes

Downloads

General


Help


Tutorials


General use Manuals


NCBI GenBank Submissions Manuals


Other Resources


Support Center

Last updated: 2020-12-09T23:08:13Z