The NCBI Genome Remapping Service (Remap) will be retired in November 2023. Read more.
What is NCBI Remap?
NCBI Remap is a tool that allows users to project annotation data from one coordinate system to another. This remapping (sometimes called 'liftover') uses genomic alignments to project features from one sequence to the other. For each feature on the source sequence, we perform a base-by-base analysis of each feature on the source sequence in order to project the feature through the alignment to the new sequence.
We support three variations of Remap. Assembly-Assembly allows the remapping of features from one assembly to another. Clinical allows for the remapping of features from assembly sequences to RefSeqGene sequences (including transcript and protein sequences annotated on the RefSeqGene) or from RefSeqGene sequences to an assembly. Alt loci remap allows for the mapping of features between the Primary assembly unit and the Alternate Loci and Patches assembly units available for GRC assemblies.
You can view a short video describing how to use remap here: http://www.youtube.com/watch?v=0lhcMGGReVQ
- What's New
- Specifying the data
- Remapping options
- Configuring Remapping Parameters via URLs
- Providing Data
- Output files
- Remapping Variation Data
What's new
May 2018 Update
- Added features:
- Improved performance for highly fragmented assemblies
- Bgzipped VCF files are now recognized by the format guesser
- Improved error handling:
- Now reporting "NOMAP ERROR" when the VCF REF and ALT bases are switched and no remapped rows are produced.
- Unrecognized sequence identifiers are skipped and reported as "NOMAP NOTINSET".
- Bug fixes:
- Fixed bug causing "?" in the output of VCF remapping, where the REF allele differs between the source and target assemblies (REF_EDIT rows), and the REF and ALT alleles are the same in the output VCF. These cases are now shown as "[base] ."
- Fixed bug in the handling of GFF with non-Sequence Ontology terms in column 3
- Fixed bug causing modification of column 3 and column 9 in GFF3 to GFF3 remapping
- Fixed mishandling of LRG sequences in HGVS format
February 2017 Update
- Added features:
- New INFO tag "REF_EDIT" has been added to output VCF files to indicate when the REF allele in the output VCF differs from the REF allele in the input VCF. This may occur as a result of left-shifting alleles from the input VCF during remapping or a reflect a difference in the REF alleles found in the source and target assemblies. This tag replaces the "REF_UPDATE" and "REF_LEFT_SHIFT" INFO tags.
- Updated backend associated with remapping of variants from VCF files
- Bug fixes:
- Fixed known issue associated with proper reverse complementing of left-shifted or multi-base alleles in VCF output
- Fixed known issue with output VCF only reporting one allele when multiple alleles were specified in a single row of the input VCF
- Fixed bug affecting recognition of LRG sequences in Clinical Remapping option
- Fixed bug affecting remapped coordinates in columns 7,8 (thickStart/End) and 11,12 (blockSizes/Starts) in output BED
August 2016 Update
- Added features:
- Genome Workbench output now made optional for web interface and API
- Improved message reporting for errors or other user issues
- Back-end updates to support use of HTTPS protocol
- Bug fixes:
- Corrected issue with alignment coverage table that failed to update percent coverage column data correctly when source and target assemblies were swapped.
- Fixed broken and updated links to file format documentation
- Know issues:
- See previous releases
February 2016 Update
- Added features:
- Extended retrieval time for Remap job results. The URL on the page displaying Remap results will remain active for 3 days.
- e.g., https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genome/tools/remap/JSID_01_133856_130.14.18.6_9000_remap__1454025062
- Gbench file production is now optional. In the web interface, users can deselect the default option for Gbench file production. API users can turn off file production by use of the gbench=false parameter. This reduces memory usage.
- Announcements banner on Gbench home page.
- Extended retrieval time for Remap job results. The URL on the page displaying Remap results will remain active for 3 days.
- Bug fixes:
- Alt-loci remap: Fixed a bug that resulted in the prefixing of "C" to alleles remapped to alternate loci.
- Fixed a bug that prevented selection of Source Organism when using FireFox 42.0.
- Fixed a bug affecting clearing of fields when using the "reset" button.
- Fixed a bug preventing remapping of features in some .ASN1 locations.
- Back end updates to reduce likelihood of intermittent failures when running large jobs.
- Known Issues:
- When using VCF as input, left-shifted or multi-base alleles on the minus strand are not reverse-complemented in the output. This will be fixed in an upcoming release. We apologize for the inconvenience.
- see March 2015 Update below
March 2015 Update
- Added features:
- New web interface for Assembly-Assembly alignment provides greater assembly detail, making it easier to distinguish similarly named assemblies
- Improved support for remapping of variations in VCF files. See the Remapping Variation Data section on this page for details.
- Improved reporting for features that do not remap. See the Mapping Report section on this page for details.
- Bug fixes
- VCF formatting: All entries for a given seq-id now form a continuous block in output VCF, consistent with VCF specifications
- Known Issues:
- If multiple ALT alleles are specified in a single row of an input VCF, only 1 allele is reported in output VCF. To avoid this situation, only specify a single ALT allele per VCF row. This will be fixed in an upcoming release.
- Remap may crash if a feature meets all three of the following conditions: Input format=VCF, feature requires left-shifting, feature remaps to multiple locations on the same sequence-id with multiple second-pass alignments. This bug, which affects only a small minority of features, will be fixed in an upcoming release. In the interim, if you encounter this issue, we suggest use of a different input file type or ensuring that variants are already left-shifted in the input VCF file. We apologize for the inconvenience.
April 2014 Update
- Added features:
- Assembly accessions now provided as tool-tips in the target and source assembly drop-down menus. These provide users with unambiguous identifiers for the assemblies used in their remapping effort.
- Assemblies in target and source menus are sorted by assembly and release version to facilitate identification of assembly of choice.
- Construction of pre-configured URLs: users can now construct URLs with specified remapping parameters that can be bookmarked or used as links to NCBI Remap.
- Bug fixes:
- Remapped locations missing from VCF, GVF or HGVS file output (but present in report files).
- Data remapped from alt-loci or patches to chromosomes reported on scaffolds rather than chromosomes.
- Further correction to alt-loci remap
August 2013 Update
- Added features:
- Limited support for LRG sequences in Clinical Remap. We currently only support the current versions of LRG and RefSeqGene sequences. Support for older sequences will be added in a future update.
- Bug fixes:
- Inappropriate duplication of variants lines when using a VCF with multiple alterante alleles.
- Dropping of scores from GTF files.
- Alt locus remap was fixed.
November 2012 Update
- Added features:
- Alt locus remap: remap features between the primary assembly and the alternate loci/patches in GRC assemblies.
- Clinical Remap: When you run this we will now make a call to the variation reporter and insert the results into Clincal Remap.
- Added support for upload of compressed files. Currently GZip (.gz) and BZip2 (.bz) files are supported.
- Improved HGVS nomenclature.
Specifying the data
Assembly-Assembly
In order to use the NCBI Remap service, you must select the organism of interest, the assembly your features are on (Source Assembly) and the assembly on which you wish to project these features (Target Assembly). If you would like to request additional organisms or assemblies to be added to the list, please use the Support Center to make this request.
List of supported assembly-assembly alignments in remap:
Organism | Source Assembly | Target Assembly | Software version | Last Updated |
---|
Clinical Remap
Only human is supported for the RefSeqGene tab, so you only need to select the sequence upon which your features are annotated (either an assembly or RefSeqGenes) and the sequences to which you want the features mapped (either RefSeqGenes or an assembly).
Alt loci remap
Alt loci remap allows you to map data between the Primary Assembly and the Alternate Loci/Patchesthat may be available for an assembly. Only assemblies produced by the Genome Reference Consortium are supported on this page. All you need to select on this page are the organism and the assembly; the software will figure out the direction in which you want to map. Within a given input file, however, all features to be remapped should map in the same direction (e.g. primary to alt OR alt to primary).
NOTE: For both Clinical Remap and Alt loci remap if you map FROM an assembly to either the RefSeqGenes or the Alternate Loci/Patches, you may have a lot of failed features as both of these sequences only cover a fraction of the genome. Features on source sequences that are not part of an alignment set will be marked as "NOMAP/NOTINSET" in the alignment report. To see genome coverage for Alternate Loci/Patches see the GRC pages for human and mouse.
Remapping Options
Some configuration options are available that allow you to configure the stringency of remapping. These options are only configurable in the Assembly-Assembly tab.
- Minimum ratio of bases that must be remapped (default: 0.5): This option specifies the percentage of the interval that must be able to be remapped. Raising this value increases the stringency of the remapping process.
- Maximum ratio for difference between the source length and the target length (default 2.0): This feature allows the remapping algorithm to tolerate insertions and deletions in the alignment. This is calculated by taking the interval length on the target assembly (stop-start+1) and dividing it by the interval length on the source assembly (stop-start+1). An insertion or deletion in the target assembly will affect this ratio. Lowering this value will increase the stringency of the remapping process.
- Allow multiple locations to be returned (default: on): We perform alignments in two phases (see 'About our alignments'). Selecting this option will allow the 'Second Pass' alignments to be used and improve coordinate projection in regions of duplication. This can also lead to multiple features being remapped to the same location.
- Merge Fragments (default: on): An insertion in the target assembly will split a feature on the source assembly, selecting this option will merge these two locations into a single location in the annotation file. Turning this feature off will increase the stringency of the remapping process, specifically in cases where there is an insertion in the target sequence as each remapped interval will be compared to the original interval.
The merge function can help you remap features that cross an assembly gap, or have a large insertion that causes a gap in the alignment.
Figure 1: A region with a feature that crosses an assembly gap. This feature was successfully remapped because the merge function was on.
However, in regions with messy alignments, the merge function can cause a feature to be remapped to the same, or overlapping positions. This only happens when using the Second Pass alignments for remapping as these alignments are not guaranteed to be unique.
Figure 2: A region with nice First Pass alignments and many Second Pass alignments.
Using the merge function, this feature remaps to six locations in GRCh37, one using the First Pass alignments and five using the Second Pass. These are easily distinguished using the remap report as the 'recip' column specifies whether the first pass or second pass alignments were used.
Figure 3: Remap report for feature with multiple locations returned due to complicated second pass alignments.
These features are relatively easy to identify in a post-processing step, or you can turn the merge function off. This will, however, negatively affect features that cross a gap. You may need to review the alignments (which you can do using the Genome Workbench project files) to determine the best course of action.
Note: Alignments are processed in a strand-specific manner. If a feature aligns to a region for which there are alignments on both strands, you may get a placement returned for the plus and the minus strand. Using the merge feature may increase the chances of this as merge helps to span alignment gaps. Turning merge off will cause a decrease in remapped features as gaps will not be crossed on either strand.
Configuring Remapping Parameters via URLs
There are several parameters that can be added to the NCBI Remap URL to pre-configure the mapping parameters that will be used.
Included file 'docs/whatis-table1.inc' not found
Providing Data
We accept file formats that are commonly used in the bioinformatics community. We currently accept:
The default behavior is to provide the remapped annotation file in the same format as the input file, but you can specify a different format for the output. If you have a small amount of data, you can just copy and paste the data in the large text box labeled 'Paste data here'. For example:
chr1:10349-25000
Otherwise, you can upload the data file. Please note: the larger your file is, the longer it will take to perform the remapping process. If you find that the process is taking a very long time or failing, you may want to split your files into smaller ones, perhaps based on chromosome assignment. There is also an absolute limit on the amount of RAM available to the system. If this is exceeded, Remap will fail. If this happens, try again with a smaller file.
Data options specific to the Clinical Remap tab
Mapping from a RefSeqGene(s) to an assembly: In this case, an additional option is provided and checked by default. This allows the remap service to return features on genomic sequences as well as any transcripts (NMs) or proteins (NPs) available at that locus.
Mapping from an assembly to RefSeqGenes: In this case, you have the option to map to any available RefSeqGene (default) or you can specify a list of RefSeqGenes as targets. If you choose to map to any available RefSeqGene, there are two additional options for providing locations on transcripts (NMs) or proteins. One is to provide the transcript (NM) and protein (NP) locations for features that map to RefSeqGenes and the other is to provide transcript (NM) and protein (NP) locations even if there isn't a RefSeqGene where your feature maps. Not all genes in the human genome have a RefSeqGene. There is a link on this page that allows you to request the construction of a RefSeqGene if one is not available for your gene of interest.
Output files
Summary Data: This is a global report that provides an overview of remapping results. The format of the report is (by column):
- ID: The sequence ID in the source assembly (often something like 'chr1' or NC_000001.9).
- Source Features: The number of features on the ID in the source file.
- Remapped Features: The number of features that could be projected onto the Target assembly.
- Source Intervals: The number of intervals on the ID in the source file. This happens because some features will have more than one sequence interval, for example, mRNA features will often have multiple intervals (corresponding to exons).
- Remapped Intervals: The number of intervals that could be projected onto the Target assembly.
The summary data appears on the web page and is available for download.
Mapping Report: This is a report that provides a feature-by-feature breakdown of the remapping status. The format of this report on the web page is (by column):
- Feature: The name or ID of the feature (the source of this will depend on the format submitted, but it should be possible to robustly associate the information in this column with the data in the input file).
- Src. Intervals: Number of intervals the feature has in the source file.
- Remap Intervals: Number of intervals that were projected to the target assembly.
- Src location: The feature location in the input file.
- Src length: The length of the feature in the input file.
- Map Location: Projected location (or reason that the remap failed) on the target assembly.
- Map length: Length of the feature on the target assembly.
- Coverage: Coverage of feature on the target assembly.
Only a few lines of this report are displayed on the web page, but the entire report is available for download in a tab separated file (tsv) that can be easily parsed or loaded to spreadsheet program. The downloaded report has 18 columns as follows. Note: If the merge option is selected, all of these fields contain post-merge values:
- #feat_name: user-supplied feature name. If no feature name is supplied, a name is calculated using the line number in the file or the location. For features with multiple intervals (e.g. transcripts), this field will be common to each interval.
- source_int: The number of intervals in the source feature (useful for tracking features with multiple intervals, like transcripts). For single-interval features, the value is always 1.
- mapped_int: the number of mapped intervals in the remapped file from the source interval. Values >1 indicate a fragmented mapping.
- source_id: sequence identifier the feature maps to in the source file.
- mapped_id: sequence identifier the features maps to on the target assembly.
- source_length: length of the interval on the source assembly.
- mapped_length: length of the interval on the target assembly.
- source_start: first base of the interval on the source assembly.
- source_stop: last base of the interval on the source assembly.
- source_strand: strand the interval is annotated on in the source assembly.
- source_sub_start: first base of source sub-interval that was mapped (used only if entire source interval does not remap and the front edge of the source interval does not map).
- source_sub_stop: last base of source sub-interval that was mapped (used only if entire source interval does not remap and the back edge of the source interval does not map).
- mapped_start: first base of remapped interval.
- mapped_stop: last base of remapped interval.
- mapped_strand: strand of remapped base.
- coverage: This is calculated by taking the ratio of the mapped_length to the source_length. If coverage =1 the remapped and source interval are identical. A coverage score of less than 1 indicates a deletion in the target assembly and a score of greater than 1 indicates an insertion in the target assembly.
- recip: Two possible values are in this column. First Pass means the remapping is based on the 'First Pass' or reciprocal-best-hit alignments. 'Second Pass' means the remapping is based on the non-reciprocal-best-hit alignments.
- asm_unit: The assembly unit to which the mapped_id belongs. For more information on assembly units, see: https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/assembly/model/
Features that don't remap will have the word 'NOMAP' in column 15 and the reason for not mapping in column 16. The reasons are:
- NOALIGN: There was no alignment for this region.
- LOWCOV: The percent of the interval covered in the alignment was below the coverage threshold specified in the 'Remapping Options' (Minimum ratio of bases that must be remapped).
- EXPANDED: The ratio of the length on the target sequence versus the length on the source sequence is greater than specified in the remap options (default is 2).
- ALIGNGAP: The source interval falls entirely within an indel in an alignment between the source and target sequence.
- NOTINSET: The source interval is not part of the alignment set.
Annotation Data: This file contains only the remapped features, in the format specified on the input page. No sample data is shown on the web page, but the file is available for download and display in your favorite viewer.
Genome Workbench Files: These are files that can be loaded directly into our client side viewer called Genome Workbench. They contain the sequence information for both the source and target assemblies, the assembly-assembly alignments used in the remapping and feature annotations (both the source features and the remap features). These files are available for download and are very useful for understanding how the alignments influenced the feature remapping (see Figure 4).
Figure 4: View of remapping in Genome Workbench. The sequence being shown in this view is the Target assembly. The tracks are (in order from the top):
- Ruler: showing basepair coordinates.
- Sequence: for some organisms this will be colored and for others it will be grey. This track will show you the actual base pairs if you zoom in enough.
- Tiling Path: Shows the INSDC sequences used to construct the sequence.
- Genes Track: Gene annotation from NCBI annotation process.
- Alignments: Alignment to the Source assembly. This will have the 'First Pass' alignments and the 'Second Pass' alignments if the 'Allow duplications' option was checked. The alignments are zoomed to the base pair level. Mismatches are colored in red. Insertions are shown using a blue triangle (none in this view).
- SNP features: Variation features defined by dbSNP.
- Only the remapped features are shown here. In this example features from dbVar were mapped from NCBI36->GRCh37.p9. Only remapped features are shown on the target assembly. If you open a sequence that is part of the Source assembly you can see the orginal features.
Remapping Variation Data
Edits to VCF Files
If you are using a Variant Call Format (VCF) file as your input, you may find edits to REF and ALT bases in your remapped output under specific circumstances. The first is due to sequence differences in the source and target assemblies. The assembly to which a REF base refers differs in the input and output files. If you use a VCF file as your input, NCBI Remap will produce output annotation files in which REF bases refer to the sequence in the target assembly. This means that if a REF base differs between the source and target assemblies, the output VCF will report the target assembly base in the REF field. The corresponding ALT field in the output VCF will be updated, with the source assembly REF base replacing or being appended to the ALT base that was provided in the input VCF, as appropriate. The second circumstance is due to error in an input VCF. If the base specified in the REF column of an input VCF is incorrect, the correct base will be reported in the output VCF and the input base will be added to the ALT column.
Additionally, if you are using a Variant Call Format (VCF) file as your input, NCBI Remap will left-shift variants prior to remapping them. Upon remapping, it will left-shift again with respect to the target assembly. Therefore, when using VCF as your input, all output files will contain left-shifted coordinates. This ensures output VCF meets file specifications. If you provide VCF as your input and specify HGVS as your output, please note that the HGVS will also contain left-shifted coordinates. At this time, NCBI does not provide an equivalent right-shifting function for input HGVS files. This is planned for a future release.
If you have selected VCF as your output file type, all NCBI Remap edits to the REF and ALT fields are reported using INFO tags.
NCBI Remap VCF Meta-Information: NCBI Remap appends the following remapping-related information to the meta-information lines in the output VCF.
Meta-information | Description |
---|---|
NCBI_remap_source_assm | Assembly acc.ver of source assembly |
NCBI_remap_target_assm | Assembly acc.ver of target assembly |
NCBI_remap_align_date | Date on which alignments used for NCBI Remap were generated |
NCBI_remap_run_date | Date on which NCBI Remap was run by user |
NCBI_remap_batch_id | Alignment batch id (NCBI identifier) |
NCBI_remap_align_parameters | NCBI Remap alignment parameters |
NCBI Remap VCF INFO fields: NCBI Remap also uses several INFO fields in the output VCF to describe feature updates that may have occurred during the remapping process. In addition, there is an INFO field indicating whether the feature was remapped with first- (reciprocal-best-hit) or second-pass (non-reciprocal-best-hit) alignments.
ID | Number | Type | Description |
---|---|---|---|
REMAP_ALIGN | 1 | String | Alignment type used for remapping (FP=first pass, SP=second pass) |
REF_ERROR | 0 | Flag | REF base does not match source assembly |
REF_EDIT | 0 | Flag | REF and ALT bases modified due to difference in REF base in source and target assemblies or left-shifting of input REF base |
DEPRECATED: REF_UPDATE | 0 | Flag | REF and ALT bases modified due to difference in REF base in source and target assemblies |
DEPRECATED: REF_LEFT_SHIFT | 0 | Flag | Position of REF base left-shifted |