U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

About our alignments

Back to NCBI Remap Page

Assembly-Assembly Alignments

Alignments are generated in two phases. The first phase, or 'First Pass' alignments, are reciprocal best alignments, meaning any locus on the query assembly will have 0 or 1 alignment to the target assembly. These alignments are generated using the following approach. First, two assemblies are aligned using BLAST. BLAST parameters can be adjusted, depending on the quality of the alignments, but the most commonly used parameters are:

blastn -best\_hit\_overhang 0.1 -best\_hit\_score\_edge 0.1 -evalue 0.0001 -soft\_masking true -task megablast -word\_size 28

Additionally, we use in-database masking through precomputed WindowMasker masked regions. The parameter we change the most often is the word size which can be lowered for assemblies of lower quality.

BLAST produces a large set of alignments that must be trimmed to remove low quality and spurious alignments. The BLAST results are then processed in the following way:

  • Identify genomic regions containing spans covered by components that are common to both assemblies. If a GenBank/EMBL/DDBJ sequence that is used to construct both assemblies, this alignment will be preferentially favored. Because the assembled sequence is derived from the same underlying sequence in both assemblies, we can be confident these are the alignments we want.
    • For the human assembly, some components may be used once in the Primary assembly unit and again in alternate loci representations (see the Genome Reference Consortium definition page for more details) so we favor a chromosome-to-chromosome alignment over a chromosome-to-scaffold or scaffold-to-scaffold alignment.
  • Alignments based on common components are then merged into the longest, consistent stretches possible; merging does not occur on gaps of greater than 49 bases. This results in a set of alignments called the 'common component set'.
  • We then eliminate remaining BLAST alignments that are redundant with the common component alignments. 
  • The remaining alignments are then merged independently of the common component alignments; merging does not occur on gaps of greater than 49 bases.
  • Redundant alignments are then removed from the merged set. This results in a set of alignments called the 'BLAST set'.
  • The two alignment sets (common component and BLAST) are then combined into a single set of alignments and then sorted (in descending order) to select the 'First pass set', which are ranked to favor, in order:
    • common component alignments 
    • chromosome-to-chromosome alignments
    • alternate loci-to-alternate loci alignments
    • chromosome-to-alternate loci alignments
    • count of identities
  • The sorted alignments are then assessed and only alignments with non-conflicting query/subject ranges are kept for the First Pass set. Conflicting alignments are reserved for evaluation in the 'Second Pass'. 

In order to capture duplicated sequences, we do a 'Second Pass' to capture large regions (>1Kb) within an assembly that have no alignment or a conflicting alignment in the First Pass. In the 'Second Pass' alignments, a given region in the query assembly can align to >1 region in the target assembly. See Figure 1 for more details.

graphic explaining First pass and Second pass alignments Figure 1. Cartoon showing First Pass and Second Pass alignments. Regions of duplication are hard to assemble and align. Redundant sequences in an assembly can lead to artificial duplication and in other cases repeated sequences can be collapsed into a single copy. Regardless, understanding the relationship of these sequences between two assemblies is important. In the figure to the left, the sequence in assembly 1 has two copies of a repeat (labeled A and A'); there is only 1 copy of this repeat in assembly 2 (labeled A). During the First Pass only A would have an alignment to a sequence in assembly 2, A' would be unaligned. During the Second Pass phase, the alignment of A' ->A would be picked up.

When using the First Pass alignments, any locus can align 0 or 1 times to the other assembly. The Second Pass alignments will allow any locus to align 0, 1 or >1 time. The alignments are computed once and stored in a database for easy retrieval.

RefSeqGene to Assembly Alignments (for Clinical Remap)

RefSeqGene (RSG) sequences are genomic sequences that represent a single gene, typically with a bit of the 5' and 3' sequences included in the sequence record. While the RSG sequences are typically in good agreement with the chromosome sequences defined in the reference assembly (see Genome Reference Consortium for more details) there are sometimes differences. These differences arise because the clinical community has historically used a different standard, or there is a problem with the reference assembly. In order to align these sequences to the reference assembly, we use a tool called the 'NG Aligner'. This tool starts by using BLAST to align the RSG to the assembly. The BLAST output is then filtered and merged so that nearby, fragmented BLAST hits can be combined into a larger alignment. Results from multiple alignment passes using different filtering options are used to obtain a final, consistent alignment. 

Support Center