SeattleSeq Annotation 150

Indel Annotation
Indels are annotated by collecting annotation for all "involved" reference-genome locations, then merging those annotations into a single annotation. For a simple deletion, the involved locations are those of the deleted bases. For example, if three bases are deleted, there will be three involved locations. For a simple insertion, there are always two involved locations, that before the insertion and that after.

For VCF input, a complex indel will be crudely annotated. This is still useful to identify indels in or close to coding regions. If the number of bases in the reference column is one, and the number of bases in the alternate column is greater than one (and there is only one alternate), the variation is a simple insertion. If the opposite is true, but still with only one alternate, it is a simple deletion. All others (more than one alternate or multiple bases in both columns) are considered complex. In the complex case, only one "involved" reference-genome location is used: that one base downstream from the location in the second column of the VCF file. That is, the complex location for annotation would be the first base of a simple deletion or the base following a simple insertion.
The chromosome and position are those of the submitted file. (The position should be that immediately preceding the indel).
There are two modes for the contents of the referenceBase and sampleAlleles columns.

The first mode applies if the file is in VCF format. Here the REF and ALT columns of the VCF input file are echoed. The strings in both columns thus start with the reference allele of the position before the insertion or deletion.

The second mode applies if the indel input file is in GATK bed format. In the referenceBase column, for a simple deletion, the deleted alleles of the reference sequence are recorded. In the case of a simple insertion, the allele before the insertion, followed by a "-", then the allele following the insertion is given. In the complex case, the reference allele one base up from the VCF position will be given. Note that for deletions, there is no display of the reference allele just before the indel. For insertions, the allele following the insertion is displayed, an allele that does not appear in the case of a VCF file. The sampleAlleles column contains the contents of the fourth column of the submitted file, which includes the alleles of the insertion or deletion. The sampleGenotype column contains our own designation: "D" for deletion or "I" for insertion, followed by the number of bases inserted or deleted, or CX if complex.
For the functionGVS column, the functions for each involved location are calculated, and the most-likely-damaging one is chosen. The hierarchy we use is


If the most-likely-damaging function is "intron", and any of the involved locations are within 6 bp of a splice site, "-near-splice" is added.

If one of the involved locations is in an exon, but the accession ID is that of a non-coding gene, the function becomes "non-coding-exon" or "non-coding-exon-near-splice".

An exception is as follows: if the indel is an insertion, and the insertion site is between a canonical splice site and the adjacent intron, the function will be called intron-near-splice and not splice-5 or splice-3.

For insertions (not complex) where one of the involved locations is a coding location, and the accession ID is that of a coding gene, and the number of inserted bases is not a multiple of 3, the functionGVS column is noted frameshift or frameshift-near-splice. If a multiple of 3, it will be coding or coding-near-splice.

For deletions (not complex) where at least one of the involved locations is a coding location, the number of deleted coding bases will be counted, and the function will be coding or frameshift depending on whether this number of deleted coding bases is a multiple of 3.

If the indel is coding and complex, the functionGVS column is noted codingComplex or codingComplex-near-splice; there is no attempt to classify frameshifts.
If any of the involved locations are within a gene, the accession ID is given in the accession column. (As for SNVs, there will sometimes be multiple lines for an indel, one for each accession.) Similarly, any genes overlapping any involved location are listed in the geneList column. The proteinAccession column will contain the protein accession ID corresponding to the accession if available. However, protein accessions are only quoted for coding and splice variants.
The conservation scores are the highest values among those of the involved locations.
The column clinicalAssociation is handled the same way as SNVs: if there is an rs ID associated with the indel, and dbSNP reports a clinical association, it is recorded.
No annotation is available for these columns: aminoAcids, proteinPosition, polyPhen, chimpAllele, AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq, scoreCADD, genomesESP.
Identifying a submitted indel with one in dbSNP is problematic. Matches are made from a database recording at most two indels at a given location, with preference for indels with genotypes available, then with preference for simple insertions or deletions as indicated by the HGVS notation, then considering dbSNP validations, finally with preference for smaller rs IDs.

If the indel is a simple insertion or deletion (a number of consecutive alleles inserted or deleted, and pure insertion or pure deletion), a perfect match is returned if found: insertion vs. deletion match, alleles matched, and position match. Matches for a large number of inserted or deleted bases where the dbSNP HGVS notation indicates a number of bases (e.g. "del19") are also accepted if the number of bases matches. Otherwise the location one base upstream is queried for matching insertion/deletion and alleles, then if no match is found the location one base downstream is queried. If still no match, no dbSNP rs ID is reported with the following exception: if the indel location is at the beginning of a homopolymer track (e.g. AAAAA...) or a dinucleotide repeat (e.g. GTGTGTGT...), and the indel is an insertion or deletion of one or more of the same alleles as those of the repeat track (e.g. repeats of A or GT in the submitted file), a search is made for a dbSNP indel located as far as possible downstream within the track (up to 30 bp for homopolymer or 60 bp for dinucleotide), and the result is reported if found (still matching the alleles and insertion vs. deletion). This search compensates for VCF files that put the indel upstream, and dbSNP HGVS notation that sometimes puts the indel downstream. We do not attempt this downstream-shifting for repeats of more than two alleles (e.g. GTCGTCGTC...).

For complex indels (not simple), the dbSNP indels within 10 bases of the involved location are collected, and the closest location is selected, breaking ties by choosing the smallest rs ID.

These assignments should be regarded as suggestions, as there is not an exhaustive attempt to make perfect matches. The dbSNP alleles of the match are reported in the allelesDBSNP column. If they are not the same alleles as those submitted, the assignment is suspicious. Once an rs ID is assigned, the columns inDBSNPOrNot, hasGenotypes, and dbSNPValidation are filled in the same as for SNVs. The column functionDBSNP is a list for any overlapping accessions, not just the accession in the line.
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo