SeattleSeq Annotation 137
  

Indel Annotation
Indels are annotated by collecting annotation for all "involved" reference-genome locations, then merging those annotations into a single annotation. For a simple deletion, the involved locations are those of the deleted bases. For example, if three bases are deleted, there will be three involved locations. For a simple insertion, there are always two involved locations, that before the insertion and that after.

For VCF input, a complex indel will be crudely annotated. This is still useful to identify indels in or close to coding regions. If the number of bases in the reference column is one, and the number of bases in the alternate column is greater than one (and there is only one alternate), the variation is a simple insertion. If the opposite is true, but still with only one alternate, it is a simple deletion. All others (more than one alternate or multiple bases in both columns) are considered complex. In the complex case, only one "involved" reference-genome location is used: that one base up from the location in the second column of the VCF file. That is, the complex location for annotation would be the first base of a simple deletion or the base following a simple insertion.
The chromosome and position are those of the submitted file. (The position should be that immediately preceding the indel).
There are two modes for the contents of the referenceBase and sampleAlleles columns.

The first mode applies if the indel input file is in GATK bed format, or if it is in VCF format and the output choice is "SeattleSeq Annotation original allele columns". In the referenceBase column, for a simple deletion, the deleted alleles of the reference sequence are recorded. In the case of a simple insertion, the allele before the insertion, followed by a "-", then the allele following the insertion is given. In the complex case, the reference allele one base up from the VCF position will be given. Note that for deletions, there is no display of the reference allele just before the indel. For insertions, the allele following the insertion is displayed, an allele that does not appear in the case of a VCF file. For a bed input file, the sampleAlleles column contains the contents of the fourth column of the submitted file, which includes the alleles of the insertion or deletion. For a VCF input file, the sampleAlleles column contains the genotype calls for all individuals. The sampleGenotype column contains our own designation: "D" for deletion or "I" for insertion, followed by the number of bases inserted or deleted, or CX if complex.

The second mode applies if the indel file is in VCF format and the output choice is "VCF-like allele columns". Here the REF and ALT columns of the VCF input file are echoed. The strings in both columns thus start with the reference allele of the position before the insertion or deletion.
For the functionGVS column, the functions for each involved location are calculated, and the most-likely-damaging one is chosen. The hierarchy we use for the NCBI gene model is

frameshift-near-splice
frameshift
codingComplex-near-splice
codingComplex
coding-near-splice
coding
splice-5
splice-3
utr-5
utr-3
intron
near-gene-5
near-gene-3
intergenic

and that for the CCDS gene model is

frameshift-near-splice
frameshift
codingComplex-near-splice
codingComplex
coding-near-splice
coding
splice-5
splice-3
intron
outsideCoding

An exception is as follows: if the indel is an insertion, and the insertion site is between a canonical splice site and the adjacent intron, the function will be called intron and not splice-5 or splice-3. Proximity to a splice site may be recovered by looking at the distanceToSplice column.

For insertions (not complex) where one of the involved locations is a coding location, and the number of inserted bases is not a multiple of 3, the functionGVS column is noted frameshift or frameshift-near-splice. If a multiple of 3, it will be coding or coding-near-splice.

For deletions (not complex) where at least one of the involved locations is a coding location, the number of deleted coding bases will be counted, and the function will be coding or frameshift depending on whether this number of deleted coding bases is a multiple of 3.

If the indel is coding and complex, the functionGVS column is noted codingComplex or codingComplex-near-splice; there is no attempt to classify frameshifts.
If any of the involved locations are within a gene, the accession ID (or CCDS ID) is given in the accession column. (As for SNVs, there will sometimes be multiple lines for an indel, one for each accession.) Similarly, any genes overlapping any involved location are listed in the geneList column. The proteinSequence column will contain the sequence corresponding to the accession if available.
The conservation scores are the highest values among those of the involved locations.
The column clinicalAssociation is handled the same way as SNVs: if there is an rs ID associated with the indel, and dbSNP reports a clinical association, it is recorded.
No annotation is available for these columns: aminoAcids, proteinPosition, polyPhen, chimpAllele, AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq.
Identifying a submitted indel with one in dbSNP is problematic. The dbSNP variations with insertion or deletion genotypes that are within 10 bases of any involved location are collected. If there is only one, its rs ID is reported in the rsID column.

It there is more than one, a choice must be made.

There are two types of dbSNP indel annotations. (A) If genotypes are available, the alleles will be e.g. -/AT, but there will be no information about insertion vs. deletion. (B) If no genotypes are available, the dbSNP call will be e.g. delAG or insCGT.

We first try to match the number of bases inserted or deleted, and in case B whether the dbSNP indel deletion vs. insertion matches. If there is no match, the indel with the nearest location is chosen. If there is one match, it is reported. If there are multiple matches, we try to match the actual alleles (unless the submitted indel is complex, in which case this step is skipped). If there is no match, the indel with the nearest location among the size matches is chosen. If there are still multiple matches, the indel with the nearest location is chosen.

If there are still ties, the dbSNP indel with the smallest rs ID is reported.

These assignments should be regarded as suggestions, as there is not an exhaustive attempt to make perfect matches. The dbSNP alleles of the match are reported in the allelesDBSNP column. If they are not the same alleles as those submitted, the assignment is suspicious. Note that dbSNP will report the "-" allele, even for an insertion. Once an rs ID is assigned, the columns inDBSNPOrNot, functionDBSNP, hasGenotypes, and dbSNPValidation are filled in the same as for SNVs.
 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo