|Indels are annotated by collecting annotation for all "involved" reference-genome locations, then merging those annotations into a single annotation. For a simple deletion,
the involved locations are those of the deleted bases. For example, if three bases are deleted, there will be three involved locations. For a simple insertion,
there are always two involved locations, that before the insertion and that after.|
For VCF input, a complex indel will be crudely annotated. This is still useful to identify indels in or close to coding regions.
If the number of bases in the reference column is one, and the number of bases in the alternate column
is greater than one (and there is only one alternate), the variation is a simple insertion. If the opposite is true, but still with only one alternate, it is a simple deletion.
All others (more than one alternate or multiple bases in both columns) are considered complex. In the complex case, only one "involved" reference-genome location is used:
that one base downstream from the location in the second column of the VCF file. That is, the complex location for annotation would be the first base of a simple deletion or
the base following a simple insertion.
|The chromosome and position are those of the submitted file. (The position should be that immediately preceding the indel).
There are two modes for the contents of the referenceBase and sampleAlleles columns.|
The first mode applies if the indel input file is in GATK bed format, or if it is in VCF format and the output choice is "SeattleSeq Annotation original allele columns".
In the referenceBase column, for a simple deletion, the deleted alleles of the reference sequence are recorded. In the case of a simple insertion, the allele before the insertion, followed
by a "-", then the allele following the insertion is given. In the complex case, the reference allele one base up from the VCF position will be given. Note that for deletions,
there is no display of the reference allele just before the indel. For insertions, the allele following the insertion is displayed, an allele that does not appear in the case of a VCF file.
For a bed input file, the sampleAlleles column contains the contents of the fourth column of the submitted file, which includes the alleles of the insertion or deletion.
For a VCF input file, the sampleAlleles column contains the genotype calls for all individuals.
The sampleGenotype column contains
our own designation: "D" for deletion or "I" for insertion, followed by the number of bases inserted or deleted, or CX if complex.
The second mode applies if the indel file is in VCF format and the output style choice is "VCF-like allele columns" or if the output file format
"VCF SNVs and Indels" is chosen. Here the REF and ALT columns of the VCF input file are
echoed. The strings in both columns thus start with the reference allele of the position before the insertion or deletion.
|For the functionGVS column, the functions for each involved location are calculated, and the most-likely-damaging one is chosen. The hierarchy
we use is|
If the most-likely-damaging function is "intron", and any of the involved locations are within 6 bp of a splice site, "-near-splice" is added.
If one of the involved locations is an exon, but the accession ID is that of a non-coding gene, the function becomes "non-coding-exon" or "non-coding-exon-near-splice".
An exception is as follows: if the indel is an insertion, and the insertion site is between a canonical splice site and the adjacent intron,
the function will be called intron-near-splice and not splice-5 or splice-3.
For insertions (not complex) where one of the involved locations is a coding location, and the accession ID is that of a coding gene,
and the number of inserted bases is not a multiple of 3,
the functionGVS column is noted frameshift or frameshift-near-splice. If a multiple of 3, it will be coding or coding-near-splice.
For deletions (not complex) where at least one of the involved locations is a coding location, the number of deleted coding bases will be counted,
and the function will be coding or frameshift depending on whether this number of deleted coding bases is a multiple of 3.
If the indel is coding and complex, the functionGVS column is noted codingComplex or codingComplex-near-splice; there is no attempt to classify frameshifts.
|If any of the involved locations are within a gene, the accession ID is given in the accession column.
(As for SNVs, there will sometimes be multiple lines for an indel, one for each accession.)
Similarly, any genes overlapping any involved location
are listed in the geneList column. The proteinSequence column will contain the protein accession ID corresponding to the accession if available.
However, protein accessions are only quoted for coding and splice variants.
|The conservation scores are the highest values among those of the involved locations.
|The column clinicalAssociation is handled the same way as SNVs: if there is an rs ID associated with the indel, and dbSNP reports a clinical association, it is recorded.
|No annotation is available for these columns: aminoAcids, proteinPosition, polyPhen, chimpAllele, AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq, scoreCADD.
|Identifying a submitted indel with one in dbSNP is problematic. The dbSNP variations with insertion or deletion genotypes that are within 10 bases of
any involved location are collected. If there is only one, its rs ID is reported in the rsID column.|
It there is more than one, a choice must be made.
If there are duplicate locations in the list, the number of indels at a location is limited to two, with preference for indels with genotypes available,
then with preference for simple insertions or
deletions as indicated by the HGVS notation, finally with preference for smaller rs IDs. In the remaining list, we first try to match the number of bases inserted or deleted,
and whether the dbSNP indel deletion vs. insertion matches (again based on the dbSNP HGVS notation). If there is
no match, the indel with the nearest location is chosen. If there is one match, it is reported. If there are multiple matches, we try to match the actual
alleles (unless the submitted indel is complex, in which case this step is skipped). If there is no match,
the indel with the nearest location among the size matches is chosen. If there are still multiple matches, the indel with the nearest location is chosen.
If there are still ties, the dbSNP indel with the smallest rs ID is reported.
These assignments should be regarded as suggestions, as there is not an exhaustive attempt to make perfect matches. The dbSNP alleles of the match are reported in the allelesDBSNP column.
If they are not the same alleles as those submitted, the assignment is suspicious. Once an rs ID is assigned, the
columns inDBSNPOrNot, functionDBSNP, hasGenotypes, and dbSNPValidation are filled in the same as for SNVs.