SeattleSeq Annotation 138

Input Data Sources
The following sources were used to populate the databases supporting SeattleSeqAnnotation.
1. dbSNP genotypes and annotations
columns inDBSNPOrNot, dbSNPGenotype, allelesDBSNP, functionDBSNP, rsID, AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq, hasGenotypes, dbSNPValidation, clinicalAssociation (August 2013) (July-August 2013, mapping, gene function information, population definitions, submitter information) (August 2013, clinical association)
clinical association note: A list of rs IDs was generated from the dbSNP files clinvar_20130808.vcf, SNPClinSig.bcp, and OmimVarLocusIdSNP.bcp. For each of the variations, a screen-scraper program extracted the clinical association links from the dbSNP web page for the variation, and wrote the information to our database. This was done in September of 2013.
2. NCBI gene files
columns accession, functionGVS, aminoAcids, proteinPosition, cDNAPosition, geneList, proteinSequence, distanceToSplice (Oct. 1, 2013, genes and coding regions, protein accession IDs) (August 20, 2013, exons) (Aug. 2013, protein sequences used for coding calculations when the number of coding bases is not a multiple of 3)
3. UCSC conservation scores (phastCons)
column scorePhastCons (November, 2009)
4. UCSC sequence files
column referenceBase (if not in submitted file) (February 2009)
5. UCSC chimp alleles
column chimpAllele (Jan. 2013)
6. copy number variation
column CNV, file GRCh37_hg19_variants_2013-07-23.txt (July, 2013)
7. PolyPhen predictions
column polyPhen
The PolyPhen-2 prediction impacts were placed in the GVS database using bulk-download files (version 2.2.2, downloaded Aug. 2012) from the PolyPhen-2 site.
8. repeats
columns repeatMasker, tandemRepeat (March 20, 2009) RepeatMasker Tandem Repeats Finder
9. GERP conservation scores
column consScoreGERP
The manuscript describing the program Genomic Evolutionary Rate Profiling (GERP) can be found at
The GERP website is at
The hg19 GERP rejected-substitution scores were downloaded from this site in September of 2011.
10. CADD C scores
column scoreCADD
The Kircher et al. manuscript describing Combined Annotation Dependent Depletion scores can be found at Nat Genet. 2014 Mar;46(3):310-5 (PubMed 24487276).
The CADD scores were downloaded from an internal UW server in September of 2013 (version 1.0).
CADD scores are Copyright 2013 University of Washington and Hudson-Alpha Institute for Biotechnology (all rights reserved) but are freely available for all academic, non-commercial applications. For commercial licensing information contact Jennifer McCullar ( CADD is currently developed by Martin Kircher, Daniela M. Witten, Gregory M. Cooper, and Jay Shendure ("
11. microRNA IDs
column microRNAs (file dated June 2013, Release 20)
12. Grantham scores
column granthamScore
R. Grantham (1974) Science, vol. 185, pp. 862-864, Table 2.
13. GWAS hits
added to column clinicalAssociation
NHGRI GWAS catalog
Unlike other data in our database, these PubMed links are updated by us weekly.
14. KEGG Pathways
column keggPathway (files dated June/July 2013)
15. CpG Islands
column cpgIslands (downloaded Feb. 2012)
16. Transcription Factor Binding Sites
column tfbs (downloaded Feb. 2012)
17. NHLBI ESP Allele Counts
column genomesESP
allele counts are read from a local database (SNVs only), and are those of the EVS web site as of Oct. 2013 (6503 individuals)
18. ExAC Allele Counts
column genomesExAC
Exome Aggregation Consortium, release 0.2, file ExAC.r0.2.sites.vep.vcf, Nov. 2014 (61,486 individuals)
19. Protein-Protein Interactions
column PPI (v9.05, downloaded Nov. 2013)
This section primarily documents SNV columns. For indels, see also How Indels Are Annotated.
This column indicates whether the variation has been observed before. The dbSNP indicator shows that the variation is in our GVS database, whether genotypes are available or not. This database currently holds variations from dbSNP build 138. The string "dbSNP_" is followed by the dbSNP build number for which the variation was first recorded (for the chosen rs ID, see below). If the variation is not found in dbSNP, the column contains "none".
This column echoes the chromosome in the submitted file, which is assumed to be hg19/NCBI37.
The NCBI37 chromosome accession numbers used are (useful for forming HGVS nomenclature):
This column echoes the position in the submitted file, which is assumed to be hg19/NCBI37.
If the reference allele is included in the submitted file, this column echoes it. Otherwise the reference allele is generated by the server (see 4 above). (In the case of indels submitted in a VCF file, this column may echo the REF column in the VCF file, depending on the output-format choice.)
This column echoes the genotypes in the submitted file, expressed as single-character ambiguity codes. If there are multiple individuals, a comma-separated list is given. For a VCF input file with no genotypes, the ambiguity code will be that for the reference allele plus any alternative alleles.
This column lists all the alleles observed (those in the sampleGenotype column). If the submitted file covers only one individual, there will be two alleles if the genotype is heterozygous. For multiple individuals, this is a list of all unique alleles (excluding N). For a VCF input file with no genotypes, the list will include the reference allele (REF) plus any alternative alleles (ALT). (In the case of indels submitted in a VCF file, this column may echo the ALT column in the VCF file, depending on the output-format choice.)
This column is present only if genotypes are submitted for an individual with a known dbSNP individual ID. In this case, the column gives the dbSNP genotype for the individual, if known.
This column lists alleles for all populations and individuals in dbSNP, derived from the HGVS notation. The string "NA" is used if unknown.
This is the transcript identifier: the NM, XM, NR, or XR ID for the NCBI gene model. (There is one line in the output file for each transcript, as the function and protein annotations can be different for different alternative splices.)
The functionGVS entries are calculated by us. For a given SNV, we look in the gene model table in our database and find genes that overlap the position. If there are multiple transcripts for the gene, or more than one gene, the function is calculated for each, and there is one line in the output file for each. Next we check if the SNV is in a coding region of the transcript. If so, the coding bases of the entire transcript are divided into codons, and the codon at the SNV position is extracted, using the human reference sequence. The information thus far is the same regardless of the alleles in the submitted file. Next the reference base allele and the allele(s) of the data in the submitted file are used. There are 3 possible codons: from the reference base, from one allele, and from the other allele (though one of the alleles is normally the same as the reference base, and then there are 2 different codons). The amino acids are extracted for each codon. If the amino acids are all the same, the SNV is synonymous, otherwise missense, stop-gained, or stop-lost. If not synonymous and if there is a stop codon, the function is stop-gained or stop-lost, where the gain or loss is relative to the reference codon. Note that if you are submitting data for one individual, and if the genotype is homozygous in the reference allele, functionGVS can only be synonymous, as there is only one amino acid in the set. It is only for an individual with at least one allele different from the reference allele that missense and stop classifications arise. When submitting a file with multiple individuals, all different alleles are considered. There still may be a difference from functionDBSNP if some dbSNP individual has alleles not in the submitted data set. If the variation is in a coding region, and is in the first two or last two locations of an exon, the "-near-splice" string is added. If the variation is intronic, and the location is within 6 bases of a splice site, the "-near-splice" string is added.

If, as is the case for a few genes in the NCBI gene model, the number of coding bases is not a multiple of 3, we match the protein sequence to the coding nucleotide sequence, first in the forward direction along the nucleotide sequence until the SNV is found or a mismatch is observed, then if necessary in the reverse direction. If the SNV is not found in a region at the beginning or end of a gene where the protein sequence matches the translated nucleotide sequence, the function is assigned coding-unknown.

Note that for functionGVS and the protein analyses below, we are assuming that if the number of coding bases is a multiple of 3, their translation gives the correct functionGVS etc. For multiples of 3, there is no attempt to match the NCBI protein sequence. This could occasionally give errors, should the human genome contain two or more indels in the nucleotide sequence.

Splice sites are two bases at the 5' end of an intron (splice-donor), and two bases at the 3' end of an intron (splice-acceptor). In the rare cases where there is an intron in the middle of an untranslated region, splice sites will be called there.

The classifications upstream-gene and downstream-gene label SNVs that are outside transcribed regions, but within 5000 bp of a transcription region.

For the NCBI gene model, the complete list of SNV functions is


For any line in the output with a NCBI accession ID, functionDBSNP will be a list of dbSNP functions for that accession (normally only one).
If the variation is our GVS database of dbSNP entries, the rs ID is given. Otherwise the entry is zero. For variations to be in this GVS database does not require that there be genotypes available (hence some low-quality variations may be included), but it does require that the dbSNP group mapped the variation to a unique chromosomal location. For SNVs, there is no matching of alleles to those of dbSNP. Only locations are used. If there is more than one rs ID at the submitted SNV location, an rs ID with genotypes is chosen if available; ties are broken by choosing the lowest rs ID. For indels, the matching is more complex. See How Indels Are Annotated.
The bases in the coding region are divided into codons. If the number of coding bases is an even multiple of 3 or if the protein sequence can be matched to the nucleotide sequence in the SNV region, the codon of the SNV is identified. A list of alleles at the SNV site is made: the reference allele, then the alleles observed. For each allele on this list, an amino acid is assigned. A list of unique amino acids is displayed in the column. Note that the reference allele is always in the nucleotide list, even if none of the alleles observed is that of the reference sequence. The amino acid corresponding to the reference allele is listed first. There are normally two amino acids in the list, but there can be more if the observed alleles don't include the reference allele or if there are multiple individuals and the site is triallelic.

If the SNV is synonymous, the single amino acid corresponding to the reference allele is reported in this column.
If the variation causes an amino acid change as per the aminoAcids column (above), a corresponding Grantham score is supplied for each change found. Grantham scores (R. Grantham (1974) Science) estimate the strength of a change from one amino acid to another. The first amino acid in the aminoAcids column is that of the reference sequence. In rare cases when two or more alleles different from the reference are represented in the submitted genotypes, and there are two or more amino acids different from that of the reference sequence, the Grantham score is a comma-separated list for each alternative amino acid against that of the reference amino acid.
This is the position of the amino acid in the protein, beginning at the N-terminal with the first amino acid at position 1, followed by the total number of amino acids in the protein. The total includes a count for the stop codon. These positions are calculated after the bases in the coding region are divided into codons. The value "NA" is reported if the number of coding bases is not an even multiple of 3 and the protein sequence cannot be matched to the nucleotide sequence in the SNV region.
This is the position of a nucleotide in the coding sequence of a gene, beginning at the 5' end of the gene, with the first base position 1 (for SNVs only so far). (The coding region includes the three bases of the stop codon.) The value will be reported even if the number of coding bases is not an even multiple of 3. The value "NA" is reported if the position is not in the coding region of a gene.
PolyPhen prediction impacts were loaded into our GVS database from bulk download files (see 7 above), and impacts should be available for most missense SNVs. If PolyPhen was able to analyze the SNV (and the particular allele change), the polyPhen column entry is quoted as "probably-damaging", "possibly-damaging", or "benign" in the case of a "class" radio button having been set on the home page form, or is a number if a "score" radio button was chosen. The score is a number between 0 and 1, where 1 is the most damaging. There is a choice of predictions from the HumDiv training set or the HumVar training set (see If the location/alleles combination is not in the PolyPhen file (usually because the variation is not missense), the value will be "unknown". Multiple PolyPhen-2 calls, presumably from multiple transcripts, are condensed to the most deleterious call, with no attempt to match our transcripts to theirs.
The entries in this column are UCSC conservation scores, 46 placental mammalian species, range of 0 to 1, with 1 being the most conserved (see 3 above). Note that these scores are averaged over several adjacent genome locations.
The entries in this column are rejected-substitution scores from the program GERP, Stanford University, range of -12.3 to 6.17, with 6.17 being the most conserved (see 9 above). Note that these scores have single genome-location resolution.
The entries in this column are Combined Annotation Dependent Depletion C scores, range of 0 to 99, with 99 being the most likely deleterious (see 10 above). These scores have single genome-location resolution and are dependent on the alternate allele (8.6 billion variations). They are phred-like (logarithmic) scores. Variations in the top 10% of the 8.6 billion will receive a CADD score of 10 or greater; top 1% a score of 20 or greater, etc. If there is more than one alternate allele, the highest score is quoted.
This is the UCSC panTro4 (Pan troglodytes) reference allele (see 5 above).
If the variation is in one or more regions of copy number variation (from the Centre for Applied Genomics in Toronto, see 6 above), these are listed.
This is a comma-separated list of HUGO names, any for which the transcription region overlaps the variation. For the case of overlapping genes, there can be more than one name in the list, and only one of them will correspond to the transcript in the accession column (though the other transcripts will have separate lines).
AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq
If the SNV has been genotyped in the HapMap project, these are the allele frequencies (in percent) for three populations. If the "minor" radio button was selected, these are the minor-allele frequencies (the frequencies of the least-frequent alleles). If the "reference" radio button was selected, these are the frequencies of the hg19 reference allele. The minor-allele frequency can usually be derived from the reference-allele frequency. As long as the SNV is diallelic, and the reference allele is represented in the HapMap genotypes, the minor-allele frequency in percent is the smaller of two values: (1) the reference frequency in percent, and (2) 100 - (reference frequency in percent).
This column records whether the dbSNP database has genotypes available for one or more individuals, as opposed to just alleles and frequencies.
If the variation is in dbSNP, this is their list of categories of evidence supporting the variation.
repeatMasker, tandemRepeat
If the variation is in one or more repeat regions, these are listed (see 8 above).
This is a list of links (see 1 and 13 above).
The nearest splice site is a function only of the variation location, and is not constrained to genes in the column geneList. (For example, the geneList column could contain one gene that's a single exon -- no splice sites -- and the distance to a splice site in a gene some distance away would be reported.) In some transcriptions there are introns in the middle of an untranslated region. Splice sites for such introns are included.

If the distance is greater than 25,000,000 bp (as can be the case on the Y chromosome), the distance is reported as 99999999.

Splice sites are two bases at the 5' end of an intron, and two bases at the 3' end of an intron. The first and last base of an exon would have a distance of 1; a canonical splice site would have a distance of 0. The intron base next to the innermost canonical site would be 1.
If the variation falls in microRNA genes as listed in miRBase (see 11 above), they are provided in a comma-separated list.
Pathways associated with the gene are extracted from UCSC KEGG files. The pathway list is associated with the locus ID of the gene. (To keep whitespace out of the column, spaces in the descriptions have been removed, and the following characters capitalized.)
A CpG is a C base followed by a G base in the genome sequence. Regions where CpGs are present at higher than normal levels may be associated with promoter regions and thus regulation of gene expression. Regions are extracted from UCSC CpG files (see the description of the cpgIslandExt table at If the variation location is inside one of the regions, the value in the column is the range of the island.
Transcription factor binding sites (TFBS) are short DNA sequence regions where regulatory elements bind. Regions are extracted from UCSC TFBS files (see the description of the tfbsConsSites table at UCSC data is derived from the Transfac Matrix Database. The data are computational, and not all are expected to be biologically functional. If the variation location is inside one or more of the regions, the value in the column is a comma-separated list of Transfac IDs.
The allele counts are reported for 6503 exomes sequenced in the National Heart, Lung, and Blood Institute Exome Sequencing Project (see 17 above). If the box "split ESP Afr/Eur" was checked, allele counts are reported separately for individuals of African (AA:) and European ancestry (EA:). There is no matching to the input alleles.
The allele counts are reported for SNVs and indels of 61,486 exomes analyzed by the Exome Aggregation Consortium (release 0.2, see 18 above). Only variants with a PASS filter and with the AC_Adj parameter greater than 0 are reported. If the box "split ExAC" was checked, allele counts are reported separately for individuals in 7 populations:

AFR African & African American
AMR American
EAS East Asian
FIN Finnish
NFE Non-Finnish European
SAS South Asian
OTH Other

If the input variant is a SNV, only a SNV ExAC match is reported; if indel, only an indel match. There is no other matching to the input alleles; alleles counts for any alleles observed in the ExAC data are listed. Position matches are exact, even for indels; it's assumed that ExAC indels and input indels are quoted as far upstream as possible (as is the usual case for VCF files, though this is not the case for HGVS notation).
Experimental confidence scores (range 0 to 1000) from the Search Tool for the Retrieval of Interacting Genes (STRING), version 9.05 (see 19 above) were placed in our local GVS database (for NCBI gene-model HUGO names). For the NCBI locus ID associated with the accession, interactions with experimental scores of 700 or greater were sorted by score. Up to 10 interacting proteins with the highest scores were selected. If the eleventh and higher proteins had a score equal to that of the tenth, the list was continued until a lower score was found.
Entries in the proteinSequence column record the NCBI protein accession ID in the protein files listed above. There was no attempt to check that the translation of the codons in the corresponding coding regions (hg19) matched the protein sequence of the corresponding protein accession ID (with the exception that the number of coding bases is not a multiple of 3). Protein accessions are only quoted for coding and splice variants.
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo