SeattleSeq Annotation 150
  

Input Data Sources
The following sources were used to populate the databases supporting SeattleSeqAnnotation.
1. dbSNP genotypes and annotations
columns inDBSNPOrNot, dbSNPGenotype, allelesDBSNP, functionDBSNP, rsID, AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq, hasGenotypes, dbSNPValidation, clinicalAssociation
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/genotype (March 2017)
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/XML (March 2017, mapping, gene function information, population definitions, submitter information)
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20170530.vcf (May 2017, clinical association)
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/database/data/organism_data/OmimVarLocusIdSNP.bcp (March 2017, clinical association)
2. NCBI gene files
columns accession, functionGVS, aminoAcids, proteinAccession, cDNAPosition, geneList, distanceToSplice
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq (April 2017, genes and coding regions, protein accession IDs)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GRCh38.p10_interim_annotation/interim_GRCh38.p10_top_level_2017-01-13.gff3 (Jan. 2017, exons)
ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/protein/protein.fa (June 2016, protein sequences used for coding calculations when the number of coding bases is not a multiple of 3)
3. UCSC sequence files
column referenceBase (if not in submitted file)
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ (Dec. 2013)
4. UCSC chimp alleles
column chimpAllele
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/vsPanTro5/hg38.panTro5.net.axt (August 2016)
5. PolyPhen predictions
column polyPhen
The PolyPhen-2 prediction impacts were placed in the GVS database using bulk-download files (version 2.2.2, downloaded Aug. 2012) from the PolyPhen-2 site. Genomic locations were lifted from hg19 to hg38, keeping the highest score if two hg19 locations were mapped to one hg38 location.
6. repeats
columns repeatMasker, tandemRepeat
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ (January, 2014)
   hg38.fa.out: RepeatMasker
   hg38.trf.bed: Tandem Repeats Finder
7. GERP conservation scores
column consScoreGERP
The manuscript describing the program Genomic Evolutionary Rate Profiling (GERP) can be found at http://genome.cshlp.org/content/15/7/901.full.
The GERP website is at http://mendel.stanford.edu/SidowLab/downloads/gerp/index.html.
The hg19 GERP rejected-substitution scores were downloaded from this site in September of 2011, and were lifted to hg38.
8. CADD C scores
column scoreCADD
The Kircher et al. manuscript describing Combined Annotation Dependent Depletion scores can be found at Nat Genet. 2014 Mar;46(3):310-5 (PubMed 24487276).
The CADD scores were downloaded from an internal UW server in September of 2013 (version 1.0), and were lifted to hg38.
CADD scores are Copyright 2013 University of Washington and Hudson-Alpha Institute for Biotechnology (all rights reserved) but are freely available for all academic, non-commercial applications. For commercial licensing information contact Jennifer McCullar (mccullaj@uw.edu). CADD is currently developed by Martin Kircher, Daniela M. Witten, Gregory M. Cooper, and Jay Shendure (http://cadd.gs.washington.edu).
9. microRNA IDs
column microRNAs
ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3 (file dated June 2014, Release 21)
10. Grantham scores
column granthamScore
R. Grantham (1974) Science, vol. 185, pp. 862-864, Table 2.
11. GWAS hits
added to column clinicalAssociation
NHGRI/EMBL-EBI GWAS catalog
Unlike other data in our database, these PubMed links are updated by us weekly.
12. KEGG Pathways
column keggPathway
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ (files dated July 2016)
   keggPathway.txt
   keggMapDesc.txt
13. CpG Islands
column cpgIslands
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ (file dated March 2014)
   cpgIslandExt.txt
14. NHLBI ESP Allele Counts
column genomesESP
allele counts are read from a local database (SNVs only), and are those of the EVS web site http://evs.gs.washington.edu/EVS/ as of Oct. 2013 (6503 individuals), lifted over from hg19 to hg38, keeping the variation with the highest frequency if two hg19 locations mapped to one hg38 location
15. ExAC Allele Counts
column genomesExAC
Exome Aggregation Consortium http://exac.broadinstitute.org, release 0.3, file ExAC.r0.3.sites.vep.vcf, Jan. 2015 (60,706 individuals)
Only variants with a PASS filter and with the AC_Adj parameter greater than 0 are retained.
The locations were lifted over from hg19 to hg38, keeping the variation with the highest African frequency if two hg19 locations mapped to one hg38 location.
For SNVs, the variations were retained only if the hg38 reference allele was the same as the hg19 reference allele.
For insertions, the variations were retained only if the hg38 reference alleles on each side of the insertion were the same as those of hg19.
For deletions, the variations were retained only if the hg38 reference alleles for each deleted base were the same as those of hg19.
16. Protein-Protein Interactions
column PPI
http://string-db.org/ (v10, downloaded June 2015)
Calculations
This section primarily documents SNV columns. For indels, see also How Indels Are Annotated.
inDBSNPOrNot
This column indicates whether the variation has been observed before. The dbSNP indicator shows that the variation is in our GVS database, whether genotypes are available or not. This database currently holds variations from dbSNP build 147. The string "dbSNP_" is followed by the dbSNP build number for which the variation was first recorded (for the chosen rs ID, see below). If the variation is not found in dbSNP, the column contains "none".
chromosome
This column echoes the chromosome in the submitted file, which is assumed to be hg38/NCBI38.
The NCBI38 chromosome accession numbers used are (useful for forming HGVS nomenclature):
chr1NC_000001.11
chr2NC_000002.12
chr3NC_000003.12
chr4NC_000004.12
chr5NC_000005.10
chr6NC_000006.12
chr7NC_000007.14
chr8NC_000008.11
chr9NC_000009.12
chr10NC_000010.11
chr11NC_000011.10
chr12NC_000012.12
chr13NC_000013.11
chr14NC_000014.9
chr15NC_000015.10
chr16NC_000016.10
chr17NC_000017.11
chr18NC_000018.10
chr19NC_000019.10
chr20NC_000020.11
chr21NC_000021.9
chr22NC_000022.11
chrXNC_000023.11
chrYNC_000024.10
position
This column echoes the position in the submitted file, which is assumed to be hg38/NCBI38.
referenceBase
If the reference allele is included in the submitted file, this column echoes it. Otherwise the reference allele is generated by the server (see 3 above). (In the case of indels submitted in a VCF file, this column may echo the REF column in the VCF file, depending on the output-format choice.)
sampleGenotype
This column echoes the genotypes in the submitted file, expressed as single-character ambiguity codes. If there are multiple individuals, a comma-separated list is given. For a VCF input file with no genotypes, the ambiguity code will be that for the reference allele plus any alternative alleles.
sampleAlleles
This column lists all the alleles observed (those in the sampleGenotype column). If the submitted file covers only one individual, there will be two alleles if the genotype is heterozygous. For multiple individuals, this is a list of all unique alleles (excluding N). For a VCF input file with no genotypes, the list will include the reference allele (REF) plus any alternative alleles (ALT). (In the case of indels submitted in a VCF file, this column may echo the ALT column in the VCF file, depending on the output-format choice.)
dbSNPGenotype
This column is present only if genotypes are submitted for an individual with a known dbSNP individual ID. In this case, the column gives the dbSNP genotype for the individual, if known.
allelesDBSNP
This column lists alleles for all populations and individuals in dbSNP, derived from the HGVS notation. The string "NA" is used if unknown.
accession
This is the transcript identifier: the NM, XM, or NR ID for the NCBI gene model. (For original tabular output, there is one line in the output file for each transcript, as the function and protein annotations can be different for different alternative splices.)
functionGVS
The functionGVS entries are calculated by us. For a given SNV, we look in the gene model table in our database and find genes that overlap the position. If there are multiple transcripts for the gene, or more than one gene, the function is calculated for each, and there is one line in the output file for each. Next we check if the SNV is in a coding region of the transcript. If so, the coding bases of the entire transcript are divided into codons, and the codon at the SNV position is extracted, using the human reference sequence. The information thus far is the same regardless of the alleles in the submitted file. Next the reference base allele and the allele(s) of the data in the submitted file are used. There are 3 possible codons: from the reference base, from one allele, and from the other allele (though one of the alleles is normally the same as the reference base, and then there are 2 different codons). The amino acids are extracted for each codon. If the amino acids are all the same, the SNV is synonymous, otherwise missense, stop-gained, or stop-lost. If not synonymous and if there is a stop codon, the function is stop-gained or stop-lost, where the gain or loss is relative to the reference codon. Note that if you are submitting data for one individual, and if the genotype is homozygous in the reference allele, functionGVS can only be synonymous, as there is only one amino acid in the set. It is only for an individual with at least one allele different from the reference allele that missense and stop classifications arise. When submitting a file with multiple individuals, all different alleles are considered. There still may be a difference from functionDBSNP if some dbSNP individual has alleles not in the submitted data set. If the variation is in a coding region, and is in the first two or last two locations of an exon, the "-near-splice" string is added. If the variation is intronic, and the location is within 6 bases of a splice site, the "-near-splice" string is added.

If, as is the case for a few genes in the NCBI gene model, the number of coding bases is not a multiple of 3, we match the protein sequence to the coding nucleotide sequence, first in the forward direction along the nucleotide sequence until the SNV is found or a mismatch is observed, then if necessary in the reverse direction. If the SNV is not found in a region at the beginning or end of a gene where the protein sequence matches the translated nucleotide sequence, the function is assigned coding-unknown.

Note that for functionGVS and the protein analyses below, we are assuming that if the number of coding bases is a multiple of 3, their translation gives the correct functionGVS etc. For multiples of 3, there is no attempt to match the NCBI protein sequence. This could occasionally give errors, should the human genome contain two or more indels in the nucleotide sequence.

Splice sites are two bases at the 5' end of an intron (splice-donor), and two bases at the 3' end of an intron (splice-acceptor). In the rare cases where there is an intron in the middle of an untranslated region, splice sites will be called there.

The classifications upstream-gene and downstream-gene label SNVs that are outside transcribed regions, but within 5000 bp of a transcription region.

For the NCBI gene model, the complete list of SNV functions is

intergenic
intron
intron-near-splice
downstream-gene
upstream-gene
3-prime-UTR
5-prime-UTR
non-coding-exon
non-coding-exon-near-splice
coding-unknown
coding-unknown-near-splice
synonymous
synonymous-near-splice
splice-acceptor
splice-donor
missense
missense-near-splice
stop-lost
stop-lost-near-splice
stop-gained
stop-gained-near-splice

functionDBSNP
For SNVs, if the line in the output has a NCBI accession ID, functionDBSNP will be a list of dbSNP functions for that accession (normally only one). For indels, functionDBSNP is a list for any overlapping accessions.
rsID
If the variation is in our GVS database of dbSNP entries, the rs ID is given. Otherwise the entry is zero. Presence of variations in this GVS database does not require that there be genotypes available (hence some low-quality variations may be included), but it does require that the dbSNP group mapped the variation to a unique chromosomal location. For SNVs, there is no matching of alleles to those of dbSNP. Only locations are used. If there is more than one rs ID at the submitted SNV location, an rs ID with genotypes is chosen if available; ties are broken by choosing the lowest rs ID. For indels, the matching is more complex. See How Indels Are Annotated.
aminoAcids
The bases in the coding region are divided into codons. If the number of coding bases is an even multiple of 3 or if the protein sequence can be matched to the nucleotide sequence in the SNV region, the codon of the SNV is identified. A list of alleles at the SNV site is made: the reference allele, then the alleles observed. For each allele on this list, an amino acid is assigned. A list of unique amino acids is displayed in the column. Note that the reference allele is always in the nucleotide list, even if none of the alleles observed is that of the reference sequence. The amino acid corresponding to the reference allele is listed first. There are normally two amino acids in the list, but there can be more if more than two alleles are in the list.

If the SNV is synonymous, the single amino acid corresponding to the reference allele is reported in this column.
granthamScore
If the variation causes an amino acid change, as indicated in the aminoAcids column (above), a corresponding Grantham score is supplied for each change found. Grantham scores (R. Grantham (1974) Science) estimate the strength of a change from one amino acid to another. The first amino acid in the aminoAcids column is that of the reference sequence. In rare cases when two or more alleles different from the reference are represented in the submitted genotypes, and there are two or more amino acids different from that of the reference sequence, the Grantham score is a comma-separated list for each alternative amino acid against that of the reference amino acid.
proteinPosition
This is the position of the amino acid in the protein, beginning at the N-terminal with the first amino acid at position 1, followed by the total number of amino acids in the protein. The total includes a count for the stop codon. These positions are calculated after the bases in the coding region are divided into codons. The value "NA" is reported if the number of coding bases is not an even multiple of 3 and the protein sequence cannot be matched to the nucleotide sequence in the SNV region.
cDNAPosition
This is the position of a nucleotide in the coding sequence of a gene, beginning at the 5' end of the gene, with the first base position 1 (for SNVs only so far). (The coding region includes the three bases of the stop codon.) The value will be reported even if the number of coding bases is not an even multiple of 3. The value "NA" is reported if the position is not in the coding region of a gene.
polyPhen
PolyPhen prediction impacts were loaded into our GVS database from bulk download files (see 5 above), and impacts should be available for most missense SNVs. If PolyPhen was able to analyze the SNV (and the particular allele change), the polyPhen column entry is quoted as "probably-damaging", "possibly-damaging", or "benign" in the case of the "class" menu item having been set on the home page form, or is a number if the "score" menu item was chosen. The score is a number between 0 and 1, where 1 is the most damaging. There is a choice of predictions from the HumDiv training set or the HumVar training set (see genetics.bwh.harvard.edu/pph2/dokuwiki/overview). If the location/alleles combination is not in the PolyPhen file (usually because the variation is not missense), the value will be "unknown". Multiple PolyPhen-2 calls, presumably from multiple transcripts, are condensed to the most deleterious call, with no attempt to match our transcripts to theirs.
consScoreGERP
The entries in this column are rejected-substitution scores from the program GERP, Stanford University, range of -12.3 to 6.17, with 6.17 being the most conserved (see 7 above). Note that these scores have single genome-location resolution. The hg19 scores were lifted to hg38, keeping the highest score if two hg19 locations mapped to one hg38 location.
scoreCADD
The entries in this column are Combined Annotation Dependent Depletion C scores, range of 0 to 99, with 99 being the most likely deleterious (see 8 above). These scores have single genome-location resolution and are dependent on the alternate allele (8.6 billion variations). They are phred-like (logarithmic) scores. Variations in the top 10% of the 8.6 billion will receive a CADD score of 10 or greater; top 1% a score of 20 or greater, etc. If there is more than one alternate allele, the highest score is quoted. The hg19 scores were lifted to hg38, keeping the highest score if two hg19 locations mapped to one hg38 location.
chimpAllele
This is the UCSC panTro4 (Pan troglodytes) reference allele (see 4 above).
geneList
This is a comma-separated list of HUGO names, any for which the transcription region overlaps the variation. For the case of overlapping genes, there can be more than one name in the list, and only one of them will correspond to the transcript in the accession column (though the other transcripts will have separate lines).
AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq
If the SNV has been genotyped in the HapMap project, these are the allele frequencies (in percent) for three populations. If the "minor" radio button was selected, these are the minor-allele frequencies (the frequencies of the least-frequent alleles). If the "reference" radio button was selected, these are the frequencies of the hg38 reference allele. The minor-allele frequency can usually be derived from the reference-allele frequency. As long as the SNV is diallelic, and the reference allele is represented in the HapMap genotypes, the minor-allele frequency in percent is the smaller of two values: (1) the reference frequency in percent, and (2) 100 - (reference frequency in percent).
hasGenotypes
This column records whether the dbSNP database has genotypes available for one or more individuals, as opposed to just alleles and frequencies.
dbSNPValidation
If the variation is in dbSNP, this is their list of categories of evidence supporting the variation.
repeatMasker, tandemRepeat
If the variation is in one or more repeat regions, these are listed (see 6 above).
clinicalAssociation
This is a list of links (see 1 and 11 above).
distanceToSplice
The nearest splice site is a function only of the variation location, and is not constrained to genes in the column geneList. (For example, the geneList column could contain one gene that's a single exon -- no splice sites -- and the distance to a splice site in a gene some distance away would be reported.) In some transcriptions there are introns in the middle of an untranslated region. Splice sites for such introns are included.

If the distance is greater than 25,000,000 bp (as can be the case on the Y chromosome), the distance is reported as 99999999.

Splice sites are two bases at the 5' end of an intron, and two bases at the 3' end of an intron. The first and last base of an exon would have a distance of 1; a canonical splice site would have a distance of 0. The intron base next to the innermost canonical site would be 1.
	XXGTII...IIAGXX
	210012...210012
	
microRNAs
If the variation falls in microRNA genes as listed in miRBase (see 9 above), they are provided in a comma-separated list.
keggPathway
Pathways associated with the gene are extracted from UCSC KEGG files. The pathway list is associated with the locus ID of the gene. (To keep whitespace out of the column, spaces in the descriptions have been removed, and the following characters capitalized.)
cpgIslands
A CpG is a C base followed by a G base in the genome sequence. Regions where CpGs are present at higher than normal levels may be associated with promoter regions and thus regulation of gene expression. Regions are extracted from UCSC CpG files (see the description of the cpgIslandExt table at http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=cpgIslandExt). If the variation location is inside one of the regions, the value in the column is the range of the island.
genomesESP
The allele counts are reported for 6503 exomes sequenced in the National Heart, Lung, and Blood Institute Exome Sequencing Project (see 14 above). If the box "split ESP Afr/Eur" was checked, allele counts are reported separately for individuals of African (AA:) and European (EA:) ancestry.
genomesExAC
The allele counts are reported for SNVs and indels of 60,706 exomes analyzed by the Exome Aggregation Consortium (release 0.3, see 15 above). If the box "split ExAC" was checked, allele counts are reported separately for individuals in 7 populations:

AFR African & African American
AMR American
EAS East Asian
FIN Finnish
NFE Non-Finnish European
SAS South Asian
OTH Other

If the input variant is a SNV, only a SNV ExAC match is reported; if indel, only an indel match. There is no other matching to the input alleles; alleles counts for any alleles observed in the ExAC data are listed. Position matches are exact, even for indels; it's assumed that ExAC indels and input indels are quoted as far upstream as possible (as is the usual case for VCF files, though this is not the case for HGVS notation).
PPI
Experimental confidence scores (range 0 to 1000) from the Search Tool for the Retrieval of Interacting Genes (STRING), version 10 (see 16 above) were placed in our local GVS database (for NCBI gene-model HUGO names). For the NCBI locus ID associated with the accession, interactions with experimental scores of 700 or greater were sorted by score. Up to 10 interacting proteins with the highest scores were selected. If the eleventh and higher proteins had a score equal to that of the tenth, the list was continued until a lower score was found.
proteinAccession
Entries in the proteinAccession column record the NCBI protein accession ID in the protein files listed above. There was no attempt to check that the translation of the codons in the corresponding coding regions (hg38) matched the protein sequence of the corresponding protein accession ID (with the exception that the number of coding bases is not a multiple of 3). Protein accessions are only quoted for coding and splice variants.
 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo