SeattleSeq Annotation 137
  

Input Data Sources
The following sources were used to populate the databases supporting SeattleSeqAnnotation.
1. dbSNP genotypes and annotations
columns inDBSNPOrNot, dbSNPGenotype, allelesDBSNP, functionDBSNP, rsID, AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq, hasGenotypes, dbSNPValidation, clinicalAssociation
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/genotype (June 2012)
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML (June 2012, mapping, gene function information, population definitions, submitter information)
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_20120616.vcf (June 2012, clinical association)
clinical association note: A list of rs IDs was generated from the dbSNP file clinvar_20120616. For each of the variations, a screen-scraper program extracted the clinical association links from the dbSNP web page for the variation, and wrote the information to our database. This was done in October of 2012.
2. NCBI gene files and synonyms
columns accession, functionGVS, aminoAcids, proteinPosition, cDNAPosition, geneList, proteinSequence, distanceToSplice
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq (Sept. 2012, genes and coding regions)
an NCBI file posted for us in August of 2012: ref_GRCh37.p9_top_level.gff3 (exons; the exon file seq_gene.md we usually use had not been updated for a year)
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa (Sept. 2012, protein sequences)
3. CCDS gene files
columns accession, functionGVS, aminoAcids, proteinPosition, cDNAPosition, geneList, proteinSequence, distanceToSplice
NCBI CCDS FTP archive/Hs37.3/CCDS.current.txt (genes and coding regions, downloaded Sept. 2012)
NCBI CCDS FTP archive/Hs37.3/CCDS_protein.current.faa (protein sequences, downloaded Sept. 2012)
4. UCSC conservation scores (phastCons)
column scorePhastCons
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals (November, 2009)
5. UCSC sequence files
column referenceBase (if not in submitted file)
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ (February 2009)
6. UCSC chimp alleles
column chimpAllele
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/vsPanTro3/axtNet/ (February, 2011)
7. copy number variation
column CNV
http://projects.tcag.ca/variation/tableview.asp?table=DGV_Content_Summary.txt, files variation.hg19.v10.nov.2010.txt and indel.hg19.v10.nov.2010.txt (November, 2010)
8. PolyPhen predictions
column polyPhen
The PolyPhen-2 prediction impacts were placed in the GVS database using bulk-download files (version 2.2.2, downloaded Aug. 2012) from the PolyPhen-2 site.
9. repeats
columns repeatMasker, tandemRepeat
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (March 20, 2009)
   chromOut.zip: RepeatMasker
   chromTrf.zip: Tandem Repeats Finder
10. GERP conservation scores
column consScoreGERP
The manuscript describing the program Genomic Evolutionary Rate Profiling (GERP) can be found at http://genome.cshlp.org/content/15/7/901.full.
The GERP website is at http://mendel.stanford.edu/SidowLab/downloads/gerp/index.html.
The hg19 GERP rejected-substitution scores were downloaded from this site in September of 2011.
11. microRNA IDs
column microRNAs
ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3 (file dated Aug. 23, 2012)
12. Grantham scores
column granthamScore
R. Grantham (1974) Science, vol. 185, pp. 862-864, Table 2.
13. GWAS hits
added to column clinicalAssociation
NHGRI GWAS catalog
Unlike other data in our database, these PubMed links are updated by us weekly.
14. KEGG Pathways
column keggPathway
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ (downloaded Oct. 2012)
   keggPathway.txt
   keggMapDesc.txt
   ccdsKgMap.txt
15. CpG Islands
column cpgIslands
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ (downloaded Feb. 2012)
   cpgIslandExt.txt
16. Transcription Factor Binding Sites
column tfbs
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ (downloaded Feb. 2012)
   tfbsConsSites.txt
17. NHLBI ESP Allele Counts
column genomesESP
allele counts are read from a local database, and are those of the EVS web site http://evs.gs.washington.edu/EVS/ as of Oct. 2012 (6503 individuals)
18. Protein-Protein Interactions
column PPI
http://string-db.org/newstring_cgi/show_download_page.pl?UserId=VunYC2sl1cxw&sessionId=xnmyDSF37IaSn (downloaded Sept. 2012)
Calculations
This section primarily documents SNV columns. For indels, see also How Indels Are Annotated.
inDBSNPOrNot
This column indicates whether the variation has been observed before. The dbSNP indicator shows that the variation is in our GVS database, whether genotypes are available or not. This database currently holds variations from dbSNP build 137. The string "dbSNP_" is followed by the dbSNP build number for which the variation was first recorded (for the chosen rs ID, see below). If the variation is not found in dbSNP, the column contains "none".
chromosome
This column echoes the chromosome in the submitted file, which is assumed to be hg19/NCBI37.
The NCBI37 chromosome accession numbers used are (useful for forming HGVS nomenclature):
chr1NC_000001.10
chr2NC_000002.11
chr3NC_000003.11
chr4NC_000004.11
chr5NC_000005.9
chr6NC_000006.11
chr7NC_000007.13
chr8NC_000008.10
chr9NC_000009.11
chr10NC_000010.10
chr11NC_000011.9
chr12NC_000012.11
chr13NC_000013.10
chr14NC_000014.8
chr15NC_000015.9
chr16NC_000016.9
chr17NC_000017.10
chr18NC_000018.9
chr19NC_000019.9
chr20NC_000020.10
chr21NC_000021.8
chr22NC_000022.10
chrXNC_000023.10
chrYNC_000024.9
position
This column echoes the position in the submitted file, which is assumed to be hg19/NCBI37.
referenceBase
If the reference allele is included in the submitted file, this column echoes it. Otherwise the reference allele is generated by the server (see 5 above). (In the case of indels submitted in a VCF file, this column may echo the REF column in the VCF file, depending on the output-format choice.)
sampleGenotype
This column echoes the genotypes in the submitted file, expressed as single-character ambiguity codes. If there are multiple individuals, a comma-separated list is given. For a VCF input file with no genotypes, the ambiguity code will be that for the reference allele plus any alternative alleles.
sampleAlleles
This column list all the alleles observed (those in the sampleGenotype column). If the submitted file covers only one individual, there will be two alleles if the genotype is heterozygous. For multiple individuals, this is a list of all unique alleles (excluding N). For a VCF input file with no genotypes, the list will include the reference allele (REF) plus any alternative alleles (ALT). (In the case of indels submitted in a VCF file, this column may echo the ALT column in the VCF file, depending on the output-format choice.)
dbSNPGenotype
This column is present only if genotypes are submitted for an individual with a known dbSNP individual ID. In this case, the column gives the dbSNP genotype for the individual, if known.
allelesDBSNP
This column lists any alleles for genotypes recorded in dbSNP for any individual in any population. The string "NA" is used if unknown.
accession
This is the transcript identifier: the NM or XM ID for the NCBI gene model or the CCDS ID for the CCDS gene model. (There is one line in the output file for each transcript, as the SNV function and protein annotations can be different for different alternative splices.)
functionGVS
The functionGVS entries are calculated on the server when you submit a file. For a given SNV, we look in the gene model table in our database and find genes that overlap the position. If there are multiple transcripts for the gene, or more than one gene, the function is calculated for each, and there is one line in the output file for each. Next we check if the SNV is in a coding region of the transcript. If so, the coding bases of the entire transcript are divided into codons, and the codon at the SNV position is extracted, using the human reference sequence. The information thus far is the same regardless of the alleles in the submitted file. Next the reference base allele and the allele(s) of the data in the submitted file are used. There are 3 possible codons: from the reference base, from one allele, and from the other allele (though one of the alleles is normally the same as the reference base, and then there are 2 different codons). The amino acids are extracted for each codon. If the amino acids are all the same, the SNV is synonymous, otherwise missense or nonsense. If not synonymous and if there is a stop codon, the function is stop-gained or stop-lost, where the gain or loss is relative to the reference codon. Note that if you are submitting data for one individual, and if the genotype is homozygous in the reference allele, functionGVS can only be synonymous, as there is only one amino acid in the set. It is only for an individual with at least one allele different from the reference allele that missense and nonsense classifications arise. When submitting a file with multiple individuals, all different alleles are considered. There still may be a difference from functionDBSNP if some dbSNP individual has alleles not in the submitted data set. If the variation is in a coding region, and is in the first two or last two locations of an exon, the "-near-splice" string is added.

If, as is the case for a few genes in the NCBI gene model, the number of coding bases is not a multiple of 3, we match the protein sequence to the coding nucleotide sequence, first in the forward direction along the nucleotide sequence until the SNV is found or a mismatch is observed, then if necessary in the reverse direction. If the SNV is not found in a region at the beginning or end of a gene where the protein sequence matches the translated nucleotide sequence, the function is assigned coding-notMod3.

Note that for functionGVS and the protein analyses below, we are assuming that if the number of coding bases is a multiple of 3, their translation gives the correct functionGVS etc. For multiples of 3, there is no attempt to match the NCBI protein sequence. This could occasionally give errors, should the human genome contain two or more indels in the nucleotide sequence.

Splice sites are two bases at the 5' end of an intron (splice-5), and two bases at the 3' end of an intron (splice-3). Note that for the CCDS gene model, splice sites are only called if they are between coding exons, as there are no untranslated regions available. In the rare cases where there is an intron in the middle of an untranslated region, splice sites would be called for NCBI, but not CCDS.

The classifications near-gene-5 and near-gene-3, used only for the NCBI gene model, label SNVs that are outside transcribed regions, but within 2000 bp of a transcription region.

For the NCBI gene model, the complete list of SNV functions is

intergenic
intron
near-gene-3
near-gene-5
utr-3
utr-5
coding-notMod3
coding-notMod3-near-splice
coding-synonymous
coding-synonymous-near-splice
splice-3
splice-5
missense
missense-near-splice
stop-lost
stop-lost-near-splice
stop-gained
stop-gained-near-splice

For the CCDS gene model, the SNV list is

outsideCoding
intron
coding-synonymous
coding-synonymous-near-splice
splice-3
splice-5
missense
missense-near-splice
stop-lost
stop-lost-near-splice
stop-gained
stop-gained-near-splice

For the CCDS gene model, all genes have a multiple of 3 coding bases with the exception of CCDS56117.1 (LOC388946). However, the protein sequence tracks up to the very end, so we ignore its not-mod-3 status.
functionDBSNP
The dbSNP functions are only available for the NCBI gene model. For any line in the output with a NCBI accession ID, functionDBSNP will be a list of dbSNP functions for that accession (normally only one). For any line in the output with a CCDS ID, functionDBSNP will be a list of dbSNP functions for every NCBI accession ID.
rsID
If the variation is our GVS database of dbSNP entries, the rs ID is given. Otherwise the entry is zero. For variations to be in this GVS database does not require that there be genotypes available (hence some low-quality variations may be included), but it does require that the dbSNP group mapped the variation to a unique chromosomal location. For SNVs, there is no matching of alleles to those of dbSNP. Only locations are used. If there is more than one rs ID at the submitted SNV location, an rs ID with genotypes is chosen if available; ties are broken by choosing the lowest rs ID. For indels, the matching is more complex. See How Indels Are Annotated.
aminoAcids
The bases in the coding region are divided into codons. If the number of coding bases is an even multiple of 3 or if the protein sequence can be matched to the nucleotide sequence in the SNV region, the codon of the SNV is identified. A list of alleles at the SNV site is made: the reference allele, then the alleles observed. For each allele on this list, an amino acid is assigned. A list of unique amino acids is displayed in the column. Note that the reference allele is always in the nucleotide list, even if none of the alleles observed is that of the reference sequence. The amino acid corresponding to the reference allele is listed first. There are normally two amino acids in the list, but there can be more if the observed alleles don't include the reference allele or if there are multiple individuals and the site is triallelic.
granthamScore
If the variation causes an amino acid change as per the aminoAcids column (above), a corresponding Grantham score is supplied for each change found. Grantham scores (R. Grantham (1974) Science) estimate the strength of a change from one amino acid to another. The first amino acid in the aminoAcids column is that of the reference sequence. In rare cases when two or more alleles different from the reference are represented in the submitted genotypes, and there are two or more amino acids different from that of the reference sequence, the Grantham score is a comma-separated list for each alternative amino acid against that of the reference amino acid.
proteinPosition
This is the position of the amino acid in the protein, beginning at the N-terminal with the first amino acid at position 1, followed by the total number of amino acids in the protein. The total includes a count for the stop codon. These positions are calculated after the bases in the coding region are divided into codons. The value "NA" is reported if the number of coding bases is not an even multiple of 3 and the protein sequence cannot be matched to the nucleotide sequence in the SNV region.
cDNAPosition
This is the position of a nucleotide in the coding sequence of a gene, beginning at the 5' end of the gene, with the first base position 1 (for SNVs only so far). (The coding region includes the three bases of the stop codon.) The value will be reported even if the number of coding bases is not an even multiple of 3. The value "NA" is reported if the position is not in the coding region of a gene.
polyPhen
PolyPhen prediction impacts were loaded into our GVS database from bulk download files (see 8 above), and impacts should be available for most missense SNVs. If PolyPhen was able to analyze the SNV (and the particular allele change), the polyPhen column entry is quoted as "probably-damaging", "possibly-damaging", or "benign" in the case of the "class" radio button having been set on the home page form, or is a number between 0 and 1 if the "score" radio button was chosen. The score is a number between 0 and 1, where 1 is the most damaging. This score echoes the PolyPhen-2 "hdiv_prob" column. If the location/alleles combination is not in the PolyPhen file (usually because the variation is not missense), the value will be "unknown". Multiple PolyPhen-2 calls, presumably from multiple transcripts, are condensed to the most deleterious call, with no attempt to match our transcripts to theirs.
scorePhastCons
The entries in this column are UCSC conservation scores, 46 placental mammalian species, range of 0 to 1, with 1 being the most conserved (see 4 above). Note that these scores are averaged over several adjacent genome locations.
consScoreGERP
The entries in this column are rejected-substitution scores from the program GERP, Stanford University, range of -12.3 to 6.17, with 6.17 being the most conserved (see 10 above). Note that these scores have single genome-location resolution.
chimpAllele
This is the UCSC panTro3 (Pan troglodytes) reference allele (see 6 above).
CNV
If the variation is in one or more regions of copy number variation (from the Centre for Applied Genomics in Toronto, see 7 above), these are listed.
geneList
This is a comma-separated list of HUGO names, any for which the transcription region overlaps the variation. NCBI lines will list only genes in the NCBI gene model, and CCDS will list only those in the CCDS gene model. Note that since untranslated regions are not available for the CCDS gene model, a variation there will not have the gene name listed in the CCDS case. For the case of overlapping genes, there can be more than one name in the list, and only one of them will correspond to the transcript in the accession column (though the other transcripts will have separate lines).
AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq
If the SNV has been genotyped in the HapMap project, these are the allele frequencies (in percent) for three populations. If the "minor" radio button was selected, these are the minor-allele frequencies (the frequencies of the least-frequent alleles). If the "reference" radio button was selected, these are the frequencies of the hg19 reference allele. The minor-allele frequency can usually be derived from the reference-allele frequency. As long as the SNV is diallelic, and the reference allele is represented in the HapMap genotypes, the minor-allele frequency in percent is the smaller of two values: (1) the reference frequency in percent, and (2) 100 - (reference frequency in percent).
hasGenotypes
This column records whether the dbSNP database has genotypes available for one or more individuals, as opposed to just alleles and frequencies.
dbSNPValidation
If the variation is in dbSNP, this is their list of categories of evidence supporting the variation.
repeatMasker, tandemRepeat
If the variation is in one or more repeat regions, these are listed (see 9 above).
clinicalAssociation
This is a list of links (see 1 and 13 above).
distanceToSplice
Distance to the nearest splice site is straightforward for the NCBI gene model. For the CCDS case, where only coding regions are in the gene model (no untranslated regions), only splice sites between coding exons are considered. In some transcriptions there are introns in the middle of an untranslated region. Splice sites for such introns are included only in the NCBI gene model. The nearest splice site is a function only of the variation location, and is not constrained to genes in the column geneList. (For example, the geneList column could contain one gene that's a single exon -- no splice sites -- and the distance to a splice site in a gene some distance away would be reported.) For files with both NCBI and CCDS gene models, the NCBI lines have distances to nearest NCBI sites, and the CCDS lines consider only CCDS genes.

If the distance is greater than 25,000,000 bp (as can be the case on the Y chromosome), the distance is reported as 99999999.

Splice sites are two bases at the 5' end of an intron, and two bases at the 3' end of an intron. The first and last base of an exon would have a distance of 1; a canonical splice site would have a distance of 0. The intron base next to the innermost canonical site would be 1.
	XXGTII...IIAGXX
	210012...210012
	
microRNAs
If the variation falls in microRNA genes as listed in miRBase (see 11 above), they are provided in a comma-separated list.
keggPathway
Pathways associated with the gene are extracted from UCSC KEGG files. If the gene model is NCBI, the pathway list is associated with the locus ID of the gene. If the gene model is CCDS, the pathway list is associated with the CCDS ID. (To keep whitespace out of the column, spaces in the descriptions have been removed, and the following characters capitalized.)
cpgIslands
A CpG is a C base followed by a G base in the genome sequence. Regions where CpGs are present at higher than normal levels may be associated with promoter regions and thus regulation of gene expression. Regions are extracted from UCSC CpG files (see the description of the cpgIslandExt table at http://genome.ucsc.edu/cgi-bin/hgTables). If the variation location is inside one of the regions, the value in the column is the range of the island.
tfbs
Transcription factor binding sites (TFBS) are short DNA sequence regions where regulatory elements bind. Regions are extracted from UCSC TFBS files (see the description of the tfbsConsSites table at http://genome.ucsc.edu/cgi-bin/hgTables). UCSC data is derived from the Transfac Matrix Database. The data are computational, and not all are expected to be biologically functional. If the variation location is inside one or more of the regions, the value in the column is a comma-separated list of Transfac IDs.
genomesESP
The allele counts are reported for 6503 exomes sequenced in the National Heart, Lung, and Blood Institute Exome Sequencing Project (see 17 above). If the box "split ESP Afr/Eur" was checked, allele counts are reported separately for individuals of African (AA:) and European ancestry (EA:).
PPI
Experimental confidence scores (range 0 to 1000) from the Search Tool for the Retrieval of Interacting Genes (STRING), version 9.0 (see 18 above) were placed in our local GVS database (for NCBI gene-model HUGO names). For the NCBI locus ID associated with the accession, interactions with experimental scores of 700 or greater were sorted by score. Up to 10 interacting proteins with the highest scores were selected. If the eleventh and higher proteins had a score equal to that of the tenth, the list was continued until a lower score was found. CCDS locus IDs are handled the same way if the gene HUGO name corresponding to the CCDS ID is in the PPI table of the GVS database.
proteinSequence
Entries in the proteinSequence column come directly from the protein files listed above. There was no attempt to check that the translation of the codons in the corresponding coding regions (hg19) matched these protein sequences.
 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo