SeattleSeq Annotation 154

SeattleSeq Annotation Build Notes

The current SeattleSeq Annotation version is 16.01, May 9, 2022

For multiallelic indels and for VCF files with individual genotypes, the column "sampleGenotype" now shows a list of genotypes represented.

Build notes for 16.00, September 30, 2020

The dbSNP build is now 154. The genome positions are still those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38).

Build notes for 15.00, November 12, 2019

The dbSNP build is now 153. The genome positions are still those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38).

The dbSNPValidation column is still present, but always contains the text "NA".

Build notes for 14.00, August 17, 2018

The dbSNP build is now 151. The genome positions are still those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38).

We are using the newer ClinVar format: variants keyed by location rather than dbSNP rs ID.

We are no longer serving out HapMap data or dbSNP individual genotypes.

Build notes for 13.00, July 27, 2017

The dbSNP build is now 150. The genome positions are still those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38).

Build notes for 12.01, July 18, 2016

Return of dbSNP rs IDs has changed for VCF-out format. Previously our own chosen rs ID replaced the rs ID in the third column of the submitted VCF file only if the submitted value was unknown ("."). If a known rs ID was submitted in the third column, our rs ID was not returned. We now leave the third column as it is in the submitted file, and add a RSID field to the INFO column that is populated with our own interpretation of the matching rs ID.

Build notes for 12.00, June 27, 2016

The dbSNP build is now 147. The genome positions are still those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38).

Build notes for 11.00, October 4, 2015

The dbSNP build is now 144. The genome positions are still those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38).

ExAC allele counts are now those of release 0.3 (Jan. 2015).

Matching of dbSNP indels to those of the submitted indels has been modified slightly to include use of dbSNP validations when there is more than one dbSNP indel at the same location. Also for dbSNP matching, downstream-shifting of submitted indels in dinucleotide-repeat regions has been added to the homopolymer-track shifting of SeattleSeqAnnotation141.

For automated jobs, return of a compressed file is simplified. It is no longer necessary to add a "# compress" parameter line in the submitted file if the submitting code sets the "compressAuto" parameter to "true" (and reads the returned file as binary, as before).

Newer versions of GATK may document spanning deletions when there are multiple individuals in the VCF file. This puts strings like "*" or "<*:DEL>" into a comma-separated list in the ALT column. Such alternative-allele strings are ignored, and any genotypes referencing the ignored alternate are treated as "./.". Existence of such lines is indicated by a note on the submission-received page.

The current server load is available at

Build notes for 10.02, April 12, 2015

ExAC Allele Counts have been added.

Build notes for 10.01, January 4, 2015

The function for an insertion between coding and splice sites is now called coding or frameshift (near splice) instead of splice.

Matching an indel to one in dbSNP has been augmented to handle insertion or deletion of repeated alleles when the indel location is at the beginning of a homopolymer track. For example, insertion or deletion of AA at the beginning of an A homopolymer track can be matched with dbSNP indels downstream, but still within the homopolymer track.

Build notes for 10.00, November 1, 2014

The genome positions are now those of the human reference sequence of December 2013 (UCSC hg38, NCBI build 38). Submitted data must be referenced to that sequence. The dbSNP build is now 141. Variant locations will be tested to be in the ranges of the hg38 chromosomes.

These columns have been removed: Conservation Score phastCons, Copy Number Variations, and Transcription Factor Binding Sites. They may be added later, depending on demand and availability for the hg38 sequence.

There is only one VCF-input choice, which can handle SNVs and/or indels. The "SeattleSeq Annotation original allele columns" option has been discontinued for VCF in, tabular out, leaving only "VCF-like allele columns" for indels in this case.

Matching of simple insertions and deletions to those in dbSNP is much more constrained than in previous versions. See this description.

Files output in VCF format now have the extension ".vcf". For these VCF-out files, the PolyPhen type has been changed from String to Float.

For tabular output, the header "proteinSequence" has been changed to "proteinAccession".

Build notes for 9.02, April 20, 2014

CADD C scores have been added (freely available for academic, non-commercial applications only, or see About SeattleSeq Annotation).

Build notes for 9.01, January 31, 2014

Old function classifications coding-notMod3 and coding-synonymous were still appearing for genes where the number of coding bases is not a multiple of 3. This has been fixed.

Build notes for 9.00, January 17, 2014

The dbSNP build is now 138. The CCDS gene model has been discontinued, leaving only the NCBI gene model. NCBI non-coding genes are now included (accession IDs beginning with NR_ and XR_ added to the previous NM_ and XM_). Also, there are many more XM_ transcripts in the NCBI gene model now.

Chimp alleles, copy number variations, microRNAs, KEGG pathways, and protein-protein interactions have all been updated (and GWAS hits are updated weekly).

There are numerous changes in our function classifications, and downstream software using these values will need to be updated. The old near-gene-3 becomes downstream-gene, and near-gene-5 becomes upstream-gene, both with the interval increased from 2000 bp to 5000 bp. utr-3 becomes 3-prime-UTR and utr-5 becomes 5-prime-UTR. coding-notMod3 changes to coding-unknown. coding-synonymous is shortened to synonymous. splice-3 becomes splice-acceptor and splice-5 becomes splice-donor. A new classification is added: intron-near-splice for variations within 6 bp of a splice donor or acceptor site. In-frame indels are still labeled coding. non-coding-exon (and non-coding-exon-near-splice) is a new label for variations in exons of non-coding genes.

Matching indels to those in dbSNP has been modified slightly. If there are more than two indels at a given dbSNP location, two are selected, and the others are dropped. The attempt to match alleles is now based entirely on the alleles of dbSNP's HGVS notation. Annotation of indels will be much faster for this build, as cached SNV annotations are now used to form the indel annotation.

Full protein sequences are no longer available, only the protein accession ID. In the case of the original tabular output format, the column is still labeled proteinSequence. In the case of VCF output format, the tag has been changed from PS to PAC.

For synonymous SNVs, the single amino acid is reported in the aminoAcids column.

The GFF format has been upgraded to GFF3, requiring tab-separated columns and a genotype tagged by the entry on the home page.

Build notes for 8.08, October 13, 2013

For input format "VCF SNVs and Indels", VCF files with individual columns, but no genotypes in those columns, are better handled.

Build notes for 8.07, July 3, 2013

VCF files with no genotypes may now be submitted with the "VCF SNVs and Indels (both)" input format. The "VCF (indels only)" input format will still not accept files with no genotypes, but pure indel files with no genotypes can be submitted with "VCF SNVs and Indels (both)".

The functions for indels near coding or splice boundaries have been tuned slightly. If the indel is an insertion, and the insertion site is between a canonical splice site and the adjacent intron, the function will be called intron and not splice-5 or splice-3. Proximity to a splice site may be recovered by looking at the distanceToSplice column. If the indel is a deletion, and the deletion includes both coding and non-coding bases, the number of deleted coding bases will be counted, and the function will be coding or frameshift depending on the number of coding bases deleted (rather than depending only on whether the total number of deleted bases is an even multiple of 3).

Build notes for 8.06, June 13, 2013

An option to split the ESP allele counts by African and European ancestry is provided.

On May 19, ESP allele counts for the Y chromosome were added to our database.

For automated submissions, progress and download URLs now contain the string ABORT if the submitted file has too many lines, and the Java examples have been modified to stop on the ABORT string.

Build notes for 8.05, May 5, 2013

A new section at the bottom of the home page allows entry of a single SNV for table display of annotation.

Build notes for 8.04, April 8, 2013

SNV-only VCF files with individual columns, but no genotypes in those columns, are better handled.

Build notes for 8.03, April 5, 2013

A few of the most recent GWAS links were missing (clinical association column). This has been corrected.

Files larger than 1 GB (uncompressed) will no longer be accepted.

Build notes for 8.02, March 20, 2013

The site will no longer attempt to annotate structural variants or insertions/deletions of size greater than 1000 bp. Pindel VCF files will be accepted, but for original output format long indels will be skipped; for VCF out there will only be an echo of the input line.

Internally there was an upgrade from JBoss 5 to JBoss 7.

Build notes for 8.01, January 23, 2013

Most SNV data is cached for the NCBI gene model, so the response will be much faster for SNVs when selecting "NCBI full genes".

For the VCF-out format, the amino-acid tag has been changed from "AC" to "AAC" to avoid conflict with another commonly used tag.

For input file format "VCF SNVs and Indels", the VCF output format is now available.

Protein-protein interactions (STRING 9.0 experimental) have been added as an annotation column.

Inclusion of variations in repeat regions is now available for all locations, not just those of variations in dbSNP.

Matching indels to those in dbSNP has been tuned slightly by using the dbSNP HGVS notation to identify a better allele match when there are multiple choices.

Build notes for 8.00, November 14, 2012

The dbSNP build is now 137. The gene models (NCBI and CCDS), PolyPhen-2 assignments, clinical associations, Kegg pathways, ESP allele counts (now for the final 6503 individuals), and microRNAs have been updated.

We are now reporting PolyPhen-2 impacts using bulk download files, so results will be there for most missense SNVs. The "score" radio button is now the default.

For indel-only VCF file submissions, "VCF-like allele columns" is now the default.

The "NHLBI ESP Allele Counts" checkbox defaults to on.

Build notes for 7.05, June 19, 2012

Automated submissions may now be made without an autoFile line in the submitted file. (See the Automated Submission section in the how-to-use page.)

Build notes for 7.04, May 17, 2012

CpG Islands, Transcription Factor Binding Sites, and NHLBI ESP Allele Counts have been added.

Build notes for 7.03, April 15, 2012

KEGG pathways have been added.

There are small changes to the notation in the footer lines, including use of "stop" rather than "nonsense".

Build notes for 7.02, March 25, 2012

It is now possible to submit VCF files with SNVs and indels in the same file. The "VCF SNVs and Indels (both)" option will eventually replace the separate SNV and indel formats, but for now the "both" option has these restrictions: (1) the VCF version must be 4.0 or higher, (2) there must be genotypes for at least one individual (10 or more columns), and (3) the output is only available in the original format, with "VCF-like allele columns" for indels.

For the PolyPhen column, there is a choice of "class", as before, or the new "score", a number between 0 and 1.

PubMed links from the NHGRI GWAS catalog have been added to the clinicalAssociation column.

Build notes for 7.01, December 2, 2011

7.00 issues with reverse-complementing dbSNP alleles, and with classification of dbSNP variations into SNPs and indels have been resolved.

Build notes for 7.00, November 17, 2011

The dbSNP build is now 134. The gene models (NCBI and CCDS), GERP scores, chimp alleles, PolyPhen-2 assignments, copy number variations, clinical associations, and microRNAs have been updated. Version numbers are now reported for NCBI accessions and CCDS IDs. For dbSNP, we've now tracked whether a variation is a SNP or an indel (whether there are dbSNP genotypes or not) so the comparison of submitted indels to dbSNP indels is improved. In addition, for variations without genotypes, we store the dbSNP alleles, so that the allelesDBSNP column can be filled in more frequently.

The list of GVS functions has been augmented to add the string "-near-splice" if the variation is in the first two or last two positions in an exon, and is in a coding region. To pick out missense SNPs, for example, it will be necessary to use "contains" instead of "equals" for the strings. The "nonsense" classification has been replaced by "stop-gained" and "stop-lost".

There is now no separate treatment of 1000 Genomes. In the first column of the result file, the string "dbSNP_" is followed by the create build for dbSNP (larger numbers indicate more recent variation submissions to dbSNP).

Extraction of PolyPhen-2 calls for our database is now different. We submit genomic locations and alleles, rather than protein sequences with amino-acid changes. The set submitted is a list of missense SNPs for any SNP in dbSNP that was called missense by dbSNP or by us, plus any SNPs in 5500 exomes sequenced at the University of Washington or at the Broad Institute (NHLBI GO Exome Sequencing Project) that are called missense by us.

In the NCBI gene model, there are a few transcripts with two genomic locations. We've retained both of these and added the string ".dup" to the transcript with the smaller number of coding bases.

There are several changes to the defaults for checkboxes and radio buttons on the home page. In addition, for indels submitted in VCF files, there is now a choice to echo VCF REF and ALT alleles in the referenceBase and sampleAlleles columns (radio button "VCF-like allele columns").

Multiple clinical association links are now separated by a bar ("|") rather than a comma.

For more details, see the How-to-Use page and the links at the bottom of that page.

Build notes for 6.21, August 11, 2011

Complex indels may be submitted in VCF files (see "Indel Annotation" for details).

For automated submissions, return of a compressed file is possible (and recommended for large jobs).

VCF files submitted with quotes around the individual genotype fields will be handled by removing the quotes (accompanied by a warning message).

Build notes for 6.20, July 29, 2011

For automated submissions, there is additional information returned to enable job cancellation when wanted (see the "AUTOMATED SUBMISSION" section on the How-to-Use page).

Build notes for 6.19, July 18, 2011

OMIM clinical-association links and annotation for microRNAs have been updated.

For automated submissions, there is now a limit of 2 million lines in progress, total for all submissions. If you exceed that, you will get a short file and an email with an abort message. This limit may be lowered as necessary to control the server load.

Build notes for 6.18, July 7, 2011

Presence in 1000 Genomes is now recorded for indels.

Build notes for 6.17, June 24, 2011

For the one-genotype-per-line input format, the site will now accept genotypes of the form A/G.

Build notes for 6.16, June 15, 2011

The maximum number of lines for a submitted file is now 1 million.

Build notes for 6.15, June 14, 2011

Plain-text files that have been compressed by the gzip utility will be accepted for most browser/operating-system combinations.

Build notes for 6.14, May 15, 2011

For the polyPhen column, we now match the submitted alleles to those we used when submitting SNPs to PolyPhen 2. In rare cases where the alleles don't match (for example three alleles are involved), the polyPhen value is set to unknown.

Internally, we have increased the cached annotations in our database, so speed should be improved.

For indels, there was a fix for a bug that affected the CNV and microRNAs annotations: the previous annotation was only for the location just before the insertion or deletion. Now the annotation is for all bases of a deletion and for the locations just before and just after an insertion.

Build notes for 6.13, May 2, 2011

Most of the "coding-notMod3" functions are now resolved by matching the protein sequence to the coding nucleotide sequence, first in the forward direction along the nucleotide sequence until the SNP is found or a mismatch is observed, then if necessary in the reverse direction.

Build notes for 6.12, April 2, 2011

There is more documentation and verification for the one-genotype-per-line input format; plus a progress-bar fix.

For VCF in, lower-case column headers are accepted.

Build notes for 6.11, March 23, 2011

If you ask for VCF-out and don't select cDNA position annotation, the job will no longer fail.

Added further validation of submitted files. For input format one-genotype-per-line, commas in the genome position are handled.

For VCF files, the code can now handle genotype fields that have multiple characters in the set "/", "|" ,"\" (for example a "|" for the GT component but a "/" for another component).

Build notes for 6.10, March 11, 2011

A new column is available: cDNA position, position in the coding region of a gene.

Build notes for 6.09, February 16, 2011

A new column is available: Grantham scores (applicable to missense SNPs).

Very simple indels can be submitted in VCF 4.0 format.

The SeattleSeqAnnotation131 version is printed in the lines at the end of the returned file (for the original output format, not VCF out).

Build notes for 6.08, January 27, 2011

There is a new file format: one genotype per line.

Build notes for 6.07, January 10, 2011

In the lines at the end of the returned file, SNP-based function counts (or indel-based counts) are available as well as accession-based counts.

For single-individual submissions, it is now possible to submit non-unique locations (more than one genotype per location), and thus investigate how the SNP function changes for the set of alleles different from the reference allele.

Internally, we have cached data for a large number of SNPs observed locally. This should speed up processing. (However, the caching will occasionally be turned off for maintenance, in which case the site will temporarily revert to the current speeds.)

Build notes for 6.06, December 15, 2010

PolyPhen calls (for known SNPs) are now available for NCBI gene model lines. The labels have changed slightly.

Build notes for 6.05, December 9, 2010

The Text option under "Input Annotation File for Table Display" has been cleaned up to facilitate Excel imports.

Build notes for 6.04, November 22, 2010

A new column is available: microRNAs, the EMBL IDs of any microRNAs in which the variation is found.

Build notes for 6.03, November 2, 2010

More than twice as many PolyPhen results are available. SNP locations for 80 individuals sequenced at the University of Washington (mixed ancestry, Environmental Genome Project) were submitted to the PolyPhen 2 site in late October 2010. Many of these SNPs were in dbSNP as well, but we found that about 10% of the calls were different from those extracted in July 2010. In these cases, the newer PolyPhen result was written to our database. During the next week we will be updating the previous dbSNP calls, so 10% of the calls for SNPs in dbSNP but not in the 80 UW exomes are expected to change.

Build notes for 6.02, October 13, 2010

A new column is available: distance to nearest splice site (column distanceToSplice). Overlap of copy number variations with submitted SNPs or indels is now reported for novel SNPs as well as SNPs in dbSNP (column CNV). There is an optional parameter "compress" available, which when set to false, triggers the result file to be returned as plain text. Any version of a VCF file is now accepted without complaint, though code was written for version 3.3.

Build notes for 6.01, September 27, 2010

Limited annotation is now available for indels when submitted in the GATK bed format. A clinicial-association filter has been added to the "Input Annotation File for Table Display" section.

Build notes for 6.00, July 29, 2010

The genome positions are now those of the human reference sequence of February 2009 (UCSC hg19, NCBI build 37). Submitted data must be referenced to that sequence. The dbSNP build is now 131.

This dbSNP build does not label splice sites (dbSNP functions will be "intron" for splice-site SNPs).

The UCSC phastCons conservation scores are now those for 46 placental mammals. A liftover from hg18 to hg19 has been applied to the GERP conservation scores.

The nickLab column has been discontinued.

Only the CCDS build of May 2010 is supported.

The 1000-genomes presence is now derived from the union of all 3 pilots released in March 2010.

The column name "allelesMaq" has been changed to "sampleAlleles".

The PolyPhen prediction impacts are those of PolyPhen-2. Thus far only (missense) SNPs in dbSNP build 131 with genotypes have impact values (no novel SNPs of our own, and no 1000-genomes SNPs).

The site now accepts VCF files.

Build notes for 5.00, February 5, 2010

There is a choice of expressing HapMap frequencies as minor-allele or hg18-reference-allele frequencies. Internally, there has been a major upgrade to the application server software (to JBoss 5).

Previous Build Notes

December 21, 2009: Presence of a SNP in the April 2009 1000 Genomes release is now noted in the first column. The new descriptions in column 1 are now dbSNP, 1000Genomes, dbSNP_1000Genomes, and none.

December 14, 2009: The unused "polyPhen" in the column-header line is now omitted when no CCDS gene models are selected. (This does not affect any column entries.)

November 6, 2009: When reading annotation files back into the site and observing the table, clicking on the GERP conservation score brings up a plot of nearby conservation scores. (This could aid in identifying regulatory SNPs, as the GERP scores have single-base resolution.) The clinical associations from dbSNP have been updated in our database, and there are a few more PolyPhen calls. There is now a way to automate the exchange of submitted file and result file with a screen-scraper program (see the AUTOMATED SUBMISSION section on the How-to-Use page).

October 28, 2009: Changes were in support of reading annotation files back into the site to generate a table ("Input Annotation File for Table Display" section). If a large file with more than 5,000 lines is resubmitted, the table is divided into multiple pages. For short jobs, there is a shortcut to get to the table. On the progress page, a link "show table" appears once the job is complete and the page refreshed. This displays the table with input from the file on our server.

September 23, 2009: Conservation scores from the program GERP are now available, in addition to the previous phastCons scores. The column heading conservationScore (phastCons) has been changed to scorePhastCons, and the new GERP column heading is consScoreGERP. In addition, the function classes near-gene-5 and near-gene-3 (within 2000 bases of the transcribed region) have been added for the NCBI gene model.

August 11, 2009: The functionGVS classes have been changed to match those of functionDBSNP: missense, nonsense, utr-5, utr-3, splice-5, splice-3. Intron and coding-synonymous are the same. The lines at the end of the returned file have been correspondingly changed, and 3 lines have been added. The "reference" class has been omitted from the functionDBSNP list.

July 29, 2009: Protein sequence was added for the NCBI gene model.

July 27, 2009: Protein position was added for the NCBI gene model. For CCDS, (irrelevant) protein positions are no longer quoted for intronic SNPs.

July 13, 2009: A bug was fixed that affected the functionGVS and aminoAcids columns for the case of SNPs for which the codon extends between two exons. Only SNPs in the first nucleotide of an exon, the last, or one base in from the first or last were affected, and then only a small fraction of those. The corrected code was posted at 6 pm PDT.

April 28, 2009: Illumina CASAVA files can now be submitted.

April 10, 2009: Files with multiple individuals may be submitted, and the genotypes from all individuals will be listed on a single SNP line. If an individual classification is given, the genotypes will be divided into those classifications.

April 2, 2009: Links from dbSNP to clinical association data are available.

March 24, 2009: The name of your input file is now included in the name of the returned file if you don't specify a fileName string in your submitted file. If an annotation file is being read back in for display, and if the original submission specifed a dbSNP individual ID, there is now the option to display only those SNPs with discrepant genotypes. Discrepant genotypes, whether submitted/dbSNP or submitted/nickLab are marked with red backgrounds.

March 17, 2009: In the nickLab column, the actual genotype is displayed (if available) instead of "seen" or "notSeen", if an individual ID is entered on the home page. Emails are now being sent by user snpserve. Please contact us if you have trouble getting an email with a download link.

March 10, 2009: For some non-synonymous SNPs of the CCDS gene models, PolyPhen predictions are available.

March 9, 2009: There is now a choice of two CCDS gene models: the previous 2007, and the newer 2008 (September). The site name has been changed to SeattleSeq Annotation. The limit to the number of lines in a submitted file is 210,000.

March 4, 2009: Speed is improved for the chimp allele annotation.

March 3, 2009: The progress bar was fixed. Speed is improved.

February 24, 2009: There were only internal changes for better server memory usage.

February 10, 2009: For nonsynonymous SNPs, when a result file is read back in, the list of amino acids is a link that will bring up a window with several physicochemical properties of the amino acids.

February 5, 2009: For nonsynonymous SNPs (CCDS gene model only so far), when a result file is read back in, there is now a link to the ProPhylER website in the protein sequence column.

February 4, 2009: The conservation-score calculation is now fast.

February 3, 2009: There is now a choice of input file formats.

January 21, 2009: A column proteinSequence (CCDS gene model only so far) has been added.

January 20, 2009: A column proteinPosition (CCDS gene model only so far) has been added. This is the position of the amino acid in the protein, beginning at the N-terminal (first amino acid is position 1), followed by the total number of amino acids in the protein.

January 14, 2009: The CCDS database was corrected to treat pseudoautosomal genes (X and Y chromosomes) correctly. For CCDS annotation and all chromosomes, the gene region was one base off, and this has been corrected. The functions were ok, but if the SNP was one base below the start of the gene, the gene would be listed, and if the SNP was the last base of the gene, the gene would not be listed.

December 31, 2008: When a dbSNP individual is entered, and the result file read back in, genotype disagreements are indicated with a red background. There is more header information on the read-back page.

December 30, 2008: A dbSNP individual ID may be entered if the data is from a known individual, and the dbSNP genotype can then be added to the annotation.

December 2, 2008: A choice of NCBI or CCDS genes is available.

September 18, 2008: The GVS database has been upgraded to dbSNP build 129 (June 2008). Maq and dbSNP alleles columns were added.

Overlap with repeats has been added to the August 14, 2008 build.

Presence or absence of genotypes, and dbSNP validation codes, have been added to the August 11, 2008 build.

HapMap minor-allele frequencies been added to the August 7, 2008 build.

A list of genes has been added to the July 31, 2008 build.

As of the July 10, 2008 build, there is output for SNPs that are not within genes.

For the July 8, 2008 build, it is now possible to submit files without the initial dbSNP-presence column (that is, files directly from Maq may be submitted). The web site will automatically detect whether this column is there or not in the input file, and it will report whether the SNP is in the GVS database or not. The chimp-allele annotation is now reported for SNPs not in the GVS database. Downloaded files can be returned for table display.

For the May 27, 2008 build, additional annotation (conservation score, chimp allele, and copy number variations) is available. In the list of amino acids, the reference amino acid is now always first.

Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo