SeattleSeq Annotation 137

SeattleSeq Annotation Build Notes

The current SeattleSeq Annotation version is 8.08, October 13, 2013

For input format "VCF SNVs and Indels", VCF files with individual columns, but no genotypes in those columns, are better handled.

Build notes for 8.07, July 3, 2013

VCF files with no genotypes may now be submitted with the "VCF SNVs and Indels (both)" input format. The "VCF (indels only)" input format will still not accept files with no genotypes, but pure indel files with no genotypes can be submitted with "VCF SNVs and Indels (both)".

The functions for indels near coding or splice boundaries have been tuned slightly. If the indel is an insertion, and the insertion site is between a canonical splice site and the adjacent intron, the function will be called intron and not splice-5 or splice-3. Proximity to a splice site may be recovered by looking at the distanceToSplice column. If the indel is a deletion, and the deletion includes both coding and non-coding bases, the number of deleted coding bases will be counted, and the function will be coding or frameshift depending on the number of coding bases deleted (rather than depending only on whether the total number of deleted bases is an even multiple of 3).

Build notes for 8.06, June 13, 2013

An option to split the ESP allele counts by African and European ancestry is provided.

On May 19, ESP allele counts for the Y chromosome were added to our database.

For automated submissions, progress and download URLs now contain the string ABORT if the submitted file has too many lines, and the Java examples have been modified to stop on the ABORT string.

Build notes for 8.05, May 5, 2013

A new section at the bottom of the home page allows entry of a single SNV for table display of annotation.

Build notes for 8.04, April 8, 2013

SNV-only VCF files with individual columns, but no genotypes in those columns, are better handled.

Build notes for 8.03, April 5, 2013

A few of the most recent GWAS links were missing (clinical association column). This has been corrected.

Files larger than 1 GB (uncompressed) will no longer be accepted.

Build notes for 8.02, March 20, 2013

The site will no longer attempt to annotate structural variants or insertions/deletions of size greater than 1000 bp. Pindel VCF files will be accepted, but for original output format long indels will be skipped; for VCF out there will only be an echo of the input line.

Internally there was an upgrade from JBoss 5 to JBoss 7.

Build notes for 8.01, January 23, 2013

Most SNV data is cached for the NCBI gene model, so the response will be much faster for SNVs when selecting "NCBI full genes".

For the VCF-out format, the amino-acid tag has been changed from "AC" to "AAC" to avoid conflict with another commonly used tag.

For input file format "VCF SNVs and Indels", the VCF output format is now available.

Protein-protein interactions (STRING 9.0 experimental) have been added as an annotation column.

Inclusion of variations in repeat regions is now available for all locations, not just those of variations in dbSNP.

Matching indels to those in dbSNP has been tuned slightly by using the dbSNP HGVS notation to identify a better allele match when there are multiple choices.

Build notes for 8.00, November 14, 2012

The dbSNP build is now 137. The gene models (NCBI and CCDS), PolyPhen-2 assignments, clinical associations, Kegg pathways, ESP allele counts (now for the final 6503 individuals), and microRNAs have been updated.

We are now reporting PolyPhen-2 impacts using bulk download files, so results will be there for most missense SNVs. The "score" radio button is now the default.

For indel-only VCF file submissions, "VCF-like allele columns" is now the default.

The "NHLBI ESP Allele Counts" checkbox defaults to on.

Build notes for 7.05, June 19, 2012

Automated submissions may now be made without an autoFile line in the submitted file. (See the Automated Submission section in the how-to-use page.)

Build notes for 7.04, May 17, 2012

CpG Islands, Transcription Factor Binding Sites, and NHLBI ESP Allele Counts have been added.

Build notes for 7.03, April 15, 2012

KEGG pathways have been added.

There are small changes to the notation in the footer lines, including use of "stop" rather than "nonsense".

Build notes for 7.02, March 25, 2012

It is now possible to submit VCF files with SNVs and indels in the same file. The "VCF SNVs and Indels (both)" option will eventually replace the separate SNV and indel formats, but for now the "both" option has these restrictions: (1) the VCF version must be 4.0 or higher, (2) there must be genotypes for at least one individual (10 or more columns), and (3) the output is only available in the original format, with "VCF-like allele columns" for indels.

For the PolyPhen column, there is a choice of "class", as before, or the new "score", a number between 0 and 1.

PubMed links from the NHGRI GWAS catalog have been added to the clinicalAssociation column.

Build notes for 7.01, December 2, 2011

7.00 issues with reverse-complementing dbSNP alleles, and with classification of dbSNP variations into SNPs and indels have been resolved.

Build notes for 7.00, November 17, 2011

The dbSNP build is now 134. The gene models (NCBI and CCDS), GERP scores, chimp alleles, PolyPhen-2 assignments, copy number variations, clinical associations, and microRNAs have been updated. Version numbers are now reported for NCBI accessions and CCDS IDs. For dbSNP, we've now tracked whether a variation is a SNP or an indel (whether there are dbSNP genotypes or not) so the comparison of submitted indels to dbSNP indels is improved. In addition, for variations without genotypes, we store the dbSNP alleles, so that the allelesDBSNP column can be filled in more frequently.

The list of GVS functions has been augmented to add the string "-near-splice" if the variation is in the first two or last two positions in an exon, and is in a coding region. To pick out missense SNPs, for example, it will be necessary to use "contains" instead of "equals" for the strings. The "nonsense" classification has been replaced by "stop-gained" and "stop-lost".

There is now no separate treatment of 1000 Genomes. In the first column of the result file, the string "dbSNP_" is followed by the create build for dbSNP (larger numbers indicate more recent variation submissions to dbSNP).

Extraction of PolyPhen-2 calls for our database is now different. We submit genomic locations and alleles, rather than protein sequences with amino-acid changes. The set submitted is a list of missense SNPs for any SNP in dbSNP that was called missense by dbSNP or by us, plus any SNPs in 5500 exomes sequenced at the University of Washington or at the Broad Institute (NHLBI GO Exome Sequencing Project) that are called missense by us.

In the NCBI gene model, there are a few transcripts with two genomic locations. We've retained both of these and added the string ".dup" to the transcript with the smaller number of coding bases.

There are several changes to the defaults for checkboxes and radio buttons on the home page. In addition, for indels submitted in VCF files, there is now a choice to echo VCF REF and ALT alleles in the referenceBase and sampleAlleles columns (radio button "VCF-like allele columns").

Multiple clinical association links are now separated by a bar ("|") rather than a comma.

For more details, see the How-to-Use page and the links at the bottom of that page.

Build notes for 6.21, August 11, 2011

Complex indels may be submitted in VCF files (see "Indel Annotation" for details).

For automated submissions, return of a compressed file is possible (and recommended for large jobs).

VCF files submitted with quotes around the individual genotype fields will be handled by removing the quotes (accompanied by a warning message).

Build notes for 6.20, July 29, 2011

For automated submissions, there is additional information returned to enable job cancellation when wanted (see the "AUTOMATED SUBMISSION" section on the How-to-Use page).

Build notes for 6.19, July 18, 2011

OMIM clinical-association links and annotation for microRNAs have been updated.

For automated submissions, there is now a limit of 2 million lines in progress, total for all submissions. If you exceed that, you will get a short file and an email with an abort message. This limit may be lowered as necessary to control the server load.

Build notes for 6.18, July 7, 2011

Presence in 1000 Genomes is now recorded for indels.

Build notes for 6.17, June 24, 2011

For the one-genotype-per-line input format, the site will now accept genotypes of the form A/G.

Build notes for 6.16, June 15, 2011

The maximum number of lines for a submitted file is now 1 million.

Build notes for 6.15, June 14, 2011

Plain-text files that have been compressed by the gzip utility will be accepted for most browser/operating-system combinations.

Build notes for 6.14, May 15, 2011

For the polyPhen column, we now match the submitted alleles to those we used when submitting SNPs to PolyPhen 2. In rare cases where the alleles don't match (for example three alleles are involved), the polyPhen value is set to unknown.

Internally, we have increased the cached annotations in our database, so speed should be improved.

For indels, there was a fix for a bug that affected the CNV and microRNAs annotations: the previous annotation was only for the location just before the insertion or deletion. Now the annotation is for all bases of a deletion and for the locations just before and just after an insertion.

Build notes for 6.13, May 2, 2011

Most of the "coding-notMod3" functions are now resolved by matching the protein sequence to the coding nucleotide sequence, first in the forward direction along the nucleotide sequence until the SNP is found or a mismatch is observed, then if necessary in the reverse direction.

Build notes for 6.12, April 2, 2011

There is more documentation and verification for the one-genotype-per-line input format; plus a progress-bar fix.

For VCF in, lower-case column headers are accepted.

Build notes for 6.11, March 23, 2011

If you ask for VCF-out and don't select cDNA position annotation, the job will no longer fail.

Added further validation of submitted files. For input format one-genotype-per-line, commas in the genome position are handled.

For VCF files, the code can now handle genotype fields that have multiple characters in the set "/", "|" ,"\" (for example a "|" for the GT component but a "/" for another component).

Build notes for 6.10, March 11, 2011

A new column is available: cDNA position, position in the coding region of a gene.

Build notes for 6.09, February 16, 2011

A new column is available: Grantham scores (applicable to missense SNPs).

Very simple indels can be submitted in VCF 4.0 format.

The SeattleSeqAnnotation131 version is printed in the lines at the end of the returned file (for the original output format, not VCF out).

Build notes for 6.08, January 27, 2011

There is a new file format: one genotype per line.

Build notes for 6.07, January 10, 2011

In the lines at the end of the returned file, SNP-based function counts (or indel-based counts) are available as well as accession-based counts.

For single-individual submissions, it is now possible to submit non-unique locations (more than one genotype per location), and thus investigate how the SNP function changes for the set of alleles different from the reference allele.

Internally, we have cached data for a large number of SNPs observed locally. This should speed up processing. (However, the caching will occasionally be turned off for maintenance, in which case the site will temporarily revert to the current speeds.)

Build notes for 6.06, December 15, 2010

PolyPhen calls (for known SNPs) are now available for NCBI gene model lines. The labels have changed slightly.

Build notes for 6.05, December 9, 2010

The Text option under "Input Annotation File for Table Display" has been cleaned up to facilitate Excel imports.

Build notes for 6.04, November 22, 2010

A new column is available: microRNAs, the EMBL IDs of any microRNAs in which the variation is found.

Build notes for 6.03, November 2, 2010

More than twice as many PolyPhen results are available. SNP locations for 80 individuals sequenced at the University of Washington (mixed ancestry, Environmental Genome Project) were submitted to the PolyPhen 2 site in late October 2010. Many of these SNPs were in dbSNP as well, but we found that about 10% of the calls were different from those extracted in July 2010. In these cases, the newer PolyPhen result was written to our database. During the next week we will be updating the previous dbSNP calls, so 10% of the calls for SNPs in dbSNP but not in the 80 UW exomes are expected to change.

Build notes for 6.02, October 13, 2010

A new column is available: distance to nearest splice site (column distanceToSplice). Overlap of copy number variations with submitted SNPs or indels is now reported for novel SNPs as well as SNPs in dbSNP (column CNV). There is an optional parameter "compress" available, which when set to false, triggers the result file to be returned as plain text. Any version of a VCF file is now accepted without complaint, though code was written for version 3.3.

Build notes for 6.01, September 27, 2010

Limited annotation is now available for indels when submitted in the GATK bed format. A clinicial-association filter has been added to the "Input Annotation File for Table Display" section.

Build notes for 6.00, July 29, 2010

The genome positions are now those of the human reference sequence of February 2009 (UCSC hg19, NCBI build 37). Submitted data must be referenced to that sequence. The dbSNP build is now 131.

This dbSNP build does not label splice sites (dbSNP functions will be "intron" for splice-site SNPs).

The UCSC phastCons conservation scores are now those for 46 placental mammals. A liftover from hg18 to hg19 has been applied to the GERP conservation scores.

The nickLab column has been discontinued.

Only the CCDS build of May 2010 is supported.

The 1000-genomes presence is now derived from the union of all 3 pilots released in March 2010.

The column name "allelesMaq" has been changed to "sampleAlleles".

The PolyPhen prediction impacts are those of PolyPhen-2. Thus far only (missense) SNPs in dbSNP build 131 with genotypes have impact values (no novel SNPs of our own, and no 1000-genomes SNPs).

The site now accepts VCF files.

Build notes for 5.00, February 5, 2010

There is a choice of expressing HapMap frequencies as minor-allele or hg18-reference-allele frequencies. Internally, there has been a major upgrade to the application server software (to JBoss 5).

Previous Build Notes

December 21, 2009: Presence of a SNP in the April 2009 1000 Genomes release is now noted in the first column. The new descriptions in column 1 are now dbSNP, 1000Genomes, dbSNP_1000Genomes, and none.

December 14, 2009: The unused "polyPhen" in the column-header line is now omitted when no CCDS gene models are selected. (This does not affect any column entries.)

November 6, 2009: When reading annotation files back into the site and observing the table, clicking on the GERP conservation score brings up a plot of nearby conservation scores. (This could aid in identifying regulatory SNPs, as the GERP scores have single-base resolution.) The clinical associations from dbSNP have been updated in our database, and there are a few more PolyPhen calls. There is now a way to automate the exchange of submitted file and result file with a screen-scraper program (see the AUTOMATED SUBMISSION section on the How-to-Use page).

October 28, 2009: Changes were in support of reading annotation files back into the site to generate a table ("Input Annotation File for Table Display" section). If a large file with more than 5,000 lines is resubmitted, the table is divided into multiple pages. For short jobs, there is a shortcut to get to the table. On the progress page, a link "show table" appears once the job is complete and the page refreshed. This displays the table with input from the file on our server.

September 23, 2009: Conservation scores from the program GERP are now available, in addition to the previous phastCons scores. The column heading conservationScore (phastCons) has been changed to scorePhastCons, and the new GERP column heading is consScoreGERP. In addition, the function classes near-gene-5 and near-gene-3 (within 2000 bases of the transcribed region) have been added for the NCBI gene model.

August 11, 2009: The functionGVS classes have been changed to match those of functionDBSNP: missense, nonsense, utr-5, utr-3, splice-5, splice-3. Intron and coding-synonymous are the same. The lines at the end of the returned file have been correspondingly changed, and 3 lines have been added. The "reference" class has been omitted from the functionDBSNP list.

July 29, 2009: Protein sequence was added for the NCBI gene model.

July 27, 2009: Protein position was added for the NCBI gene model. For CCDS, (irrelevant) protein positions are no longer quoted for intronic SNPs.

July 13, 2009: A bug was fixed that affected the functionGVS and aminoAcids columns for the case of SNPs for which the codon extends between two exons. Only SNPs in the first nucleotide of an exon, the last, or one base in from the first or last were affected, and then only a small fraction of those. The corrected code was posted at 6 pm PDT.

April 28, 2009: Illumina CASAVA files can now be submitted.

April 10, 2009: Files with multiple individuals may be submitted, and the genotypes from all individuals will be listed on a single SNP line. If an individual classification is given, the genotypes will be divided into those classifications.

April 2, 2009: Links from dbSNP to clinical association data are available.

March 24, 2009: The name of your input file is now included in the name of the returned file if you don't specify a fileName string in your submitted file. If an annotation file is being read back in for display, and if the original submission specifed a dbSNP individual ID, there is now the option to display only those SNPs with discrepant genotypes. Discrepant genotypes, whether submitted/dbSNP or submitted/nickLab are marked with red backgrounds.

March 17, 2009: In the nickLab column, the actual genotype is displayed (if available) instead of "seen" or "notSeen", if an individual ID is entered on the home page. Emails are now being sent by user snpserve. Please contact us if you have trouble getting an email with a download link.

March 10, 2009: For some non-synonymous SNPs of the CCDS gene models, PolyPhen predictions are available.

March 9, 2009: There is now a choice of two CCDS gene models: the previous 2007, and the newer 2008 (September). The site name has been changed to SeattleSeq Annotation. The limit to the number of lines in a submitted file is 210,000.

March 4, 2009: Speed is improved for the chimp allele annotation.

March 3, 2009: The progress bar was fixed. Speed is improved.

February 24, 2009: There were only internal changes for better server memory usage.

February 10, 2009: For nonsynonymous SNPs, when a result file is read back in, the list of amino acids is a link that will bring up a window with several physicochemical properties of the amino acids.

February 5, 2009: For nonsynonymous SNPs (CCDS gene model only so far), when a result file is read back in, there is now a link to the ProPhylER website in the protein sequence column.

February 4, 2009: The conservation-score calculation is now fast.

February 3, 2009: There is now a choice of input file formats.

January 21, 2009: A column proteinSequence (CCDS gene model only so far) has been added.

January 20, 2009: A column proteinPosition (CCDS gene model only so far) has been added. This is the position of the amino acid in the protein, beginning at the N-terminal (first amino acid is position 1), followed by the total number of amino acids in the protein.

January 14, 2009: The CCDS database was corrected to treat pseudoautosomal genes (X and Y chromosomes) correctly. For CCDS annotation and all chromosomes, the gene region was one base off, and this has been corrected. The functions were ok, but if the SNP was one base below the start of the gene, the gene would be listed, and if the SNP was the last base of the gene, the gene would not be listed.

December 31, 2008: When a dbSNP individual is entered, and the result file read back in, genotype disagreements are indicated with a red background. There is more header information on the read-back page.

December 30, 2008: A dbSNP individual ID may be entered if the data is from a known individual, and the dbSNP genotype can then be added to the annotation.

December 2, 2008: A choice of NCBI or CCDS genes is available.

September 18, 2008: The GVS database has been upgraded to dbSNP build 129 (June 2008). Maq and dbSNP alleles columns were added.

Overlap with repeats has been added to the August 14, 2008 build.

Presence or absence of genotypes, and dbSNP validation codes, have been added to the August 11, 2008 build.

HapMap minor-allele frequencies been added to the August 7, 2008 build.

A list of genes has been added to the July 31, 2008 build.

As of the July 10, 2008 build, there is output for SNPs that are not within genes.

For the July 8, 2008 build, it is now possible to submit files without the initial dbSNP-presence column (that is, files directly from Maq may be submitted). The web site will automatically detect whether this column is there or not in the input file, and it will report whether the SNP is in the GVS database or not. The chimp-allele annotation is now reported for SNPs not in the GVS database. Downloaded files can be returned for table display.

For the May 27, 2008 build, additional annotation (conservation score, chimp allele, and copy number variations) is available. In the list of amino acids, the reference amino acid is now always first.

Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo