SeattleSeq Annotation 137
  

How to Use SeattleSeq Annotation
Input Variation List File for Annotation
Submit a plain-text file with a list of human SNVs (single nucleotide variations; also sometimes referred to as SNPs) or indels (small insertions or deletions) with NCBI-37/hg19 positions (1-based: first position on the chromosome is 1), and enter your e-mail address. Once the annotation calculations are complete, you'll receive an e-mail with a link to download the result file. Additional documentation is available via the links at the bottom of this page.

INPUT FILE REQUIREMENTS

Only chromosomes 1-22 and X and Y will be annotated. The variations should be grouped by chromosome in the file, for maximum speed. The maximum number of lines in the file is 1 million, and the maximum file size is 1 GB. Long indels (> 1000 bp) and structural variants (including the VCF <> alternate-allele notation) will be read in, but they will not be annotated. For original output format long indels and structural variants will be skipped; for VCF out there will only be an echo of the input line.

There are 6 choices of input-file formats, all plain-text. (Plain-text files may be gzipped before submission, but their acceptance may depend on your browser and operating system.)

1. For a Maq file, there must be at least 4 white-space-separated columns in the input file, in the order chromosome (chr1 or chr2 or ... or chr22 or chrX or chrY), genomic coordinate (human NCBI 37, 1-based), reference base, Maq base (using ambiguity codes). The "chr" in the first column is required; if it's not there, use the custom-format option. There can be any number of columns following these four, and they are ignored. There can also be an additional column at the beginning, the contents of which will be copied to the output file. If the additional column at the beginning is not present, the website will fill in an initial column with "dbSNP_" and the dbSNP create build if the variation is in our GVS (Genome Variation Server) database, based on dbSNP build 137.

2. For a gff file, there must be at least 9 white-space-separated columns in the input file. An example is
chr1    SoapSNP SNP     4793    4793    25      +       .       ID=YHSNP0128643; status=novel; ref=A; allele=A/G; support1=48; support2=26;
	
The first column is the chromosome (initial "chr" is optional). The fourth column is treated as the SNV location (human NCBI 37, 1-based). The strand column is not used, and the strand is assumed to be "+". The string of annotations is assumed to be separated by whitespace, and columns 9 through the end are searched for "allele" and "ref" tags to extract the genotype and reference base. The genotype can be two alleles separated by "/", or it can be an ambiguity code. If no reference base is found, its value will be filled in from the NCBI 37 human reference sequence.

3. For a CASAVA file, there can be only one chromosome per file, and there must be at least 10 white-space-separated columns. The first (location), sixth (genotype), and tenth (reference allele) columns are used. The genotypes are assumed to be homozygous if there is one base in the sixth column, and heterozygous for content such as "AG". The column-header line beginning with "#" is ignored. In this case, there is no chromosome number inside the file, so the chromosome is extracted from the file name. Somewhere in the file name, separated from the rest of the name by "." characters, there must be a string "c" plus the number of the chromosome (c is case-sensitive, X and Y are not). Examples of valid names are: c6, c19.snp.txt, individualA.cX.txt, individualA.cy. If you wish to submit all chromosomes at once, here is a Perl script that will concatenate the information into a single file that can be submitted under custom-format, as described in the next paragraph. Note that the 1-million-line limit will still apply, so submitting all chromosomes in one file may not work for whole-genome studies (but the Perl script could be modified to group a limited number of chromosomes).

4. For a VCF file (SNVs only), the VCF-required first line with the version number must be present, plus the column header line. (Any number of VCF metainformation lines may be included between the version and column-header lines.) If you use file header lines (e.g. "# fileName", see below), they must be placed at the beginning, before the VCF version line. VCF Versions 3.3 and 4.0 are supported, though there is minimal version-verification testing. A description of the VCF format is here. The VCF file may contain genotypes for one or for multiple individuals, or it can contain no genotypes.

5. For a custom-format file, the column numbers for the chromosome, location, reference base, and the two alleles of the genotype must be filled in. For the chromosome, the initial "chr" is optional. The genotype can be two alleles in separate columns, two alleles separated by "/", or an ambiguity code. In the latter two cases, the two allele columns should be set to the same value. If no reference base is in the file, its column should be set to 0, and the base will be filled in from the NCBI 37 human reference sequence. If both the allele columns are set to zero, the two alleles of the genotype will be set to the reference allele (this is a way to get limited annotation, such as the gene list, for a chromosome location).

6. The One Genotype Per Line format is a white-space separated ASCII text format for handling one to many individuals in an ungrouped fashion. Indels are not supported for this format. Each line represents a single genotype for a single individual, and will result in at least one line of output. If multiple individuals have the same genotype, there must still be a separate line for each individual/genotype combination. Five (5) columns are required, conforming to the following arrangement:
  • Column 1: Individual label (no whitespace, freeform otherwise)
  • Column 2: Chromosome (chr leader text optional)
  • Column 3: Position on chromosome
  • Column 4: Reference base
  • Column 5: Genotype (either a single-character ambiguity code or single bases (A, C, G, T, N) separated by a slash e.g. A/G)

Any subsequent columns will be ignored and will not be repeated in the output. Input entry lines can be presented in any order, and lines for a given individual do not need to be grouped or sorted.

The individual column will be retained in output and listed as the first column. Output will be returned in chromosome-position-reference order, and cannot be used with the Input Annotation File for Table Display option of SeattleSeqAnnotation137.

A given annotation will be the same for the same combination of chromosome and position, regardless of individual. Predictions such as PolyPhen2 will report the most deleterious outcome for overlapping and/or inclusive results.

Sample input line:

IndA1-5	10	75871735   	C	Y

7. The GATK bed format is only for simple indels, for which limited annotation is available. This choice is for files created by older versions of the GATK Indel Genotyper. If you have indels in other formats, you can create a valid file with these white-space-separated columns: (1) the chromosome, with or without a leading "chr" string, (2) the location before the beginning of the indel (in the case of a deletion, one position before the first base deleted), (3) any characters (this column is not used but at least one non-whitespace character must be there), and (4) the alleles, optionally followed by a colon (characters following the colon are not used but will be echoed in the returned file). Here are examples of valid alleles: -A, +T, -CG, +TGTT. The + (insertion) or - (deletion) is required as the first character, and there must be one or more bases following, indicating those inserted or deleted.

8. Indels can be submitted in VCF 4.0 format. As for SNVs, the VCF-required first line with the version number must be present, plus the column header line. If you use file header lines (e.g. "# fileName", see below), they must be placed at the beginning, before the VCF version line. Only the indel format for VCF Version 4.0 is supported, though higher versions will work if that format doesn't change. Files made for earlier versions will be rejected. A description of the 4.0 VCF format is here. The VCF file may contain genotypes for one or for multiple individuals (files with no genotypes are not yet accepted). The annotation is crude for complex indels: more than one alternative allele or indels for which the number of bases is greater than one in both the reference and alternate columns (see How Indels Are Annotated).

9. Both SNVs and indels can be submitted in VCF format. The "VCF SNVs and Indels (both)" option currently has the following restrictions: (1) the VCF version must be 4.0 or higher and (2) the output is only available for "VCF-like allele columns" for indels. The VCF file may contain genotypes for one or for multiple individuals, or it can contain no genotypes.

Often a file will contain genotypes for a single individual (exceptions being multi-individual VCF files). However, there is a second mode for a file, that will automatically be detected. If genotypes for multiple individuals are grouped following a separator line identifying the individual, the output file will list each SNV once with a list of genotypes for all the individuals. The input file for this mode must have separator lines beginning with the # character, followed by optional whitespace, followed by the key "individual", then required whitespace, then an individual ID (any string with no spaces). Optionally, there can also be whitespace and a classification string. (This mode is not available for VCF files or for GATK bed indel files.) The file would then look like this (after the optional header lines, documented below):
	# individual abc population1
	data lines
	# individual def population2
	data lines
	etc.
	
In this multiple-individual case, the individuals will be listed at the end of the result file.

Example input files are available via the "Download Example Input Files" link in the left panel of this page.

ANNOTATION OPTIONS

Most of the annotations are derived from the Genome Variation Server (GVS) database, which acquires information from dbSNP, UCSC, and several other sources. Details for the indel annotations are here.

The following annotations are always included in the output:

--- inDBSNPOrNot (whether the variation is in the dbSNP database)
--- chromosome (input from the user)
--- position (input from user, location on the chromosome, hg19, 1-based)
--- referenceBase (input from the user or calculated if not provided)
--- sampleGenotype (input from the user)
--- accession (NCBI or CCDS transcript identifier)
--- functionGVS (GVS class of variation function, using only hg19 and your submitted alleles; see description)
--- functionDBSNP (dbSNP class of variation function)
--- rsID (dbSNP identifier for the variation, 0 if not in dbSNP)
--- aminoAcids (list of amino acids for the codon, starting with that of the reference base, coding SNVs only)
--- proteinPosition (the position of the amino acid in the protein, beginning at the N-terminal with the first amino acid at position 1, followed by the total number of amino acids in the protein; the total includes a count for the stop codon)

There are several optional annotations:

--- Alleles Submitted (column sampleAlleles: same information as sampleGenotype, with the ambiguity code resolved into alleles; if there are multiple individuals, this is a list of all alleles present)
--- Genotype in dbSNP (column dbSNPGenotype: the genotype in dbSNP for a particular individual; N/N if not available, X/X if conflicting measurements; this column is available only if a numerical dbSNP individual ID is entered on the query form)
--- Alleles in dbSNP (column allelesDBSNP: list of alleles for all populations and individuals in the GVS database; if there are genotypes available, the list contains all alleles of those genotypes; if no genotypes are available, the alleles are those in the downloaded dbSNP files.
--- Conservation Score phastCons (column scorePhastCons: UCSC, 46 placental mammalian species, range of 0 to 1, with 1 being the most conserved)
--- Conservation Score GERP (column consScoreGERP: rejected-substitution score from the program GERP, Stanford University, range of -12.3 to 6.17, with 6.17 being the most conserved)
--- Chimp Allele (column chimpAllele: UCSC alignments)
--- Copy Number Variations (column CNV: Toronto database)
--- Genes (column geneList: HUGO names, any for which the transcription region overlaps the variation)
--- HapMap Frequencies (3 columns AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq: African, European, and Asian, in percent)
--- Has Genotypes (column hasGenotypes: whether GVS, and thus usually dbSNP, has genotypes available for the variation)
--- dbSNP Validation (column dbSNPValidation: dbSNP validation status codes, dealing with e.g. whether the variation has been seen at least twice)
--- Repeats (2 columns repeatMasker and tandemRepeat)
--- Grantham Score (column granthamScore: the Grantham score of any amino acid changes, as per Grantham (1974) Science, Table 2)
--- microRNAs (column microRNAs: EMBL IDs of any microRNAs in which the variation is found)
--- Protein Sequence (column proteinSequence: sequence of amino acids for the transcript if applicable)
--- PolyPhen Prediction (column polyPhen: amino acid substitution impacts)
--- Clinical Association (column clinicalAssociation: links to NCBI pages and PubMed)
--- Distance to Nearest Splice Site (column distanceToSplice: how close the variation is to a splice site)
--- cDNA Position (column cDNAPosition: the location within the sequence of coding bases; NA if not coding; so far only SNVs)
--- KEGG Pathways (column keggPathway: the Kyoto Encyclopedia of Genes and Genomes biological pathways for the gene or CCDS ID)
--- CpG Islands (column cpgIslands: whether in a region where CpGs are present at a high level, from the UCSC genome annotation database)
--- Transcription Factor Binding Sites (column tfbs: whether in a region of conserved sites, from the UCSC genome annotation database)
--- NHLBI ESP Allele Counts (column genomesESP: the allele counts observed in the Exome Sequencing Project, optionally split by two ancestries)
--- Protein-Protein Interactions (column PPI: experimental confidence scores from the STRING 9.0 database)

Note that sampleAlleles and allelesDBSNP are not expected to agree if there is one individual and that individual is homozygous, though the submitted allele will usually be contained in the allelesDBSNP list.

In the case of HapMap frequencies, there is a choice of the minor-allele frequency or the hg19 reference-allele frequency. The minor-allele frequency can usually be derived from the reference-allele frequency. As long as the SNV is diallelic, and the reference allele is represented in the HapMap genotypes, the minor-allele frequency in percent is the smaller of two values: (1) the reference frequency in percent, and (2) 100 - (reference frequency in percent).

In the case of PolyPhen predictions, there is a choice of class (e.g. benign, possibly-damaging, probaby-damaging) or score. The score is a number between 0 and 1, where 1 is the most damaging. This score echoes the PolyPhen-2 "pph2_prob" column.

In rare cases where two or more genes overlap the variation, the gene column will be a comma-separated list.

There are 3 choices for generating gene loci: (1) the full NCBI gene definitions (accessions beginning with NM and XM), (2) the Consensus CDS (CCDS) coding regions of 2012, and (3) NCBI and CCDS 2012. If "NCBI and CCDS 2012" is chosen, there will be lines in the result file for each NCBI accession number and each CCDS ID that overlap the location. For any line in the output with a NCBI accession, functionDBSNP will be a list of dbSNP functions for that accession. For any line in the output with a CCDS ID, the accession column will be the CCDS ID, the functionGVS column will be the function calculated from the CCDS exon ranges (no untranslated classification), and functionDBSNP will be a list of dbSNP functions for every NCBI accession number. The aminoAcids columns will reflect the CCDS codons, and the geneList column will contain only CCDS gene names.

Distance to the nearest splice site is straightforward for the NCBI gene model. For the CCDS case, where only coding regions are in the gene model (no untranslated regions), only splice sites between coding exons are considered. In some transcriptions there are introns in the middle of an untranslated region. Splice sites for such introns are included only in the NCBI gene model. The nearest splice site is a function only of the variation location, and is not constrained to genes in the column geneList. (For example, the geneList column could contain one gene that's a single exon -- no splice sites -- and the distance to a splice site in a gene some distance away would be reported.) For files with both NCBI and CCDS gene models, the NCBI lines have distances to nearest NCBI sites and the CCDS lines consider only CCDS genes.

FILE HEADER LINES

For bookkeeping purposes, two parameters can be optionally specified: "headerLine" and/or "fileName". The added lines must start with # and a space. The first adds a line at the beginning of the returned file (preceded by # in the returned file), and the second adds the string to the file name of the returned file (spaces not recommended, punctuation ok if valid for a file name). If the fileName parameter is not present, the name of the returned file will include the name of your submitted file. Another optional parameter is "compress": if this is set to false, the returned file will be plain text. The default is true. (This no-compress option is not recommended for large jobs.) If your file is VCF formatted, these header lines must be at the beginning of the file.

Examples are
	# headerLine NAXXXXX data of 03/10
	# fileName NAXXXXX.highQualityCutoff
	# compress false
	
OUTPUT FILE FORMATS

The default output file format is a header line (starting with "#") followed by tab-separated annotations. There is one line for each alternate transcript when more than one overlaps the variation location. At the end there are a number of lines (again starting with "#") with some input parameters, version, and various counts.

For two of the VCF-in choices there is the option of VCF-out format. In this case there is only one line for each variation. Annotations are appended to the INFO column. When there are multiple transcripts, the accession field is a comma-separated list of overlapping accessions. For the accession-dependent fields functionGVS, functionDBSNP, aminoAcids, proteinPosition, cDNAPosition, polyPhen, granthamScore, and proteinSequence, these are also lists in the same order as the accessions. However, if all values are the same, only one is given. For keggPathway and PPI (dependent on locus ID but not on multiple transcripts for that locus ID), all unique values are listed. Note that the geneList field is not accession-dependent; it is a list of all genes overlapping the variation location.

PROCESSING TIME

This site is backed up by a database with cached annotations for all locations in the human genome, which is used when the "NCBI full genes" model is chosen, and when SNV annotation is requested. For NCBI-only, about a million SNVs can be processed in an hour. (If the cache database should be unavailable, the site will revert to annotating without it; results will be the same, but will come in slowly.) Protein-protein interactions are not yet in the cache, so choosing them will slow the annotation moderately. For mixed SNVs and indels, the cache will be used for SNVs, but not indels.

MONITORING AND CANCELING, TABLE SHORTCUT

Once your file is submitted, there are links on the acknowledgment page for monitoring the job progress and canceling the job. On the progress page, once the job is completed and the page is refreshed, a "show table" link appears. This link is useful for short jobs. Clicking it brings up a table of annotation from the file that resides on our server. (You will still get an email.) This is a shortcut for downloading the file from the server, then resubmitting it in the "Input Annotation File for Table Display" section of the home page (described below). CAUTION: When using the shortcut, don't alternate between the shortcut mode and the "Input Annotation File for Table Display" for a second data set, without closing and restarting your browser; our code uses session attributes that are identified with a browser session, and keeping two data sets active at once does not work.

AUTOMATED SUBMISSION (optional, advanced usage)

There are two ways to automate the batch procedure.

(1) The submitted file must have a line at the beginning like
	# autoFile testAuto
	
where autoFile is the parameter, and the value is only used for file identification within our server (it does not need to be unique, as a time stamp is added). It's then necessary to write a screen-scraper program to submit your file, and to designate a local file for writing. An example using the Java httpunit library is given here.

(2) No autoFile line is required in the submitted file. The form of the home page has a hidden parameter "autoFile" with the default values of "none". The screen-scraper program will be similar to that of the first option, except that it must be able to set the hidden parameter to something besides "none" (not always easy). An example using the Java httpunit library is given here.

The default compression for automated jobs is false, though by adding a line "# compressAuto true" to the submitted file, and requesting compression in the screen-scraper code, gzipped files may be returned. This is recommended for large jobs.

Please don't submit more than one job at a time. If too many variations at a time are submitted, jobs may be aborted by our server.

If you wish to be notified of SeattleSeqAnnotation new builds or scheduled outages, visit this site:
https://mailman1.u.washington.edu/mailman/listinfo/gvsnotify
This is a moderated site, so approval of subscriptions will only be made on weekdays. The mailing list is set up one-way: no posting by subscribers.

If you need to cancel an automated job, look at the progress URL returned. The end of the URL will look like e.g. "&jobID=211281031438". Open a browser and enter a URL like "http://snp.gs.washington.edu/SeattleSeqAnnotation137/BatchCancelServlet?cancelFile=AnnotationCancel.211281031438.txt", replacing the jobID with your own. You should receive a page indicating that your job has been cancelled. You may need to stop your own submitting program if it does not detect the end of the job.

Input Annotation File for Table Display
Submit a text annotation file ("original" SeattleSeqAnnotation format only) previously downloaded from this site (must be SeattleSeqAnnotation137, hg19/NCBI37). The columns will be displayed in a table. Some column choice and sorting is available.

If you have files with several hundred thousand lines, they may need to be split up before reading back in.

If a dbSNP individual was entered in the original query, genotype disagreements are indicated with a red background. If the "Show only SNPs with discrepant genotypes" box is checked, only those variations with discrepant genotypes will be displayed (the alleles submitted disagree with dbSNP, if available)

For missense SNVs, there is a "query ProPhylER" button in the protein sequence column. This appears when either the functionGVS or functionDBSNP string contains the text "missense". The protein sequence is sent to the ProPhylER website. Not all queries will provide useful protein structure/function information, as many protein alignments are still being manually curated. If you get a display, it will be necessary to move the shaded slider at the top to home in on the SNP position, and then to use the amino acids involved to study the predicted effect of the base change.

For missense or nonsense SNPs, the list of amino acids is a link that will bring up a window with several physicochemical properties of the amino acids (see Stone and Sidow, Genome Research 15, 978 (2005)).

When "Text" is chosen from the "Table/Text" menu, the resulting text can be copied and pasted to a text file, and imported to Excel as space-delimited. (Only variations and columns passing the various filters are displayed.)

table example

Input One SNV
Enter the NCBI-37/hg19 chromosome and location and two alleles for a SNV (no indels). All possible annotations will be returned in a table.

The alleles will be treated as in the custom input format (above) with no reference allele entered. If the reference allele is not the first or second allele, it will be added from the NCBI 37 human reference sequence, for function and protein calculations.

List of All Documentation Pages
About SeattleSeq Annotation

How To Use SeattleSeq Annotation (this page)

Build Notes

Calculations and Sources of Data

Download Example Files

How Indels Are Annotated

Terms of Service

 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo