SeattleSeq Annotation 141
  

How to Use SeattleSeq Annotation
Input Variation List File for Annotation
Submit a plain-text file with a list of human SNVs (single nucleotide variations; also sometimes referred to as SNPs) or indels (small insertions or deletions) with NCBI-38/hg38 positions (1-based: first position on the chromosome is 1), and enter your e-mail address. Once the annotation calculations are complete, you'll receive an e-mail with a link to download the result file. Additional documentation is available via the links at the bottom of this page.

INPUT FILE REQUIREMENTS

Only chromosomes 1-22 and X and Y will be annotated. The variations should be grouped by chromosome in the file, for maximum speed. The maximum number of lines in the file is 1 million, and the maximum file size is 1 GB. Long indels (> 1000 bp) and structural variants (including the VCF <> alternate-allele notation) will be read in, but they will not be annotated. For original (tabular) output format long indels and structural variants will be skipped; for VCF out there will only be an echo of the input line.

There are 9 choices of input-file formats, all plain-text. (Plain-text files may be gzipped before submission, but their acceptance may depend on your browser and operating system.)

1. Both SNVs and indels can be submitted in VCF format. The "VCF SNVs and Indels (SNVs and/or indels)" option requires the VCF version to be 4.0 or higher. The VCF file may contain genotypes for one or for multiple individuals, or it can contain no genotypes.

2. For a Maq file (SNVs only), there must be at least 4 white-space-separated columns in the input file, in the order chromosome (chr1 or chr2 or ... or chr22 or chrX or chrY), genomic coordinate (human NCBI 38, 1-based), reference base, Maq base (using ambiguity codes). The "chr" in the first column is required; if it's not there, use the custom-format option. There can be any number of columns following these four, and they are ignored. There can also be an additional column at the beginning, the contents of which will be copied to the output file. If the additional column at the beginning is not present, the website will fill in an initial column with "dbSNP_" and the dbSNP create build if the variation is in our GVS (Genome Variation Server) database, based on dbSNP build 141

3. For a GFF3 file (SNVs only), there must be at least 9 tab-separated columns in the input file. An example is
    chr1	il	SNP	14907	14907	0.0008	.	.	reference=A;genotype=A/G

	
The first column is the chromosome (initial "chr" is optional). The fourth column is treated as the SNV location (human NCBI 38, 1-based). The strand column is not used, and the strand is assumed to be "+". The string of attributes is assumed to be separated by semi-colons, and these attributes are searched for genotype and reference tags, as specified on the home page. The genotype can be two alleles separated by "/", or it can be an ambiguity code. If no reference tag is found, its value will be filled in from the NCBI 38 human reference sequence.

4. For a CASAVA file (SNVs only), there can be only one chromosome per file, and there must be at least 10 white-space-separated columns. The first (location), sixth (genotype), and tenth (reference allele) columns are used. The genotypes are assumed to be homozygous if there is one base in the sixth column, and heterozygous for content such as "AG". The column-header line beginning with "#" is ignored. In this case, there is no chromosome number inside the file, so the chromosome is extracted from the file name. Somewhere in the file name, separated from the rest of the name by "." characters, there must be a string "c" plus the number of the chromosome (c is case-sensitive, X and Y are not). Examples of valid names are: c6, c19.snp.txt, individualA.cX.txt, individualA.cy. If you wish to submit all chromosomes at once, here is a Perl script that will concatenate the information into a single file that can be submitted under custom-format, as described in the next paragraph. Note that the 1-million-line limit will still apply, so submitting all chromosomes in one file may not work for whole-genome studies (but the Perl script could be modified to group a limited number of chromosomes).

5. For a custom-format file (SNVs only), the column numbers for the chromosome, location, reference base, and the two alleles of the genotype must be filled in. For the chromosome, the initial "chr" is optional. The genotype can be two alleles in separate columns, two alleles separated by "/", or an ambiguity code. In the latter two cases, the two allele columns should be set to the same value. If no reference base is in the file, its column should be set to 0, and the base will be filled in from the NCBI 38 human reference sequence. If both the allele columns are set to zero, the two alleles of the genotype will be set to the reference allele (this is a way to get limited annotation, such as the gene list, for a chromosome location).

6. The One Genotype Per Line format (SNVs only) is a white-space separated ASCII text format for handling one to many individuals in an ungrouped fashion. Indels are not supported for this format. Each line represents a single genotype for a single individual, and will result in at least one line of output. If multiple individuals have the same genotype, there must still be a separate line for each individual/genotype combination. Five (5) columns are required, conforming to the following arrangement:
  • Column 1: Individual label (no whitespace, free-form otherwise)
  • Column 2: Chromosome (chr leader text optional)
  • Column 3: Position on chromosome
  • Column 4: Reference base
  • Column 5: Genotype (either a single-character ambiguity code or single bases (A, C, G, T, N) separated by a slash e.g. A/G)

Any subsequent columns will be ignored and will not be repeated in the output. Input entry lines can be presented in any order, and lines for a given individual do not need to be grouped or sorted.

The individual column will be retained in output and listed as the first column. Output will be returned in chromosome-position-reference order, and cannot be used with the Input Annotation File for Table Display option of SeattleSeqAnnotation141.

A given annotation will be the same for the same combination of chromosome and position, regardless of individual. Predictions such as PolyPhen2 will report the most deleterious outcome for overlapping and/or inclusive results.

Sample input line:

IndA1-5	10	75871735   	C	Y

7. The GATK bed format is only for simple indels, for which limited annotation is available. This choice is for files created by older versions of the GATK Indel Genotyper. If you have indels in other formats, you can create a valid file with these white-space-separated columns: (1) the chromosome, with or without a leading "chr" string, (2) the location before the beginning of the indel (in the case of a deletion, one position before the first base deleted), (3) any characters (this column is not used but at least one non-whitespace character must be there), and (4) the alleles, optionally followed by a colon (characters following the colon are not used but will be echoed in the returned file). Here are examples of valid alleles: -A, +T, -CG, +TGTT. The + (insertion) or - (deletion) is required as the first character, and there must be one or more bases following, indicating those inserted or deleted.

Often a file will contain genotypes for a single individual (exceptions being multi-individual VCF files). However, there is a second mode for a file, that will automatically be detected. If genotypes for multiple individuals are grouped following a separator line identifying the individual, the output file will list each SNV once with a list of genotypes for all the individuals. The input file for this mode must have separator lines beginning with the # character, followed by optional whitespace, followed by the key "individual", then required whitespace, then an individual ID (any string with no spaces). Optionally, there can also be whitespace and a classification string. (This mode is not available for VCF files or for GATK bed indel files.) The file would then look like this (after the optional header lines, documented below):
	# individual abc population1
	data lines
	# individual def population2
	data lines
	etc.
	
In this multiple-individual case, the individuals will be listed at the end of the result file.

Example input files are available via the "Download Example Input Files" link in the left panel of this page.

ANNOTATION OPTIONS

Many of the annotations are derived from the Genome Variation Server (GVS) database, which acquires information from dbSNP, UCSC, and several other sources. Details for the indel annotations are here.

The following annotations are always included in the output:

--- inDBSNPOrNot (whether the variation is in the dbSNP database)
--- chromosome (input from the user)
--- position (input from user, location on the chromosome, hg38, 1-based)
--- referenceBase (input from the user or calculated if not provided)
--- sampleGenotype (input from the user)
--- accession (NCBI transcript identifier)
--- functionGVS (GVS class of variation function, using only the hg38 reference allele and your submitted alleles; see description)
--- functionDBSNP (dbSNP class of variation function)
--- rsID (dbSNP identifier for the variation, 0 if not in dbSNP)
--- aminoAcids (list of amino acids for the codon, starting with that of the reference base, coding SNVs only)
--- proteinPosition (the position of the amino acid in the protein, beginning at the N-terminal with the first amino acid at position 1, followed by the total number of amino acids in the protein; the total includes a count for the stop codon)

There are several optional annotations:

--- Alleles Submitted (column sampleAlleles: same information as sampleGenotype, with the ambiguity code resolved into alleles; if there are multiple individuals, this is a list of all alleles present)
--- Genotype in dbSNP (column dbSNPGenotype: the genotype in dbSNP for a particular individual; N/N if not available, X/X if conflicting measurements; this column is available only if a numerical dbSNP individual ID is entered on the query form)
--- Alleles in dbSNP (column allelesDBSNP: list of alleles for all populations and individuals in dbSNP, derived from the HGVS notation.
--- Conservation Score GERP (column consScoreGERP: rejected-substitution score from the program GERP, Stanford University, range of -12.3 to 6.17, with 6.17 being the most conserved)
--- CADD C Score (column scoreCADD: phred-like Combined Annotation Dependent Depletion scores from Kircher et al., University of Washington, range 0 though 99)
--- Chimp Allele (column chimpAllele: UCSC alignments)
--- Genes (column geneList: HUGO names, any for which the transcription region overlaps the variation)
--- HapMap Frequencies (3 columns AfricanHapMapFreq, EuropeanHapMapFreq, AsianHapMapFreq: African, European, and Asian, in percent)
--- Has Genotypes (column hasGenotypes: whether dbSNP has genotypes available for the variation)
--- dbSNP Validation (column dbSNPValidation: dbSNP validation status codes, dealing with e.g. whether the variation has been seen at least twice)
--- Repeats (2 columns repeatMasker and tandemRepeat)
--- Protein Accession (column proteinAccession: NCBI protein accession ID)
--- cDNA Position (column cDNAPosition: the location within the sequence of coding bases; NA if not coding; so far only SNVs)
--- PolyPhen (column polyPhen: amino acid substitution impacts, PolyPhen-2)
--- Clinical Association (column clinicalAssociation: links to NCBI pages and PubMed)
--- Distance to Nearest Splice Site (column distanceToSplice: how close the variation is to a donor or acceptor splice site)
--- microRNAs (column microRNAs: EMBL IDs of any microRNAs in which the variation is found)
--- Grantham Score (column granthamScore: the Grantham score of any amino acid changes, as per Grantham (1974) Science, Table 2)
--- KEGG Pathways (column keggPathway: the Kyoto Encyclopedia of Genes and Genomes biological pathways for the gene)
--- CpG Islands (column cpgIslands: whether in a region where CpGs are present at a high level, from the UCSC genome annotation database)
--- NHLBI ESP Allele Counts (column genomesESP: the allele counts observed in the Exome Sequencing Project, optionally split by two ancestries)
--- ExAC Allele Counts (column genomesExAC: the allele counts of the Exome Aggregation Consortium, optionally split by 7 populations)
--- Protein-Protein Interactions (column PPI: experimental confidence scores from the STRING 9.1 database)

Note that sampleAlleles and allelesDBSNP are not expected to agree if there is one individual and that individual is homozygous, though the submitted allele will usually be contained in the allelesDBSNP list.

In the case of HapMap frequencies, there is a choice of the minor-allele frequency or the hg38 reference-allele frequency. The minor-allele frequency can usually be derived from the reference-allele frequency. As long as the SNV is diallelic, and the reference allele is represented in the HapMap genotypes, the minor-allele frequency in percent is the smaller of two values: (1) the reference frequency in percent, and (2) 100 - (reference frequency in percent).

In the case of PolyPhen predictions, there is a choice of class (e.g. benign, possibly-damaging, probaby-damaging) or score. The score is a number between 0 and 1, where 1 is the most damaging. This score echoes the PolyPhen-2 "pph2_prob" column.

In rare cases where two or more genes overlap the variation, the gene column will be a comma-separated list.

For tabular output, there will be lines in the result file for each NCBI accession number that overlaps the location. For any line in the output with a NCBI accession, functionDBSNP will be a list of dbSNP functions for that accession.

The nearest splice site is a function only of the variation location, and is not constrained to genes in the column geneList. (For example, the geneList column could contain one gene that's a single exon -- no splice sites -- and the distance to a splice site in a gene some distance away would be reported.) In some transcripts there are introns in the middle of an untranslated region, and these are included.

FILE HEADER LINES

For bookkeeping purposes, two parameters can be optionally specified: "headerLine" and/or "fileName". The added lines must start with # and a space. The first adds a line at the beginning of the returned file (preceded by # in the returned file), and the second adds the string to the file name of the returned file (spaces not recommended, punctuation ok if valid for a file name). If the fileName parameter is not present, the name of the returned file will include the name of your submitted file. Another optional parameter is "compress": if this is set to false, the returned file will be plain text. The default is true. (This no-compress option is not recommended for large jobs.) If your file is VCF formatted, these header lines must be at the beginning of the file.

Examples are
	# headerLine NAXXXXX data of 03/10
	# fileName NAXXXXX.highQualityCutoff
	# compress false
	
OUTPUT FILE FORMATS

The default output file format is a header line (starting with "#") followed by tab-separated annotations. There is one line for each alternate transcript when more than one overlaps the variation location. At the end there are a number of lines (again starting with "#") with some input parameters, version, and various counts.

For the VCF-in choice there is the option of VCF-out format. In this case there is only one line for each variation. Annotations are appended to the INFO column. When there are multiple transcripts, the accession field is a comma-separated list of overlapping accessions. For the accession-dependent fields functionGVS, functionDBSNP, aminoAcids, proteinPosition, cDNAPosition, polyPhen, granthamScore, and proteinAccession, these are also lists in the same order as the accessions. However, if all values are the same, only one is given. For keggPathway and PPI (dependent on locus ID but not on multiple transcripts for that locus ID), all unique values are listed. Note that the geneList field is not accession-dependent; it is a list of all genes overlapping the variation location.

PROCESSING TIME

This site is backed up by a database with cached SNV annotations for all locations in the human genome. About a million SNVs can be processed in an hour. (If the cache database should be unavailable, the site will revert to annotating without it; results will be the same, but will come in slowly.) Indel annotation is also derived from this cache, but the processing time will be several times longer.

MONITORING AND CANCELING, TABLE SHORTCUT

Once your file is submitted, there are links on the acknowledgment page for monitoring the job progress and canceling the job. On the progress page, once the job is completed and the page is refreshed, a "show table" link appears. This link is useful for short jobs. Clicking it brings up a table of annotation from the file that resides on our server. (You will still get an email.) This is a shortcut for downloading the file from the server, then resubmitting it in the "Input Annotation File for Table Display" section of the home page (described below). CAUTION: When using the shortcut, don't alternate between the shortcut mode and the "Input Annotation File for Table Display" for a second data set, without closing and restarting your browser; our code uses session attributes that are identified with a browser session, and keeping two data sets active at once does not work.

AUTOMATED SUBMISSION (optional, advanced usage)

A screen-scraper program can be written to submit your file, and to designate a local file for writing. An example using the Java httpunit library is given here.

The form on the home page has a hidden parameter "autoFile" with the default value of "none". The screen-scraper program must be able to set the hidden parameter to something besides "none", and it must list the additional columns desired.

The default compression for automated jobs is false, though by setting another hidden parameter "compressAuto" to "true", and reading the returned file as binary in the screen-scraper code, a gzipped file may be acquired. This is recommended for large jobs. (A "# compress" line is not required in the submitted file.)

Please don't submit more than one job at a time. If too many variations at a time are submitted, or if the submitted file is too large, the job may be aborted by our server. If the download or progress URLs returned by the screen-scraper code contains the string "ABORT", there is a problem.

If you wish to be notified of SeattleSeqAnnotation new builds or scheduled outages, visit this site:
https://mailman1.u.washington.edu/mailman/listinfo/gvsnotify
This is a moderated site, so approval of subscriptions will only be made on weekdays. The mailing list is set up one-way: no posting by subscribers.

If you need to cancel an automated job, look at the progress URL returned. The end of the URL will look like e.g. "&jobID=211281031438". Open a browser and enter a URL like "http://snp.gs.washington.edu/SeattleSeqAnnotation141/BatchCancelServlet?cancelFile=AnnotationCancel.211281031438.txt", replacing the jobID with your own. You should receive a page indicating that your job has been cancelled. You may need to stop your own submitting program if it does not detect the end of the job.

SUBMITTING FILES LIFTED OVER FROM A PREVIOUS GENOME BUILD

If you lift your variant locations from a hg19 or earlier file, you should remove any duplicate locations, as results for duplicates are unpredictable. Updating the reference allele is also required.

Input Annotation File for Table Display
Submit a text annotation file ("original" SeattleSeqAnnotation format only) previously downloaded from this site (must be SeattleSeqAnnotation141, hg38/NCBI38). The columns will be displayed in a table. Some column choice and sorting is available.

If you have files with several hundred thousand lines, they may need to be split up before reading back in.

If a dbSNP individual was entered in the original query, genotype disagreements are indicated with a red background. If the "Show only SNPs with discrepant genotypes" box is checked, only those variations with discrepant genotypes will be displayed (the alleles submitted disagree with dbSNP, if available)

For missense or stop SNVs, the list of amino acids is a link that will bring up a window with several physicochemical properties of the amino acids (see Stone and Sidow, Genome Research 15, 978 (2005)).

When "Text" is chosen from the "Table/Text" menu, the resulting text can be copied and pasted to a text file, and imported to Excel as space-delimited. (Only variations and columns passing the various filters are displayed.)

table example

Input One SNV
Enter the NCBI-38/hg38 chromosome and location and two alleles for a SNV (no indels). Annotations will be returned in a table.

The alleles will be treated as in the custom input format (above) with no reference allele entered. If the reference allele is not the first or second allele, it will be added from the NCBI 38 human reference sequence, for function and protein calculations. Annotation options are those of the defaults on the home page (e.g. score for PolyPhen).

Input One Indel
Entries will be treated as in the VCF format. Enter the NCBI-38/hg38 chromosome, the location before the insertion or deletion, and two allele strings for an indel (no SNVs). Annotations will be returned in a table.

For both allele strings, the first allele must be that of the reference sequence at the location one base before. For a simple deletion, the reference-allele string must have length greater than 1 (the reference allele followed by the deleted alleles), and the alternate-allele string must have length 1 (and be the reference allele). For a simple insertion, the reference-allele string must have length 1 (and be the reference allele), and the alternate-allele string must have length greater than 1 (the reference allele followed by the inserted alleles). If both allele strings have length greater than 1, the indel will be treated as complex (only crude annotation available). If the reference allele one base before the indel is not at the beginning of both allele strings, the request will fail. All alleles must be in the set ACGTacgt, and each string must have a length of 40 or less. Annotation options are those of the defaults on the home page (e.g. no ExAC population split).

List of All Documentation Pages
About SeattleSeq Annotation

How To Use SeattleSeq Annotation (this page)

Build Notes

Calculations and Sources of Data

Download Example Files

How Indels Are Annotated

Terms of Service

 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo