I. NCBI Resources for Genetic Analysis

The Locus Link Project: http://www.ncbi.nlm.nih.gov/LocusLink/

LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci. It presents information on official nomenclature, aliases, sequence accession numbers, phenotypes, EC numbers, MIM numbers, UniGene clusters, map information, and relevant web sites. The sequence data presented includes a new type, NCBI Reference Sequence (RefSeq) records, as well as a subset of GenBank accession numbers. The current scope is human genes.

A. RefSeq

1. About RefSeq: NCBI Reference Sequences

The NCBI Reference Sequence project (RefSeq) will provide reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery. Furthermore, the RefSeq-to-LocusLink associations anchor UniGene clusters and support annotation of genomic contig sequence data generated by the Human Genome Project.

RefSeq records are created via a process consisting of:

1. Establishing the correct gene name-to-accession number association

2. Identifying the full extent of available sequence data

3. Creating a new sequence record with a PROVISIONAL status

The provisional RefSeq records are then reviewed by a biologist who confirms the initial name-to-sequence association, adds information including a summary of gene function, and, more importantly, corrects, re-annotates, or extends the sequence data using data available in other GenBank records. Both provisional and reviewed RefSeq records are made publicly available via the NCBI Entrez retrieval system, BLAST databases, FTP, and the LocusLink web site.

2. What is a Reference Sequence?

The RefSeq database will be a non-redundant set of reference sequences, including constructed genomic contigs, mRNAs, proteins, and, in the future, entire chromosomes. RefSeq records are made available at two 'status' levels: provisional and reviewed. Reviewed records represent a compilation of our current knowledge of a gene and it's transcripts. During the review process additional information is integrated, when available, such as sequence data, publications, nomenclature, and feature annotations from multiple GenBank records, the Human Gene Nomenclature Committee, and Online Mendelian Inheritance in Man.

3. How do I access RefSeq records?

RefSeq records can be accessed through several NCBI resources including:

BLAST

NM_###### records are in the nucleotide non-redundant database (nr)

NP_###### records are in the protein non-redundant database

Entrez

NM_###### records are in Entrez nucleotides

NP_###### records are in Entrez proteins.

Entrez Genomes Division

NC_###### records representing the complete genome or chromosomes of several organisms are presented on the Genomes pages.

FTP

Currently limited to NM_* and NP_* records; NT_* and NC_* records to be added in the future.

Human Genome Sequencing

NT_###### records for human contigs can be viewed graphically, downloaded, or accessed with BLAST queries only from the Human Genome Sequencing pages.

LocusLink

LocusLink records provide links to NM_###### and NP_###### records. LocusLink can be queried with the RefSeq accession number in addition to text terms.

4. Retrieving NM_ and NP_ RefSeq records with Entrez queries:

RefSeq records can be retrieved via different Entrez queries:

Sample Query (Q)

Result (R)

Q. NM_003988

R. One RefSeq record for PAX2, isoform c, is returned.

Q. PAX2[Gene Name]

R. This returns 17 records including 5 PAX2 RefSeq records.

Q. PAX2[Gene Name] AND srcdb_refseq[Properties]

R. This query retrieves only the set of 5 alternately spliced PAX2 RefSeq records.

Q. srcdb_refseq[properties] AND provisional[all]

R. This query returns the set of all PROVISIONAL RefSeq records.

Q. srcdb_refseq[properties] NOT provisional[all]

R. This query returns the set of all REVIEWED RefSeq records.

5. Identifying NM_ and NP_ RefSeq records in BLAST results:

The distinct format of RefSeq accession numbers (they include an underscore) provides a quick indication that a BLAST result includes a RefSeq record.

Score E

Sequences producing significant alignments: (bits) Value

ref|NM_000014.1|A2M| Homo sapiens alpha-2-ma... 9073 0.0

^ ^

| |

| RefSeq accession numbers have a distinct format

|

"ref" indicates RefSeq database

6. What is the difference between PROVISIONAL and REVIEWED RefSeq records?

a. PROVISIONAL Records:

"Provisional" RefSeq records have not been reviewed yet. They are generated by an automated process with some initial quality checking done to double check the validity of the 'name'-to-'sequence data' association we are presenting.

RefSeq records are only made when we have source sequence records annotated with complete coding regions. If multiple sequences from the same transcript are identified by local alignments, the longest is selected automatically for the provisional record.

A provisional record presents, for the most part, the annotation that was present on the source GenBank record used to create it. The main differences between the source GenBank record and the provisional RefSeq record include the addition of the following in the RefSeq entry: nomenclature (gene name and aliases), a stable LocusID number, the MIM number for the gene, and a statement in the Comment field that the entry is provisional.

b. REVIEWED Records:

Reviewed records have been manually processed by NCBI staff or collaborating groups to create a sequence record that is analagous to a 'review article.'

Some changes/enhancements in the reviewed record might include the addition or removal of DNA sequence data and feature annotations, the addition of summary information and publications, and the addition of other information, as appropriate.

When a record is reviewed, sequence data from more than one record may be merged together, as deemed appropriate, to construct a more complete mRNA record. Sequence data available in both genomic and mRNA records is used; we do not use EST sequence data. The review process frequently includes reading the primary literature to cross-check accuracy and determine if additional data concerning the extent of the UTR is available. Transcript variant records are only made after reviewing the literature or in collaboration with experts.

All sequences used to generate the sequence "assembly" are reported in the RefSeq record and in LocusLink. We also attempt to curate a list of other GenBank records that represent this gene. However, this list is not intended to be fully comprehensive; additional 'related' sequence information will always be available in the Entrez 'related sequences' (or 'neighbors') reports, BLAST search results, etc.

For examples of reviewed RefSeq records, see the following entries:

Gene Symbol LocusID Comments
AGL 178 Example of splice variant treatment. We make RefSeq records for splice variants for which the full length nature of the transcript is well documented and supported by experimental evidence. There is a greater emphasis on providing RefSeq records for cases where some of the transcript variation results in altered coding regions.
PAX2 5076 Example of splice variant treatment
MICA 4276 Note several references included; the record is analogous to a 'review article'. A single article is annotated in the Reference field of the source GenBank record.
GCKR 2646 Note the last line of the Comment field on the RefSeq record provides a 'completeness' indicator. If we determine during the review process that the 5' and/or 3' end of the mRNA is complete, then this information is provided on the RefSeq record.

7. How is the GenBank source sequence initially selected?

There are several factors used in selecting the source sequence first used to generate the PROVISIONAL mRNA RefSeq record, but quite often the source GenBank record used is selected primarily because it includes more complete UTR sequence data. We do strive to make reference sequences that maintain consistency with standards already in use by the mutation community.

Reference sequence records are not intended to represent the historical 'first sequenced' record (although for genes with very limited available sequence data they may at times do so). While the PROVISIONAL RefSeq records do temporarily represent a single GenBank source sequence, the REVIEWED RefSeq records are intended to represent the current state of knowledge as provided by the whole research community rather than by any one laboratory.

8. RefSeq NM_xxxxxx and GenBank AFxxxxxx appear to be duplicates. Will one be removed?

No, both records will continue to be available. RefSeq and GenBank are separate databases, and both databases are available in the Entrez nucleotides data set.

Provisional RefSeq records are usually quite similar to the source GenBank records from which they were drawn. However, when RefSeq records are reviewed by experts, additional sequence data, biological annotations, and references are often added. At that time, the original source GenBank record(s) and the corresponding RefSeq entry can be quite different -- the RefSeq entry can represent a combination of information from various labs, which are credited in the Comments and/or References field of the record.

The RefSeq database is designed to reduce duplication by selecting one representative sequence for each human locus, whereas GenBank is a repository of sequences that might contain numerous records for any given gene. The only duplicates in the RefSeq database will be naturally occurring splice variants. Entrez search results can be limited to RefSeq entries by searching for 'srcdb_refseq' in the Properties Field.

B. Locus Link

The LocusLink query searches for any word (or word stem) in a LocusLink report. The query results are returned as a summary alphabetized by Symbol (browse list). Any detailed LocusLink report page is then accessed by clicking on the LocusID number at the left. Color-coded icons support rapid jumps to related records in PubMed, OMIM,RefSeq, GenBank, UniGene, and Variation data (dbSNP)

1. Summary of Query options

wild card *
Field restriction [chr] chromosome
[mim] MIM number
booleans - does not allow explicit, but and is implied with multiple words

2. Examples of LocusLink Queries

Query Result
A2M Browse list including A2M and IGHA2. Click on the LocusID value (2, 3494 respectively) to display each LocusLink Report.
A2* Browse list of loci with words beginning with "A2" in any data field
Macroglobulin Browse list of loci with the word "macroglobulin" in any data field.
2[chr] Browse list of loci on chromosome 2.
protein kinase 3[chr] Browse list of loci with records containing the word protein and the word kinase and a location on chromosome 3
12p1* Browse list of loci recorded as having the upper range somewhere in 12p13, 12p12, or 12p11.Warning: This is not a range search. Also note: Wild cards cannot be used in combination with field-restricted searches (like [mim], [chr])
4.1.2.13 Browse list of loci encoding enzymes with this EC number (ALDOA, ALDOB, and ALDOC).
103950[mim] Browse list containing only A2M, because 103950 is the MIM number for A2M.
12305* Browse list of loci with numbers beginning 12305 (in this case based on MIM numbers and a GDB id) in any data field. Note: Stemming does not work with field-restricted searches ([mim], [chr])
Hs.74561 Browse list containing only A2M, because (Hs.74561 is the UniGene cluster number for A2M)
AF053356 Browse list of loci with sequences in this GenBank record. Note: accession data are still being integrated; some accession numbers may not yield a result.

3. LocusLink Report Page

Data Element Definition
Official Nomenclature The official symbol and name of a gene.
Interim Nomenclature If the official symbol and name have NOT been established, an interim symbol and name are provided.
LocusID A unique NCBI LocusID is assigned to each locus. LocusIDs are stable identifiers of a locus independent of symbols or other identifiers.
Alternate Symbols Alias symbols. These are compiled from previous nomenclature, the published literature, or sequence records.
Product The preferred gene product name. These names may be revised to reflect current usage.
Alias Alternative product names .
EC number Enzyme commission number(s).
Chromosome The chromosome(s) to which this locus is mapped.
Position The cytogenetic location.
OMIM The MIM number assigned to this gene product.
UniGene The UniGene cluster that represents this locus.
Phenotype The name of the disease that may result from variants at this locus; this is linked to the OMIM record for the disease
Links Links to other related WWW sites
Reference Sequences All RefSeq records created for a given locus. Multiple records are distinguished from each other by a uniqued locus abbreviation (the gene symbol appended by incremental lowercase alphabetic characters), and a brief descriptor of the transcript variant. This section provides data on (and links to): the GenBank source sequence, the nucleotide RefSeq record (nucleotide accession numbers are have an 'NM_' prefix), and the protein RefSeq record (the 'NP_' prefix).
GenBank Sequences
Accession A subset of representative GenBank accession numbers linked to nucleotide and protein data. These accession numbers are initially derived from a variety of collaborations. These data are reviewed frequently and the accessions listed for a given locus may change over time. EST accession numbers are provided only if no other sequence data is available to represent the locus.
Type Molecule type for the nucleotide record.
m mRNA
g genomic DNA
e EST
u undetermined
Protein data Protein gi numbers link to the GenBank protein record. The presence of the blue button link () indicates structure data for related protein sequences is available

4. How are UniGene links handled? Sometimes I don't see a link and sometimes I see more than one.

This is a temporary situation. The goal is to improve the correspondence between UniGene and LocusLink such that for any gene anchored with sequence data, there is a single UniGene cluster.

5. I know there are more accession numbers for a gene than what I see listed. Why?

The answer depends in part on whether the gene has a provisional or a reviewed RefSeq record, or no RefSeq record at all.

If there is no RefSeq record, each accession number is one suggested to represent the gene, but no tools have been used to identify other related sequences.

If the RefSeq record is in the provisional category, only records annotated with complete coding regions, which do not exceed a cutoff nucleotide mismatch level, are listed.

If reviewed, the sequences used to generate the sequence "assembly" are reported explicitly. Other GenBank records that represent this gene are also listed; however, this list is not intended to be comprehensive. Additional 'related' sequence information will always be available as Entrez 'related sequences' (or 'neighbors') reports, BLAST search results, etc.

C. Unigene

http://www.ncbi.nlm.nih.gov/UniGene/index.html

What is Unigene?

How are clusters created?

Searching Unigene

Things to be aware of when using Unigene

D. OMIM (Online Mendelian Inheritance in Man): http://www.ncbi.nlm.nih.gov/Omim/

An allelic variant is designated by the MIM number of its parent entry, followed by a decimal point and a unique 4-digit variant number. For example, allelic variants (mutations) at the factor IX (hemophilia B) locus are numbered 306900.0001 to 306900.0101. The beta-globin locus (HBB) is numbered 141900; sickle hemoglobin is numbered 141900.0243.

An asterisk (*) before an entry number means that the phenotype determined by the gene at the given locus is separate from those represented by other asterisked entries and that the mode of inheritance of the phenotype has been proved (in the judgment of the authors and editors). In general, an attempt has been made to create only one asterisked entry per gene locus.

No asterisk before an entry number means that the mode of inheritance has not been proved, although suspected, or that the separateness of this locus from that of another entry is unclear.

A number symbol (#) before an entry number means that the phenotype can be caused by mutation in any of 2 or more genes. The #-labeled entries are considered useful for avoiding repetition of the same phenotypic information in several entries and necessary because it is often unknown which genetic type is referred to in a particular report.

E. ORF Finder: http://www.ncbi.nlm.nih.gov/gorf/gorf.html

A graphical analysis tool which finds all open reading frames of a selectable minimum size

You may paste in your own sequence or an accession number

You may restrict the size of the sequence in which to search for ORF’s

Identifies all open reading frames using the standard or alternative genetic codes.

Once the Orfind results have been returned you can:

Save the deduced amino acid sequence in various formats

Search the amino acid sequence against the sequence database using BLAST

Change the minimum cutoff size for ORF’s and redraw

Display alternative initiation codons

II. Other Genetic Analysis Tools

Prediction Tools

1. Promoter Prediction by Neural Network http://www.fruitfly.org/seq_tools/promoter.html

Your sequence should consist of one-letter nucleotides (A, C, G, T). The sequence should be in plain or FASTA format.

Select whether you want to use the NNPP version for prokaryotes or for eukaryotes (it defaults to eukaryotes).

You can choose whether to show predictions for the reverse strand as well as the forward strand.

You may also set the score cutoff (it defaults to 0.8). Potential promoters are assigned scores; only those that exceed the score cutoff are shown. The lower the score cutoff, the more potential promoters will be shown (see "Estimated Accuracy of Prediction" table for details). The score cutoff should be between 0 and 1.

The output of NNPP is a list of the 51-base (eukaryotes) or 46-base (prokaryotes) regions that the network judges most likely to be promoters. Because promoter elements may appear at different relative positions, the positional accuracy of promoter prediction is +/- 3bp including the transcription start.

The TSS is indicated as a large, capital letter

2. Splice Site Prediction http://www.fruitfly.org/seq_tools/splice.html

Your sequence should consist of one-letter nucleotides (A, C, G, T).

The sequence should be in plain or FASTA format. (ex. V00574)

Select whether you want to use the neural network version for Human or for Drosophila melanogaster sequences. You can choose whether to show predictions for the reverse strand as well as the forward strand.

The output of the neural networks is a list of the 15-base (41-base) regions that the network judges most likely to be 5' and 3' splice sites, respectively. The junctions are indicated by a larger font size.

Genome Analysis

1. National Center for Genome Resources http://www.ncgr.org/

GeneX - a repository of gene expression data with an integrated toolset that will let researchers analyse mRNA expression data and facilitate comparison with other such data

HomologyDB

TreeBLAST - integrates sequence similarity searching, sequence alignment and tree analysis

R-Genes - investigatory subset focussing on plant disease resistance genes and their homologs in other organisms. Over 500 family members known.

HomologyDB - database to store homology information that will be the foundation of the previous two projects.

GSDB - Genome Sequence Database

Updated nightly

Data retrieval methods

BLAST (http://seqsim.ncgr.org/newBlast.html )

Java Sequence Viewer

GSDB Maestro - searches database based on any combination of 18 fields

SANBI Maestro - allows access to sequences associated with STACK

GSDB Ad Hoc Query Tool - Web based form that will let users query using SQL statements. A free account and an understanding of the db schema is required

GSDB Excerpt - Allows users to extract exact portions of a GSDB sequence

FlatFile retrieval - individual file retrieval

MAR-Finder - searches for probable Matrix Attachment Regions (MAR)

PathDB - a general metabolic pathway database that will represent the current knowledge.

http://www.ncgr.org/software/pathdb/

The main data types represented by PathDB are compounds, reactions, enzymes and other metabolic proteins and pathways.

PathDB attempts to include very rich descriptions of the kinetic, thermodynamic and physico-chemical properties of pathway components.

All the data are categorized by taxonomy.

Live data updates

Simple queries can be made directly through the Web with a normal browser.

A Java software application connects directly to the database and provides a user-friendly means to make complex queries. This interface makes it possible to navigate the relations between the various entities stored. It also makes it possible to carry out set operations on multiple query results.

RNA Analysis

1. The RNA World at IMB—Jena http://www.imb-jena.de/RNA.html?

3D Structure database links

Sequences and secondary structure database links

Protocols

Software

Books

Meetings

2. Algorithms, thermodynamics and database for RNA secondary structure http://bioinfo.math.rpi.edu/~zukerm/rna/

Knowledge Base

1. GeneCards: Human Genes, Proteins and Disease http://bioinformatics.weizmann.ac.il/cards/index.html

GeneCards is a database of human genes, their products and their involvement in diseases.

Provides concise information about the functions of all human genes that have an approved symbol, as well as selected others.

The information presented here has been automatically extracted from various resources by scripts developed at the Weizmann Institute.

2. KEGG: Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/

Goals of the KEGG project

To computerize all aspects of cellular functions in terms of the pathway of interacting molecules or genes

To maintain gene catalogs for all organisms and link each gene product to a pathway component

To organize a database of all chemical compounds in the cell and link each compound to a pathway component

To develop computational technologies for pathway comparison, reconstruction and analysis

Useful Links

1. Smith-Waterman

EBI (European Bioinformatics Institute)

BLITZ - http://www2.ebi.ac.uk/bic_sw/ Compugen´s Bic2´s Smith & Waterman algorithm implementation for protein database searches

Scanps - http://www2.ebi.ac.uk/scanps/ Very fast implementation of the true Smith & Waterman algorithm for protein database searches

Ssearch3 - http://www2.ebi.ac.uk/ssearch3/ Generic implementation of the Smith & Waterman algorithm for protein databse searches

2. BCM Search Launcher http://dot.imgen.bcm.tmc.edu:9331/

A central site to do many types of searches. A batch client is available.

General protein sequence/pattern searches

Species-Specific protein sequence searches

Nucleic acid sequence searches

Multiple sequence alignments

Pairwise sequence alignments

Gene feature searches

Sequence utilities

Protein secondary structure prediction

3. Alert Servers

Swiss Shop Alert Service http://www.expasy.ch/swiss-shop

The available services will automatically perform daily comparisons of you sequence of interest to the latest database updates. If there is a new sequence that has a high degree of similarity to your sequence of interest, you will automatically be sent an email informing you of this.