LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci. It presents information on official nomenclature, aliases, sequence accession numbers, phenotypes, EC numbers, MIM numbers, UniGene clusters, map information, and relevant web sites. The sequence data presented includes a new type, NCBI Reference Sequence (RefSeq) records, as well as a subset of GenBank accession numbers. The current scope is human genes.
The NCBI Reference Sequence project (RefSeq) will provide reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery. Furthermore, the RefSeq-to-LocusLink associations anchor UniGene clusters and support annotation of genomic contig sequence data generated by the Human Genome Project.
RefSeq records are created via a process consisting of:
1. Establishing the correct gene name-to-accession number association
2. Identifying the full extent of available sequence data
3. Creating a new sequence record with a PROVISIONAL status
The provisional RefSeq records are then reviewed by a biologist who confirms the initial name-to-sequence association, adds information including a summary of gene function, and, more importantly, corrects, re-annotates, or extends the sequence data using data available in other GenBank records. Both provisional and reviewed RefSeq records are made publicly available via the NCBI Entrez retrieval system, BLAST databases, FTP, and the LocusLink web site.
The RefSeq database will be a non-redundant set of reference sequences, including constructed genomic contigs, mRNAs, proteins, and, in the future, entire chromosomes. RefSeq records are made available at two 'status' levels: provisional and reviewed. Reviewed records represent a compilation of our current knowledge of a gene and it's transcripts. During the review process additional information is integrated, when available, such as sequence data, publications, nomenclature, and feature annotations from multiple GenBank records, the Human Gene Nomenclature Committee, and Online Mendelian Inheritance in Man.
RefSeq records can be accessed through several NCBI resources including:
BLAST
NM_###### records are in the nucleotide non-redundant database (nr)
NP_###### records are in the protein non-redundant database
Entrez
NM_###### records are in Entrez nucleotides
NP_###### records are in Entrez proteins.
Entrez Genomes Division
NC_###### records representing the complete genome or chromosomes of several organisms are presented on the Genomes pages.
FTP
Currently limited to NM_* and NP_* records; NT_* and NC_* records to be added in the future.
Human Genome Sequencing
NT_###### records for human contigs can be viewed graphically, downloaded, or accessed with BLAST queries only from the Human Genome Sequencing pages.
LocusLink
LocusLink records provide links to NM_###### and NP_###### records. LocusLink can be queried with the RefSeq accession number in addition to text terms.
RefSeq records can be retrieved via different Entrez queries:
Sample Query (Q)
Result (R)
Q. NM_003988
R. One RefSeq record for PAX2, isoform c, is returned.
Q. PAX2[Gene Name]
R. This returns 17 records including 5 PAX2 RefSeq records.
Q. PAX2[Gene Name] AND srcdb_refseq[Properties]
R. This query retrieves only the set of 5 alternately spliced PAX2 RefSeq records.
Q. srcdb_refseq[properties] AND provisional[all]
R. This query returns the set of all PROVISIONAL RefSeq records.
Q. srcdb_refseq[properties] NOT provisional[all]
R. This query returns the set of all REVIEWED RefSeq records.
The distinct format of RefSeq accession numbers (they include an underscore) provides a quick indication that a BLAST result includes a RefSeq record.
Score E
Sequences producing significant alignments: (bits) Value
ref|NM_000014.1|A2M| Homo sapiens alpha-2-ma... 9073 0.0
^ ^
| |
| RefSeq accession numbers have a distinct format
|
"ref" indicates RefSeq database
a. PROVISIONAL Records:
"Provisional" RefSeq records have not been reviewed yet. They are generated by an automated process with some initial quality checking done to double check the validity of the 'name'-to-'sequence data' association we are presenting.
RefSeq records are only made when we have source sequence records annotated with complete coding regions. If multiple sequences from the same transcript are identified by local alignments, the longest is selected automatically for the provisional record.
A provisional record presents, for the most part, the annotation that was present on the source GenBank record used to create it. The main differences between the source GenBank record and the provisional RefSeq record include the addition of the following in the RefSeq entry: nomenclature (gene name and aliases), a stable LocusID number, the MIM number for the gene, and a statement in the Comment field that the entry is provisional.
b. REVIEWED Records:
Reviewed records have been manually processed by NCBI staff or collaborating groups to create a sequence record that is analagous to a 'review article.'
Some changes/enhancements in the reviewed record might include the addition or removal of DNA sequence data and feature annotations, the addition of summary information and publications, and the addition of other information, as appropriate.
When a record is reviewed, sequence data from more than one record may be merged together, as deemed appropriate, to construct a more complete mRNA record. Sequence data available in both genomic and mRNA records is used; we do not use EST sequence data. The review process frequently includes reading the primary literature to cross-check accuracy and determine if additional data concerning the extent of the UTR is available. Transcript variant records are only made after reviewing the literature or in collaboration with experts.
All sequences used to generate the sequence "assembly" are reported in the RefSeq record and in LocusLink. We also attempt to curate a list of other GenBank records that represent this gene. However, this list is not intended to be fully comprehensive; additional 'related' sequence information will always be available in the Entrez 'related sequences' (or 'neighbors') reports, BLAST search results, etc.
For examples of reviewed RefSeq records, see the following entries:
Gene Symbol
LocusID
Comments
AGL
178
Example of splice variant treatment. We make RefSeq records for splice variants for which the full length nature of the transcript is well documented and supported by experimental evidence. There is a greater emphasis on providing RefSeq records for cases where some of the transcript variation results in altered coding regions.
PAX2
5076
Example of splice variant treatment
MICA
4276
Note several references included; the record is analogous to a 'review article'. A single article is annotated in the Reference field of the source GenBank record.
GCKR
2646
Note the last line of the Comment field on the RefSeq record provides a 'completeness' indicator. If we determine during the review process that the 5' and/or 3' end of the mRNA is complete, then this information is provided on the RefSeq record.
There are several factors used in selecting the source sequence first used to generate the PROVISIONAL mRNA RefSeq record, but quite often the source GenBank record used is selected primarily because it includes more complete UTR sequence data. We do strive to make reference sequences that maintain consistency with standards already in use by the mutation community.
Reference sequence records are not intended to represent the historical 'first sequenced' record (although for genes with very limited available sequence data they may at times do so). While the PROVISIONAL RefSeq records do temporarily represent a single GenBank source sequence, the REVIEWED RefSeq records are intended to represent the current state of knowledge as provided by the whole research community rather than by any one laboratory.
No, both records will continue to be available. RefSeq and GenBank are separate databases, and both databases are available in the Entrez nucleotides data set.
Provisional RefSeq records are usually quite similar to the source GenBank records from which they were drawn. However, when RefSeq records are reviewed by experts, additional sequence data, biological annotations, and references are often added. At that time, the original source GenBank record(s) and the corresponding RefSeq entry can be quite different -- the RefSeq entry can represent a combination of information from various labs, which are credited in the Comments and/or References field of the record.
The RefSeq database is designed to reduce duplication by selecting one representative sequence for each human locus, whereas GenBank is a repository of sequences that might contain numerous records for any given gene. The only duplicates in the RefSeq database will be naturally occurring splice variants. Entrez search results can be limited to RefSeq entries by searching for 'srcdb_refseq' in the Properties Field.
The LocusLink query searches for any word (or word stem) in a LocusLink report. The query results are returned as a summary alphabetized by Symbol (browse list). Any detailed LocusLink report page is then accessed by clicking on the LocusID number at the left. Color-coded icons support rapid jumps to related records in PubMed, OMIM,RefSeq, GenBank, UniGene, and Variation data (dbSNP)
wild card
*
Field restriction
[chr]
chromosome
[mim]
MIM number
booleans
-
does not allow explicit, but and is implied with multiple words
Query
Result
A2M
Browse list including A2M and IGHA2. Click on the LocusID value (2, 3494 respectively) to display each LocusLink Report.
A2*
Browse list of loci with words beginning with "A2" in any data field
Macroglobulin
Browse list of loci with the word "macroglobulin" in any data field.
2[chr]
Browse list of loci on chromosome 2.
protein kinase 3[chr]
Browse list of loci with records containing the word protein and the word kinase and a location on chromosome 3
12p1*
Browse list of loci recorded as having the upper range somewhere in 12p13, 12p12, or 12p11.Warning: This is not a range search. Also note: Wild cards cannot be used in combination with field-restricted searches (like [mim], [chr])
4.1.2.13
Browse list of loci encoding enzymes with this EC number (ALDOA, ALDOB, and ALDOC).
103950[mim]
Browse list containing only A2M, because 103950 is the MIM number for A2M.
12305*
Browse list of loci with numbers beginning 12305 (in this case based on MIM numbers and a GDB id) in any data field. Note: Stemming does not work with field-restricted searches ([mim], [chr])
Hs.74561
Browse list containing only A2M, because (Hs.74561 is the UniGene cluster number for A2M)
AF053356
Browse list of loci with sequences in this GenBank record. Note: accession data are still being integrated; some accession numbers may not yield a result.
Data Element
Definition
Official Nomenclature
The official symbol and name of a gene.
Interim Nomenclature
If the official symbol and name have NOT been established, an interim symbol and name are provided.
LocusID
A unique NCBI LocusID is assigned to each locus. LocusIDs are stable identifiers of a locus independent of symbols or other identifiers.
Alternate Symbols
Alias symbols. These are compiled from previous nomenclature, the published literature, or sequence records.
Product
The preferred gene product name. These names may be revised to reflect current usage.
Alias
Alternative product names .
EC number
Enzyme commission number(s).
Chromosome
The chromosome(s) to which this locus is mapped.
Position
The cytogenetic location.
OMIM
The MIM number assigned to this gene product.
UniGene
The UniGene cluster that represents this locus.
Phenotype
The name of the disease that may result from variants at this locus; this is linked to the OMIM record for the disease
Links
Links to other related WWW sites
Reference Sequences
All RefSeq records created for a given locus. Multiple records are distinguished from each other by a uniqued locus abbreviation (the gene symbol appended by incremental lowercase alphabetic characters), and a brief descriptor of the transcript variant. This section provides data on (and links to): the GenBank source sequence, the nucleotide RefSeq record (nucleotide accession numbers are have an 'NM_' prefix), and the protein RefSeq record (the 'NP_' prefix).
GenBank Sequences
Accession
A subset of representative GenBank accession numbers linked to nucleotide and protein data. These accession numbers are initially derived from a variety of collaborations. These data are reviewed frequently and the accessions listed for a given locus may change over time. EST accession numbers are provided only if no other sequence data is available to represent the locus.
Type
Molecule type for the nucleotide record.
m
mRNA
g
genomic DNA
e
EST
u
undetermined
Protein data
Protein gi numbers link to the GenBank protein record. The presence of the blue button link () indicates structure data for related protein sequences is available
This is a temporary situation. The goal is to improve the correspondence between UniGene and LocusLink such that for any gene anchored with sequence data, there is a single UniGene cluster.
The answer depends in part on whether the gene has a provisional or a reviewed RefSeq record, or no RefSeq record at all.
If there is no RefSeq record, each accession number is one suggested to represent the gene, but no tools have been used to identify other related sequences.
If the RefSeq record is in the provisional category, only records annotated with complete coding regions, which do not exceed a cutoff nucleotide mismatch level, are listed.
If reviewed, the sequences used to generate the sequence "assembly" are reported explicitly. Other GenBank records that represent this gene are also listed; however, this list is not intended to be comprehensive. Additional 'related' sequence information will always be available as Entrez 'related sequences' (or 'neighbors') reports, BLAST search results, etc.
C. Unigene
http://www.ncbi.nlm.nih.gov/UniGene/index.html
What is Unigene?
How are clusters created?
Searching Unigene
Things to be aware of when using Unigene
An allelic variant is designated by the MIM number of its parent entry, followed by a decimal point and a unique 4-digit variant number. For example, allelic variants (mutations) at the factor IX (hemophilia B) locus are numbered 306900.0001 to 306900.0101. The beta-globin locus (HBB) is numbered 141900; sickle hemoglobin is numbered 141900.0243.
An asterisk (*) before an entry number means that the phenotype determined by the gene at the given locus is separate from those represented by other asterisked entries and that the mode of inheritance of the phenotype has been proved (in the judgment of the authors and editors). In general, an attempt has been made to create only one asterisked entry per gene locus.
No asterisk before an entry number means that the mode of inheritance has not been proved, although suspected, or that the separateness of this locus from that of another entry is unclear.
A number symbol (#) before an entry number means that the phenotype can be caused by mutation in any of 2 or more genes. The #-labeled entries are considered useful for avoiding repetition of the same phenotypic information in several entries and necessary because it is often unknown which genetic type is referred to in a particular report.
A graphical analysis tool which finds all open reading frames of a selectable minimum size
You may paste in your own sequence or an accession number
You may restrict the size of the sequence in which to search for ORFs
Identifies all open reading frames using the standard or alternative genetic codes.
Once the Orfind results have been returned you can:
Save the deduced amino acid sequence in various formats
Search the amino acid sequence against the sequence database using BLAST
Change the minimum cutoff size for ORFs and redraw
Display alternative initiation codons
Your sequence should consist of one-letter nucleotides (A, C, G, T). The sequence should be in plain or FASTA format.
Select whether you want to use the NNPP version for prokaryotes or for eukaryotes (it defaults to eukaryotes).
You can choose whether to show predictions for the reverse strand as well as the forward strand.
You may also set the score cutoff (it defaults to 0.8). Potential promoters are assigned scores; only those that exceed the score cutoff are shown. The lower the score cutoff, the more potential promoters will be shown (see "Estimated Accuracy of Prediction" table for details). The score cutoff should be between 0 and 1.
The output of NNPP is a list of the 51-base (eukaryotes) or 46-base (prokaryotes) regions that the network judges most likely to be promoters. Because promoter elements may appear at different relative positions, the positional accuracy of promoter prediction is +/- 3bp including the transcription start.
The TSS is indicated as a large, capital letter
Your sequence should consist of one-letter nucleotides (A, C, G, T).
The sequence should be in plain or FASTA format. (ex. V00574)
Select whether you want to use the neural network version for Human or for Drosophila melanogaster sequences. You can choose whether to show predictions for the reverse strand as well as the forward strand.
The output of the neural networks is a list of the 15-base (41-base) regions that the network judges most likely to be 5' and 3' splice sites, respectively. The junctions are indicated by a larger font size.
GeneX - a repository of gene expression data with an integrated toolset that will let researchers analyse mRNA expression data and facilitate comparison with other such data
HomologyDB
TreeBLAST - integrates sequence similarity searching, sequence alignment and tree analysis
R-Genes - investigatory subset focussing on plant disease resistance genes and their homologs in other organisms. Over 500 family members known.
HomologyDB - database to store homology information that will be the foundation of the previous two projects.
GSDB - Genome Sequence Database
Updated nightly
Data retrieval methods
BLAST (http://seqsim.ncgr.org/newBlast.html )
Java Sequence Viewer
GSDB Maestro - searches database based on any combination of 18 fields
SANBI Maestro - allows access to sequences associated with STACK
GSDB Ad Hoc Query Tool - Web based form that will let users query using SQL statements. A free account and an understanding of the db schema is required
GSDB Excerpt - Allows users to extract exact portions of a GSDB sequence
FlatFile retrieval - individual file retrieval
MAR-Finder - searches for probable Matrix Attachment Regions (MAR)
PathDB - a general metabolic pathway database that will represent the current knowledge.
http://www.ncgr.org/software/pathdb/
The main data types represented by PathDB are compounds, reactions, enzymes and other metabolic proteins and pathways.
PathDB attempts to include very rich descriptions of the kinetic, thermodynamic and physico-chemical properties of pathway components.
All the data are categorized by taxonomy.
Live data updates
Simple queries can be made directly through the Web with a normal browser.
A Java software application connects directly to the database and provides a user-friendly means to make complex queries. This interface makes it possible to navigate the relations between the various entities stored. It also makes it possible to carry out set operations on multiple query results.
3D Structure database links
Sequences and secondary structure database links
Protocols
Software
Books
Meetings
GeneCards is a database of human genes, their products and their involvement in diseases.
Provides concise information about the functions of all human genes that have an approved symbol, as well as selected others.
The information presented here has been automatically extracted from various resources by scripts developed at the Weizmann Institute.
Goals of the KEGG project
To computerize all aspects of cellular functions in terms of the pathway of interacting molecules or genes
To maintain gene catalogs for all organisms and link each gene product to a pathway component
To organize a database of all chemical compounds in the cell and link each compound to a pathway component
To develop computational technologies for pathway comparison, reconstruction and analysis
BLITZ - http://www2.ebi.ac.uk/bic_sw/ Compugen´s Bic2´s Smith & Waterman algorithm implementation for protein database searches
Scanps - http://www2.ebi.ac.uk/scanps/ Very fast implementation of the true Smith & Waterman algorithm for protein database searches
Ssearch3 - http://www2.ebi.ac.uk/ssearch3/ Generic implementation of the Smith & Waterman algorithm for protein databse searches
A central site to do many types of searches. A batch client is available.
General protein sequence/pattern searches
Species-Specific protein sequence searches
Nucleic acid sequence searches
Multiple sequence alignments
Pairwise sequence alignments
Gene feature searches
Sequence utilities
Protein secondary structure prediction
Swiss Shop Alert Service http://www.expasy.ch/swiss-shop
The available services will automatically perform daily comparisons of you sequence of interest to the latest database updates. If there is a new sequence that has a high degree of similarity to your sequence of interest, you will automatically be sent an email informing you of this.