I. What is Bioinformatics?
I define that term loosely to be the use of computer programs to assist in the study of biological data. Since 1993, many software tools have been made available on the Internet. This allows any researcher to use any platform to access the tools they need to find answers to their questions.
This class is intended to be a general survey of the kinds of tools currently available. It will be by no means complete, as new tools are appearing every day. I will attempt to teach you how to use the primary tool used by most molecular biologists today, which is BLAST.
Once you are comfortable with manipulating the BLAST user interface, we will then continue on a survey of other programs out there that address specific areas of research. The goal of this is to introduce to you the possibilities that exist for using the Internet to assist you in your research. We will be following the hypothetical development of a gene discovery, from first identifying the unknown DNA sequence to identifying details of it's genetic structure.
Introduction to DNA/Protein Searching
The primary resource center for researchers wishing to perform genetic analysis over the Internet is the National Center for Biotechnology Information, known as NCBI (http://ncbi.nlm.nih.gov). NCBI is the home of the most popular tools currently in use and they are constantly expanding and refining them. They are also responsible for maintaining GenBank, the largest sequence database in the world. We will be spending a lot of time at NCBI, investigating the different resources it has to offer.
What are the main databases used?
The three primary sequence repositories in the world are GenBank (USA), EMBL (Europe) and DDBJ (Japan). All three of these databases mirror each other and constantly update each other. Sequences submitted to one are present in all within days. Other major databases are also very valuable. They include Unigene, LocusLink, SwissProt and others.
What format does the data have?
1. FASTA Format
|
>Accession number and identifier (one line max) Actual sequence |
| 2. Complete Database record. |
Includes Authors names, literature citations, gene attributes, sequence features, gene name, accession number, actual sequence. |
What tools are available for searching the databases?
There are many, many tools that have been created that access the public database as well as many specialized databases. Many of these tools cross-reference each other. We will be looking at several of these tools. The way that a tool searches the database is via a mathematical algorithm. The most commonly used algorithms are
- BLAST (Basic Local Sequence Alignment Tool) based on the statistical methods of Karlin and Altschul (1990, 1993). Available through internet or as downloadable application. Quick and simple.
- Smith-Waterman very sensitive and highly accurate, is very processor intensive. Therefore, not widely available to the public. Usually requires hardware accelerators. See website list for publicly accessible sites.
- FASTA available through the Internet. Primarily used in Europe since advent of BLAST.
Which algorithm is best?
Depends on what you want to do. Smith-Waterman is the most accurate, capable of detecting very weak similarities. However it takes massive computing power and is very slow otherwise. For most similarity searches, BLAST or BLAST 2.0 is best. It is fast and very accurate. However it is not designed for motif searching.
II. Basic Gene Discovery
In this section we will examine the primary tools used by researchers today to identify nucleotide sequences. By far the most commonly used application is NCBI's BLAST program.
BLAST
http://www.ncbi.nlm.nih.gov/BLAST/
BLAST Examples
BLAST is actually a family of programs maintained at NCBI. The basic programs that are available are listed below:
BLAST Programs
| Program |
Query Sequence |
Database Sequence Target |
Comments |
| BLASTN |
Nucleotide (both strands |
Nucleotide database |
Optimized for speed, not sensitivity.
Not intended for finding distant homologies
Has Dust option (low complexity filter ) |
| BLASTX |
Nucleotide translated into 6 frames |
Protein database |
Less sensitive to sequence errors and mismatches than BLASTN.
Useful for preliminary data and EST searching
Has low complexity filter option |
| TBLASTX |
Nucleotide translated into 6 frames |
Nucleotide database translated in 6 frames |
Very good for EST?s and single-pass sequences
Very slow |
| BLASTP |
Protein |
Protein database |
|
| TBLASTN |
Protein |
Nucleotide database translated in 6 frames |
For searching proteins against nucleotide (including EST) sequences. |
All of the BLAST programs allow you to chose which database or version of a database you wish to search. Not all databases are available for each program. A list and description of the available databases is shown below:
GenBank Databases
| Database |
Sequence Type |
Usable Programs |
Comments |
| NR |
Nucleotide
Protein |
BLASTN
BLASTP
BLASTX |
Contains "non-redundant" nucleic acid and protein sequences from all known data sources. |
| MONTH |
Nucleotide
Protein |
BLASTN
BLASTP
BLASTX |
Contains new records added to NR within the last month. |
| SWISSPROT |
Protein |
BLASTP
BLASTX |
The SWISS-PROT annotated subset of NR |
| DBEST |
Nucleotide from cDNA source |
BLASTN
TBLASTN
TBLASTX |
Database of Expressed Sequence Tags |
| DBSTS |
Nucleotide from genomic source |
BLASTN
TBLASTN
TBLASTX
|
Database of Sequence Tagged Sites |
| PDB |
Protein |
BLASTP
BLASTX |
The Brookhaven Protein Database subset of NR |
| VECTOR |
Nucleotide |
BLASTN |
A database of vector sequences. Hasn't been updated in a long time. Useful to check sequences against before submitting them to GenBank |
| KABAT |
Protein |
BLASTP
BLASTX |
Proteins of immunological interest. A subset of NR |
| MITO |
Nucleotide |
BLASTN |
A database of mitochondrial DNA sequences. Also useful to check before submitting sequence to GenBank |
| ALU |
Nucleotide |
BLASTN |
Database of repetitive DNA sequences. Check before submitting sequence to GenBank. |
| EPD |
Nucleotide |
BLASTN |
Eukaryotic Promoter Database |
| YEAST |
Protein |
BLASTP
BLASTX |
S.cerevisiae protein database |
| E.COLI |
Protein |
BLASTP
BLASTX |
E.coli genomic coding region translations |
| HTGS |
Nucleotide |
BLASTN |
High Throughput genomic sequence. Usually single pass data |
- How does BLAST work?
- Identify HSP's (High-scoring Segment Pairs)
- default 11 bp or 3 aa
- perfect match
- Slide query and target sequence across each other until the maximum number of HSP for that target is found
- Score the alignment
- a scoring matrix is used (such as BLOSUM 62)
- gaps introduced between HSP's during sliding get negative score
- a match gets a positive score
- total alignment score is subjected to statistical analysis (K-A statistics) to the significance vs. chance of the score
- Repeat for every sequence in the target database
- Return total results
- Interpreting BLAST results
BLAST returns the alignment information as a graphic, a list of accession numbers and as aligned sequences
- Graphic results
- Color indicates the alignment score
- Red >= 200
- Purple 80-200
- Green 50-80
- Blue 40-50
- Black <40
- Mousing over a line in the graphic shows you the name ans accession of the target
- Accession lists
- First column contains Accession number, database source and locus name hyperlinked to the Genbank record
- Second column contains the defline from the FASTA formatted Genbank record
- Third column is the alignment score (corresponds to the color in the graphic)
- Fourth column is the E (Expected) value. This number describes the number of hits you would expect to see by random chance when searching a database of the given size
- default E value is 10, therefore only results with a value <10 are reported
- Truly significant results have an E value approaching zero
- Aligned sequences
- Display accession number, defline and length of target sequence. The accession number is hyperlinked to the Genbank record
- Score and E value
- Number of identities and gaps for the alignment
- Which strands were matched
- The alignment
- Advanced BLAST parameters
- Choose an organism to restrict your query to
- This is a good way to identify homologs in other species
- Change the E value
- Making E smaller returns fewer sequences, but they are of better quality
- Making E larger returns more sequences, but also includes more junk
- Turn filter On/Off (DUST / SEG)
- Many DNA and protein sequences have regions of low complexity
- 3' UTR are often 'A' rich, e.g. AAATAAAAAAAAAATAAAAAAT
- Proteins can have low complexity regions such as PPCDPPPPPKDKKKKKDDGPP
- BLAST scores can be artificially inflated by these regions
- The default option is 'ON'. To avoid spurious results, you should leave it on.
- If a query has been filtered, masked regions are replaced with 'NNNNN' for DNA and 'XXXX' for protein
- Set the genetic code for your query sequence (BLASTX only)
- Change the scoring matrix
- Recommend you don't change
- If you want to change, read the papers first
- Other advanced options
- Most likely to use
- -v The number of descriptions to show in the accession list
- -b The number of alignments to show
- -w Change the word size
- Can also change the penalties for gaps
- A few things to be aware of when using BLAST
- Alu repeat sequences
- Most common repetative DNA element is called an Alu repeat
- 282 bp, occurs on average every 3300 bp in human genomic DNA
- several dozen subfamilies, but still retain sequence homology
- Many genomic sequences and non-translated portions of mRNA?s have repetative DNA elements
- How to identify them
- Query your sequence with BLASTN vs the Alu db
- Query your sequence with BLASTX against SWISS-PROT (contains dummy translations of Alu). Hits are then designated as Alu
- Vector sequences
- A surprising proportion of the sequences in the public databases are contaminated with vector sequences
- If you search a query sequence that contains vector-derived sequence against Gen-Bank, you will find a lot of "hits"
- Make sure you have removed all of the vector sequence from the ends of your query before you run it
- To double-check, you can run your query against the VECTOR database using BLASTN
- EST data
- They are generated from single-pass DNA sequencing; hence they have a lot of errors, up to 5% by some estimates
- There is a great deal of redundancy in EST's, i.e. different EST's that tag the same gene
- The EST database in GenBank (largely generated by the IMAGE consortium) is best searched by BLASTX and TBLASTX
- The dBEST database is not broken down by species
- How do I perform BLAST with small query sequences?
- Increase E to at least 1000
- A small sequence is more likely to occur by chance
- Increasing E lets you look farther down the list and see matches that would normally be discarded
- Decease word size (W)
- BLASTN will not work if W < 7
- Good rule is that the query length must be at least 2W
- The smaller W is, the slower the search will be.
- Turn Filter option 'OFF'
- Change the matrix to optimize for searching with short protein sequences
- The Q-BLAST Interface
- Q-BLAST is a new user interface for Basic and Advanced BLAST
- Instead of waiting for results to be returned, a Request ID is immediately assigned.
- An intermediate formatting page will let you format or reformat your results without having to re-run the query.
- Results are stored on the server under the Request ID for 24 hours, unless the results set is very large, in which case it will be deleted after 30 minutes.
- You can go back and retrieve your results anytime within 24 hours with your Request ID
- The Request ID's are partially random to protect privacy.
- New results format options
- Default - same format as before Q-BLAST. Contains graphical overview, list of matching accession numbers, pairwise alignment
- The NCBI-gi option results in every entry in the accession list to begin with the gi number. Normally the accession list does not contain the gi number
- Alternate alignment views
- Master-slave with identities - shows multiple sequence alignment (MSA) over the entire length of the query with all of the hits. The sequence of the query is shown, while sequence identity of the other sequences is represented by dots. Dashes represent gaps. Insertions are shown as "tails"
- Master-slave without identities - same as above, except that the sequence for each sequence is shown rather than using a dot representation.
- Flat master-slave with identites
- Flat master-slave without identites
- Get ASN.1 for SeqAnnot - ASN.1 code for SeqAnnot program
- Get ASN.1 for BLAST Object - ASN.1 code for CORBA BLAST Object
- What is VecScreen?
- VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin using optimized parameters for blastn.
- VecScreen searches a query for segments that match any sequence in a specialized non-redundant vector database (UniVec).
- The UniVec database contains only one copy of every unique sequence segment from a large number of vectors.
- In addition to vector sequences, UniVec also contains sequences for those adapters, linkers and primers commonly used in the process of cloning cDNA or genomic DNA.
- A copy of the first 49 bases of the sequence for a circular vector is appended to the end of the sequence before it is processed for addition to UniVec. This "pseudo-circularization" allows matches that span the circular junction to be identified correctly.
- The sequences used to build the current version of UniVec are listed in the UniVec Representation List.
- Why use VecScreen?
- The most common sources of contamination are accessory DNAs deliberately attached to the DNA/RNA under investigation
- Vectors -- Sequencing of vector constructs frequently produces raw sequences that include segments derived from vector. Failure to identify and remove all the vector sequence results in a finished sequence that is contaminated.
- Adapters, linkers, and PCR primers -- Various oligonucleotides can be attached to the DNA/RNA under investigation as part of the cloning or amplification process. The sequences of these oligonucleotides are therefore often present in raw sequences and will contaminate the finished sequence unless they are identified and removed.
- Another source of contamination is ?unintended events?
- Transposons and Insertion Sequences -- A transposable element from the cloning host (generally Escherichia coli or yeast) occasionally will insert itself into the cloned DNA/RNA while the clone is being propagated. The chance of a transposon or insertion sequence inserting into the clone increases with the size of the DNA insert.
- Impurities in the DNA/RNA under investigation Often derived from impure reagents, incomplete isolations or heterogeneous organism (fungal, bacterial) contaminants.
- Limitations of VecScreen
- The UniVec database was constructed such that no sequence element is longer than 50 bp.
- Search results will not indicate the identity of the vector with the strongest match to the query, since the database is made of fragments and most vectors contain many common regions
- Should not be used in a case where you are looking for a match of longer than 50 bp.
III. Specialized Resources at NCBI for Genetic Analysis
The Locus Link Project
http://www.ncbi.nlm.nih.gov/LocusLink/
Example of data format
LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci. It presents information on official nomenclature, aliases, sequence accession numbers, phenotypes, EC numbers, MIM numbers, UniGene clusters, map information, and relevant web sites. The sequence data presented includes a new type, NCBI Reference Sequence (RefSeq) records, as well as a subset of GenBank accession numbers. The current scope is human genes.
- RefSeq
- About RefSeq: NCBI Reference Sequences
The NCBI Reference Sequence project (RefSeq) will provide reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery. Furthermore, the RefSeq-to-LocusLink associations anchor UniGene clusters and support annotation of genomic contig sequence data generated by the Human Genome Project.
RefSeq records are created via a process consisting of:
- Establishing the correct gene name-to-accession number association
- Identifying the full extent of available sequence data
- Creating a new sequence record with a PROVISIONAL status
The provisional RefSeq records are then reviewed by a biologist who confirms the initial name-to-sequence association, adds information including a summary of gene function, and, more importantly, corrects, re-annotates, or extends the sequence data using data available in other GenBank records. Both provisional and reviewed RefSeq records are made publicly available via the NCBI Entrez retrieval system, BLAST databases, FTP, and the LocusLink web site.
- What is a Reference Sequence?
The RefSeq database will be a non-redundant set of reference sequences, including constructed genomic contigs, mRNAs, proteins, and, in the future, entire chromosomes. RefSeq records are made available at two 'status' levels: provisional and reviewed. Reviewed records represent a compilation of our current knowledge of a gene and it's transcripts. During the review process additional information is integrated, when available, such as sequence data, publications, nomenclature, and feature annotations from multiple GenBank records, the Human Gene Nomenclature Committee, and Online Mendelian Inheritance in Man.
- How do I access RefSeq records?
RefSeq records can be accessed through several NCBI resources including:
- BLAST
NM_###### records are in the nucleotide non-redundant database (nr)
NP_###### records are in the protein non-redundant database
- Entrez
NM_###### records are in Entrez nucleotides
NP_###### records are in Entrez proteins.
- Entrez Genomes Division
NC_###### records representing the complete genome or chromosomes of several organisms are presented on the Genomes pages.
- FTP
Currently limited to NM_* and NP_* records; NT_* and NC_* records to be added in the future.
- Human Genome Sequencing
NT_###### records for human contigs can be viewed graphically, downloaded, or accessed with BLAST queries only from the Human Genome Sequencing pages.
- LocusLink
LocusLink records provide links to NM_###### and NP_###### records. LocusLink can be queried with the RefSeq accession number in addition to text terms.
- Retrieving NM_ and NP_ RefSeq records with Entrez queries:
RefSeq records can be retrieved via different Entrez queries:
Sample Query (Q)
Result (R)
Q. NM_003988
R. One RefSeq record for PAX2, isoform c, is returned.
Q. PAX2[Gene Name]
R. This returns 17 records including 5 PAX2 RefSeq records.
Q. PAX2[Gene Name] AND srcdb_refseq[Properties]
R. This query retrieves only the set of 5 alternately spliced PAX2 RefSeq records.
Q. srcdb_refseq[properties] AND provisional[all]
R. This query returns the set of all PROVISIONAL RefSeq records.
Q. srcdb_refseq[properties] NOT provisional[all]
R. This query returns the set of all REVIEWED RefSeq records.
- Identifying NM_ and NP_ RefSeq records in BLAST results:
The distinct format of RefSeq accession numbers (they include an underscore) provides a quick indication that a BLAST result includes a RefSeq record.
Score E
Sequences producing significant alignments: (bits) Value
ref|NM_000014.1|A2M| Homo sapiens alpha-2-ma... 9073 0.0
^ ^
| |
| RefSeq accession numbers have a distinct format
|
"ref" indicates RefSeq database
- What is the difference between PROVISIONAL and REVIEWED RefSeq records?
- PROVISIONAL Records:
"Provisional" RefSeq records have not been reviewed yet. They are generated by an automated process with some initial quality checking done to double check the validity of the 'name'-to-'sequence data' association we are presenting.
RefSeq records are only made when we have source sequence records annotated with complete coding regions. If multiple sequences from the same transcript are identified by local alignments, the longest is selected automatically for the provisional record.
A provisional record presents, for the most part, the annotation that was present on the source GenBank record used to create it. The main differences between the source GenBank record and the provisional RefSeq record include the addition of the following in the RefSeq entry: nomenclature (gene name and aliases), a stable LocusID number, the MIM number for the gene, and a statement in the Comment field that the entry is provisional.
- REVIEWED Records:
Reviewed records have been manually processed by NCBI staff or collaborating groups to create a sequence record that is analagous to a 'review article.'
Some changes/enhancements in the reviewed record might include the addition or removal of DNA sequence data and feature annotations, the addition of summary information and publications, and the addition of other information, as appropriate.
When a record is reviewed, sequence data from more than one record may be merged together, as deemed appropriate, to construct a more complete mRNA record. Sequence data available in both genomic and mRNA records is used; we do not use EST sequence data. The review process frequently includes reading the primary literature to cross-check accuracy and determine if additional data concerning the extent of the UTR is available. Transcript variant records are only made after reviewing the literature or in collaboration with experts.
All sequences used to generate the sequence "assembly" are reported in the RefSeq record and in LocusLink. We also attempt to curate a list of other GenBank records that represent this gene. However, this list is not intended to be fully comprehensive; additional 'related' sequence information will always be available in the Entrez 'related sequences' (or 'neighbors') reports, BLAST search results, etc.
For examples of reviewed RefSeq records, see the following entries:
| Gene Symbol |
LocusID |
Comments |
| AGL |
178 |
Example of splice variant treatment. We make RefSeq records for splice variants for which the full length nature of the transcript is well documented and supported by experimental evidence. There is a greater emphasis on providing RefSeq records for cases where some of the transcript variation results in altered coding regions.
|
| PAX2 |
5076 |
Example of splice variant treatment
|
| MICA |
4276 |
Note several references included; the record is analogous to a 'review article'. A single article is annotated in the Reference field of the source GenBank record.
|
| GCKR |
2646 |
Note the last line of the Comment field on the RefSeq record provides a 'completeness' indicator. If we determine during the review process that the 5' and/or 3' end of the mRNA is complete, then this information is provided on the RefSeq record. |
- How is the GenBank source sequence initially selected?
There are several factors used in selecting the source sequence first used to generate the PROVISIONAL mRNA RefSeq record, but quite often the source GenBank record used is selected primarily because it includes more complete UTR sequence data. We do strive to make reference sequences that maintain consistency with standards already in use by the mutation community.
Reference sequence records are not intended to represent the historical 'first sequenced' record (although for genes with very limited available sequence data they may at times do so). While the PROVISIONAL RefSeq records do temporarily represent a single GenBank source sequence, the REVIEWED RefSeq records are intended to represent the current state of knowledge as provided by the whole research community rather than by any one laboratory.
- RefSeq NM_xxxxxx and GenBank AFxxxxxx appear to be duplicates. Will one be removed?
No, both records will continue to be available. RefSeq and GenBank are separate databases, and both databases are available in the Entrez nucleotides data set.
Provisional RefSeq records are usually quite similar to the source GenBank records from which they were drawn. However, when RefSeq records are reviewed by experts, additional sequence data, biological annotations, and references are often added. At that time, the original source GenBank record(s) and the corresponding RefSeq entry can be quite different -- the RefSeq entry can represent a combination of information from various labs, which are credited in the Comments and/or References field of the record.
The RefSeq database is designed to reduce duplication by selecting one representative sequence for each human locus, whereas GenBank is a repository of sequences that might contain numerous records for any given gene. The only duplicates in the RefSeq database will be naturally occurring splice variants. Entrez search results can be limited to RefSeq entries by searching for 'srcdb_refseq' in the Properties Field.
- Locus Link
The LocusLink query searches for any word (or word stem) in a LocusLink report. The query results are returned as a summary alphabetized by Symbol (browse list). Any detailed LocusLink report page is then accessed by clicking on the LocusID number at the left. Color-coded icons support rapid jumps to related records in PubMed, OMIM,RefSeq, GenBank, UniGene, and Variation data (dbSNP)
- Summary of Query options
wild card*
Field restriction[chr] chromosome
[mim] MIM number
booleans does not allow explicit, but and is implied with multiple words
- Examples of LocusLink Queries
| Query |
Result |
| A2M |
Browse list including A2M and IGHA2. Click on the LocusID value (2, 3494 respectively) to display each LocusLink Report.
|
| A2* |
Browse list of loci with words beginning with "A2" in any data field.
|
| Macroglobulin |
Browse list of loci with the word "macroglobulin" in any data field.
|
| 2[chr] |
Browse list of loci on chromosome 2.
|
| protein kinase 3[chr] |
Browse list of loci with records containing the word protein and the word kinase and a location on chromosome 3
|
| 12p1* |
Browse list of loci recorded as having the upper range somewhere in 12p13, 12p12, or 12p11.
Warning: This is not a range search.
Note: Wild cards cannot be used in combination with field-restricted searches (like [mim], [chr])
|
| 4.1.2.13 |
Browse list of loci encoding enzymes with this EC number (ALDOA, ALDOB, and ALDOC).
|
| 103950[mim] |
Browse list containing only A2M, because 103950 is the MIM number for A2M.
|
| 12305* |
Browse list of loci with numbers beginning 12305 (in this case based on MIM numbers and a GDB id) in any data field.
Note: Stemming does not work with field-restricted searches ([mim], [chr])
|
| Hs.74561 |
Browse list containing only A2M, because (Hs.74561 is the UniGene cluster number for A2M)
|
| AF053356 |
Browse list of loci with sequences in this GenBank record.
Note: accession data are still being integrated; some accession numbers may not yield a result.
|
- LocusLink Report Page
| Data Element |
Definition |
| Official Nomenclature |
The official symbol and name of a gene.
|
| Interim Nomenclature |
If the official symbol and name have NOT been established, an interim symbol and name are provided.
|
| LocusID |
A unique NCBI LocusID is assigned to each locus. LocusIDs are stable identifiers of a locus independent of symbols or other identifiers.
|
| Alternate Symbols |
Alias symbols. These are compiled from previous nomenclature, the published literature, or sequence records.
|
| Product |
The preferred gene product name. These names may be revised to reflect current usage.
|
| Alias |
Alternative product names .
|
| EC number |
Enzyme commission number(s).
|
| Chromosome |
The chromosome(s) to which this locus is mapped.
|
| Position |
The cytogenetic location.
|
| OMIM |
The MIM number assigned to this gene product.
|
| UniGene |
The UniGene cluster that represents this locus.
|
| Phenotype |
The name of the disease that may result from variants at this locus; this is linked to the OMIM record for the disease
|
| Links |
Links to other related WWW sites
|
| Reference Sequences |
All RefSeq records created for a given locus. Multiple records are distinguished from each other by a uniqued locus abbreviation (the gene symbol appended by incremental lowercase alphabetic characters), and a brief descriptor of the transcript variant. This section provides data on (and links to): the GenBank source sequence, the nucleotide RefSeq record (nucleotide accession numbers are have an 'NM_' prefix), and the protein RefSeq record (the 'NP_' prefix).
|
| GenBank Sequences |
Accession: A subset of representative GenBank accession numbers linked to nucleotide and protein data. These accession numbers are initially derived from a variety of collaborations. These data are reviewed frequently and the accessions listed for a given locus may change over time. EST accession numbers are provided only if no other sequence data is available to represent the locus.
Type: Molecule type for the nucleotide record.
m mRNA
g genomic DNA
e EST
u undetermined
Protein data: Protein gi numbers link to the GenBank protein record. The presence of the blue button link () indicates structure data for related protein sequences is available
|
- How are UniGene links handled? Sometimes I don't see a link and sometimes I see more than one.
This is a temporary situation. The goal is to improve the correspondence between UniGene and LocusLink such that for any gene anchored with sequence data, there is a single UniGene cluster.
- I know there are more accession numbers for a gene than what I see listed. Why?
The answer depends in part on whether the gene has a provisional or a reviewed RefSeq record, or no RefSeq record at all.
If there is no RefSeq record, each accession number is one suggested to represent the gene, but no tools have been used to identify other related sequences.
If the RefSeq record is in the provisional category, only records annotated with complete coding regions, which do not exceed a cutoff nucleotide mismatch level, are listed.
If reviewed, the sequences used to generate the sequence "assembly" are reported explicitly. Other GenBank records that represent this gene are also listed; however, this list is not intended to be comprehensive. Additional 'related' sequence information will always be available as Entrez 'related sequences' (or 'neighbors') reports, BLAST search results, etc.
- Unigene
http://www.ncbi.nlm.nih.gov/UniGene/index.html
Examples
- What is Unigene?
- An automatically created database containing a non-redundant set of gene-oriented clusters
- Each cluster represents a unique gene (in an ideal world)
- Each cluster also contains information on tissue type where expression has been found and map location
- Source databases are the mRNA and genomic sequences from GanBank and the dbEST
- EST - Expressed Sequence Tag
- Unigene exists for Human, Mouse and Rat currently
- Clusters do not contain contigs or consensus sequences
- How are clusters created?
- The process proceedes in stages, with each stage adding less reliable data to the results of the preceding stage
- Screen for contaminants, repeats, and low complexity sequences using DUST
- Includes mitochondrial, ribosomal and vector contaminants
- Gene links identified
- mRNA and genomic data compared with itself to identify initial clusters
- EST added to previous clusters using WHALE
- ESTs are compared with gene clusters
- Any EST that would join two clusters from the previous stage are discarded
- Any resulting cluster that does not contain a sequence with a Poly a signal or two 3' ESTs is discarded
- Resulting clusters are called "anchored clusters" because their 3' end is presumably known
- Clone based edges are added
- ESTs not belonging an anchored cluster are rechecked at a lower level of stringency and added to the cluster they match
- Clusters of size 1 (infrequently expressed genes) are compared to other clusters at lower stringency and merged
- Resulting clusters are compared with previous build and renumbered
- Searching Unigene
- You may search using keywords, genbank identifiers, map location, chromosome, or library source
- You may also use modifiers such as @chr(number) and @lib(number)
- Things to be aware of when using Unigene
- Cluster assignments are not stable
- Since new data is constantly being added, cluster id's are removed after two previously separate clusters can be merged
- The clustering algorithms are greedy
- This allows for splice varients, point mutations and poor sequence from ESTs to be introduced to the cluster
- This is why consensi are not calculated or contigs built
- The quality of the dataset is only as good as the source
- The error rate for EST sequencing is 3-5 %
- The dbEST contains contaminants and errors, which can translate to faulty clustering
- Multiple clusters can contain different parts of the same gene
- The 5' and 3' ESTs may exist for a gene, but they may not yet overlap. In this case they would be placed in separate clusters. When bridging sequence is available, one of the cluster id's will be retired
- The same accession number can be present in multiple clusters with the same or different gene products?
- This often occurs when genomic sequence is used, as the large DNA fragments contain more than one gene
- OMIM (Online Mendelian Inheritance in Man)
http://www.ncbi.nlm.nih.gov/Omim/
- This database is a catalog of human genes and genetic disorders
- Contains textual information, pictures, and reference information
- Many links to NCBI's Entrez database of MEDLINE articles and sequence information.
- This database can be searched by gene name or disease description
- The OMIM numbering system
- Each OMIM entry is given a unique six-digit number whose first digit indicates the mode of inheritance of the gene involved:
- 1----- (100000- ) Autosomal dominant (entries created before May 15, 1994)
- 2----- (200000- ) Autosomal recessive (entries created before May 15, 1994)
- 3----- (300000- ) X-linked loci or phenotypes
- 4----- (400000- ) Y-linked loci or phenotypes
- 5----- (500000- ) Mitochondrial loci or phenotypes
- 6----- (600000- ) Autosomal loci or phenotypes (entries created after May 15, 1994)
- An allelic variant is designated by the MIM number of its parent entry, followed by a decimal point and a unique 4-digit variant number. For example, allelic variants (mutations) at the factor IX (hemophilia B) locus are numbered 306900.0001 to 306900.0101. The beta-globin locus (HBB) is numbered 141900; sickle hemoglobin is numbered 141900.0243.
- An asterisk (*) before an entry number means that the phenotype determined by the gene at the given locus is separate from those represented by other asterisked entries and that the mode of inheritance of the phenotype has been proved (in the judgment of the authors and editors). In general, an attempt has been made to create only one asterisked entry per gene locus.
- No asterisk before an entry number means that the mode of inheritance has not been proved, although suspected, or that the separateness of this locus from that of another entry is unclear.
- A number symbol (#) before an entry number means that the phenotype can be caused by mutation in any of 2 or more genes. The #-labeled entries are considered useful for avoiding repetition of the same phenotypic information in several entries and necessary because it is often unknown which genetic type is referred to in a particular report.
- ORF Finder
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Examples
- A graphical analysis tool which finds all open reading frames of a selectable minimum size
- You may paste in your own sequence or an accession number
- You may restrict the size of the sequence in which to search for ORFs
- Identifies all open reading frames using the standard or alternative genetic codes.
- Once the Orfind results have been returned you can:
- Save the deduced amino acid sequence in various formats
- Search the amino acid sequence against the sequence database using BLAST
- Change the minimum cutoff size for ORFs and redraw
- Display alternative initiation codons
IV. Other Genetic Analysis Tools not at NCBI
- EST Tools
- UniBLAST
http://gcg.tigem.it/UNIBLAST/uniblast.html
The UniBlast server will perform a local Blast search against the UniGene database or against UniNewGene, a locally generated version of UniGene devoid of all the clusters containing an mRNA or a CDS .The researcher will be able to identify in one single step the UniGene cluster containing the EST(s) which match with the query sequence. Searching against UniNewGene will report only the UniGene cluster which should not contain any known CDS or transcript. Available search programs are NCBI Blast ver. 1.4.8 (non-gapped alignments) or WU-Blast ver. 2.0 (gapped alignments). This service is not interactive: search results will be sent back to the user by email .
- The EST Extractor
http://gcg.tigem.it/BLASTEXTRACT/estextract.html
The EST Extractor is a tool for building clusters (corresponding to "virtual transcripts") from dbEST, starting from a sequence Accession Number or a plain DNA/Protein sequence. It will perform a remote interactive Blast search (either a BLASTN or a TBLASTN analysis) versus dbEST at NCBI using the 1.4 or 2.0 version of the Blast program and will reformat the search output, adding the following possibilities:
- select the interesting matches from the output;
- retrieve the EST sequences from the locally maintained databases*;
- display the links with (remote and local) UniGene entries*;
- assemble or align the selected sequences interactively*;
- repeat the whole EST Extractor procedure with a selected contig from the assembly, or
- perform a remote Blast search at NCBI with a selected contig from the assembly.
* These operations can be done with both end sequences for each clone.
- Prediction Tools
- Promoter Prediction by Neural Network
http://www.fruitfly.org/seq_tools/promoter.html
Example
- Your sequence should consist of one-letter nucleotides (A, C, G, T). The sequence should be in plain or FASTA format.
- Select whether you want to use the NNPP version for prokaryotes or for eukaryotes (it defaults to eukaryotes).
- You can choose whether to show predictions for the reverse strand as well as the forward strand.
- You may also set the score cutoff (it defaults to 0.8). Potential promoters are assigned scores; only those that exceed the score cutoff are shown. The lower the score cutoff, the more potential promoters will be shown (see "Estimated Accuracy of Prediction" table for details). The score cutoff should be between 0 and 1.
- The output of NNPP is a list of the 51-base (eukaryotes) or 46-base (prokaryotes) regions that the network judges most likely to be promoters. Because promoter elements may appear at different relative positions, the positional accuracy of promoter prediction is +/- 3bp including the transcription start.
- The TSS is indicated as a large, capital letter
- Splice Site Prediction
http://www.fruitfly.org/seq_tools/splice.html
Example
- Your sequence should consist of one-letter nucleotides (A, C, G, T).
- The sequence should be in plain or FASTA format. (ex. V00574)
- Select whether you want to use the neural network version for Human or for Drosophila melanogaster sequences. You can choose whether to show predictions for the reverse strand as well as the forward strand.
- The output of the neural networks is a list of the 15-base (41-base) regions that the network judges most likely to be 5' and 3' splice sites, respectively. The junctions are indicated by a larger font size.
- Identification of Target Genes for DNA Binding Proteins
http://gcg.tigem.it/TargetFinder.html
- Allows users to search for candidate target genes of DNA-binding proteins in a database.
- It looks for binding sites located in context with other important transcription regulatory signals and regions, like the TATA element, the transcription start site, the promoter and so on, thereby greatly reducing the background usually associated with this kind of searches.
- Results are e-mailed back to the user.
- Interpreting the results - You should be very cautious, when interpreting the outcome of a TargetFinder search. In fact, although TargetFinder tries to restrict its search to meaningful regions of a gene, the binding sites of DNA binding proteins usually possess a low information-content and therefore the search for them in a whole database is bound to give a rather high noise level.
- Genome Analysis
- National Center for Genome Resources
http://www.ncgr.org/
Not Yet Available
- GeneX - a repository of gene expression data with an integrated toolset that will let researchers analyse mRNA expression data and facilitate comparison with other such data
Status: Beta release in 2 months
- HomologyDB
- TreeBLAST - integrates sequence similarity searching, sequence alignment and tree analysis
Status: functional prototype, not yet available
- R-Genes - investigatory subset focussing on plant disease resistance genes and their homologs in other organisms. Over 500 family members known.
Status: Currently available
- HomologyDB - database to store homology information that will be the foundation of the previous two projects.
Status: Prototype second half 1999
Available
- GSDB - Genome Sequence Database
- Updated nightly
- Data retrieval methods
- BLAST (http://seqsim.ncgr.org/newBlast.html )
- Java Sequence Viewer
- GSDB Maestro - searches database based on any combination of 18 fields
- SANBI Maestro - allows access to sequences associated with STACK
- GSDB Ad Hoc Query Tool - Web based form that will let users query using SQL statements. A free account and an understanding of the db schema is required
- GSDB Excerpt - Allows users to extract exact portions of a GSDB sequence
- FlatFile retrieval - individual file retrieval
- MAR-Finder - searches for probable Matrix Attachment Regions (MAR)
- PathDB - a general metabolic pathway database that will represent the current knowledge.
http://www.ncgr.org/software/pathdb/
- The main data types represented by PathDB are compounds, reactions, enzymes and other metabolic proteins and pathways.
- PathDB attempts to include very rich descriptions of the kinetic, thermodynamic and physico-chemical properties of pathway components.
- All the data are categorized by taxonomy.
- Live data updates
- Simple queries can be made directly through the Web with a normal browser.
- A Java software application connects directly to the database and provides a user-friendly means to make complex queries. This interface makes it possible to navigate the relations between the various entities stored. It also makes it possible to carry out set operations on multiple query results.
- RNA Analysis
- Resources for RNA Analysis
- The RNA World at IMB-Jena http://www.imb-jena.de/RNA.html
- 3D Structure database links
- Sequences and secondary structure database links
- Protocols
- Software
- Books
- Meetings
- Algorithms, thermodynamics and database for RNA secondary structure http://www.ibc.wustl.edu/~zuker/rna/
- Knowledge Base
- GeneCards: Human Genes, Proteins and Disease
http://bioinformatics.weizmann.ac.il/cards/index.html
- GeneCards is a database of human genes, their products and their involvement in diseases.
- Provides concise information about the functions of all human genes that have an approved symbol, as well as selected others.
- The information presented here has been automatically extracted from various resources by scripts developed at the Weizmann Institute.
- 2. KEGG: Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
- Goals of the KEGG project
- To computerize all aspects of cellular functions in terms of the pathway of interacting molecules or genes
- To maintain gene catalogs for all organisms and link each gene product to a pathway component
- To organize a database of all chemical compounds in the cell and link each compound to a pathway component
- To develop computational technologies for pathway comparison, reconstruction and analysis
- Useful Links
- Smith-Waterman
- GenWeb Server
http://croma.ebi.ac.uk:80/cgi-bin/genweb/admin/login.cgi
One of the very few publicly accessable locations that will let you do Smith-Waterman searches.
- EBI (European Bioinformatics Institute)
- BCM Search Launcher
http://dot.imgen.bcm.tmc.edu:9331/
A central site to do many types of searches. A batch client is available.
- General protein sequence/pattern searches
- Species-Specific protein sequence searches
- Nucleic acid sequence searches
- Multiple sequence alignments
- Pairwise sequence alignments
- Gene feature searches
- Sequence utilities
- Protein secondary structure prediction
- Alert Servers
- Sequence Alerting Service http://www.bork.embl-heidelberg.de/Alerting/
- MIPS Alert http://vms.mips.biochem.mpg.de/mips/programs/alert.html
- Swiss Shop Alert Service http://www.expasy.ch/swiss-shop/
These services will automatically perform daily comparisons of you sequence of interest to the latest database updates. If there is a new sequence that has a high degree of similarity to your sequence of interest, you will automatically be sent an email informing you of this.