Contents
- Introduction to Protein Analysis
- Obtain a sequence of interest.
- Identify ORF's and translate into protein
- Identify Similar Proteins from the Databases
- Align your sequence vs similar sequences and look for Gene Families
- Determine the putative function of your protein
- Determine the putative structure of your protein
- Protein Structure Visualization Tools
- Other Interesting Things You Can Do With Proteins
V. Align your sequence vs similar sequences and look for Gene Families
- Multiple Alignments - Multiple alignments are the most useful tool to help determine significant regions of your sequence. They are also the starting point for identification of distant members of gene families through the construction of profiles and blocks
- ClustalW
http://www.ebi.ac.uk/clustalw/
http://www.ebi.ac.uk/clustalw/help.html Help File
Example
- Generates Multiple Sequence Alignments (MSA's)
- Output options
- Export MSF file for viewing with GCG or other programs
- Java Applet - supports viewing and editing
- What is the advantage of a MSA over BLASTP?
- Residue coloring
- Easier to visualize patterns
- Can specify files to be aligned and their order
- BCM Search Launcher
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html
Again, this is a good place to start from, but it is usually very busy. MSA's take a lot of time and Baylor puts restrictions on the number of sequences that can be aligned per program in one pass
- Domain analysis - Searching against a pre-compiled collection of domains can give useful hints about the function of the query sequence.
- Pfam
http://www.sanger.ac.uk/Pfam/
Example
- Pfam A - curated families with a HMM that can be used for searching and alignment
- Pfam B - sequences not included in Pfam A are clustered automatically
- Each Pfam A family has 4 elements
- Annotation - source used to make family, how aligned, thresholds for the HMM's, etc
- Seed alignment - a curated alignment, containing representative family members that are judged to be well aligned
- Profile HMM - constructed with HMMER 2.0
- Full alignment - the pHMM used to search the database. Those sequences scoring above the family specific threshold are aligned. Should only contain members of a family.
- New sequence, nothing known
- Search the sequence against the collection of pHMM's to locate regions of the sequence that belong to known domain families
- Already have SWISSPROT or SWISSPROT-TrEMBL id for the sequence
- Access precalculated matches through Swisspfam. Both Pfam A and B results will be displayed
- Browse information by family or text search
- Download the dataset and search locally
- Belvu - UNIX only
- Java alignment viewer - standalone
- Jalview - java applet
- SCOP - Structural Classification of Proteins
http://scop.mrc-lmb.cam.ac.uk/scop/
http://scop.stanford.edu/scop/ USA West Coast Mirror
http://scop.mrc-lmb.cam.ac.uk/scop/help.html Help File
- What is SCOP?
- A database containing all of the entries from PDB and proteins for which there are published descriptions but whose coordinates are not yet available.
- Classification is done by visual inspection and comparison
- Unit of classification is the protein domain
- Small proteins and most medium size proteins have one domain and are therefore treated as a whole
- The domains in larger proteins are usually classed individually
- Classification is hierarchical
- First two levels describe near and far evolutionary relationships
- The third level, fold, describes geometrical relationships
- The distribution between evolutionary relationships and those arising from physics and chemistry is unique to this database
- Classification Structure
- Family
- Must have common evolutionary origin
- Protein identity must be > 30 % or
- Identity < 30 % but similar structure and function
- e.g. globins with sequence identity of 15 %
- Superfamily
- Families with low sequence identities but with structure or function that suggest common origin are placed in superfamilies
- e.g. variable and constant domains of immunoglobins
- Common Fold
- Families and Superfamilies have a common fold if their proteins have the same major secondary structure in the same arrangement and with the same topological connections
- The structural similarities probably arise from the physics and chemistry of proteins favoring certain packing arrangements
- Class
- All a
- All b
- a/ b- contains both structures in mixed configuration
- a+ b- contains both structures in segregated regions
- Multidomain - different structures or no homologs
- Accessing SCOP
- Browse the hierarchy
- Organized in a tree structure as HTML pages
- Root -> Class -> Fold -> Superfamily -> Family -> Domain -> PDB Accession
- Similarity search
- Match protein sequence to the database via BLASTP, FASTA or Ssearch
- By keyword
- By PDB identifier
|
|