Contents

  1. Introduction to Protein Analysis
  2. Obtain a sequence of interest.
  3. Identify ORF's and translate into protein
  4. Identify Similar Proteins from the Databases
  5. Align your sequence vs similar sequences and look for Gene Families
  6. Determine the putative function of your protein
  7. Determine the putative structure of your protein
  8. Protein Structure Visualization Tools
  9. Other Interesting Things You Can Do With Proteins

VI. Determine the putative function of your protein

  1. PROSITE
    http://www.expasy.ch/prosite/
    • A database of regular expression-like patterns (motifs)
    • Each entry is thoroughly documented
    • Each entry is periodically reviewed to keep it correct
    • Current release contains 1034 documentation entries describing 1374 different patterns, rules and profiles
    • Accessing Prosite
      • FTP - the database exists as an ASCII flat file
      • E-mail server
      • Web server
      • May be included as part of various other search engines
  2. PRINTS - a diagnostic collection of protein fingerprints
    http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.html
    Example
    • What is a protein fingerprint?
      • A collection of aligned, unweighted sequence motifs
      • Within an ungapped MSA one can often find one or more different pattern motifs. A fingerprint is the type of pattern and order of patterns for a protein family
      • In a database search, this gives a greater chance of finding a distant relative since a sequence that matches 4 of 7 motifs may still be valid if those 4 are in the correct order and properly spaced
    • Database source
      • OWL - a non-redundant compilation of SWISSPROT, PIR, GenBank translations, NRL-3D
    • Contents of current release
      • 1100 entries covering approximately 6510 individual motifs
      • Major updates quarterly
    • Accessing PRINTS
      • Keyword search
        • Can retrieve the full entry
        • Can retrieve the MSA the entry was built from
        • Can visualize the motifs in 3D where the coordinates are available
        • Can search with either protein or DNA sequence
        • The BLAST results include a list of the top 10 most frequently occurring fingerprint matches
        • The accession list of hits also includes the PRINTS name and ID, the number of motifs contained in the query and the total number of motifs in the fingerprint
        • Also contains links to Graphscan image
          • Horizontal axis represents sequence
          • Vertical axis represents the percentage score (identity) of each fingerprint element
          • Yellow blocks indicate position of matches above a 15 % threshold
          • Blue ticks mark position of matches below 15 % threshold.
          • The height of the yellow block indicates percent identity
          • The width of the yellow block indicates the size of the motif
          • The number of graphs indicate the number of individual scans (1 per motif)
        • Can do single searches against the full database
        • Can do bulk searches using MULScan against the full database
          • Bulk submission results are returned by email
        • Can do single searches against a named fingerprint
        • Results shown as a series of HTML tables
          • Simple - program's best intelligent guess
          • Detailed - a sorted list of motifs
          • Complex - raw result data
          • Color in the table reflects confidence levels
    • Application - Example of Rhodopsin-like fingerprint
      • OPSD_SHEEP is a clear family member, matching all 7 TM regions
      • NY5R_HUMAN is not found by PROSITE because it contains changes to the third TM
      • YMJC_CAEEL is a partial match (5/7), fails PROSITE
      • UL78_HCMVA is a poor twilight zone match
    • What is BLOCKS
      • BLOCKS are ungapped MSA representing conserved protein regions
      • The BLOCKS database consists of blocks from documented protein families
        • The primary list is obtained from PROSITE
        • The list is subjected to an automated motif finding process that does not use PROSITE patterns
    • The current release
      • Contains 4034 blocks representing 994 families (BLOCKS)
      • Database searches have access to an enhanced version (BLOCKS+) containing
        • 2277 blocks from PRINTS that are not in PROSITE
        • 1247 blocks from Pfam that are not in PRINTS or PROSITE
        • 1628 blocks from ProDom that are not in Pfam, PRINTS or PROSITE
        • 312 blocks from Domo that are not in ProDom, Pfam, PRINTS or PROSITE
    • Accessing BLOCKS
      • BLOCKS Searcher
        • Can use protein or DNA as query
        • Search BLOCKS, BLOCKS+ or PRINTS
      • b. IMPALA
        • Developed by NCBI BLAST people
        • Slightly different method of computation from BLOCKS Searcher
        • Identical results obtained from both BLOCKS Searcher and IMPALA are good
        • Results obtained only in one or the other search program are suspicious
      • LAMA (Local Alignment of Multiple Alignments)
        • Compares protein MSA with each other
        • Can search database of such alignments
        • Search is for conserved regions between families
        • Sensitive, can detect weak similarities
    • Other goodies
      • Blockmaker
        • If you want to search a protein or block that is not in the database, it must be converted to BLOCK format (i.e. blocks extracted)
        • Input is 2 or more related sequences. They don't have to be aligned
        • Finds the conserved blocks among the sequences
        • Returns the result in BLOCKS database format
        • Results can be pasted into Multiple Alignment Processor to generate logos or trees with which to search the sequence database using COBBLER or MAST and to predict PCR primers using CODEHOP
      • Multiple Alignment Processor
        • Creates input files for blocks based searches (MAST, LAMA)
        • Input can take up to 400 sequences in BLOCKS file format
        • Input (MSA) can also be in FASTA, CLUSTAL or MSF format
      • CODEHOP
        • Designs primers for protein MSA's
        • Intended to be used in cases where the protein sequences are distant and degenerate primers are needed
        • Input is BLOCKS format MSA of amino acids
        • Output is list of suggested degenerate primers. You must choose which sets of primers to use
        • A CODEHOP primer has a non-degenerate 5' consensus clamp and a 3' degenerate core