Biochemistry and Biophysics D-course practical in Bioinformatics
Today you are going to use bioinformatics tools to discover
functional and structural information about some proteins. It's
possible that you may find something nobody has seen before, so
this is not necessarily another "pointless practical exercise".
Johannes is there to help you - so ask him lots of questions!
Choose a protein to work with
Here are the sequences:
- Q9K6A4
- Q9X256
- P23616
- Q9YFP5
- Q46828
- Q9X261
Divide up the sequences between you so that no more than two people
work on the same sequence. You should work individually, but
you may also discuss your protein and share information with your "partner".
Start your report
You are asked to prepare a 2-3 page report on the function and
structure of your protein. Please include answers to all
the bulleted questions on this page, and a summary of further information you discover.
It would be a good idea to create a rough version of your report now,
and email it to yourself to finish later.
What to do...
What is already known about the protein?
The links above take you to a UniProt entry.
- Briefly, what is the purpose/goal of the UniProt database and how is it produced?
Now look in detail at the UniProt entry and answer the following questions:
- What alternative names and accession numbers does this protein have?
- What organism does it come from?
- If there is a "Comments" section, read it carefully and paste this
into your report. Think about how these comments were created,
and what information they are based on.
What do we want to know?
Different research projects raise different questions about protein structure and function.
The two main scenarios are:
- You have a hypothetical protein and want to know
anything, however vague or approximate, about its possible
function.
- You are working on a protein involved in a certain process/pathway, but
want to know more details about its mode of action.
Question:
- Considering the functional information given by UniProt, which of the two categories
above could you imagine to be the case?
Does the protein contain any known domain families?
In the "Cross-references" section of the UniProt page, read the
descriptions of each cross-referenced database (click on the left
column). Find out which of the databases are concerned with
"protein/domain families".
- For each of the cross-referenced domain databases, list the domains that are found
in your protein and a short summary of their function. An example
might be:
Pfam domains: PF00234 - residues 1-95 - DNA topoisomerase II gamma;
PF008631 - residues 110-183 - ATP-binding
- Do these domain databases provide any functional information that you did not
already know? (from the name of the protein and the UniProt "Comments")
If so, summarise the functional information.
- Which (if any) of the domain families have members with known 3D structure?
(Hint: the domain family web page will probably contain a picture of the 3D structure,
if one is known.)
- Which regions (if any) of your protein are not assigned to domain families?
If your protein has Pfam domain(s), go to the Pfam page for each
domain, and look at the "Domain organisation". For example: for the REJ
domain you would click on the "View architectures for 16 proteins"
button and then click "View Graphic".
You will now see some nice pictures showing the domain organisation
of all the proteins which contain "your domain".
- Roughly how many different domain architectures are there?
- If there are proteins with domain architectures that are
different from "your protein", briefly look at the function of these
proteins. (Click on the links to get a UniProt or Swiss-Prot summary
page.) Could these other functions help you propose a function for your protein?
Note: the Pfam domain architecture viewer is very interesting
to look at, so follow the REJ link above, even if your protein has no
Pfam domains.
PSI-BLAST
Domain family assignments (such as those in Pfam) are made using
powerful sequence database search tools. However, the assignments may
be a number of months old since they cannot be updated every day or
week. Because new sequences enter the databases all the time, it is
possible that you might find some useful information by performing a
powerful sequence database search against the most up-to-date sequence
databases. Here we will use PSI-BLAST.
Submit the whole sequence to the PSI-BLAST service at
NCBI. Search against the default "NR" database.
Note that you have to watch two browser windows at
the same time when running PSI-BLAST iterations. Run as many
iterations as you can, but at least 3 or 4.
Note that the BLAST server does a "CD-Search" for conserved domain
sequences. If you get CD hits (explanation),
have a quick look at them and answer the following question:
- Do the CDs agree with the domain assignments cross-referenced in UniProt,
or are they different?
Now complete some iterations of PSI-BLAST, and answer as many of the
following questions as possible/suitable.
- What is the default "inclusion threshold" used by NCBI's PSI-BLAST web service?
Make sure you understand what this means in practical terms.
- How many iterations did you perform? When did you stop seeing "new hits"?
- Did you check/uncheck any boxes for "inclusion in the next PSI-BLAST iteration"?
What are the advantages/disadvantages of checking extra boxes?
- How many homologues of your protein exist in the database?
- Which species they come from (taxonomy report)?
- If you don't already know if your protein is multi-domain - does
the graphical "Distribution of XX Blast Hits on the Query Sequence" suggest
possible domain boundaries?
- Look at the short descriptions of the database hits.
Do they tell you anything new about the function of your protein?
- Are any of the homologues sequences from the PDB (protein
structure) database? These are marked with a red square containing a "S".
Make a note of these for the next section of this practical.
Structure prediction
If you have already found a homologous sequence with a known structure
using the tools above, you could build a homology
model for the aligned parts of your query protein. We will not do
that in this practical though. Instead we will have a quick look at fold
recognition methods.
Extract the regions/domains of unknown structure
First you need to find out which parts of your sequence can not
be aligned to a known structure. That is, which parts of the sequence
are not assigned to domains of known structure, or were not aligned to
a protein of known structure by PSI-BLAST. If there is no
"structurally unknown" region in your sequence or if the region is
shorter than about 50 residues, then there's no real point in doing
fold recognition on your sequence. If this happens, you may use the
following sequence instead: Y237_MYCPN.
Submit sequences to fold recognition servers
You may submit your sequence regions to any fold recognition server on
the web, but we recommend that you first try the 3D-PSSM server.
If you have spare time, you could also try the "Server of servers" server Meta-Server, which combines the
results from many servers, including 3D-PSSM. However, the meta-server
can be a little confusing.
The fold recognition process may take 30 minutes or more! If it is
late, you may want to go home and check the results later.
If you are still here, you may want to read the 3D-PSSM help pages.
When you have some results to look at, please answer the following
questions:
- Did 3D-PSSM or any of the servers give very confident hits to
a known structure?
- If you have a confident hit, the alignment between your protein
and the known structure can be used to build an approximate 3D model.
(Don't build one, just imagine...) Assuming that your protein has
some catalytic activity, how could you use this model to suggest
possible the molecular mechanism for catalysis?
Handing in the report
Please send an electronic version of your report to both Johannes and Bob. Please use the subject line: "D-course practical from YourName". It would be
much appreciated if you send it in an non-Microsoft format like
PDF, HTML or plain text. MS Word format
will be accepted, but we will use OpenOffice to read
it so anything could happen to the formatting...
Tutorial question
Please think about the following and write a few sentences on each
and email them to Bob
before the deadline. Use the subject line: "D-course tutorial from YourName"
In the area of problem solving and algorithms, a "heuristic" is a
knowledge-based trick or short-cut to get to the solution (or close to
it) quicker. For example, imagine you are asked to find the quickest
walking route from "place A" to "place B" in a city. You are given 2
hours to time yourself walking as many different routes as possible.
You are not allowed to measure distances on a map. In this example, a
natural (and obvious) heuristic is to avoid walking down any dead-end
streets.
In this web
tutorial on sequence database searching, two "pathological
examples" are given for situations where the FASTA heuristics would
fail. Can you think of another pathological example where
two clearly related DNA sequences of roughly the same length would also
fail to give a high-scoring alignment? (Hint: the number 3.)
Any questions?
Please contact Bob or Johannes by email if you need to discuss anything
in more detail. Good luck with the practical report and tutorial question!