Biochemistry and Biophysics D-course practical in Bioinformatics

Today you are going to use bioinformatics tools to discover functional and structural information about some proteins. It's possible that you may find something nobody has seen before, so this is not necessarily another "pointless practical exercise".

Johannes is there to help you - so ask him lots of questions!

Choose a protein to work with

Here are the sequences:
  1. Q9K6A4
  2. Q9X256
  3. P23616
  4. Q9YFP5
  5. Q46828
  6. Q9X261
Divide up the sequences between you so that no more than two people work on the same sequence. You should work individually, but you may also discuss your protein and share information with your "partner".

Start your report

You are asked to prepare a 2-3 page report on the function and structure of your protein. Please include answers to all the bulleted questions on this page, and a summary of further information you discover.

It would be a good idea to create a rough version of your report now, and email it to yourself to finish later.

What to do...

What is already known about the protein?

The links above take you to a UniProt entry. Now look in detail at the UniProt entry and answer the following questions:

What do we want to know?

Different research projects raise different questions about protein structure and function. The two main scenarios are:
  1. You have a hypothetical protein and want to know anything, however vague or approximate, about its possible function.
  2. You are working on a protein involved in a certain process/pathway, but want to know more details about its mode of action.
Question:

Does the protein contain any known domain families?

In the "Cross-references" section of the UniProt page, read the descriptions of each cross-referenced database (click on the left column). Find out which of the databases are concerned with "protein/domain families". If your protein has Pfam domain(s), go to the Pfam page for each domain, and look at the "Domain organisation". For example: for the REJ domain you would click on the "View architectures for 16 proteins" button and then click "View Graphic".

You will now see some nice pictures showing the domain organisation of all the proteins which contain "your domain".

Note: the Pfam domain architecture viewer is very interesting to look at, so follow the REJ link above, even if your protein has no Pfam domains.

PSI-BLAST

Domain family assignments (such as those in Pfam) are made using powerful sequence database search tools. However, the assignments may be a number of months old since they cannot be updated every day or week. Because new sequences enter the databases all the time, it is possible that you might find some useful information by performing a powerful sequence database search against the most up-to-date sequence databases. Here we will use PSI-BLAST.

Submit the whole sequence to the PSI-BLAST service at NCBI. Search against the default "NR" database. Note that you have to watch two browser windows at the same time when running PSI-BLAST iterations. Run as many iterations as you can, but at least 3 or 4.

Note that the BLAST server does a "CD-Search" for conserved domain sequences. If you get CD hits (explanation), have a quick look at them and answer the following question: Now complete some iterations of PSI-BLAST, and answer as many of the following questions as possible/suitable.

Structure prediction

If you have already found a homologous sequence with a known structure using the tools above, you could build a homology model for the aligned parts of your query protein. We will not do that in this practical though. Instead we will have a quick look at fold recognition methods.

Extract the regions/domains of unknown structure

First you need to find out which parts of your sequence can not be aligned to a known structure. That is, which parts of the sequence are not assigned to domains of known structure, or were not aligned to a protein of known structure by PSI-BLAST. If there is no "structurally unknown" region in your sequence or if the region is shorter than about 50 residues, then there's no real point in doing fold recognition on your sequence. If this happens, you may use the following sequence instead: Y237_MYCPN.

Submit sequences to fold recognition servers

You may submit your sequence regions to any fold recognition server on the web, but we recommend that you first try the 3D-PSSM server. If you have spare time, you could also try the "Server of servers" server Meta-Server, which combines the results from many servers, including 3D-PSSM. However, the meta-server can be a little confusing.

The fold recognition process may take 30 minutes or more! If it is late, you may want to go home and check the results later.

If you are still here, you may want to read the 3D-PSSM help pages. When you have some results to look at, please answer the following questions:

Handing in the report

Please send an electronic version of your report to both Johannes and Bob. Please use the subject line: "D-course practical from YourName". It would be much appreciated if you send it in an non-Microsoft format like PDF, HTML or plain text. MS Word format will be accepted, but we will use OpenOffice to read it so anything could happen to the formatting...

Tutorial question

Please think about the following and write a few sentences on each and email them to Bob before the deadline. Use the subject line: "D-course tutorial from YourName"
In the area of problem solving and algorithms, a "heuristic" is a knowledge-based trick or short-cut to get to the solution (or close to it) quicker. For example, imagine you are asked to find the quickest walking route from "place A" to "place B" in a city. You are given 2 hours to time yourself walking as many different routes as possible. You are not allowed to measure distances on a map. In this example, a natural (and obvious) heuristic is to avoid walking down any dead-end streets.

In this web tutorial on sequence database searching, two "pathological examples" are given for situations where the FASTA heuristics would fail. Can you think of another pathological example where two clearly related DNA sequences of roughly the same length would also fail to give a high-scoring alignment? (Hint: the number 3.)

Any questions?

Please contact Bob or Johannes by email if you need to discuss anything in more detail. Good luck with the practical report and tutorial question!