How To Do Contig Analysis
4
2
Entering edit mode
11.1 years ago
User 3822 ▴ 60

Hi,

I have a couple of contigs, and I would like to answer questions of the following sort

a.) Are they unique?

b.) Which organism do they come from?

c.) Do they come from a protein encoding gene?

d.) What is that protein's function etc...

I am very new to Bioinformatics (even Bio for that matter) so any help would be appreciated. Even links to further readings. Thanks.

contigs dna analysis • 8.7k views
ADD COMMENT
1
Entering edit mode

How can you have a contig with no idea of what organism it's from? Did you find it in the street, or something? :-) And what do you mean by unique?

ADD REPLY
0
Entering edit mode

what do mean with "contig" ? just a sequence of DNA ? from which organism do they come from ?

ADD REPLY
0
Entering edit mode

Yes just a sequence of DNA. That is another question that has to be answered. Which organism do they come from.

ADD REPLY
0
Entering edit mode

I was told that the contig sequencing technology is going to be discontinued soon. I don't understand how can you have a contig sequence without knowing where it comes from and how. In any case, take into consideration that contigs do not necessarily correspond to coding sequences: a contig can contain more than one gene, or none.

ADD REPLY
3
Entering edit mode
11.1 years ago

As suggested by Pierre, BLAST will pretty much answer all of your questions. You can use the LinkOut Links (usually on the right, use CTRL+F to find them) on the NCBI webpage to get additional information.

You should note one potential pitfall though: in order to properly search for proteins, your contigs should come from mRNA sequencing, because eucaryotic DNA is likely to contain introns in their sequence (which you can't make sense of that easily). If you're searching genomic sequences, finding the right BLAST parameters can be a bit tricky.

In most cases you will know a bit of information about your sequence that will help you applying the right searches. If you're going totally blind, doing a BLAST against NCBI's nr database is a good place to start (because most likely your sequence is not entirely new or unknown).

What you can try as well:

  1. Trying to find an ORF; mRNAs proteins will have a possible translation from nearly the beginning to nearly the end; Polyadenylation signals (in eucaryotes) might provide additional help (there are 3 STOP codons- think about how likely it is for a given DNA sequence of length N to contain none by pure chance)
  2. Use NCBI's Conserved Domain Search to classify the protein
  3. There are many specialized databases to find information as soon as you know what type your sequence is; for instance, there is BRENDA for enzymes or RDP for rRNA.
  4. If it is a protein and you are interested in structure, you might want to take a look at the PDB.

Sorry if this was too basic, but you mentioned you were new to biology ;-)

ADD COMMENT
0
Entering edit mode

Thank you very much. Yes, that's quite a lot in fact :).

ADD REPLY
0
Entering edit mode

No problem :-) Please don't forget to vote up answers that are helpful for you and accept an answer by clicking on the check mark.

ADD REPLY
2
Entering edit mode
11.1 years ago

What about starting with a simple NCBI BLAST ? http://blast.ncbi.nlm.nih.gov/Blast.cgi using the blastx algorithm ("useful for finding similar proteins to those encoded by a nucleotide query"). You can later use the "Taxonomy reports" to get a list of the organisms found in your query.

ADD COMMENT
0
Entering edit mode

Thanks that helps. Anything other than BLAST you would suggest? Is there some sort of a resource that tells you what all tools you could use for a particular type of query?

ADD REPLY
0
Entering edit mode

What about describing your "particular type of query" here ? :-)

ADD REPLY
0
Entering edit mode

There are loads of them actually, e.g. Does the gene encode a protein, what's the protein's function. Basically have to find out all the information I can about the contig(s).

ADD REPLY
0
Entering edit mode

a contig usually corresponds to a region on the genome of a sequenced organism.... It may contain many genes or none at all. So making a blastx query may return weird results..

ADD REPLY
1
Entering edit mode
11.1 years ago

A contig is DNA sequence corresponding to a region of a genome, and it is an intermediate state in the sequencing and assembly of a genome.

a.) Are they unique?

This question is not clear. Unique with respect to what, to other known contigs? Consider that contigs are an intermediate product of the assembly of a genome, and not all the contigs are deposited into public databases.

The best thing you can do is a nucleotide blast of your sequence against the nr/nt database from NCBI. Use this link and select it from the 'Databases' menu.

b.) Which organism do they come from?

This is really a difficult question. It is very strange that you have a contig and you don't know from which species it comes from. You can do a blast against nr/nt or Refseq/genomic and see in which species you get the best score and p-value. Consider that since we only have few genomes sequenced for each specie, we don't know enough about intraspecific variability and moreover your contig could belong to a non-sequenced specie. You will only be able to make a guess. you will need to look carefully at the alignments and at the p-values.

c.) Do they come from a protein encoding gene?

Contigs do not come from the sequencing of a gene. They correspond to genomic regions and are variable in size and contents. A contig could contain many genes or none. You can do a blast against RefSeq transcripts, but it won't be an optimal solution because mRNAs don't align well on genomic regions.

d.) What is that protein's function etc...

the same as above.

ADD COMMENT
1
Entering edit mode
10.7 years ago
Ketil 4.1k

The question is a bit vague, so I'll have to make a load of assumptions.

If your contigs are from a genome sequencing project, perhaps you have available raw reads as well? If so, mapping these reads might help you to determine whether your contigs are from a unique region (coverage near expected level), from a repeated region (very high coverage), or from some random contamination (low coverage). This is not absolute, of course, and you have to be careful with stuff like duplicate clones, which can artificially inflate coverage, especially if you are sloppy in the lab. If you have multiple independent libraries, something that only exists in a few of them is likely a contamination.

Another thing I've looked for in our project, is contigs that appear to be circular (having reads that map partially to both ends). The jury is still out on the usefulness of this (my one glaring example appears to be a misassembled contig).

AT/GC content might also be useful for identifying contamination, depending on the length of your contigs and the ratio in your organism.

16s and 18s sequences can classify contigs as pro- or eukaryote.

ADD COMMENT

Login before adding your answer.

Traffic: 1512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6