Question

How is this blastresult possible?

0

Entering edit mode

9.6 years ago

ronaldnieuwenhuis ▴ 30

Hello guys,

I am a bioinformatics student and I am having metagenomic data (contigs) of the content of a fermentation plant. This means there is possible plant, animal and bacterial DNA material in it. I am only interested in some enzyme produced by methanogenic bacteria. So that is why I try to filter my metagenomic data with a self- constructed db with the genome of z. mays, s. tuberosum, b. subtilis and r.norvegicus. For this purpose I wanted to use a nucleotide megablast. To figure out which settings I needed to use to get the cleanest filtering I made a query that consisted of: some random plant genes, some random mammalian animal genes and some subunits gene coding for a methanogenesis pathway enzyme.

This is my command:

rnieuwenhuis@bin211:~$ megablast \
  -i /homes/rnieuwenhuis/Documents/Thema09/Project9/Tofindrightsettings.fa \
  -d /media/usb1/Data2/Thema09_fliterdb \
  -o test5 \
  -m 9 \
  -a 4 \
  -W 24 \
  -e 1

Then I get this output:

# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c753499-752768 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|428205073:c5244719-5241819 Chroococcidiopsis thermalis PCC 7203 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|45357563:1512964-1514625 Methanococcus maripaludis S2 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|240256243:1440146-1441863 Arabidopsis thaliana chromosome 4, complete sequence
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|240256243:6100743-6101708 Arabidopsis thaliana chromosome 4, complete sequence
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gi|240256243:6100743-6101708    gi|564111623|ref|NW_006239484.1|    100.00    25    0    0    835    859    288651    288627    0.005    50.0
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c753499-752768 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c752768-751938 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c751588-750875 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gi|312136230:c751588-750875    gi|109471397|ref|NW_001084817.1|    96.67    30    1    0    502    531    12407878    12407907    8e-04    52.0
gi|312136230:c751588-750875    gi|109506243|ref|NW_001084730.1|    100.00    25    0    0    502    526    17031349    17031325    0.003    50.0

There are 4 Methanothermus fervidus methyltransferase subunit loci in it. And 2 plant genes I found by searching kegg for a common enzyme of plants.

My result show that, whatever setting I use, the subunit gene always matches way better than any of the other genes. It says that the complete subunit is in the genome of the r.norvegicus. How is this possible? Does a rat have some enzyme which could use exact the same subunit?

(p.s. I know a gene is not the same as an enzyme and that I probrably used some wrong terms, feel free to correct me, I want to learn :) )

result megablast • 2.7k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by ronaldnieuwenhuis ▴ 30

0

Entering edit mode

I would start solving this problem by searching enzyme sequence with online ncbi blast. You'll get approximate results what you should be looking in your db.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by PoGibas 5.1k

0

Entering edit mode

It is impossible to answer your question because of the lack of detail and obvious errors in your description. It is very likely that your pipeline is flawed and all wrong. In fact I do not think you need this subtractive filtering at all. What is your motivation to include the organisms you included? Are sequences from these organisms potentially contained in the sample? If not, why would you filter for them on the DNA level? what happens on the DNA level is mostly irrelevant and maybe an artifact for any comparison of distant sequences.

Methanogens are not bacteria but archaea. Furthermore, to answer the question we need exact input and output of your query, in particular:

blast program name (megablast on protein or dna data?? megablast implies nucleotide query and database)
ALL blast parameters
exact query sequence, you indicate a single "subunit" gene, what do you mean: look at the Methanogenesis pathway in KEGG: there are many such genes: MtaA-MtaC, MfbA-C, MtrA, and possibly more
blast database
exact output, how do your alignments look like, are they short or long? Which genome do they hit?

In fact I guess that you are comparing distant gene sequences on the DNA level which is irrelevant and gives rise to artifacts.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Michael 54k

0

Entering edit mode

Thank you sir for your comment.

The reason the question is a bit briefly because I expected it to be enough to answer, so my apologies.

I cannot provide all the information you asked for. (I will modify my question tomorrow with the requested information, I cannot reach it right now)

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by ronaldnieuwenhuis ▴ 30

1

Entering edit mode

It would be best to cast a wide net including all plant, animal and bacterial genes. The hits of a rat protein against a bacterial protein are probably a cause of this, it isn't impossible for a rat to have a sequence similar to a bacterial one. The reason you're getting hits against bacteria is because you've only included bacteria (and a very small number at that) in your database. BLAST is finding the best possible hit for your sequence, which happens to place a rat gene against a bacterial gene.

This is going to very strongly bias your results, just because you're interested in a specific set of species, doesn't mean that you can ignore everything else. A good scientist finds the best possible explanation given all possible explanations, not the best one given what he or she is interested in. This approach is massively inappropriate, as you can see, limiting your species is drastically biasing your results. Contigs from massively distant species could be seen as hits because BLAST is only designed to find the best possible hit given the search space provided. Yes there are metrics to filter hits, but in the rare occasions of highly conserved genes, you might incorrectly map things.

Since you have reason to believe that there are species from such a very diverse set, you need to include them. You also need to remember that there are multiple species of Rat, Mouse, as well as numerous other rodent genera that could end up in the fermenter either at the point of harvest, transport, or within the plant. What about plant, animal and bacterial viruses/phages? What about fungi? Do birds defecate on grain piles or on anything that ends up in the fermentation? What type of animals graze on the grains? How is the grain collected, by machinery or manual labor? Is there a chance for human urine or feces to contaminate the grain at any point? What about insects?

Unless you have a reason to block species (e.g. the plant adds fungicidal compounds to the grain), you must include them if you want to do good science. Is there previous 16S sequencing that allows you to narrow bacterial species present? Do you have access to records from the pest management at the plant that show you what species of rodent are most common? Even then, this sort of information should be used to include additional species to the mix, if they're trying to kill species/etc, it means there's a real chance of that species/etc being present.

The best way to start with this, and with any large dataset, is to filter out the crap you're not interested in, not try and map everything to the things you're interested in. A basic solution would be to BLAST against the nr/nt databases to assign all of your contigs a taxonomical placement. Then you can reasonably throw things out to focus on the data you're interested in, although since you used the term "metagenome", you need to analyze everything.

I never get why people try to narrow so much to start with, metagenomic datasets are the bioinformatics equivalent of Indiana Jones, you can dig and hunt and look for all of the novel/oddball stuff no one has ever found before! Besides, finding that kind of stuff can mean big impact factors, so why not get as much as you can and only filter as you need to?

Who knows, maybe the gut bacteria from the mice chopped up in the combines are critical to that smooth oaky finish to my bourbon! :)

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by pld 5.1k

0

Entering edit mode

Thank you for your reply sir,

I was clearly having a wrong idea about 'filtering' my data and I will certainly think more and better about my future plan to not oversimplify them.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.5 years ago by ronaldnieuwenhuis ▴ 30

Ram · Accepted Answer · 2014-10-08

So, you are asking, why am I getting these alignments at the nucleotide level, using blastn (megablast) to compare an archeal DNA sequence (query) to the rat DNA sequence (database), and why do I not get a similar hit for other DNA.

# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gi|312136230:c751588-750875    gi|109471397|ref|NW_001084817.1|    96.67    30    1    0    502    531    12407878    12407907    8e-04    52.0
gi|312136230:c751588-750875    gi|109506243|ref|NW_001084730.1|    100.00    25    0    0    502    526    17031349    17031325    0.003    50.0

The answer is that these are random alignments. Have a look at the highlighted values, alignment lengths are very short, 30 and 25, e-values very high. It is possible for these short stretches of DNA to occur just by chance in a database of given size. Because this result is due to random sequence composition, you cannot conclude anything from the absence of hits for other query sequences.

Comparisons of distant sequences should be carried out on the AA-level using eg. BlastX or TBlastX, because DNA sequences mutate too quickly. If you are interested in only archeal sequences it would also be possible to use a different kind of nucleotide blast database to allocate your contigs on the DNA level. You might be interested in this article and related articles, where we have done something similar to what you are trying.