Question

Unexpected blastx results

0

Entering edit mode

3.4 years ago

langziv ▴ 70

Hi.

I ran blastx with 10 results for each sequence. I noticed 2 issues with the results:

1) For most input sequences, each sequence got results for the same gene / protein, which is unlike in blastn in which every input sequence got multiple results. For instance, one input sequence was roughly 48.5k base pairs. All 10 results were for the same gene, which was only 377 base pairs. It's possible to have only one short gene in such a long sequence, but that happens with every input sequence. Would it be unreasonable to expect to have more genes in such sequences, at least in some sequences?

2) The other issue is that the results for the sequences show that the input sequences are much shorter than they are (also unlike in blastn results). For instance, one input sequence was roughly 21.5k base pairs. The results indicated that it was only 4655 base pairs.

When considering these two issues together, it seems that blastx aligns only a small part of each sequence. I tried increasing the number of results but got a similar output.

Any insights from your experience will be welcomed.

Here's a sample of the uotput: A sample of the output

blast blastx • 1.4k views

ADD COMMENT • link 3.0 years ago by langziv ▴ 70

score 1 · Answer 1 · 2021-06-09

1

Entering edit mode

3.4 years ago

lieven.sterck 15k

Would be helpful to post some extract of your blast result.

You do have to keep in mind that blastx will make alignments on the protein level, as such you will in most cases see indeed several smaller parts being aligned as the protein is likely split in 'exons' on the DNA sequence and moreover also on different reading frames of the DNA.

Unless exceptions it is thus unlikely that you will find a single stretch of alignments.

Your comparison to blastN is somewhat skewed, there it will indeed report one long stretch as that DNA sequence is not split up in 'exons/introns' and there is no notion of reading frames.

ADD COMMENT • link 3.4 years ago by lieven.sterck 15k

0

Entering edit mode

So you think that expecting to get multiple results per sequence is reasonable?

ADD REPLY • link 3.4 years ago by langziv ▴ 70

0

Entering edit mode

Hits are to bacterial sequence data so introns are not likely a contributor. Snippet above could be a reasonable result but you will need to examine the alignments by eye to see if that is so. Since you have a bunch of similar sequences in your reference database (all appear to be Citrobacter) no wonder they are showing up in your result.

ADD REPLY • link 3.4 years ago by GenoMax 146k

0

Entering edit mode

OK, yeah indeed for bacterial things the situation is indeed slightly different (all single exon genes).

However if you take a stretch of DNA as input that will have more than 1 gene on it you will still get more than a single hit back. (genes on different frames)

ADD REPLY • link 3.4 years ago by lieven.sterck 15k

0

Entering edit mode

Depends on the settings. (cfr below)

Some crucial info I thus forgot to ask for: what is the input sequence? (species for instance?) What do you have in the blastDB ?

ADD REPLY • link 3.4 years ago by lieven.sterck 15k

0

Entering edit mode

I'm studying a Myxozoa species, and I run it's sequences against the blastn and blastx databases from the NCBI website.

ADD REPLY • link 3.4 years ago by langziv ▴ 70

0

Entering edit mode

Hi, I am working with RNA-seq of viruses mixed with host genome. I use blastx to map de novo assembly contigs against viral blastdb (download form NCBI and mkblastdb), I got my blastx results and I found each contig has several hits to different virues and identity% varies from 30 to 90%. I want to ask how to filter viral contigs from host contigs, and how to filter the best hit for each contig. Thanks!

ADD REPLY • link 3.0 years ago by Yi • 0

0

Entering edit mode

I used blastn before using blastx. I set an identity percentage threshold of 80% for filtering contamination: if the identity of the contamination was 80% or higher I removed it from the fasta file, and created new sub scaffolds from the remaining sequences. Then I run blastn again which yielded new results now that the previous contamination was removed. I filtered with the same threshold again and continue doing this cycle until no contamination was found. The program was written in python. Good luck.

ADD REPLY • link 3.0 years ago by langziv ▴ 70