Hi.
I ran blastx with 10 results for each sequence. I noticed 2 issues with the results:
1) For most input sequences, each sequence got results for the same gene / protein, which is unlike in blastn in which every input sequence got multiple results. For instance, one input sequence was roughly 48.5k base pairs. All 10 results were for the same gene, which was only 377 base pairs. It's possible to have only one short gene in such a long sequence, but that happens with every input sequence. Would it be unreasonable to expect to have more genes in such sequences, at least in some sequences?
2) The other issue is that the results for the sequences show that the input sequences are much shorter than they are (also unlike in blastn results). For instance, one input sequence was roughly 21.5k base pairs. The results indicated that it was only 4655 base pairs.
When considering these two issues together, it seems that blastx aligns only a small part of each sequence. I tried increasing the number of results but got a similar output.
Any insights from your experience will be welcomed.
Here's a sample of the uotput:
So you think that expecting to get multiple results per sequence is reasonable?
Hits are to bacterial sequence data so introns are not likely a contributor. Snippet above could be a reasonable result but you will need to examine the alignments by eye to see if that is so. Since you have a bunch of similar sequences in your reference database (all appear to be Citrobacter) no wonder they are showing up in your result.
OK, yeah indeed for bacterial things the situation is indeed slightly different (all single exon genes).
However if you take a stretch of DNA as input that will have more than 1 gene on it you will still get more than a single hit back. (genes on different frames)
Depends on the settings. (cfr below)
Some crucial info I thus forgot to ask for: what is the input sequence? (species for instance?) What do you have in the blastDB ?
I'm studying a Myxozoa species, and I run it's sequences against the blastn and blastx databases from the NCBI website.
Hi, I am working with RNA-seq of viruses mixed with host genome. I use blastx to map de novo assembly contigs against viral blastdb (download form NCBI and mkblastdb), I got my blastx results and I found each contig has several hits to different virues and identity% varies from 30 to 90%. I want to ask how to filter viral contigs from host contigs, and how to filter the best hit for each contig. Thanks!
I used blastn before using blastx. I set an identity percentage threshold of 80% for filtering contamination: if the identity of the contamination was 80% or higher I removed it from the fasta file, and created new sub scaffolds from the remaining sequences. Then I run blastn again which yielded new results now that the previous contamination was removed. I filtered with the same threshold again and continue doing this cycle until no contamination was found. The program was written in python. Good luck.