Entering edit mode
4.2 years ago
ando.kelli
▴
60
Hi All,
I'm annotating a transcriptome against NCBI's nt database, and was wondering if I could get some advice regarding customisation.
A lot of the hits are to genomes: e.g. 'Salmo trutta genome assembly, chromosome: 13' which doesn't tell me anything about what the transcript might be. (I'm doing several types of annotation, and I use this one to fill some of the gaps that are left after using other methods).
An example of my code is:
ls trinity_out_dir.Trinity.*.fasta | parallel --eta -j 14 --load 80% --noswap 'blastn -db /volume/BlastDBs/nt -query {} -out blastn_outfiles/{.}.tabular -evalue 1e-5 -outfmt "6 std stitle staxids sscinames sskingdom" -max_target_seqs 1 -max_hsps 1 -num_threads 2'
Any ideas on how I can get blastn to ignore key words? Like 'genome' and 'predicted'?
Cheers, Kelli
You may need to post-filter your results to ignore things you are not interested in.
I don't think one can get blast to ignore keywords - it is a sequence search tool rather than keyword parser. Separately, I don't think it is a good idea to do what you want even if it was possible, as I know from experience that even genuine hits sometimes have words like
genome
orpredicted
in their descriptions. As @genomax suggested, you can filter out the unwanted hits after the search is complete.Thanks for your input genomax and Mansur Dlakic.
Filtering the offending hits out of the database isn't what I want to do, because that's the same as deleting them from my dataset. I'd rather annotate them if possible because many of them are differentially expressed.
We are suggesting that you filter
hits
your can't use from your blast results, not the sequence from your database.Thanks Genomax. If I can't annotate them I can't use them, so for me filtering them is the equivalent of deleting them.
I hope you realize that every transcript put together by Trinity is not real. 100% of transcripts are never found in one experiment.
Yep for sure Genomax. I'm talking about transcripts that have been annotated using Blastn with quite stringent parameters, but the annotation not informative. I want to improve the existing annotation if possible.
I'm happy to filter out transcripts that don't have high quality hits.