blasting a fasta file with many, many contigs via command line
2
0
Entering edit mode
4.1 years ago

I'm sure this is a beginner 101 question, but I have a genome assembly (fasta file with many, many, many contigs) and what I am looking to do is blast this file against the NCBI database as to remove contigs that may be there due to some sort of contamination. What is the easiest and quickest way to do this via the command line?

Assembly genome • 1.5k views
ADD COMMENT
0
Entering edit mode

This is a not so trivial question. It depends on your data. I am not sure that a blastn would give you good results. You could start from a blastn (with outfmt 6 to parse the data) and if you are able to identify the contaminant you can download this genome and map your contigs or your reads to that genomes. Then you remove the contigs that mapped or you can also realign your genome without the reads that mapped the contaminant.

Are you working on a bacterial genome ? You should try checkm to see the completness and contaminantion of your genome

ADD REPLY
0
Entering edit mode

If you have a closely related genome available in GenBank then you may actually want to blast against that genome to identify contigs that should be there and separate those first. Later on you could take a look at what is left over to see if anything there you should recover.

ADD REPLY
1
Entering edit mode
4.1 years ago
YocelynGG ▴ 70

If you are looking for possible contamination on your contigs, you could try DeconSeq. I think is a better option and you can create specific databases. http://deconseq.sourceforge.net/

ADD COMMENT
0
Entering edit mode
4.1 years ago

Rather than the whole NCBI db, maybe first try Univec to get rid of adapters and other technical rubbish.

You can find good tutorials on BLAST very, very, very easily on the web or here without bothering people specifically here.

I believe the NCBI offers an adapter/contaminant screening process these days as well ? Have a look at VecScreeen here perhaps as well: https://www.ncbi.nlm.nih.gov/tools/vecscreen/contam/

We run a simple Nextflow blast on our SLURM cluster vs the entire NCBI (for detection of weird and or technical noise in big metagenomics datasets) but I haven't made this widely available yet.

ADD COMMENT

Login before adding your answer.

Traffic: 2719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6