Question

Remove contaminant reads using Bowtie2

1

Entering edit mode

9.2 years ago

treitlis ▴ 40

Hello to everybody,

I am new to bioinformatics so please excuse me if the question is quite stupid.

So I am working on assembling an eukaryotic genome. For this I have three MiSeq paired-end libraries with 500, 5000 and 10000 bp insert sizes. I assembled the genome, but since the organism does not live alone in the culture, but with a mixture of bacteria, I need to decontaminate the data. I do the decontamination with a script written by me (a combination of blastn followed by protein prediction and blastp), and finally I get two files called "clean" and "contaminants".

At this point what I want is to map the reads to the contaminants file and get just the reads which don't map to this file to reassemble the genome. So for this I use bowtie2 to create an index of "contaminants" file and then map the reads and use the --un-conc to get the unmapped reads. But there is a problem.

Generally as I understood, the unmapped reads file contains reads which did not map, and also the reads which did map but disconcordantly. Because of this, when I reassemble the genome I still get some bacterial contamination. Is there any way to force bowtie2 to don't include these sequences also. Or any other software which can do this.

What I want is to get the paired-end reads which don't map at all to the index file (I want to remove also those which map disconcordantly).

I don't want to do it the other way around (mapping the reads to the "clean" file and use -al-conc) because I don't want to lose any reads which were not used at all in the assembly.

Any suggestion is appreciated:) Thank you.

genome sequencing bowtie2 • 7.4k views

ADD COMMENT • link 9.2 years ago by treitlis ▴ 40

1

Entering edit mode

Hello,

Well, the bacteria and also the eukaryote on which I am working are not studied organisms. This is why I use also protein prediction and blastp to decontaminate the data. I don't have the genomes of the contaminant organisms, or if they are, usually the hits are from metagenomic data, and these are most of the time incomplete.

However, I figured out the solution and it works.

If anyone will need this type of solution you can do it like this. First map the reads with bowtie2. Then use samtools view -bS -f 12 samfile > unmapped_reads.bam. This will select only the reads where none of the pairs map to the index file.

Then use from picard tools (http://broadinstitute.github.io/picard/ ) the SamToFastq to extract the reads in two different fastq files (forward and reverse). I have tested these files with botwie2 and none of the sequences mapped, so the method works well:)

Thank you for all your answers.

ADD REPLY • link 9.2 years ago by treitlis ▴ 40

0

Entering edit mode

Hi,

Thank you for your answer, but as I said I am beginner with bioinformatics and so far I can handle just shell scripting. By the way I am not using kneaddata but I will check the tool:). I am using an inhouse script written in shell based on blastn, awk filtering (based on some thresholds) then protein prediction using prodigal (since contaminants are bacterial) and finally blastp. I get some coverage and identity values for each scaffold based on which I sort them as "clean" or "contaminants"

Returning to the problem, I was thinking to filter the sam file with samtools, and then use samtools fastq to convert the sam/bam file to fastq file. Would this do the trick?

However I am confused about the flags. For me the sequences which I need should have both 0x4 and 0x8 flags set if I understood correctly (first is for read unmapped, and second is for pair unmapped). Is this correct? Can I specify in -f option multiple flags? This part confuses me a lot:( .

Thank you,

ADD REPLY • link 9.2 years ago by treitlis ▴ 40

1

Entering edit mode

The use of bbsplit.sh is even easier that using shell scripting. ANd much more easier that using the many programs and files you are trying to use. ANd I will bet you that bbsplit.sh will give you better results

bbsplit.sh is a binary program and you only need to invoke it in the same way that you can see in the link I provided you

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq

ADD REPLY • link 9.2 years ago by Antonio R. Franco ★ 5.2k

score 4 · Answer 1 · 2016-04-23

If you know and can download the genome of the contaminant organisms, it is likely that the best thing you can do is to use bbsplit.sh from the terrific set of BBMap tools

The program is extremely fast and precise. You can find details about how it works in this threat http://seqanswers.com/forums/showthread.php?t=41288

score 1 · Answer 2 · 2016-04-22

1

Entering edit mode

9.2 years ago

fanli.gcb ▴ 730

We've run into this exact problem (perhaps you are using kneaddata as well?), and my workaround was to manually filter the fastq files against the SAM output from bowtie2. Any simple Perl/Python script should do just fine.

ADD COMMENT • link 9.2 years ago by fanli.gcb ▴ 730