Question

Removing contaminating reads from human genome

0

Entering edit mode

3.4 years ago

C4 ▴ 30

Hi, below is my "bbduk" command to remove bacterial reads from human DNA reads. The problem is that mapping percentage to the human genome is decreasing as it also removes bacterial reads. Is there a way around to make it more stringent without reducing mapping % to human genome, currently I think it's removing some of the human reads as well. Or any other tools that are more effective in removing contaminating genome from human reads? Thank you!

bbduk.sh -Xmx80g -Xms50g ref=$contaminants ordered=t k=35 hdist=0 ow=t in=fastq1 in1=fastq1 out=clean_fastq1 out2= clean_fastq2 outm=bad.fastq1 outm2=bad.fastq2

alignment sequencing next-gen • 1.7k views

ADD COMMENT • link updated 3.4 years ago by Ram 43k • written 3.4 years ago by C4 ▴ 30

score 1 · Answer 1 · 2020-12-07

1

Entering edit mode

3.4 years ago

GenoMax 141k

I don't think you should be using bbduk for this application. I suggest using bbsplit.sh (or even simple bbmap.sh) instead. Use the human genome and keep the reads that align to it.

How did you determine that there are bacterial contaminating reads in your data?

Edit: I had a look at one of your past threads (you have used kraken to decide that there is pseudomonas contamination). When you are dealing with short reads there is no way to know for sure if the read came from a particular genome if it matches equally well to human genome and something else. If reads aligning to pesudomonas are a small fraction (say < 10% of total reads) then you can simply discard them during analysis. If you are comparing > 1 sample and this issue is not uniformly applicable to all then you need to be concerned since there will be a bias in your data because of these manipulations.

ADD COMMENT • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Hi Genomax, only a few samples have that issue. I determined it using kraken. There is same amout of contamination in all of them (~30%) However in my experience, removing bacterial reads may give better downstream results. I have used bbsplit in the past, as you suggested, but it gave me even less number of reads aligning to human genome than that were mapping using bwa. My strategy is to remove these contaminating reads and map them to human genome later. Do you suggest I use bbmap to align these reads to human genome and keep those that are aligning to further map using bwa? Could I align to bacterial genome and remove those reads instead? Thanks as always! :)

ADD REPLY • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

only a few samples have that issue.

To me that indicates a slip-up somewhere on the experimental side (either sample collection or some subsequent processing). Not sure how that is going to affect your end result.

As I wrote above:

When you are dealing with short reads there is no way to know for sure if the read came from a particular genome if it matches equally well to human genome and something else.

When using bbsplit.sh If a read matches both human genome and bacterial genome then you can either discard it completely (being stringent) or keep it in both pools (ambiguous2= option). You can use human genome + pseudomonas genome and decide on one of the strategies with bbsplit.sh. Only reads you could safely discard are ones that bin to the Pseudomonas genome alone.

BTW: Results you get from bbmap are going to be as good as bwa so there is no need to do the mapping twice.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

Thanks GenoMax, would you suggest using the adaptor-trimmed reads (in my case nextera adaptors) or the raw reads (with adaptors) for bbsplit? Also, is it a good idea to use ambiguous=all and ambiguous2=all ? , since ambiguous are the reads with multiple top-scoring mapping locations.

ADD REPLY • link 3.4 years ago by C4 ▴ 30

0

Entering edit mode

Adapter trimmed reads are fine. ambiguous2= decides what happens to reads that multi-map across genomes. As I said above you need to decide if you want to keep them in both or toss them totally or use one best site. ambiguous= does a similar thing for reads that are multi-mapping within one genome.

ADD REPLY • link 3.4 years ago by GenoMax 141k