Question

How to Remove Reads Found in Negative Controls from Experimental Samples

2

Entering edit mode

7.6 years ago

kdevakan ▴ 20

Hi

I have paired-end reads from human DNA samples, where I am trying to determine the metagenomic viral profile for each sample. I also have negative controls which were run through the same protocol as the human DNA samples (processed, library prep, sequenced, trimmed for adapters/barcodes). The next step would be to remove any reads found in my negative controls from my human samples. Does anyone know what the best approach/tool for this would be?

Thanks! Any help would be greatly appreciated!

sequencing sequence • 2.4k views

ADD COMMENT • link updated 7.2 years ago by Brian Bushnell 20k • written 7.6 years ago by kdevakan ▴ 20

0

Entering edit mode

Hi, I'm at this step too. There are mixed views about the utility of removing negative control reads. So I'd like to do both, i.e. perform my analysis once without removing negative control reads and once with removing negative control reads, and then compare and discuss results.

Logically, in mothur, one should be able to generate an accnos file with all the sequences to be removed and then use the remove.seqs command to accomplish just that. However, apparently this doesn't work after preclustering... the exact reason why was never given on the mothur forum. A solution was suggested using the sed command in bash, which is what I'm trying to learn how to do now.

Does anybody out there have a sed command solution for this problem? Any pointers or example code would be greatly appreciated.

ADD REPLY • link 7.4 years ago by careymsuehs • 0

score 0 · Answer 1 · 2017-02-24

0

Entering edit mode

7.2 years ago

surendra ▴ 30

you can try vsearch (https://github.com/torognes/vsearch)

ADD COMMENT • link 7.2 years ago by surendra ▴ 30

score 0 · Answer 2 · 2017-02-24

I would suggest first removing all human reads - map to the human genome, and keep only what does not map. Then assemble the remainder from the negative controls. What would that be? Maybe centromeres and telomeres; I'm not sure.

Then, map positive samples to the human genome and assembly and again keep what does not map. The output will be stuff present in your positive samples that is not normally present in humans, and also not present in your negative samples. Since you are asking for specific tools, I suggest BBMap for this approach, as it can map human reads with a low identity to the human genome, such as reads with long deletions. For example:

bbmap.sh ref=human.fa in=negative.fq out=n_mapped.fq outu=n_nonhuman.fq
(assemble nonhuman reads; I don't know the best tool for this, as it depends on the clade)
cat assembly.fa human.fa > negative.fa
bbmap.sh ref=negative.fa in=positive.fq out=p_mapped.fq outu=p_nonnegative.fq
tadpole.sh in=p_nonnegative.fq out=virus.fa k=62

...and presto! There's the virus. Hopefully.