How to remove contamination from WGS fastq files
1
0
Entering edit mode
4.1 years ago

Hi,

I recently got back some fastq files. The samples I sent off for sequencing appear to have some human DNA in them. I was wondering if there was a way to remove the human DNA from my samples. The fastq files are 150 bp paired-end reads. I saw there were some post suggestion bbsplit and other suggesting bbduk. I am just not sure which to use or maybe there is new/better software available. I saw that both those pieces of software are 6 years old (not that makes them bad).

Thanks in advance!

next-gen sequence • 2.1k views
ADD COMMENT
0
Entering edit mode

I guess bbsplit is still a valid option. Alternatively you could map the samples with any aligner against a combined genome consisting of both your target genome and the human genome and then remove those reads that map against human. I would probably require end-to-end mapping in this case to avoid soft-clipped matches.

ADD REPLY
0
Entering edit mode

Huh.. that is an interesting idea? How would you remove the reads that map against the human genome?

ADD REPLY
1
Entering edit mode

I would do the following:

  1. Append to the chromosome names of each fasta the species, e.g. chr1_human, chr2_human etc.... Same for your actual organism.

  2. cat both together and make an index, e.g. with bowtie2.

  3. Align reads. Probably end-to-end is good.

  4. Keep everything non-human. You could use samtools view to only extract alignments to your organism and maybe the unmapped reads given you want to do any kind of assembly. I'd use a high MAPQ threshold here since you want to remove only the obvious human contaminations. Convert this back to fastq and done (I guess, never tried that). I think though that bbsplit does pretty much that under the hood.

ADD REPLY
0
Entering edit mode

I think I'll try both methods and see how it goes. Thank you for your suggestions

ADD REPLY
1
Entering edit mode

genomax knows bbtools very well, I would try that suggestion first. Mine is rather a naive thinking-aloud.

ADD REPLY
4
Entering edit mode
4.1 years ago
GenoMax 147k

BBTools suite is not old. It is continuously updated over time. If you have a reference genome available for your species of interest then using bbsplit is the best option. It will allow you to map the data at the same time to both genomes. Allow you to handle multi-mapping reads in intelligent ways (look at ambiguous2= option in in-line help).

That said, if the contamination happened at the sequence provider then you should request them to re-sequence your samples. If contamination is high then you paid for data you can't use.

ADD COMMENT
0
Entering edit mode

Thank you for your response! We think it happened at the sequencing provider we just aren't sure how we can really prove that. Do you know of any way to prove that?

ADD REPLY
1
Entering edit mode

Proving the error is theirs can become tricky. At a minimum you can let them know that you are seeing human contamination which is unexpected. If they care about customers they should respond and be willing to work with you. Facilities can re-make at least one library (if more than one sample is involved) and re-sequence it. If new libraries look fine then customer does not pay (problem was at facility). But if new libraries are contaminated then customer pays for that test.

ADD REPLY
0
Entering edit mode

Also, does bbsplit give you a file that contains the clean reads and another file that contains the contaminated reads? If it doesn't I guess I could just run it twice, one with our species, and the second with the human genome

ADD REPLY
1
Entering edit mode

bbsplit bins the reads into species specific files and then will give you separate files for reads that can't be binned.

ADD REPLY

Login before adding your answer.

Traffic: 807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6