Question

Mate-Pair/Paired-End Contamination

0

Entering edit mode

10.4 years ago

Adrian Pelin ★ 2.6k

Hello,

So I got 10% of a lane of mate-pairs back. When I map them to the genome, I see there is a lot of paired-end reads, which are in a FR orientation, while the mates are RF.

This is a big problem, because this contamination ranges from 40% to 70%, so not even sure what is the percent. I was wondering, by mapping back to contigs, is there a way to actually extract all paired-end reads in a FR orientation, and subtract it from an initial file that has the entire library?

Adrian

assembly scaffolding • 3.9k views

ADD COMMENT • link updated 10.4 years ago by richardc.gsc ▴ 160 • written 10.4 years ago by Adrian Pelin ★ 2.6k

score 0 · Answer 1 · 2013-12-05

0

Entering edit mode

10.4 years ago

richardc.gsc ▴ 160

Hi Akoik063, You can use Bowtie2 and tell it to only mark the mate pair alignments as proper pairs. However, this will only work if you have enough fragments that have both ends aligning to the same contig, which probably doesn't help you too much.

One thing that might help is if your mate-pair protocol has a stuffer sequence around the biotin used during the pulldown in library construction. If you find the stuffer sequence you can have more confidence that your reads are crossing the mate-pair junction.

ADD COMMENT • link 10.4 years ago by richardc.gsc ▴ 160

0

Entering edit mode

I have no idea to find the stuffer sequence.

Telling Bowtie to mark mate-pairs is an idea, the problem with that is that it would mark only the ones that align to contigs. What I want, is to remove paired end contamination from my file, so that the mate pairs all remain in it, while some paired end are removed.

ADD REPLY • link 10.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Have a read through cts's answer to a similar question from a week ago: A: How to cluster mate pair, paie end and single end reads from single file??

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

I remember that post. The problem with it, is that is takes the reads that have aligned, and splits it into mp.sam and pe.sam. However, mp.sam will not contain ALL mate pairs, simply the ones that did map. I was wondering if there is a way to:

original.fastq - pe.sam = mp.fastq?

ADD REPLY • link 10.4 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

You could feasibly do it (create a lookup table of PE-based seqids and screen the FASTQ for those in the PE file), but you won't get rid of all PE data, only filter them down to a lower frequency. Unfortunately the ones you may have to worry about the most will still be there, e.g. the discordant ones which possibly have reads mapping to different contigs, possibly confounding scaffolding.

My guess is you are using sequence derived from the original Illumina mate-pair protocol, which generally had no stuffer sequences. We have recently switched over to using their Nextera protocol, which is tons better (much higher freq of true MP, even up to 15kb or so) and has a biotin-labeled stuffer sequence alluded to by @richardc.gsc, so one could efficiently filter for only those that should be true mate-pairs.

ADD REPLY • link 10.4 years ago by Chris Fields ★ 2.2k