Question: Strategy to get rid of fake mate pairs
1
gravatar for Adrian Pelin
3.5 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

Hello,

I was thinking about challenges in dealing with Illumina mate pairs. It is known that after the sequencing run, the final reads are a mixture between true mate pairs, paired-end reads and seemingly single ended reads.

Let's assume a mate pair library with a ~3kb average sequenced insert. If we were to denovo assemble the genome, we could select all contigs above say 20kb. We can then trim 3kb from left and right of each contig and map our mate pair reads to those trimmed contigs. This would allow us to identify which reads are paired-end (based on relative orientation of the reads and the distance between pairs), and exclude them from our sequencing file in the next denovo assembly.

I can already think of one way this can back fire, namely repetitive regions, where we will high coverage after mapping. Maybe these regions can somehow be excluded/masked, and the focus can be on regions of average coverage.

Any thoughts?

Adrian

ADD COMMENTlink written 3.5 years ago by Adrian Pelin2.2k

fg fsd 

ADD REPLYlink written 3.5 years ago by hlj272230

How did you trim the biotinylated adapters? I have used NxTrim, and it gives four classes of reads: mate pairs, regular paired reads, single ends and "unknowns". I've gotten 20-50% of mate pairs, and 30-60% single+pair ends. So this will decrease your problem anywhere to 30-80% of your original mate pair sequencing, depending on the size of "unknowns".

The "unknowns" could then be mapped and sorted out - repetitive regions could be dealt either by coverage filtering as you said, and by multiple mapping filtering. Or you could just ignore (or tread as single ends) the "unknowns", if you retrieved a good proportion of mate pairs.

What is the reasoning for trimming 3kb left and right?

 

ADD REPLYlink written 3.5 years ago by h.mon24k

The sequencing provider did that analysis for us and gave us the RAW file and the "mate pairs". However, if I map their mate pairs, I still see non mate pairs, suggesting it is not 100% clean, which could trick the assembler. The reason for trimming 3kb left and right, is that you do not want to filter our mate pairs that can be later used for scaffolding. If the mates map to different contigs they will be considered as single end reads and therefore hard not to exclude. If you trim left and right, you will likely avoid touching mate pairs that can later help you scaffold. The 3kb trimming I chose based on the average distance between my mate pairs.

ADD REPLYlink written 3.5 years ago by Adrian Pelin2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1063 users visited in the last hour