Question: About Paired-End Sequencing
16
gravatar for Pierre Lindenbaum
10.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

Hi all, here are some questions about paired-end sequencing for NGS:

  • What are the main differences between mate-paired sequencing and paired-end sequencing; Should I care when I use tools like 'samtools', maq, etc.... ? Should one, and only one short read, should be paired with another one (1-1)?
  • What is removing duplicates ? does it mean that a pair of short reads has been mapped at two distint positions on the genome or does it mean that a pair matched too many time at one position ?
  • Knowing that bwa sampe "Generates alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly", is there any need to "remove the duplicates" ?
  • How does picard MarkDuplicates work ? How can I find the reads that have been 'tagged' ? will it remove the reads from the BAM file ?

Thanks

Pierre

ADD COMMENTlink modified 9.1 years ago by Ketil4.0k • written 10.1 years ago by Pierre Lindenbaum128k
1

This wikipage discuss various aspects of MarkDuplicates: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Adding it for future references.

ADD REPLYlink written 9.1 years ago by Khader Shameer18k
16
gravatar for Jeremy Leipzig
10.1 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

paired-ends and mate pairs are different protocols. The distance between mate pairs are much longer (2-5kb), while paired-end fragments are rarely more than 500bp apart and can even have negative distance (overlapping pairs)

ADD COMMENTlink written 10.1 years ago by Jeremy Leipzig19k
6

yeah it is yet another illumina-naming convention problem. I think they should call them Illumina Long-Ass Mate Pairs or something.

ADD REPLYlink written 10.1 years ago by Jeremy Leipzig19k

Nice one. I didn't realize there was a distinction and would have said paired end referred to any case, independent of insert size. Now my Australia/mate joke seems even more lame.

ADD REPLYlink written 10.1 years ago by Brad Chapman9.5k
8
gravatar for Ketil
9.0 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

Hm. Old questions, so nobody will read this, but I'm not entirely happy. Here are my answers:

  1. Paired ends is supported by some technologies (Illumina and Sanger), where it is possible to sequence from both ends of a clone. Mate pairs involves making circular fragments using a linker sequence, and fragmenting them around the linker, and then sequencing the result. Illumina will read from each end of the fragment, 454 (and I believe Solid?) will read through it all.

Now this terminology isn't fixed, and lots of people will talk about mate pairs when doing paired end, or vice versa. Caution is advised! Also, I think Solid has some funky variations on this, but I haven't looked too closely. And, the mate pair protocol is rather unreliable, IME. Expect lots of non-mate-pair reads, and a wide range of insert sizes.

Oh, and I believe you can get mate pairs from 1.5K to 20K. We just got Illumina PE reads at 500bp inserts, this was considered experimental by the company doing it.

  1. Removing duplicates can refer to several different things, but with all the second gen technologies, it is common to get a proportion of duplicate clones. This can skew things, e.g. for de novo assembly, it will give an artificially high coverage of a region, and it might be incorrectly identified as a repeat.

  2. The random placement of sampe is probably to make sure that you get the right coverage for repetitive regions. So it's orthogonal to removing duplicates that are sequencing artifacts. And if you have a dinucleotide region of (AT), a read of (AT) and a read of (TA)* would not be considered duplicates, but be placed randomly.

  3. Can't help you there, but Pierre seems to have it covered. :-)

ADD COMMENTlink written 9.0 years ago by Ketil4.0k
14
gravatar for Brad Chapman
10.1 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

Pierre -- I'll tackle your questions in order:

  • See Jeremy's answer on the mate-pair/paired end distinction.

  • Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. What you are trying to do in duplicate removal is identify reads that are PCR duplicates, which is especially useful in SNP calling since you want to avoid double counting evidence from the same underlying biological sequence.

  • This refers to one read mapping in multiple places, so is separate from duplicate detection.

  • MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:

http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

When they are marked, a bit (0x0400) is set in the flag field. See the table in section 2.2.2 of the SAM format for all the gory details:

http://samtools.sourceforge.net/SAM1.pdf

ADD COMMENTlink modified 10.1 years ago • written 10.1 years ago by Brad Chapman9.5k

Thanks Brad, you're my NGS guru again ;-)

ADD REPLYlink written 10.1 years ago by Pierre Lindenbaum128k

Brad, does Picard compare the actual read sequences when marking/removing the duplicates?

ADD REPLYlink written 10.1 years ago by Mikael Huss4.7k

Mikael, I believe it goes by alignment location not read sequence.

ADD REPLYlink written 10.1 years ago by Brad Chapman9.5k

Consider correcting the mate-pair/paired-end distinction? They are different.

ADD REPLYlink written 10.1 years ago by Jonathan Manning640

Good idea; didn't even realize I could edit later. All fixed, thanks.

ADD REPLYlink written 10.1 years ago by Brad Chapman9.5k

Regarding the duplicates, I am interested in finding short reads that had been mapped to more than one distinct positions on the genome. And also which locations has it been mapped to. Is there a tool that already does this?

ADD REPLYlink written 6.7 years ago by roll310

ask this as a new question please.

ADD REPLYlink written 6.7 years ago by Pierre Lindenbaum128k
6
gravatar for Nate
10.1 years ago by
Nate60
Nate60 wrote:

Also remember that the read orientations of paired-end and mate-pair (using Illumina terminology) are different. It's RF for mate pair and FR for paired end. Some software needs the reads for mate-pairs to be reversed as it expects FR orientation - for example BWA.

ADD COMMENTlink written 10.1 years ago by Nate60
5
gravatar for Pierre Lindenbaum
10.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

About MarkDuplicates : As seen today on the samtools mailing list

Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.

ADD COMMENTlink written 10.1 years ago by Pierre Lindenbaum128k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1549 users visited in the last hour