About Paired-End Sequencing
5
27
Entering edit mode
14.0 years ago

Hi all,

Here are some questions about paired-end sequencing for NGS:

  • What are the main differences between mate-paired sequencing and paired-end sequencing; Should I care when I use tools like 'samtools', maq, etc.... ? Should one, and only one short read, should be paired with another one (1-1)?
  • What is removing duplicates ? does it mean that a pair of short reads has been mapped at two distinct positions on the genome or does it mean that a pair matched too many time at one position?
  • Knowing that bwa sampe "Generates alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly", is there any need to "remove the duplicates"?
  • How does picard MarkDuplicates work? How can I find the reads that have been 'tagged'? Will it remove the reads from the BAM file?

Thanks
Pierre

next-gen-sequencing duplicates • 36k views
ADD COMMENT
1
Entering edit mode

This wikipage discuss various aspects of MarkDuplicates: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page Adding it for future references.

ADD REPLY
16
Entering edit mode
14.0 years ago

paired-ends and mate pairs are different protocols. The distance between mate pairs are much longer (2-5kb), while paired-end fragments are rarely more than 500bp apart and can even have negative distance (overlapping pairs)

ADD COMMENT
7
Entering edit mode

yeah it is yet another illumina-naming convention problem. I think they should call them Illumina Long-Ass Mate Pairs or something.

ADD REPLY
0
Entering edit mode

Nice one. I didn't realize there was a distinction and would have said paired end referred to any case, independent of insert size. Now my Australia/mate joke seems even more lame.

ADD REPLY
9
Entering edit mode
13.0 years ago
Ketil 4.1k

Hm. Old questions, so nobody will read this, but I'm not entirely happy. Here are my answers:


1. Paired ends is supported by some technologies (Illumina and Sanger), where it is possible to sequence from both ends of a clone. Mate pairs involves making circular fragments using a linker sequence, and fragmenting them around the linker, and then sequencing the result. Illumina will read from each end of the fragment, 454 (and I believe Solid?) will read through it all.

Now this terminology isn't fixed, and lots of people will talk about mate pairs when doing paired end, or vice versa. Caution is advised! Also, I think Solid has some funky variations on this, but I haven't looked too closely. And, the mate pair protocol is rather unreliable, IME. Expect lots of non-mate-pair reads, and a wide range of insert sizes.

Oh, and I believe you can get mate pairs from 1.5K to 20K. We just got Illumina PE reads at 500bp inserts, this was considered experimental by the company doing it.


2. Removing duplicates can refer to several different things, but with all the second gen technologies, it is common to get a proportion of duplicate clones. This can skew things, e.g. for de novo assembly, it will give an artificially high coverage of a region, and it might be incorrectly identified as a repeat.


3. The random placement of sampe is probably to make sure that you get the right coverage for repetitive regions. So it's orthogonal to removing duplicates that are sequencing artifacts. And if you have a dinucleotide region of (AT), a read of (AT) and a read of (TA)* would not be considered duplicates, but be placed randomly.


4. Can't help you there, but Pierre seems to have it covered. :-)

ADD COMMENT
14
Entering edit mode
14.0 years ago

Pierre -- I'll tackle your questions in order:

  • See Jeremy's answer on the mate-pair/paired end distinction.
  • Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. What you are trying to do in duplicate removal is identify reads that are PCR duplicates, which is especially useful in SNP calling since you want to avoid double counting evidence from the same underlying biological sequence.
  • This refers to one read mapping in multiple places, so is separate from duplicate detection.
  • MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:

http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

When they are marked, a bit (0x0400) is set in the flag field. See the table in section 2.2.2 of the SAM format for all the gory details:

ADD COMMENT
0
Entering edit mode

Thanks Brad, you're my NGS guru again ;-)

ADD REPLY
0
Entering edit mode

Brad, does Picard compare the actual read sequences when marking/removing the duplicates?

ADD REPLY
0
Entering edit mode

Mikael, I believe it goes by alignment location not read sequence.

ADD REPLY
0
Entering edit mode

Consider correcting the mate-pair/paired-end distinction? They are different.

ADD REPLY
0
Entering edit mode

Good idea; didn't even realize I could edit later. All fixed, thanks.

ADD REPLY
0
Entering edit mode

Regarding the duplicates, I am interested in finding short reads that had been mapped to more than one distinct positions on the genome. And also which locations has it been mapped to. Is there a tool that already does this?

ADD REPLY
0
Entering edit mode

ask this as a new question please.

ADD REPLY
6
Entering edit mode
14.0 years ago
Nate ▴ 60

Also remember that the read orientations of paired-end and mate-pair (using Illumina terminology) are different. It's RF for mate pair and FR for paired end. Some software needs the reads for mate-pairs to be reversed as it expects FR orientation - for example BWA.

ADD COMMENT
5
Entering edit mode
14.0 years ago

About MarkDuplicates : As seen today on the samtools mailing list

Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.

ADD COMMENT

Login before adding your answer.

Traffic: 1917 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6