Question: Questions about LB field in SAM specification for PCR duplicate removal
0
gravatar for James Ashmore
2.4 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

Should the LB field in the SAM specification refer to the library preparation for the sample, or the library preparation carried out by the sequencing centre? Say I have a sample sequenced on multiple lanes of a single flowcell/machine, should they have the same library name? Or what if I have a sample which was sequenced on one lane/flowcell/machine on a certain date, and then sequenced again on a different lane/flowcell/machine. Would the reads from these two runs have the same library name?

My question arises because normally when I want to remove duplicates from multiplexed samples (all sequenced on the same machine/date) I just align the FASTQ files separately, then merge BAM files belonging to the same sample and run MarkDuplicates on the merged BAM. However I recently contacted GATK to ask whether read group information was necessary in this context and the answer was yes (http://gatkforums.broadinstitute.org/gatk/discussion/9310/read-group-information-required-for-markduplicates).

This confused me because if your sample was produced from a single library then merging and duplicate removal based on the 5' position alone should remove all duplicates (optical and library)?

markduplicates sam • 815 views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by James Ashmore2.6k

I have faced a similar problem in the past. From what I know MarkDuplicates looks for duplicates within reads that belong to the same read group (RG), possibly checking the library part of the RG. All the data from the same library should have the same library in the RG. However, when you analyse your data in pieces you may find that at the end the RG field does not reflect the correct information. There are different ways to solve this. For example, if you are aligning with bwa you can ask it to include a proper pre-specified RG field.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by abascalfederico1.1k
0
gravatar for Pierre Lindenbaum
2.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

My question arises because normally when I want to remove duplicates from multiplexed samples (all sequenced on the same machine/date) I just align the FASTQ files separately, then merge BAM files belonging to the same sample and run MarkDuplicates

different sample/lane/library should be given a different group ID in the sam header.

ADD COMMENTlink written 2.4 years ago by Pierre Lindenbaum122k

What do you define as library?

ADD REPLYlink written 2.4 years ago by James Ashmore2.6k

Library is the DNA library, the preparation of DNA, where PCR duplicates arise. It doesn't matter if you run it on different lanes, you should treat all the reads from a library together when marking read duplicates.

ADD REPLYlink written 2.4 years ago by abascalfederico1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1969 users visited in the last hour