Question: How does samblaster use library tag information if at all?
1
gravatar for Carlos Borroto
5.0 years ago by
Carlos Borroto1.8k
Washington Metropolitan Area
Carlos Borroto1.8k wrote:

After seeing a couple of mentions by none other than @lh3 about samblaster, I decided to try it out. I'm in the middle of a massive data processing for a large cohort and picard's markduplicate is taking a good chunk of the processing time.

My main question is, how does samblaster uses the library(LB) read group tag? The author mentions the input SAM needs to be sorted by read group id, which makes me think marking duplicates is limited to only reads coming from the same '@RG ID'. In our case we resequence the same sample library a few times. It is my understanding you need to mark duplicate within all the data coming from the same library, not just read group id.

Imagine this situation.

sample: S; library: S; sequence runs: 1, 2

In order to use samblaster I would map with with something like:

bwa mem -r '@RG\tID:S.1\tSM:S\tPL:ILLUMINA\tPU:1\tLB:S' index S.1.r1.fq S.1.r2.fq | samblaster | samtools view -Sb - > S.1.out.bam
bwa mem -r '@RG\tID:S.2\tSM:S\tPL:ILLUMINA\tPU:2\tLB:S' index S.2.r1.fq S.2.r2.fq | samblaster | samtools view -Sb - > S.2.out.bam

In this case I would not be marking duplicates within all the data coming from the same library, even if samblaster correctly uses the LB tag. Do you see a way of using piping(data streaming) but still marking duplicates correctly in this situation?

 

Thanks, Carlos.

samblaster markduplicates • 2.0k views
ADD COMMENTlink modified 4.6 years ago by gf4ea30 • written 5.0 years ago by Carlos Borroto1.8k

another question: does MarkDuplicates  use both ID and LB to match to mark reads as duplicate? or just LB?

 

ADD REPLYlink written 5.0 years ago by brentp23k

That's a good question. I assumed picard uses LB only, but I have no evidence for that.

ADD REPLYlink written 5.0 years ago by Carlos Borroto1.8k
1
gravatar for gf4ea
4.6 years ago by
gf4ea30
gf4ea30 wrote:

samblaster currently ignores both the LB and RG tags.  The input file must be grouped by QNAME (often also called "read id").  That is, the file need not be sorted by QNAME so long as all the alignments for a given QNAME are in contiguous lines in the input file. This is the natural order for the output of essentially all aligners.

I hope this answers your questions.

Greg

 

 

 

ADD COMMENTlink written 4.6 years ago by gf4ea30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1775 users visited in the last hour