How does samblaster use library tag information if at all?
1
1
Entering edit mode
6.9 years ago
Carlos Borroto ★ 2.0k

After seeing a couple of mentions by none other than @lh3 about samblaster, I decided to try it out. I'm in the middle of a massive data processing for a large cohort and picard's markduplicate is taking a good chunk of the processing time.

My main question is, how does samblaster uses the library(LB) read group tag? The author mentions the input SAM needs to be sorted by read group id, which makes me think marking duplicates is limited to only reads coming from the same '@RG ID'. In our case we resequence the same sample library a few times. It is my understanding you need to mark duplicate within all the data coming from the same library, not just read group id.

Imagine this situation.

sample: S; library: S; sequence runs: 1, 2

In order to use samblaster I would map with with something like:

bwa mem -r '@RG\tID:S.1\tSM:S\tPL:ILLUMINA\tPU:1\tLB:S' index S.1.r1.fq S.1.r2.fq | samblaster | samtools view -Sb - > S.1.out.bam
bwa mem -r '@RG\tID:S.2\tSM:S\tPL:ILLUMINA\tPU:2\tLB:S' index S.2.r1.fq S.2.r2.fq | samblaster | samtools view -Sb - > S.2.out.bam

In this case I would not be marking duplicates within all the data coming from the same library, even if samblaster correctly uses the LB tag. Do you see a way of using piping(data streaming) but still marking duplicates correctly in this situation?

 

Thanks, Carlos.

samblaster markduplicates • 2.5k views
ADD COMMENT
0
Entering edit mode

another question: does MarkDuplicates  use both ID and LB to match to mark reads as duplicate? or just LB?

 

ADD REPLY
0
Entering edit mode

That's a good question. I assumed picard uses LB only, but I have no evidence for that.

ADD REPLY
1
Entering edit mode
6.5 years ago
gf4ea ▴ 30

samblaster currently ignores both the LB and RG tags.  The input file must be grouped by QNAME (often also called "read id").  That is, the file need not be sorted by QNAME so long as all the alignments for a given QNAME are in contiguous lines in the input file. This is the natural order for the output of essentially all aligners.

I hope this answers your questions.

Greg

 

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 1061 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6