After seeing a couple of mentions by none other than @lh3 about samblaster, I decided to try it out. I'm in the middle of a massive data processing for a large cohort and picard's markduplicate is taking a good chunk of the processing time.
My main question is, how does samblaster use the library(LB) read group tag? The author mentions the input SAM needs to be sorted by read group id, which makes me think marking duplicates is limited to only reads coming from the same '@RG ID'. In our case we resequence the same sample library a few times. It is my understanding you need to mark duplicate within all the data coming from the same library, not just read group id.
Imagine this situation.
sample: S; library: S; sequence runs: 1, 2
In order to use samblaster I would map with with something like:
bwa mem -r '@RG\tID:S.1\tSM:S\tPL:ILLUMINA\tPU:1\tLB:S' index S.1.r1.fq S.1.r2.fq | samblaster | samtools view -Sb - > S.1.out.bam bwa mem -r '@RG\tID:S.2\tSM:S\tPL:ILLUMINA\tPU:2\tLB:S' index S.2.r1.fq S.2.r2.fq | samblaster | samtools view -Sb - > S.2.out.bam
In this case I would not be marking duplicates within all the data coming from the same library, even if samblaster correctly uses the LB tag. Do you see a way of using piping(data streaming) but still marking duplicates correctly in this situation?
Another question: Does MarkDuplicates use both ID and LB to match to mark reads as duplicate? Or just LB?
That's a good question. I assumed picard uses LB only, but I have no evidence for that.