Question: Dealing with multiple read groups, from BWA to sorted .bam
gravatar for jamiedm
21 months ago by
jamiedm0 wrote:

Hi all,

My (paired-end) sequencing data is comprised of 5 samples each ran on 3 different lanes (6,7,8):

 Sample1_L6_R1.fastq.gz,     Sample1_L6_R2.fastq.gz
 Sample1_L7_R1.fastq.gz,     Sample1_L7_R2.fastq.gz
 Sample1_L8_R1.fastq.gz,     Sample1_L8_R2.fastq.gz
 Sample5_L6_R1.fastq.gz,     Sample1_L6_R2.fastq.gz
 Sample5_L7_R1.fastq.gz,     Sample1_L7_R2.fastq.gz
 Sample5_L8_R1.fastq.gz,     Sample1_L8_R2.fastq.gz

My understanding is that since each sample was ran on multiple lanes, it is important to specify read groups so that downstream applications such as GATK can distinguish them. For this reason, I align each lane file separately, adding read group information like this (for Sample1_L6):

bwa mem -t #threadno.# -R "@RG\tID:S1L6\tSM:S1\tPL:ILLUMINA\tLB:FC-140-1086" /path/to/hg38ref.fa /path/to/Sample1Lane6Read1.fq /path/to/Sample1Lane6Read2.fq > S1L6_alignment.sam

Naturally, I want to go from here to sorted .bam files (one for each sample like: Sample1_sorted.bam ... Sample5_sorted.bam), so I can then RemoveDuplicates and proceed with downstream analysis.

My question is, what would be the 'best' way to go from three unsorted .sam files to a sorted .bam file with read groups intact (preferably with Samtools)? By 'intact' I mean that each Samplex.bam would contain three different read groups corresponding to the lanes.

I presume that samtools view -b, samtools sort, and samtools merge/cat would be the tools I need, but in which order?

I originally tried merging and converting in one step like this:

samtools merge Sample1_unsorted.bam Sample1_L6_aligned.sam Sample1_L7_aligned.sam Sample1_L8_aligned.sam

I'm unsure if this is a valid use of the tools, and I think I read somewhere that samtools merge should be ran on sorted files anyway.

Any help or advice would be hugely appreciated!

sequence dna-seq • 1.5k views
ADD COMMENTlink written 21 months ago by jamiedm0

This article from GATK might provide some helpful information:

How should I pre-process data from multiplexed sequencing and multi-library designs?

ADD REPLYlink written 21 months ago by Russ460
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1357 users visited in the last hour