Question

bam file readgroup to include several individuals

0

Entering edit mode

9.4 years ago

P.NJ ▴ 50

I have paired-end reads (gene-regions-exons) that contains 200 individuals and I want to include all the sample names in the read group. I have a text file in a single column and tried to include the readgroup when aligning

file=($(cat samples.txt))
bwa mem -M -R "@RG\tID:Library1\tSM:$file\tPL:Illumina\tLB:lib_2x250\tDS:hg19" hg19.ref R1.fastq.gz R2.fastq.gz > file.sam

But it doesn't work, the sam file generated includes only the last sample from the file samples.txt.

Also, once the bam files have been generated, is it possible to split the files based on individuals?

readgroup bam • 2.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by P.NJ ▴ 50

score 0 · Answer 1 · 2014-11-18

0

Entering edit mode

9.4 years ago

Devon Ryan 104k

You're not iterating over anything, so file will only ever be the last line in samples.txt. You want something more like (untested!):

while read file; do
    bwa mem -M -R "@RG\tID:Library1\tSM:$file\tPL:Illumina\tLB:lib_2x250\tDS:hg19" hg19.ref R1.fastq.gz R2.fastq.gz > file.sam
done < samples.txt

Of course, that'll overwrite the output file and perform the same alignment again and again, but I leave fixing that as an exercise to you (presumably you have path information in the samples.txt file).

For splitting a merged file by read group, there's probably something premade that can do it (either in picard tools or GATK). But if not, you can always use samtools with the -r option and then iterate over the read groups (it'd probably be faster to just write a custom tool to do this with pysam).

ADD COMMENT • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you... I understand the iteration part of it but I did not get the point: "(presumably you have path information in the samples.txt file)."

So, what I have is only one library of pooled samples that contain both cases and controls (200 in total). The sample list is only the ID names in one column. I wanted to add all the 200 names into the bam header and then separate the bams based on cases and controls.

ADD REPLY • link 9.4 years ago by P.NJ ▴ 50

1

Entering edit mode

Ah, I had presumed that you had multiple individual samples, not a bunch whose names you wanted to concatenate. Then either make an $RG variable to which you just append each line in the while loop (and then put bwa outside of the loop) or, better yet, just linearize samples.txt:

RG=`cat samples.txt | tr -d "\n"`
bwa mem -M -R "@RG\tID:Library1\tSM:$RG\tPL:Illumina\tLB:lib_2x250\tDS:hg19" hg19.ref R1.fastq.gz R2.fastq.gz > file.sam

You could also comma separate things, if you prefer with RG=`cat samples.txt | tr "\n" ","`.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks but now it ends up adding everything in just one line several times:

@RG     ID:Library1 SM:NA19NA20NA21NA22... PL:Illumina  LB:lib_2x250 DS:hg19
@RG     ID:Library1 SM:NA19NA20NA21NA22... PL:Illumina  LB:lib_2x250 DS:hg19

..

What I intended was

@RG     ID:Library1 SM:NA19 PL:Illumina  LB:lib_2x250 DS:hg19
@RG     ID:Library1 SM:NA20 PL:Illumina  LB:lib_2x250 DS:hg19
@RG     ID:Library1 SM:NA21 PL:Illumina  LB:lib_2x250 DS:hg19