bam file readgroup to include several individuals
1
0
Entering edit mode
9.4 years ago
P.NJ ▴ 50

I have paired-end reads (gene-regions-exons) that contains 200 individuals and I want to include all the sample names in the read group. I have a text file in a single column and tried to include the readgroup when aligning

file=($(cat samples.txt))
bwa mem -M -R "@RG\tID:Library1\tSM:$file\tPL:Illumina\tLB:lib_2x250\tDS:hg19" hg19.ref R1.fastq.gz R2.fastq.gz > file.sam

But it doesn't work, the sam file generated includes only the last sample from the file samples.txt.

Also, once the bam files have been generated, is it possible to split the files based on individuals?

readgroup bam • 2.9k views
ADD COMMENT
0
Entering edit mode
9.4 years ago

You're not iterating over anything, so file will only ever be the last line in samples.txt. You want something more like (untested!):

while read file; do
    bwa mem -M -R "@RG\tID:Library1\tSM:$file\tPL:Illumina\tLB:lib_2x250\tDS:hg19" hg19.ref R1.fastq.gz R2.fastq.gz > file.sam
done < samples.txt

Of course, that'll overwrite the output file and perform the same alignment again and again, but I leave fixing that as an exercise to you (presumably you have path information in the samples.txt file).

For splitting a merged file by read group, there's probably something premade that can do it (either in picard tools or GATK). But if not, you can always use samtools with the -r option and then iterate over the read groups (it'd probably be faster to just write a custom tool to do this with pysam).

ADD COMMENT
0
Entering edit mode

Thank you... I understand the iteration part of it but I did not get the point: "(presumably you have path information in the samples.txt file)."

So, what I have is only one library of pooled samples that contain both cases and controls (200 in total). The sample list is only the ID names in one column. I wanted to add all the 200 names into the bam header and then separate the bams based on cases and controls.

ADD REPLY
1
Entering edit mode

Ah, I had presumed that you had multiple individual samples, not a bunch whose names you wanted to concatenate. Then either make an $RG variable to which you just append each line in the while loop (and then put bwa outside of the loop) or, better yet, just linearize samples.txt:

RG=`cat samples.txt | tr -d "\n"`
bwa mem -M -R "@RG\tID:Library1\tSM:$RG\tPL:Illumina\tLB:lib_2x250\tDS:hg19" hg19.ref R1.fastq.gz R2.fastq.gz > file.sam

You could also comma separate things, if you prefer with RG=`cat samples.txt | tr "\n" ","`.

ADD REPLY
0
Entering edit mode

Thanks but now it ends up adding everything in just one line several times:

@RG     ID:Library1 SM:NA19NA20NA21NA22... PL:Illumina  LB:lib_2x250 DS:hg19
@RG     ID:Library1 SM:NA19NA20NA21NA22... PL:Illumina  LB:lib_2x250 DS:hg19

..

What I intended was

@RG     ID:Library1 SM:NA19 PL:Illumina  LB:lib_2x250 DS:hg19
@RG     ID:Library1 SM:NA20 PL:Illumina  LB:lib_2x250 DS:hg19
@RG     ID:Library1 SM:NA21 PL:Illumina  LB:lib_2x250 DS:hg19
ADD REPLY
0
Entering edit mode

this is useful... many thanks Devon!

ADD REPLY

Login before adding your answer.

Traffic: 3131 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6