Advice On Adding Readgroups
2
3
Entering edit mode
12.2 years ago
Jamie ▴ 30

Hi there,

Ok so I have some Illumina whole genome bams. They have been aligned by illumina using Casava but we wanted to re-run the alignments ourselves using bwa so I have used bam2fastq and extracted the paired end sequences. These have then been sucessfully aligned using bwa mem to produce a sam file that is then converted to a new, sorted and indexed bam file.

So far so good.

I want to use tools from the GATK and will need to insert readgroup data in order to do so. Each bam I have represents a single sample from a single library prep but they were run on multiple lanes (typically 3) as indicated from the fastq files and from the QNAME variable in the sam file.

For example: HS2000-1259_127:3:1210:15640:52255

With, I believe, 3 being the flowcell lane.

As the sample was the same and it's from the same library prep can I ignore the fact they were run on different lanes or is it necessary to individually tag each read with a read group according to it's flowcell lane?

If it is necessary to tag each read separately, am I correct in thinking that Picard's AddOrReplaceGroups is not capable of doing this (or at least not without splitting the bam up first, running picard then remerging)? And if it isn't, has someone already written something to carry out this task? (I'm sure I could probably whip something up, but there's no point reinventing the wheel!).

Thanks in advance.

ngs • 11k views
ADD COMMENT
0
Entering edit mode

the simplest method is to set the read group with BWA sampe -r '.....'

ADD REPLY
0
Entering edit mode

Hi Pierre,

I think I need to explain a bit more, the files we received from Illumina were already aligned by Illumina (not us) using Casava. We want to re-run the alignments using bwa. So we've extracted fastq files from the bams we received from Illumina. While I can use the -r method (or the equivalent -R method in bwa mem that I'm using), that still doesn't account for the fact the fastq files contain separate lanes (unless I'm missing something).

Doing it this way would it in fact be necessary to split out the lanes from the fastq files (i.e. create 3 fastq files, one for each lane) and align separately, while adding appropriate readgroups, the merge the resultant sam files?

ADD REPLY
2
Entering edit mode

but you could split your fastq per lane isn't it ? how to split reads for different flowcell lanes in fastQ files? , align each pair of fastq and merge later with picard/MergeSam

ADD REPLY
0
Entering edit mode

Thanks Pierre, that's what I thought. Assuming that I haven't actually done this though (which i haven't in this case, but I can alter the pipeline for the remaining samples), should I then use another tool (or make something myself) to add the read group info to the sam file on a per lane basis?

ADD REPLY
4
Entering edit mode
12.2 years ago

I just wrote a tool to do the job.

see:

https://github.com/lindenb/jvarkit#biostar78400

and

https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar78400.java

$ cat groups.xml
<read-groups>
<flowcell name="HS2000-1259_127">
 <lane index="1">
   <group ID="X1">
     <library>L1</library>
     <platform>P1</platform>
     <sample>S1</sample>
     <platformunit>PU1</platformunit>
     <center>C1</center>
     <description>blabla</description>
   </group>
 </lane>
</flowcell>
<flowcell name="HS2000-1259_128">
 <lane index="2">
   <group ID="x2">
     <library>L2</library>
     <platform>P2</platform>
     <sample>S2</sample>
     <platformunit>PU1</platformunit>
     <center>C1</center>
     <description>blabla</description>
   </group>
 </lane>
</flowcell>
</read-groups>



$ cat input.sam 
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  
TTAGATAAAGAGGATACTG *   XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   
0   AAAAGATAAGGGATAAA   *

$java -jar dist/biostar78400.jar \
    XML=groups.xml \
    I=input.sam \
    VALIDATION_STRINGENCY=LENIENT

@HD VN:1.4  SO:unsorted
@SQ SN:ref  LN:45
@SQ SN:ref2 LN:40
@RG ID:X1   PL:P1   PU:P1   LB:L1   DS:blabla   SM:S1   CN:C1
@RG ID:x2   PL:P2   PU:P2   LB:L2   DS:blabla   SM:S2   CN:C1
@PG ID:Biostar78400 PN:Biostar78400 PP:Biostar78400 VN:1.0  (...)
HS2000-1259_127:1:1210:15640:52255  163 ref 7   30  8M4I4M1D3M  =   37  39  TTAGATAAAGAGGATACTG *   RG:Z:X1 XX:B:S,12561,2,20,112
HS2000-1259_128:2:1210:15640:52255  0   ref 9   30  1S2I6M1P1I1P1I4M2I  *   0   0AAAAGATAAGGGATAAA  *   RG:Z:x2
ADD COMMENT
0
Entering edit mode

Wow! That's much more comprehensive than mine, I've just hacked it together with awk and perl on the sam file :D

Thanks I'll give this a shot on the bams, haven't played with java in years...

ADD REPLY
2
Entering edit mode
12.2 years ago

I am not sure if I got you. But here is what I understood. I assume you had a single sample for which a single library was built and ran on multiple lanes. Now you can either create a single bam file for the sample or you can have multiple bam files (one for each lane) adn it should work until if you have the same LB (Library) tag and same SM (Sample) tag for each of them. Of course they will have separate LN (Lane) and RG (Read group) tag. I think most of the variant callers can take multiple bam files and are smart enough to decide if they belong to one sample or not based on the SM tag.

As Pierre, suggested its better to assign a read group id while aligning if the aligner lets you do so. But if you have missed it you can always use Picard. Now depending on if you have already merged all the bam files or if they are still there at the lane level. The way you add read group id will differ. Just let me know what exactly you have right now and I may help you.

ADD COMMENT
0
Entering edit mode

Hi there,

Yeah, pretty much. I was toying with doing it at the aligner level but it seemed like I would have to split up the fastq's to do that. So I haven't actually merged any bam files, Illumina did so when they provided the data - I have just extracted the paired end fastq files from the merged bam.

But effectively yes I have ended up with lane merged bam files after our bwa run. Just wondering if it's REALLY necessary to tag each lane of the bam separately or if it can be treated as one? From what I've read before I posted opinion seemed a little mixed.

ADD REPLY
1
Entering edit mode

Right now the only way I think of is using HS2000-1259_127:3:1210:15640:52255 info to split your big bam into three smaller bams based on the lane number. Then add the read group id and merge everything together. Another possible way: It seems to me that your read names are unique and the same read name can't exist in two different lanes. Now if you are not looking for some lane specific modification including base recalibration at lane level, you can trick aligner to think that all these reads belong to the same lane (depending on if a read name in one lane doesnt exist in other). I think just add a single read group id (RG ID) to all the reads irrespective of which lane it belongs to.

ADD REPLY
0
Entering edit mode

I am sorry if I confusing you rather than helping you.

ADD REPLY
0
Entering edit mode

Thanks, but down the line I will need to do some base recalibration so it'd be wise if I label the lanes with the appropriate read groups it would seem.

I guess rather than splitting the bam I can just write a script to edit the files (insert RG data) at the sam stage, while it's essentially still text.

No, not confusing, I follow what you're saying :)

ADD REPLY
0
Entering edit mode

Yup I think writing a script that adds a RG ID to reads based on their lane number will be an easy solution.

ADD REPLY

Login before adding your answer.

Traffic: 2779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6