Question

Putting Run/Lane Info Back Into Readgroup For Gatk Pipeline

1

Entering edit mode

13.0 years ago

Kevin ▴ 640

I am looking at the read name from the archival bam from the sequencing provider. It provides machine/ run / lane info but readgroup info isn't written in it.

What are my options if I want to extract the fastq from the bam to align with BWA to annotate the RG info so that it is used in downstream GATK calling? (kinda a reversal of the process to simulate the output of per lane fastq for alignment then per lane dedup and so on )

I have actually mapped the the sample (with reads from different lanes/ possibly different runs) to a reference already but I have 37 other samples so it would be less painful if i got it 'right' at the start. i.e. maybe an perl script to separate the fastq reads by the run/lane and dealing with each lane bam

Cheers!

gatk bam • 5.3k views

ADD COMMENT • link updated 13.0 years ago by Sean Davis 27k • written 13.0 years ago by Kevin ▴ 640

0

Entering edit mode

Could you provide a bit more information about what "machine/run/lane" information you have? In my experience, if you only have a BAM that does not have a @RG tag in the header and corresponding RG fields for each read, then you will not have enough information to assign the reads in your BAM to their proper groups.

ADD REPLY • link 13.0 years ago by Matt Shirley 10k

0

Entering edit mode

As I understand, the RG field info can be found in the read name?

e.g.

format of the template name (header in the bam file), is it in the format of sequencer:lane:tile:coord-x:coord-y?

Machine>_Run number> : Lane> : Tile> : X coordinate of cluster> : Y coordinate of cluster>

e.g. from http://en.wikipedia.org/wiki/FASTQ_format

@HWUSI-EAS100R:6:73:941:1973#0/1

the archival bam that I have only have reads belonging to one sample (library I am not too sure but I guess it should only be one library as well)

ADD REPLY • link 13.0 years ago by Kevin ▴ 640

0

Entering edit mode

Your information about FASTQ template names is correct, but this information has to carry over into your BAM file. My concern is that, since you don't have the original FASTQ files, you don't have any information about which sequencing lane your read came from.

ADD REPLY • link 13.0 years ago by Matt Shirley 10k

score 1 · Answer 1 · 2012-07-30

The read group information (sample, library, and ID) are external to the fastq file. The connection between lane/run/machine and read group, therefore, need to be provided by some other source, typically an spreadsheet or database that describes the flow of information from the original material through the sequencing process. Once you have the information about the relationship between fastq or BAM file and read group, you can either add that to the BAM file using something like the Picard tools or you can supply that to an appropriate aligner up front.