Question: Putting Run/Lane Info Back Into Readgroup For Gatk Pipeline
1
gravatar for Kevin
7.2 years ago by
Kevin630
Kevin630 wrote:

I am looking at the read name from the archival bam from the sequencing provider. It provides machine/ run / lane info but readgroup info isn't written in it.

What are my options if I want to extract the fastq from the bam to align with BWA to annotate the RG info so that it is used in downstream GATK calling? (kinda a reversal of the process to simulate the output of per lane fastq for alignment then per lane dedup and so on )

I have actually mapped the the sample (with reads from different lanes/ possibly different runs) to a reference already but I have 37 other samples so it would be less painful if i got it 'right' at the start. i.e. maybe an perl script to separate the fastq reads by the run/lane and dealing with each lane bam

Cheers!

gatk bam • 3.9k views
ADD COMMENTlink written 7.2 years ago by Kevin630

Could you provide a bit more information about what "machine/run/lane" information you have? In my experience, if you only have a BAM that does not have a @RG tag in the header and corresponding RG fields for each read, then you will not have enough information to assign the reads in your BAM to their proper groups.

ADD REPLYlink written 7.2 years ago by Matt Shirley9.1k

As I understand, the RG field info can be found in the read name?

e.g.

format of the template name (header in the bam file), is it in the format of sequencer:lane:tile:coord-x:coord-y?

Machine>_Run number> : Lane> : Tile> : X coordinate of cluster> : Y coordinate of cluster>

e.g. from http://en.wikipedia.org/wiki/FASTQ_format

@HWUSI-EAS100R:6:73:941:1973#0/1

the archival bam that I have only have reads belonging to one sample (library I am not too sure but I guess it should only be one library as well)

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Kevin630

Your information about FASTQ template names is correct, but this information has to carry over into your BAM file. My concern is that, since you don't have the original FASTQ files, you don't have any information about which sequencing lane your read came from.

ADD REPLYlink written 7.2 years ago by Matt Shirley9.1k
1
gravatar for Sean Davis
7.2 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

The read group information (sample, library, and ID) are external to the fastq file. The connection between lane/run/machine and read group, therefore, need to be provided by some other source, typically an spreadsheet or database that describes the flow of information from the original material through the sequencing process. Once you have the information about the relationship between fastq or BAM file and read group, you can either add that to the BAM file using something like the Picard tools or you can supply that to an appropriate aligner up front.

ADD COMMENTlink written 7.2 years ago by Sean Davis25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1311 users visited in the last hour