Might @PG headers affect downstream analyses, and how can I safely remove them from a BAM file?
1
0
Entering edit mode
2.2 years ago
olikidrod • 0

I've been mapping BAM files with bwa, and had bwa add read groups during the mapping. As a consequence, the exact command that I used to execute bwa is thus included in the @PG headers in all the BAM files. That obviously includes the read groups specified.

Since then, I've used Picard to replace all of the read groups with new values. As such, the information in the @PG headers is incorrect, and could mislead other researchers if I publish the BAM files.

1) How can I safely remove these @PG headers from the BAM files? I figure I might as well just strip them all out if they contain incorrect data.

2) Is this necessary, assuming I don't publish the BAM files and I'm the only one with potential to be confused? Could @PG headers affect downstream analyses when it comes to variant calling etc.? I don't think GATK uses them at all, but I don't know if other pipelines or programs might incorporate that data.

Thank you!

bam @PG read groups • 1.1k views
0
Entering edit mode

Do not all of Picard tools add @PG lines? I ran AddOrReplaceReadGroups and there was no @PG line for this but running MarkDuplicates there was.

1
Entering edit mode
2.2 years ago
h.mon 32k

I don't think the @PG is used by any downstream program, it is used as metadata, to keep track of how the file was created and modified. Multiple @PG lines are allowed, and it is possible Picard has added one for the operation you performed - did you check?

0
Entering edit mode

Many thanks for your quick reply. Yes; I checked, Picard did not add one. I'll strip out the @PG lines to avoid confusing anyone later.