Malformed BAM file produced by BWA
1
0
Entering edit mode
7.4 years ago
joreamayarom ▴ 140

I'm aligning reads to a genome to later do some variant calling. I have mapped some reads to a genome using bwa and created the corresponding dictionary

bwa index -p ref.fa ref.fa
bwa mem ref.fa forward.fastq.gz reverse.fastq.gz -t 16 > mapped_file.sam
samtools view -bS mapped_file.sam -o mapped_file.bam
java -jar picard.jar CreateSequenceDictionary \
R=ref.fa \
O=ref.dict

samtools faidx ref.fa

And then proceeded to sort and remove duplicates

java -jar picard.jar SortSam \
I=mapped_file.bam \
O=sorted_file.bam \
SORT_ORDER=coordinate

java -jar picard.jar MarkDuplicates \
REMOVE_DUPLICATES=true \
ASSUME_SORT_ORDER=coordinate \
I=sorted_file.bam \
O=clean_file.bam \
M=metric_file.txt

And finally tried to realign the sequences using GATK, but the programmed complained that my input bam file is malformed.

java -jar GenomeAnalysisTK.jar \
    -T RealignerTargetCreator \
    -R ref.fa \
    -I clean_file.bam \
    -o clean_file.intervals

I ran ValidateSamFile, as GATK recommends, and got this error

## HISTOGRAM    java.lang.String 
Error Type  Count 
ERROR:MISSING_READ_GROUP    1
WARNING:RECORD_MISSING_READ_GROUP   324404

To solve this problem I tried running AddOrReplaceReadGroups and FixMateInformation on my clean_file.bam, as the GATK webpage recommends (http://gatkforums.broadinstitute.org/gatk/discussion/7571/errors-in-sam-bam-files-can-be-diagnosed-with-validatesamfile), to no avail. I also ran ValidateSamFile directly in my Bra output and got a similar error message. Is there an explanation for this problem? Has my BWA output been corrupted in any stage?

GATK Samtools Picard • 3.4k views
ADD COMMENT
0
Entering edit mode

The RG tag has to be added manually, so it's not really a "corrupt" BAM... you just haven't added RG metadata and GATK needs that. AddOrReplaceReadGroups is the answer. Why didn't it, er, "avail"? :)

ADD REPLY
2
Entering edit mode
7.4 years ago

1) one step (don't save as SAM but bam, add your read groups and sort)

bwa mem -R '@RG\tID:S1\tSM:SAMPLE1'  ref.fa forward.fastq.gz reverse.fastq.gz -t 16 |  samtools sort -O BAM -o out.bam -T tmp -

2) when using the picard tools use the option

VALIDATION_STRINGENCY=LENIENT
ADD COMMENT
1
Entering edit mode

I'm checking GATK documentation right now. It is mentioned that the -M flag is essential for Picard compatibility. Is that missing from your answer or you have an specific reason to leave it out?

ADD REPLY
0
Entering edit mode

GATK/-M option ? what are you talking about ? VALIDATION_STRINGENCY is a general option for picard when handling bam.

ADD REPLY
0
Entering edit mode

I'm talking about recommendation 1. In GATK webpage (http://gatkforums.broadinstitute.org/gatk/discussion/2799) they recommend the following command with an -M before -R.

bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam

ADD REPLY
0
Entering edit mode

1) the option -M has nothing to do with your problem. Check the documentation

2) the order of the options doesn't matter.

ADD REPLY

Login before adding your answer.

Traffic: 2427 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6