Hi All,
First time posting to biostars, so please forgive any formatting errors. Please let me know so I can format correctly in the future.
I'm self taught at bioinf, and I've been given approximately 600 BAM files created by a old pipeline that seem to be riddled with errors. I no longer have access to the originating software and am attempting to perform a reanalysis of these files for a study.
Results from picard ValidateSamFile:
ERROR:CIGAR_MAPS_OFF_REFERENCE 8
ERROR:INVALID_FLAG_MATE_UNMAPPED 397806
ERROR:MISMATCH_MATE_ALIGNMENT_START 44
ERROR:MISMATCH_MATE_REF_INDEX 3
WARNING:MISSING_TAG_NM 7810030
Of the above errors, I'm not concerned with the TAG_NM, and if needed I can throw out the 55 reads in the smaller categories, but the ~400k reads with the invalid mate error would need to be fixed. Example of read with the 'Invalid Flag mate Unmapped' error
m01179:152:000000000-af0bb:1:2109:26166:15662_1:n:0:3 8 1 10954 60 150M * 0 0 ATG...
From this i can see that the sam flag value of 8 is the problem. I'm assuming the alignment software incorrectly handled paired reads where 1 read was removed due to poor quality. Results in "single" read with flag of mate unmapped. picard's FixMateInformation doesn't seem to be able to fix this error.
Since these are BAM files my next thought was to convert to SAM then use awk, sed or other tool to substitute '0' for the flag, but this led me to another error:
Results from samtools view:
[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375
Above read:
m01179:152:000000000-af0bb:1:1114:17516:21375 145 1 121485081 60 149M = 121485283 356 AGAAAGAAGAATTCTCAGTATCTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACACAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTAGAATAG * MD:Z:3C3T12A125A2 RG:Z:group1
I'm having trouble seeing the reason for this error. Cigar of 149M, sequence length is 149 and the MD:z equals 149, so what is these error referring to? Samtools won't finish converting the file while this error is present, picard can't seem to ignore the unmapped mate error, so I'm at a loss for how either convert or fix these files.
Thanks for any advice you can offer
If possible (towards computational resources) transform them back to fastq using
samtools fastq
and start from scratch. This is eventually more reliable that messing around with flawed files. Debugging a potentially flawed pipeline (or here its products) can be frustrating and is actually not worth the effort imho (given you can start again from scratch). The only thing you would need to make sure is that no reads were filtered out by this old pipeline. When transforming back to fastq (if this is paired-end data) make sure to randomize your BAM before transforming back to fastq as aligners such asbwa
assume random sort order.Your formatting is fine, still I suggest to also highlight data examples with the code option rather than quoting it. That improves readability. Did made the change now. Thanks for trying to make your post visually appealing :)
I did give it a try. Samtools errors out at the same point:
[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375 [bam2fq_mainloop] Failed to read bam record.
Yeah I was fearing it would do that. The thing is that every samtools command runs some basic validation actions. Essentially, you only need a way of printing the files in SAM format (so human readable version of BAM) to
stdout
and pipe this intoawk 'OFS="\n" {print $1, $10, "+", $11}'
as this is all you need to make a fastq file. The sequence and quality scores are stored in the BAM/SAM. Still, I am right now not sure which tool to use to print a flawed BAM, I will check and come back to you.I think you can try SamToFastq from Picard tools like:
picard SamToFastq I=foo.bam FASTQ=foo.fastq.gz VALIDATION_STRINGENCY=LENIENT
The VALIDATION_STRINGENCY argument will force the tool to ignore the CIGAR issue. This will give you a compressed fastq file.
That seems to have successfully made the fastq files. They appear much smaller than expected, and the 2 files are of significantly unequal sizes. I was hoping to retain the original alignment information, but I'll take whatever victory I can at this point. Now I just need to do that 599 more times!
Interesting though, I didn't see VALIDATION_STRINGENCY listed as part of the
SamToFastq
help section. I've used that option when usingCollectHsMetrics
to suppress errors, but it's listed as an option for that tool. Is this a standard picard option? Can I use that with other picard tools?It is possible that your files don't contain equal number of reads (and more importantly in the same order) in two files. You will want to investigate that before you go forward. You can use
repair.sh
from BBMap suite (How to use BBtools repair.sh on multiple files ) to repair the reorder and collect any singletons.Is it possible that you have two reads with the same name (since it is paired) and the other raises the error?
Also see if
bam2bed
followed by abed2bam
would work (bedtools
)It's possible. At first since I couldn't see the discrepancy I thought the read listed was just last in memory, but it's consistently that read which throws the error. Since I can't get past that error I can't see if the read is listed again later on.
Won't
bam2bed
remove sequence information? Ultimately I want to perform variant detection on these bams so I'll need the sequences intactThanks for the input