First time posting to biostars, so please forgive any formatting errors. Please let me know so I can format correctly in the future.
I'm self taught at bioinf, and I've been given approximately 600 BAM files created by a old pipeline that seem to be riddled with errors. I no longer have access to the originating software and am attempting to perform a reanalysis of these files for a study.
Results from picard ValidateSamFile:
Of the above errors, I'm not concerned with the TAG_NM, and if needed I can throw out the 55 reads in the smaller categories, but the ~400k reads with the invalid mate error would need to be fixed. Example of read with the 'Invalid Flag mate Unmapped' error
m01179:152:000000000-af0bb:1:2109:26166:15662_1:n:0:3 8 1 10954 60 150M * 0 0 ATG...
From this i can see that the sam flag value of 8 is the problem. I'm assuming the alignment software incorrectly handled paired reads where 1 read was removed due to poor quality. Results in "single" read with flag of mate unmapped. picard's FixMateInformation doesn't seem to be able to fix this error.
Since these are BAM files my next thought was to convert to SAM then use awk, sed or other tool to substitute '0' for the flag, but this led me to another error:
Results from samtools view:
[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375
m01179:152:000000000-af0bb:1:1114:17516:21375 145 1 121485081 60 149M = 121485283 356 AGAAAGAAGAATTCTCAGTATCTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACACAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTAGAATAG * MD:Z:3C3T12A125A2 RG:Z:group1
I'm having trouble seeing the reason for this error. Cigar of 149M, sequence length is 149 and the MD:z equals 149, so what is these error referring to? Samtools won't finish converting the file while this error is present, picard can't seem to ignore the unmapped mate error, so I'm at a loss for how either convert or fix these files.
Thanks for any advice you can offer