Question

how to fix INVALID_FLAG_MATE_UNMAPPED sam flag

0

Entering edit mode

4.3 years ago

CarnifexRex • 0

Hi All,

First time posting to biostars, so please forgive any formatting errors. Please let me know so I can format correctly in the future.

I'm self taught at bioinf, and I've been given approximately 600 BAM files created by a old pipeline that seem to be riddled with errors. I no longer have access to the originating software and am attempting to perform a reanalysis of these files for a study.

Results from picard ValidateSamFile:

ERROR:CIGAR_MAPS_OFF_REFERENCE 8

ERROR:INVALID_FLAG_MATE_UNMAPPED 397806

ERROR:MISMATCH_MATE_ALIGNMENT_START 44

ERROR:MISMATCH_MATE_REF_INDEX 3

WARNING:MISSING_TAG_NM 7810030

Of the above errors, I'm not concerned with the TAG_NM, and if needed I can throw out the 55 reads in the smaller categories, but the ~400k reads with the invalid mate error would need to be fixed. Example of read with the 'Invalid Flag mate Unmapped' error

m01179:152:000000000-af0bb:1:2109:26166:15662_1:n:0:3 8 1 10954 60 150M * 0 0 ATG...

From this i can see that the sam flag value of 8 is the problem. I'm assuming the alignment software incorrectly handled paired reads where 1 read was removed due to poor quality. Results in "single" read with flag of mate unmapped. picard's FixMateInformation doesn't seem to be able to fix this error.

Since these are BAM files my next thought was to convert to SAM then use awk, sed or other tool to substitute '0' for the flag, but this led me to another error:

Results from samtools view:

[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375

Above read:

m01179:152:000000000-af0bb:1:1114:17516:21375   145 1   121485081   60  149M    =   121485283   356 AGAAAGAAGAATTCTCAGTATCTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACACAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTAGAATAG   *   MD:Z:3C3T12A125A2   RG:Z:group1

I'm having trouble seeing the reason for this error. Cigar of 149M, sequence length is 149 and the MD:z equals 149, so what is these error referring to? Samtools won't finish converting the file while this error is present, picard can't seem to ignore the unmapped mate error, so I'm at a loss for how either convert or fix these files.

Thanks for any advice you can offer

software error samtools sam/bam • 2.4k views

ADD COMMENT • link updated 4.3 years ago by Istvan Albert 100k • written 4.3 years ago by CarnifexRex • 0

0

Entering edit mode

If possible (towards computational resources) transform them back to fastq using samtools fastq and start from scratch. This is eventually more reliable that messing around with flawed files. Debugging a potentially flawed pipeline (or here its products) can be frustrating and is actually not worth the effort imho (given you can start again from scratch). The only thing you would need to make sure is that no reads were filtered out by this old pipeline. When transforming back to fastq (if this is paired-end data) make sure to randomize your BAM before transforming back to fastq as aligners such as bwa assume random sort order.

Your formatting is fine, still I suggest to also highlight data examples with the code option rather than quoting it. That improves readability. Did made the change now. Thanks for trying to make your post visually appealing :)

ADD REPLY • link 4.3 years ago by ATpoint 82k

0

Entering edit mode

I did give it a try. Samtools errors out at the same point:

[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375 [bam2fq_mainloop] Failed to read bam record.

ADD REPLY • link 4.3 years ago by CarnifexRex • 0

0

Entering edit mode

Yeah I was fearing it would do that. The thing is that every samtools command runs some basic validation actions. Essentially, you only need a way of printing the files in SAM format (so human readable version of BAM) to stdout and pipe this into awk 'OFS="\n" {print $1, $10, "+", $11}' as this is all you need to make a fastq file. The sequence and quality scores are stored in the BAM/SAM. Still, I am right now not sure which tool to use to print a flawed BAM, I will check and come back to you.

ADD REPLY • link 4.3 years ago by ATpoint 82k

1

Entering edit mode

I think you can try SamToFastq from Picard tools like:

picard SamToFastq I=foo.bam FASTQ=foo.fastq.gz VALIDATION_STRINGENCY=LENIENT

The VALIDATION_STRINGENCY argument will force the tool to ignore the CIGAR issue. This will give you a compressed fastq file.

ADD REPLY • link 4.3 years ago by ATpoint 82k

0

Entering edit mode

That seems to have successfully made the fastq files. They appear much smaller than expected, and the 2 files are of significantly unequal sizes. I was hoping to retain the original alignment information, but I'll take whatever victory I can at this point. Now I just need to do that 599 more times!

Interesting though, I didn't see VALIDATION_STRINGENCY listed as part of the SamToFastq help section. I've used that option when using CollectHsMetrics to suppress errors, but it's listed as an option for that tool. Is this a standard picard option? Can I use that with other picard tools?

ADD REPLY • link 4.3 years ago by CarnifexRex • 0

0

Entering edit mode

They appear much smaller than expected, and the 2 files are of significantly unequal sizes.

It is possible that your files don't contain equal number of reads (and more importantly in the same order) in two files. You will want to investigate that before you go forward. You can use repair.sh from BBMap suite (How to use BBtools repair.sh on multiple files ) to repair the reorder and collect any singletons.

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Is it possible that you have two reads with the same name (since it is paired) and the other raises the error?

Also see if bam2bed followed by a bed2bam would work (bedtools)

ADD REPLY • link 4.3 years ago by Istvan Albert 100k

0

Entering edit mode

It's possible. At first since I couldn't see the discrepancy I thought the read listed was just last in memory, but it's consistently that read which throws the error. Since I can't get past that error I can't see if the read is listed again later on.

Won't bam2bed remove sequence information? Ultimately I want to perform variant detection on these bams so I'll need the sequences intact

Thanks for the input

ADD REPLY • link 4.3 years ago by CarnifexRex • 0

score 1 · Answer 1 · 2020-01-13

1

Entering edit mode

4.3 years ago

Istvan Albert 100k

I think the right solution is probably to read your files with a programming interface like Pysam and work out the issues that way.

https://pysam.readthedocs.io/en/latest/api.html

ADD COMMENT • link 4.3 years ago by Istvan Albert 100k

0

Entering edit mode

This is great. I've struggled with this for 2 days and within 30 minutes of posting here I have 3 workable ideas. Thanks to all for great and expedient suggestions.

I've never worked with pysam before, but I'm always looking for new tools. Guess I'll get to reading...

ADD REPLY • link 4.3 years ago by CarnifexRex • 0