Question: how to fix INVALID_FLAG_MATE_UNMAPPED sam flag
0
gravatar for CarnifexRex
8 months ago by
Dallas, TX
CarnifexRex0 wrote:

Hi All,

First time posting to biostars, so please forgive any formatting errors. Please let me know so I can format correctly in the future.

I'm self taught at bioinf, and I've been given approximately 600 BAM files created by a old pipeline that seem to be riddled with errors. I no longer have access to the originating software and am attempting to perform a reanalysis of these files for a study.

Results from picard ValidateSamFile:

ERROR:CIGAR_MAPS_OFF_REFERENCE 8

ERROR:INVALID_FLAG_MATE_UNMAPPED 397806

ERROR:MISMATCH_MATE_ALIGNMENT_START 44

ERROR:MISMATCH_MATE_REF_INDEX 3

WARNING:MISSING_TAG_NM 7810030

Of the above errors, I'm not concerned with the TAG_NM, and if needed I can throw out the 55 reads in the smaller categories, but the ~400k reads with the invalid mate error would need to be fixed. Example of read with the 'Invalid Flag mate Unmapped' error

m01179:152:000000000-af0bb:1:2109:26166:15662_1:n:0:3 8 1 10954 60 150M * 0 0 ATG...

From this i can see that the sam flag value of 8 is the problem. I'm assuming the alignment software incorrectly handled paired reads where 1 read was removed due to poor quality. Results in "single" read with flag of mate unmapped. picard's FixMateInformation doesn't seem to be able to fix this error.

Since these are BAM files my next thought was to convert to SAM then use awk, sed or other tool to substitute '0' for the flag, but this led me to another error:

Results from samtools view:

[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375

Above read:

m01179:152:000000000-af0bb:1:1114:17516:21375   145 1   121485081   60  149M    =   121485283   356 AGAAAGAAGAATTCTCAGTATCTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACACAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTAGAATAG   *   MD:Z:3C3T12A125A2   RG:Z:group1

I'm having trouble seeing the reason for this error. Cigar of 149M, sequence length is 149 and the MD:z equals 149, so what is these error referring to? Samtools won't finish converting the file while this error is present, picard can't seem to ignore the unmapped mate error, so I'm at a loss for how either convert or fix these files.

Thanks for any advice you can offer

ADD COMMENTlink modified 8 months ago by Istvan Albert ♦♦ 84k • written 8 months ago by CarnifexRex0

If possible (towards computational resources) transform them back to fastq using samtools fastq and start from scratch. This is eventually more reliable that messing around with flawed files. Debugging a potentially flawed pipeline (or here its products) can be frustrating and is actually not worth the effort imho (given you can start again from scratch). The only thing you would need to make sure is that no reads were filtered out by this old pipeline. When transforming back to fastq (if this is paired-end data) make sure to randomize your BAM before transforming back to fastq as aligners such as bwa assume random sort order.

Your formatting is fine, still I suggest to also highlight data examples with the code option rather than quoting it. That improves readability. Did made the change now. Thanks for trying to make your post visually appealing :)

ADD REPLYlink modified 8 months ago • written 8 months ago by ATpoint38k

I did give it a try. Samtools errors out at the same point:

[E::bam_read1] CIGAR and query sequence lengths differ for m01179:152:000000000-af0bb:1:1114:17516:21375 [bam2fq_mainloop] Failed to read bam record.

ADD REPLYlink written 8 months ago by CarnifexRex0

Yeah I was fearing it would do that. The thing is that every samtools command runs some basic validation actions. Essentially, you only need a way of printing the files in SAM format (so human readable version of BAM) to stdout and pipe this into awk 'OFS="\n" {print $1, $10, "+", $11}' as this is all you need to make a fastq file. The sequence and quality scores are stored in the BAM/SAM. Still, I am right now not sure which tool to use to print a flawed BAM, I will check and come back to you.

ADD REPLYlink written 8 months ago by ATpoint38k
1

I think you can try SamToFastq from Picard tools like:

picard SamToFastq I=foo.bam FASTQ=foo.fastq.gz VALIDATION_STRINGENCY=LENIENT

The VALIDATION_STRINGENCY argument will force the tool to ignore the CIGAR issue. This will give you a compressed fastq file.

ADD REPLYlink modified 8 months ago • written 8 months ago by ATpoint38k

That seems to have successfully made the fastq files. They appear much smaller than expected, and the 2 files are of significantly unequal sizes. I was hoping to retain the original alignment information, but I'll take whatever victory I can at this point. Now I just need to do that 599 more times!

Interesting though, I didn't see VALIDATION_STRINGENCY listed as part of the SamToFastq help section. I've used that option when using CollectHsMetrics to suppress errors, but it's listed as an option for that tool. Is this a standard picard option? Can I use that with other picard tools?

ADD REPLYlink written 8 months ago by CarnifexRex0

They appear much smaller than expected, and the 2 files are of significantly unequal sizes.

It is possible that your files don't contain equal number of reads (and more importantly in the same order) in two files. You will want to investigate that before you go forward. You can use repair.sh from BBMap suite (How to use BBtools repair.sh on multiple files ) to repair the reorder and collect any singletons.

ADD REPLYlink written 8 months ago by genomax89k

Is it possible that you have two reads with the same name (since it is paired) and the other raises the error?

Also see if bam2bed followed by a bed2bam would work (bedtools)

ADD REPLYlink written 8 months ago by Istvan Albert ♦♦ 84k

It's possible. At first since I couldn't see the discrepancy I thought the read listed was just last in memory, but it's consistently that read which throws the error. Since I can't get past that error I can't see if the read is listed again later on.

Won't bam2bed remove sequence information? Ultimately I want to perform variant detection on these bams so I'll need the sequences intact

Thanks for the input

ADD REPLYlink written 8 months ago by CarnifexRex0
1
gravatar for Istvan Albert
8 months ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

I think the right solution is probably to read your files with a programming interface like Pysam and work out the issues that way.

https://pysam.readthedocs.io/en/latest/api.html

ADD COMMENTlink written 8 months ago by Istvan Albert ♦♦ 84k

This is great. I've struggled with this for 2 days and within 30 minutes of posting here I have 3 workable ideas. Thanks to all for great and expedient suggestions.

I've never worked with pysam before, but I'm always looking for new tools. Guess I'll get to reading...

ADD REPLYlink written 8 months ago by CarnifexRex0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2022 users visited in the last hour