Question: CIGAR and query sequence are of different length when trying to convert from sam to bam?
0
gravatar for joreamayarom
3.2 years ago by
joreamayarom120
USA/Cambridge
joreamayarom120 wrote:

I am trying to convert a sam file to bam using the line of code described here, but samtools is throwing the following error at me:

samtools view -bT reference.fasta sequences.sam > sequences.bam

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file.

The offending line looks like this:

SRR808297.2571281       99      gi|309056|gb|L20934.1|MSQMTCG   747     80      101M    =       790     142     TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT      @@CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC@=?CACC>CCC      NM:i:2  MD:Z:98A1A

I have seen this error reported elsewhere, it seems to arise from a bug in samtools and solutions don't seem to be universal. In my case, I am not sure which is my source of error, but I have a likely culprit: This sequence seems to be two bases shorter than the rest of the sequences with the same CIGAR. Could this be the error? Is that is so, how can I solve it? Bear in mind that my file contains paired end reads. Furthermore, I have several more files producing the same error and have to automatize a solution in a fast an efficient script.

sam samtools bam • 3.4k views
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by joreamayarom120
1

the tool/software that produced your sam file "sequences.sam" is bugged.

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum124k

The read says '101M' which is cigar-slang for 101bp can be mapped to the genome. The read's DNA however is only 98bp long. This means the read is telling samtools two different things, so it raised an error. I feel bad that you look at all the other reads to figure out why this one was erroneous :(

ADD REPLYlink written 3.2 years ago by John12k
1
gravatar for harold.smith.tarheel
3.2 years ago by
United States
harold.smith.tarheel4.5k wrote:

Picard (available here) command ValidateSamFile can be used to identify problematic SAM files. A discussion of possible errors and repairs can be found here.

ADD COMMENTlink written 3.2 years ago by harold.smith.tarheel4.5k
1

ValidateSamFile detects the erros, but there is little info in your link on how to solve this particular issue. John is right, the Cigar string is of different length than some of my reads. My guess now is that I will have to parse the file using Perl or Python and eliminate the lines in which the Cigar doesn't match the length of the sequence...

ADD REPLYlink written 3.2 years ago by joreamayarom120
2

You could certainly go down that path, but i'd recommend you find the culprit that caused the error in the first place, because the others might have the CIGAR/read length correct by chance, but be wrong for other reasons.

FYI this is why bioinformatics sometimes sucks - but dont worry when you've figured it out you'll be better for it :)

ADD REPLYlink written 3.2 years ago by John12k
1

Fixing the problem depends upon the types of errors observed and why they occurred. For example, if the only reported issue is short CIGAR strings due to alignment off the end of the reference, then you can use Picard's CleanSam to fix. If there is a variety of error types but the number of affected reads is small, you may decide to remove those reads (ValidateSamFile MODE=VERBOSE will return the error-containing read IDs, which you can use to filter).

Personally, I would try to figure out why the SAM file is corrupted and/or repeat the alignment (if you inherited/downloaded the SAM file, convert to FASTQ using Picard SamToFastq then map with your favorite aligner).

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by harold.smith.tarheel4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1299 users visited in the last hour