What Does This Cigar Mean?
4
2
Entering edit mode
10.5 years ago
Gangcai ▴ 230

Dear All, Following are part of the output of tophat(v1.3.0):

DBV2SVN1_64_7_2105_14622_191104_0_2     256     I       4589    3       50M     *       0       0   AGTGTTAGTGTTGTTGTCTGTTGCTTGGTCAACATCATCTTGTCGTTTTT       HFHFHFHFHFHHHHHFHHFHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHE   NM:i:0  NH:i:2  CC:Z:II CP:i:4527909    HI:i:0
DBV2SVN1_64_7_2105_9755_2866_0_2        272     I       4600    1       3421396b        *       0   0GBGTGM=GGMGMRYMD==      dqj�Oc!ijd!�9I��i�


Does anybody known the meaning of CIGAR "3421396b"? Thanks.

samtools tophat cigar • 3.8k views
0
Entering edit mode

Looks to me like corrupted or truncated file. Are you sure it's correct? I.e. does it continue with more reads with sensible content?

5
Entering edit mode
10.5 years ago

It's a representation of positional alignment pattern of one sequence against another: the match/mismatch [M], deletion [D] and insertion [I] patterns of an alignment, as well as upstream and downstream padding, an other possibilities. See a table here:

[?]

A few simple examples: [?] Reference AAAAAAAAAA Query AAAAAAAAAA CIGAR MMMMMMMMMM or 10M

Reference AAAAAAAAAAT Query AAAAA-AAAAT CIGAR MMMMMDMMMMM or 5M1D5M

Reference AAAAA-AAAAT Query AAAAAAAAAAT CIGAR MMMMMIMMMMM or 5M1I5M [?]

(not sure what you G and E are in the example)

1
Entering edit mode

As far as I know your cigar operations should have a 1 in front of the D and I. It makes parsing much easier.

0
Entering edit mode

...and it's what samtools does, which also makes it a good idea. And the spec says: *|([0-9]+[MIDNSHPX=])+, so the number appears to be mandatory.

3
Entering edit mode
10.5 years ago

From the SAM v1.4 spec:

Bit 0x4 is the only reliable place to tell whether the fragment is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10 and 0x100 and the bit 0x20 of the next fragment in the template.

In other words, the don't even look at the CIGAR unless bit 0x4 is not set. If I read your alignments above correctly (a little tough with a mixture of white space and tabs), both of the reads you report are unmapped so the CIGAR strings are irrelevant.

0
Entering edit mode

Thanks for fixing the output in the original question. Brentp is right that there is a problem with that SAM line.

0
Entering edit mode

yeah, I wrapped the output in code tags so it would render more nicely.

0
Entering edit mode

Thanks again for making the original more easy to read.

0
Entering edit mode

Spec, schmec. BWA apparently reports mapq for unmapped reads (not sure about cigar) and this breaks Picard, which apparently does make assumptions for MAPQ.

0
Entering edit mode

Not for all mapped reads. bwa concatenates the reference multi-fasta before aligning, and if the case of a read that hangs off the edge of one reference contig onto another, well, it would be a little misleading for the aligner to say that it has no idea where that read goes, so it gives a map position, and a mapq, and sets the unmapped flag too, so you can see clearly that something isn't quite right. It's a feature, not a bug (meant mostly seriously)

2
Entering edit mode
10.5 years ago
brentp 23k

That is bad output. e.g.: what does "0GBGTGM=GGMGMRYMD==" in the sequence column mean?

Plus that cigar string in nonsensical.

I would check that read in your sequence files and make sure it looks normal. Then try rerunning, if you can reproduce this with just a few reads--including that one, I would contact the tophat developers.

0
Entering edit mode

I have checked the original reads, which has much shorter length than normal ones(18nt). @DBV2SVN1_64_7_2105_9755_2866_0_2 ACCGCGCAACAGACATCA + ggggggggggfgggeggg

Could this be the reason that tophat output such strange output?

0
Entering edit mode

probably, doesn't tophat try to split at 25bp?

0
Entering edit mode

If tophat produces that, regardless of input, I'd say it's a bug in tophat.

0
Entering edit mode
10.3 years ago
Marvin ▴ 850

Are you sure this is output from tophat? That second SAM line is obviously broken; I guess it's what samtools produces from a mangled BAM file.

Here's a wild guess: you ran samtools to make a BAM file, but then your hard disk became full. So you deleted a few files and everything seemed fine. Unfortunately, samtools has no error handling to speak of, so your BAM file is now missing data in the middle, and decoding it produces very weird effects.