CIGAR and query sequence lengths differ
1
0
Entering edit mode
4 months ago
ManuelDB ▴ 80

I am developing a program that softclips reads. When I run samtools view the_new_bam_created_with_my_softclipped_read.bam

I get this error message

 MN01972:51:000H5KYKL:1:11101:10749:1220 0       chr2    208248363       60      24S77M25S       *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC       AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF[E::bam_read1] CIGAR and query sequence lengths differ for MN01972:51:000H5KYKL:1:11101:10753:13456
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFF       MD:Z:126        RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:0  UQ:i:0  AS:i:126
MN01972:51:000H5KYKL:1:11101:10753:13456        0       chr2    208248339       60      111M2D13M       *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATCTTCTCTGAAGAC AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAF    MD:Z:111^GG13   RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:2  UQ:i:0       AS:i:116
samtools view: error reading file "softclipped.bam"

OK. When I check the read MN01972:51:000H5KYKL:1:11101:10753:13456 in the BAM file before done the softclipped I get an identical information

 MN01972:51:000H5KYKL:1:11101:10753:13456        0       chr2    208248339       60      111M2D13M       *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATCTTCTCTGAAGAC AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAF    MD:Z:111^GG13   RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:2  UQ:i:0       AS:i:116

I put both again but together to facilitate comparison

 Before
MN01972:51:000H5KYKL:1:11101:10753:13456        0       chr2    208248339       60      111M2D13M       *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATCTTCTCTGAAGAC AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAF    MD:Z:111^GG13   RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:2  UQ:i:0       AS:i:116
After
MN01972:51:000H5KYKL:1:11101:10753:13456        0       chr2    208248339       60      111M2D13M       *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATCTTCTCTGAAGAC AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAF    MD:Z:111^GG13   RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:2  UQ:i:0       AS:i:116

Why does Samtools report an error in the first but not in the second read if they are identical and that particular read is ignored by my program?

In case the error message is confusing and the read in which the CIGAR and query sequence length differ is actually in the read where the error message appears (this one MN01972:51:000H5KYKL:1:11101:10749:1220 which was softclipped!)

I have also compared these two reads too.

Before
MN01972:51:000H5KYKL:1:11101:10749:1220 0       chr2    208248339       60      126M    *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC       AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFF  MD:Z:126        RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:0  UQ:i:0  AS:i:126
 After
MN01972:51:000H5KYKL:1:11101:10749:1220 0       chr2    208248363       60      24S77M25S       *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC       AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFF       MD:Z:126        RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    NM:i:0  UQ:i:0  AS:i:126

To the best of my knowledge, the read soft clipped looks correct. What am I missing?

There is something I wonder...

When looking at reads softclipped with ampliconclip I have seen that the MD tag is removed.

 MN01972:49:000H5KYMY:1:11101:21047:5531 0       chr19   33302153        60      151M    *   0                               0       GCTGCCGGCTGTGCTGGAACAGGCCGGCCAGGAACTCGTCGTTGAAGGCGGCCGGGTCGATGTAGGCGCTGATGTCGATGGACGTCTCGTGCTCGCAGATGCCGCCCAGCGGCTCCGGGGCGGCAGGTGGGGCGGGAGGCTGCGCGGGGCC FFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFF/FAFF/AFFFFFFFF/FFF/FFFFFFFFFFFFFFFFFFFFAFAFF/F/FFFAFFFFFFFFFFFFFFFFF=   MD:Z:23T127                          RG:Z:230817_MN01972_0049_A000H5KYMY_RACP1-4poolv7anddirect   NM:i:1  UQ:i:14 AS:i:146


MN01972:49:000H5KYMY:1:11101:21047:5531 0       chr19   33302174        60      21S130M *       0       0       GCTGCCGGCTGTGCTGGAACAGGCCGGCCAGGAACTCGTCGTTGAAGGCGGCCGGGTCGATGTAGGCGCTGATGTCGATGGACGTCTCGTGCTCGCAGATGCCGCCCAGCGGCTCCGGGGCGGCAGGTGGGGCGGGAGGCTGCGCGGGGCC       FFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFF/FAFF/AFFFFFFFF/FFF/FFFFFFFFFFFFFFFFFFFFAFAFF/F/FFFAFFFFFFFFFFFFFFFFF=                                                           RG:Z:230817_MN01972_0049_A000H5KYMY_RACP1-4poolv7anddirect UQ:i:14 AS:i:146

Is this info involved in how the CIGAR is measured?? I doubt it becaouse this is optional.

pysam samtools • 488 views
ADD COMMENT
1
Entering edit mode
4 months ago

See my answer here

How to properly mock a (Pysam) read

make sure to compute the lengths correctly. It can be tricky.

ADD COMMENT
0
Entering edit mode

Hi Istvan Albert. I have seen your answer. Many thanks

Regarding the mock question. I have found a solution to that problem. I will put the answer.

Regarding this question, I don't see the relationship. In this one, it seems that the length of the CIGAR and the actual length of the reads (the sequence and the quality one) are not the same but all are 126 and comparing the one after soft-clipping and before, it seems identical apart from the CIGAR and the tag. I am work on this all day.

ADD REPLY
0
Entering edit mode

ok I have added a function that checks CIGAR and query sequence lengths and if disagreement is detected, stop the analysis. In that read something is done wrong

N01972:51:000H5KYKL:1:11101:10753:13456        0       #1      208248363       60      24S64M2D10S     *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATCTTCTCTGAAGAC     array('B', [32, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 32, 37])      [('MD', '111^GG13'), ('RG', '230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A'), ('NM', 2), ('UQ', 0), ('AS', 116)]
Length mismatch for read MN01972:51:000H5KYKL:1:11101:10753:13456: CIGAR length 98, actual length 124

I am still working on this (maybe this weekend or Monday). I will update you.

ADD REPLY

Login before adding your answer.

Traffic: 1774 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6