Two identical reads in my bam files but with a different in CIGAR are mapped in a different location in IGV
1
0
Entering edit mode
3 months ago
ManuelDB ▴ 80

I have extracted from a bam file one read and unclipped the CIGAR string. I have done this with this code

 import pysam


# Read IDs from file into a set
ids_to_extract = set()
with open("reads_names.txt", "r") as file:
    for line in file:
        ids_to_extract.add(line.strip())

# Open the original BAM file and a new file for output
infile = pysam.AlignmentFile("RACP1_4poolv7anddirect_A_FINAL_SORTED.bam", "rb")  # Use "rb" for reading a BAM file
outfile = pysam.AlignmentFile("unclipped_simulated_reads.bam", "wb", template=infile)

for read in infile:

    if read.query_name in ids_to_extract:
        # Get the original CIGAR tuples (operation, length)
        original_cigar_tuples = read.cigartuples

        if original_cigar_tuples:
            # Sum the lengths of match and soft-clip segments
            total_match_length = sum(length for operation, length in original_cigar_tuples if operation in [0, 4])  # 0 for 'M', 4 for 'S'

            # Other operations (insertions, deletions, etc.) are ignored in this sum
            new_cigar_tuples = [(0, total_match_length)]  # Replace with a single 'M' operation

            # Set the modified CIGAR tuples to the read
            read.cigartuples = new_cigar_tuples

        # Write the modified read to the output file
        outfile.write(read)

infile.close()
outfile.close()

These are the reads. First one is the original second is the extracted and unclipped read:

samtools view  RACP1_4poolv7anddirect_A_FINAL_SORTED.bam | grep "MN01972:51:000H5KYKL:1:11104:19018:10292"
MN01972:51:000H5KYKL:1:11104:19018:10292        0 chr2  208248363       60      24S102M *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC  AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFF    RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    UQ:i:0  AS:i:126
(RACP) [mdb1c20@cyan52 simulation_data]$ samtools view  unclipped_simulated_reads.bam | grep "MN01972:51:000H5KYKL:1:11104:19018:10292"
MN01972:51:000H5KYKL:1:11104:19018:10292        0 chr2  208248363       60      126M    *       0       0       CAAAATCACATTATTGCCAACATGACTTACTTGATCCCCATAAGCATGACGACCTATGATGATAGGTTTTACCCATCCACTCACAAGCCGGGGGATATTTTTGCAGATAATGGCTTCTCTGAAGAC  AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFF    RG:Z:230824_MN01972_0051_A000H5KYKL_RACP1_4poolv7anddirect_A    UQ:i:0  AS:i:126

What I was expecting is to extract the read as seen in the original bam but when I open both bam in IGV, they differ.

enter image description here

Why??

pysam • 234 views
ADD COMMENT
1
Entering edit mode
3 months ago
ManuelDB ▴ 80

OK, I have learned something today. This only happens in forward reads. That soft-clipped is not due to bases that don't match with the ref. genome. This soft-clipped is done by (Samtools) Softclip and in the forward ones, this modifies the start coordinate of the reads. I have seen this by looking at the bam before the soft-clip is done. and it has 126M and 20 bases upstream. Never thought that.

ADD COMMENT

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6