Issue softclipping reads when they belong and don't belong to a common amplicon
Entering edit mode
12 weeks ago
ManuelDB ▴ 80

I need to soft-clip the primers of my amplicon reads and I have the following problem

In scenario 1 I have forward and reverse reads in the same coordinates and I only want to soft-clip the first bases of each read (as shown in the example below) becaouse the reads belong to different amplicons.

enter image description here

In scenario 2 I have forward and reverse reads in the same coordinates and I want to soft-clip the first and the last bases of each read (as shown in the example below) becaouse the reads belong to the same amplicon.

enter image description here

In the second scenario, I am not soft-clipping correctly the reads.

I am using samtools ampliconclip and this is not versatile enough. If I use --both-end this will successfully soft-clip reads in scenario 2 BUT it will also soft-clip the regions of the reads in scenario 1 I don't want to soft-clip.

On the contrary, If I use --strand this will soft clip my reads in scenario 1 but in scenario 2 this will not do the job as shown in the second pic above.

I have tried to develop a Python script with Pysam to handle this (See code below but not finished yet) but Pysam as far as I know is not as optimisate as Ampliconclip. Consequently, reads with noise or real variants in which the CIGAR string is not simple (e.g. this is a simple CIGAR 120M this is not a simple CIGAR 3I117M) are not properly sotf-clipped. This is not for a single analysis, this is to put in production so I am looking for a robust solution and I am afraid that my Pysam approach may give me a lot of problems.

 import pysam

# Function to adjust CIGAR string for soft-clipping
def softclip_positive_read(read, primer_F_end, primer_R_start):
    Modify the CIGAR string of a forward read to soft-clip it at specific positions.
    Soft-clips the start of the read up to primer_F_end and the end of the read from primer_R_start.
    read: The read to be modified.
    primer_F_end: The position at which to start soft-clipping at the beginning of the read.
    primer_R_start: The start position of the second primer where the end clipping begins.
    start_clip_length = primer_F_end - read.reference_start
    end_clip_length = read.reference_end - primer_R_start

    # Check if start clipping is needed
    if start_clip_length > 0:
        read.reference_start += start_clip_length
        start_clip_length = 0

    new_cigar = []
    total_length = 0

    print(f"Before: {read.query_name} {read.reference_start}-{read.reference_end} {read.cigarstring}")

    for operation, length in read.cigartuples:
        total_length += length
        if operation == 0:  # 'M' operation
            if start_clip_length > 0:
                # Clip at the start
                clip_len = min(length, start_clip_length)
                new_cigar.append((4, clip_len))  # 'S' for soft-clipping
                length -= clip_len
                start_clip_length -= clip_len

            if length > 0 and end_clip_length > 0 and total_length > (read.reference_length - end_clip_length):
                # Clip at the end
                clip_len = min(length, end_clip_length)
                new_cigar.append((operation, length - clip_len))
                new_cigar.append((4, clip_len))
                end_clip_length -= clip_len
            elif length > 0:
                new_cigar.append((operation, length))

    read.cigartuples = new_cigar

    print(f"After: {read.query_name} {read.reference_start}-{read.reference_end} {read.cigarstring}")

# Load amplicon coordinates from BED file
amplicons = []
with open("baby1.bed", "r") as bed_file:
    for line in bed_file:
        chrom, start, end, name, score, strand, primer_F_end, primer_R_start = line.strip().split()
        amplicons.append((chrom, int(start), int(end), int(primer_F_end), int(primer_R_start), strand))

# Open BAM file for reading and writing
infile = pysam.AlignmentFile("./simulation_data/unclipped_simulated_reads.bam", "rb")
outfile = pysam.AlignmentFile("./softclipped.bam", "wb", template=infile)

# Process each read in the BAM file
for read in infile:
    # The flag 16 indicates that the read is aligned to the reverse strand of the reference.
    if read.is_reverse:
        continue  # Skip reverse reads for now

    # Else is forward
    for amplicon in amplicons:
        chrom, start, end, primer_F_end, primer_R_start, strand = amplicon
        # Start with + strand reads
        if read.reference_name == chrom and strand == "+":
            # Check if the read is within ±5 bases of the amplicon start
            if abs(read.reference_start - start) <= 5:
                # Soft-clip the read
                softclip_positive_read(read, primer_F_end, primer_R_start) 
                break  # Only process for the first matching amplicon


# Close input and output files
Samtools softclip • 368 views
Entering edit mode
12 weeks ago
aw7 ▴ 270

Well you already know the answer to the --both-ends vs --strand options.

The clipping code is not currently in htslib and so is not available to Pysam (I hope to make it so in future).

And, yes, the clipping code is annoyingly fiddly and hard to get right.


Login before adding your answer.

Traffic: 1599 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6