How to mark as QC fail reads with specific CIGARs
0
0
Entering edit mode
8 months ago
ManuelDB ▴ 80

Before the problem, I give some context.

I am developing a amplicon NGS bioinformatics pipeline. I keep the primers to do the alignment, then I use samtools-ampliconclip to mask the primers and finally I use Pisces to call the variants.

The problem is that occasionally, reads come with a large soft-clip in the middle of the read and then when both extreme of the reads are soft-cliped, the whole read is a soft-clipped. The CIGAR string for this reads is 150S and the length of the read is 150. Pisces is not happy with this and return an error

 System.Exception: RACP2-6poolv4_P5-A_FINAL_SORTED.bam: Error processing chr 'chr7': Failed to process variants for MN01972:49:000H5KYMY:1:11102:26356:2968 ... 150S ---> System.Exception: Failed to process variants for MN01972:49:000H5KYMY:1:11102:26356:2968 ... 150S
   at Pisces.Domain.Logic.CandidateVariantFinder.ProcessCigarOps(Read alignment, String refChromosome, Int32 readStartPosition, String chromosomeName)
   at Pisces.Logic.SomaticVariantCaller.Execute()
   at Pisces.Processing.Logic.BaseGenomeProcessor.ProcessByBam(BamWorkRequest workRequest, String chrName)
   --- End of inner exception stack trace ---.



8/19/23 7:21 PM 1  Exception reported:
System.Exception: RACP2-6poolv4_P5-A_FINAL_SORTED.bam: Error processing chr 'chrX': Failed to process variants for MN01972:49:000H5KYMY:1:22101:7666:3361 ... 151S ---> System.Exception: Failed to process variants for MN01972:49:000H5KYMY:1:22101:7666:3361 ... 151S
   at Pisces.Domain.Logic.CandidateVariantFinder.ProcessCigarOps(Read alignment, String refChromosome, Int32 readStartPosition, String chromosomeName)
   at Pisces.Logic.SomaticVariantCaller.Execute()
   at Pisces.Processing.Logic.BaseGenomeProcessor.ProcessByBam(BamWorkRequest workRequest, String chrName)
   --- End of inner exception stack trace ---.
8/19/23 7:21 PM 1  Exception reported:
System.Exception: Failed to process variants for MN01972:49:000H5KYMY:1:11102:26356:2968 ... 150S
   at Pisces.Processing.Logic.BaseProcessor.Execute(Int32 maxThreads)
   at Pisces.Program.ProgramExecution()
   at CommandLine.Application.BaseApplication`1.Execute().

Investigating this error, one of the two the problematic reads is this

  MN01972:49:000H5KYMY:1:22104:2296:19556 16      chr7    148817421       60      151S    *     00       TTCAAATTTACTGCCAATGAAGAACCAAAAACATGTCGGCTCATCACCACTTCACAACACAATCCCAAGCCGTTGAAAAATTGGCCTCAACATTTATTGTGTTTGACAACGTTTAATTGGATCTACAAAACCAAATGTAAGCACTGGTCAA        A//6///A/F///=FF/////F//////FF///F///F//F////F/F=////////FF///A///F6F//F/F/FF///FF/F//F/////FA//F/F///FF///A/F////F/F////F///FFFFFFFFFFF6FFF/FFFF/AAFFF       RG:Z:230817_MN01972_0049_A000H5KYMY_RACP2-6poolv4_P5-BAS:i:30  XS:i:19

So I am wondering how to manipulate the bam file correctly to mark as QC fail reads with a soft-clip longer than a number for example. Approaches I have though are to descard with Pisces low-QM-reads but reads like this has a high QM. Then, I have look at in samtools-ampliconclip documentation which has some parameters to unmapped reads or filter out reads with no matching bases but this doesn't help.

I could use something similar to this

 # remove reads with hard/soft clips info in the CIGAR string
samtools view -h sample.bam | awk '$6 !~ /H|S/{print}' | samtools view -bS - > sample.noclips.bam
# From https://www.biostars.org/p/137461/

I was wondering if there is a tools that does what I need.

ChapGPT says that there is not a tool that is able to do this and recommened me to do this

samtools view -h sample.bam | awk '($1 ~ /^@/) || ($6 !~ /^[0-9]{3,}S/)' | samtools view -bS - > sample.filtered.bam
BAM PISCES • 431 views
ADD COMMENT
1
Entering edit mode

I would write a tool using pysam in Python and have it parse out the length of the clip. Grepping may work, but then you would lose short clips.

ADD REPLY

Login before adding your answer.

Traffic: 1676 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6