I have mapped RNA-seq reads using STAR (human, hg19), then I check the output bam file and find that there are many reads with cigar marked as soft-clip, such as “13S90M47S”, “88M6S” and “7S86M”. Even many of these reads also have very good flag such as “99”, “147” or “83”, “163” which indicate unique mapping.
My questions are:
Why there are many reads which are marked as soft-clip reads. Is it related to the relative low quality?
How do I get rid of these reads? I have tried to trim the 3-prime end of reads according to the reads quality using cutadapt and set qc to 20. However, it doesn`t work.
Is it reasonable to use these unique mapping reads marked as soft-clip to summarize gene counts?
The soft-clipped sequences could be either poor quality (so the wrong bases were called) or are contaminating sequences (e.g. adapters or barcodes). Also possible that aligning to the reference just isn't good (i.e. for a given site, the actual sequence of your sample is different from the reference sequence -- especially true for repetitive sequences and structural variation). Did you check the quality on a per-base level? And what does the fastqc output look like? Why don't you look at what the soft clipped bases are -- do they represent a particular sequence?
It seems that there's soft clipping appearing on both the 5' and 3' ends of reads.
If you want to get rid of all soft-clipped alignments, you could just go through .bam and filter out the cigar strings that have the soft-clip flag. But probably best to first figure out what those sequences actually are (manually inspect them and see if it's reasonable to conclude that they are misassigned). Only then can you answer whether it's reasonable to use those sequences for summarizing gene counts.
Actually, almost of half reads are marked as soft-clip. I could lost half if I do not use these reads to summarize gene counts which I can hardly afford. The qc is somehow not very good, especially for the 3-prime end of reads. however I trim the end by qc 20 using cutadapt which didn`t change too much. I am also curious about the soft-clip appearing on both 5- and 3-prime of reads.
Do you have any suggestion to get rid of these soft-clip by some kinds of trimming method?
The STAR aligner prefers in standard settings rather to soft-clip reads than assign mismatches at the ends. There is a setting to enforce end-to-end mapping
I also see often soft-cliped reads and use them as they are reported. As the soft-cliped part is not participating in the alignment.
I have tried mapping with
--alignEndsType EndToEndand it end up dramatically decreasing the mapping percentage.