I have mapped RNA-seq reads using STAR (human, hg19), then I check the output bam file and find that there are many reads with cigar marked as soft-clip, such as “13S90M47S”, “88M6S” and “7S86M”. Even many of these reads also have very good flag such as “99”, “147” or “83”, “163” which indicate unique mapping.
My questions are:
Why there are many reads which are marked as soft-clip reads. Is it related to the relative low quality?
How do I get rid of these reads? I have tried to trim the 3-prime end of reads according to the reads quality using cutadapt and set qc to 20. However, it doesn`t work.
Is it reasonable to use these unique mapping reads marked as soft-clip to summarize gene counts?