Dear all, I would need some help to solve a problem with CIRI2 running on ribo- total RNA paired-end data that I am analysing.
I have worked with CIRI2 pipeline in the past without any problem, but my new RNA-seq data have some intrinsic problematics: the library that was used for sequencing was intended to perform 75 bp paired-end seq, but the sequencing that was actually performed is paired-end 150 bp long. This led to getting fragments in my library which were shorter than sequencing length. Indeed, length distribution of my reads varies from 75 bp to a maximum of 150 bp, and most of my R1 and R2 reads are overlapping.
I solved this problem when performing linear RNA alignment with STAR by allowing overlapping of reads, and quantification gave me no problems.
I then analyzed circRNAs with CIRI standard pipeline, starting from BWA-MEM alignment (with T -19 option) and then running CIRI2 quantification. In the end of my analysis, I compared counts of each replicate for each condition (I have 3 biological replicates per condition) and realized that my circ reads count was vary variable between replicates, with highly expressed circRNAs having many reads in some replicates and 0 counts in others of the same condition. This "extreme" case does not happen with a specific replicate (ex. n1, 2 or 3) but seems to be stochastic depending on the circRNA under analysis.
I checked whether circRNAs having 0 counts were actually present in the sequenced samples by looking for the BSJ reads in the fastq file and in the SAM alignment. These reads are present (at a level comparable with the replicates where I get counts for that specific circRNA) and have high quality. Indeed, if I run CIRI2 with no stringency option (-0) or with low mapping quality option (-U 5 instead of 10 as set by default) the situation does not change. I do not understand why such reads are not counted by CIRI2.
I figured that the variable read length of my reads might be the problem, so I tried trimming the reads to different lengths (75 bp, 100 bp and 120 bp), but this did not solve the problem since I get some circRNAs "back" but not others (also among the highly expressed ones). I also tried running CIRI2 with single core, thinking that with multicore mode there could be some problems in the initial SAM file splitting, with R1 and R2 reads getting in different SAM tmp files. Unfortunately, this also did not change the situation.
I know that the problem is related to BWA-MEM or CIRI2 working on the paired-end data, since if I try analysing R1 and R2 separately - as if they were single-ended sequencing - I have counts for all my circRNAs.
Could you help me figuring out why CIRI2 is not able to count some BSJ reads in my seq? Do you know whether I can change some options in BWA-MEM or in the CIRI2 script in order so solve my problem with overlapping reads?
Any help or suggestion would be much appreciated!
Thanks, Dario