Question: GC content biases in RNA-seq data of A.thaliana
I'm currently working on RNA-seq data from A.thaliana, and I have questions about quality and GC content. I guess it's normal to have a higher GC% in RNA-seq data than in the genome itself, since coding sequences usually show a bias toward GC. However, A.thaliana has a GC rate of 36% and my samples go up to 51-53%, isn't that a bit too much?

I'm wondering because although the quality of the sequencing looked OK from the FastQC reports, I have a very low rate of mapping, like 10-20% of reads. I have only one sample that maps over 60%, and this one has a GC rate of 44%.

I tried mapping with bowtie2 and subread-align, both with default params (meaning 0 mismatches and 3 mismatches respectively).

I'm a bit confused here, any idea someone?


I tried aligning on the TAIR10 assembly instead of Araport11 and now I've got >90% of mapping for each sample! I'm still confused but at least it works...

Is it paired-end data? If yes, you could try to align the reads separately as single-end data. If alignment rate seems reasonable, you can try to increase fragment size (see advanced parameters of aligners). Furthermore, I would recommend STAR-aligner. I don't know much about subread-align, but bowtie2 designed for DNA data.

If you did it correctly you have to consider that 36 percent and 52 percent are percentages, or ratios. Your transcriptome is certainly smaller than the genome, and as you said your transcriptome has a higher GC content. So, it's quite plausible there's nothing wrong.

