Hi all,
I am analysing a large Pig RNA-seq dataset of over 200 samples across 5 tissues. According to NCBI, the GC content of the Pig genome is 42%. I understand RNA-seq data should return higher GC content than whole genome data
During testing of the pipeline on a few random samples, the fastqc graphs returned a normal distribution peaking around 50%, with 1 sample shifted slightly left at around 47%, and over-represented sequences returned were just adapters.
So I have 2 questions:
1) What should I expect the RNA-seq GC content to be? My guess is ~ 47% (genome GC + 5%). Should I be concerned if all samples return a GC content of 50%
2) If there are a small number of samples showing a lower GC content by a few percent than the rest, should they be removed from the analysis? How should that be handled? Is a few % nothing to be concerned about?
Thanks in advance,
Kenneth
My recommendation is to ignore meaningless metrics such as GC content and focus on relevqnt QC. That is mapping rate and how samples look downstream, e.g. in PCA to assess group separation.
Appreciate it, thanks. This did feel rather pedantic.