Question

Small secondary peak for per sequence GC content - FASTQC results (bulk RNA-seq)

0

Entering edit mode

9 weeks ago

Fossil ▴ 20

Hi, I am new to bioinformatics and would love some help, please. We did bulk paired-end RNA-seq with Rattus norvegicus muscle tissue (48 files, N=24). An omics centre did the library prep and the sequencing for us, using MGI Tech. I ran FastQC and BLASTed the overrepresented sequences. I have read lots on forums, the fastqc resources and know not to take these results too seriously, but I still require guidance and want to assure I understand correctly before I proceed to STAR alignment and DE analysis. Most of the results seems good, except a few things caught my eye:

1) There are about 10 samples (N=10/24; 20/48 files) with overrepresented sequences that match to R. norvegicus mitoRNA or mRNA. There's always a little secondary peak on the per seq GC content graph and a warning (Fig. 1) Qs: 1a) Is this of any concern? 1b) I assume that they might indicate highly expressed genes and I should just ignore the warnings?

-Per seq GC content graph - **Fig. 1,** mitoRNA or mRNA

2) The two files for one sample (N=1/24) show a huge secondary peak (Fig. 2.) and the ~25 overrepresented all match rRNA from R. norvegicus (the files indicate ~32722582 seqs total and the overrep seqs make ~5%). I truly am not sure what to do here as this only happened to one sample. Qs: 2a) Why is this occurring in only one sample?, and 2b) How should I proceed with this sample?

-Per seq GC content graph - **Fig. 2.** rRNA

3) A few files have similar GC content graphs as above but the overrepresented sequence(s) map to nothing ("No significant similarity found") or a random plant that is not part of the rats' diet, or mould, or a random rodent/mammal (e.g., Abelia forrestii, Paradiachea cylindrica, Elephant etc...; Fig. 3.). Qs: 3a) Are the sequence(s) that have no match novel transcripts? 3b) I assume that aligning without trimming should be fine as these non-Rattus seqs won't be mapped?

-Per seq GC content graph - **Fig. 3.** five overrep seqs with random matches.

4) The two files for one sample (N=1/24) have overrepresented sequences belonging to E. coli and R. norvegicus mitoRNA or mRNA (Fig. 4.). Qs: 4a) As above, I assume that aligning without trimming should be fine as these bacterial seqs won't be mapped? 4b) As above, I assume that the non-E. coli overrepresented sequences are just highly expressed genes?

-Per seq GC content graph - Fig. 4.

I have pseudoaligned with salmon and ran DESeq2 without trimming to quickly assess the data. Just for info, the PCA plot looks really good - distinct clusters for my groups.

Conclusion: I should be fine to not perform trimming prior to STAR alignment?

*I have attached some screenshots, I hope they show up once posted.

Thanks a lot in advance!

RNAseq • 309 views

ADD COMMENT • link updated 9 weeks ago by GenoMax 144k • written 9 weeks ago by Fossil ▴ 20

score 0 · Answer 1 · 2024-05-22

0

Entering edit mode

9 weeks ago

GenoMax 144k

Have you checked for presence of rRNA in the affected samples? That can be one possibility for these peaks.

In theory if you are aligning to genome of choice then the contaminant sequences (if real) should not align.

ADD COMMENT • link 9 weeks ago by GenoMax 144k

0

Entering edit mode

Thanks for the reply and the help! I ran SortMeRNA then re-ran FastQC. Here's what both GC content graphs look like now. So I guess rRNA was the culprit.

I aligned the non_rRNA_reads_R1.fq and _R2.fq with STAR and the alignment rate is super low: 1.88% while pre-SortMeRNA was 70% (for that sample, while all others are >85%). I am confused. Would you know why this happens? I made sure to align the non_rRNA _reads and not the aligned_reads from SortMeRNA.

postSortMeRNA_R1.fq postSortMeRNA_R2.fq

ADD REPLY • link 7 weeks ago by Fossil ▴ 20