Question

Questions about per base sequence quality and GC content.

0

Entering edit mode

8 months ago

bioinfo ▴ 150

Hello,

I have some RNA seq data. I did fastqc and multi QC on the data and I have some questions about the output and how to proceed. I was checking the quality scores for the bases and the mean for my samples is higher than 35 which seems to be good. However, when I look at the individual fastqc files I noticed that even though the per base sequence quality seems to pass the check the error bars are really big. For example the mean could be 38 but the error bar will reach 24. Should I be concerned about that?

Also, I am getting two peaks on the per sequence GC content. I was reading about it and it seems that this could indicate some kind of contamination. I am planning to use STAR to align the data and then check the GC content on the mapped and unmapped reads. I am also planning to blast some of the overrepresented sequences. Is there anything else I can do to identify the source of contamination and check if it interferes with the mapping?

enter image description here

Thank you

fastqc fastq seq RNA • 952 views

ADD COMMENT • link 8 months ago by bioinfo ▴ 150

score 1 · Answer 1 · 2023-08-16

1

Entering edit mode

8 months ago

GenoMax 142k

If you are aligning to a good reference it should be fine to proceed. You have not posted a screenshot but it is possible that Q scores in your samples are not the best across the reads and thus you see the variation.

One of the "contaminating" entity could be rRNA sequence where the GC content is concerned. Align your data and see if you noticing issues with alignment. If you do not see similar alignment percentages then backtrack to investigate more.

This should not be a roadblock to moving forward with the rest of the analysis.

ADD COMMENT • link 8 months ago by GenoMax 142k

0

Entering edit mode

Thank you for replying. I added the images on the initial post. I blasted the overepresented sequences and they seem to be lncRNA and a specific enhancer. Only one sequence seemed to be rRNA. I get around 50% alignment with kallisto. I am planning to align with STAR now so I can check the GC content of the mapped and unmapped reads.

ADD REPLY • link 8 months ago by bioinfo ▴ 150

0

Entering edit mode

I noticed that there is an rRNA sequence that is in the overrepresented sequences for all my samples. The GC content for that is around 69%.Could that be what is causing the issue? I aligned my data with STAR and then when I do fastqc on the aligned BAM files the peak still seems to be there. However, I just realized that that rRNA is not on the gtf file from ensembl and in general it does not seem to have an ensembl id. Does that mean that it is not that rRNA causing the issue?

ADD REPLY • link 8 months ago by bioinfo ▴ 150

0

Entering edit mode

when I do fastqc on the aligned BAM files the peak still seems to be there

If your BAM files contain unaligned reads then that is expected. It is likely that rRNA is causing that peak distribution. Unless you are working with rRNA you are not going to use those counts/reads.

ADD REPLY • link 8 months ago by GenoMax 142k

0

Entering edit mode

I think that my BAM file does not contain unaligned reads because I used --outReadsUnmapped on the STAR command.

ADD REPLY • link 8 months ago by bioinfo ▴ 150