RNA-seq fastqc report explanation
1
2
Entering edit mode
9.1 years ago
Lerong ▴ 130

Hi all,

I am new to RNA-seq data analysis and just starting some QC analysis.

I run the fastqc with default setting and got the report. and would like some comments and suggestions for further QC steps. Please bear with me if the questions are stupid.

The base quality is pretty good. The problem is that the data has very high duplication levels. I read through the documents and find possible reasons are PCA amplification, adapter contamination etc. Any suggestions?

There are also many over-represented sequences and Kmer Content in the report. Any comments?

Another question is for the adapter content. I was told the adapter for trimming they used is CTGTCTCTTATACACATCT but there are certain number of universal adapters in the reads like AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA.

Thanks in advance

RNA-Seq • 7.9k views
ADD COMMENT
2
Entering edit mode
9.1 years ago

High levels of apparent duplication are normal in RNAseq and aren't due to actual PCR or optical duplicates. Any sort of rRNA or any other highly expressed transcript will cause the apparent duplication rate to jump.

Something that does deserve a bit of looking into the GC distribution. It's not normally so peaky like that. Similarly, there's a big jump in the prevalence of TGCCGACTA and CAAGTCGTC in the middle of reads. A spike on the of a hexamer (well, octamer in this case) at the start of reads is common, but I don't typically see that in the middle of reads for standard mRNAseq dataset. Was this a standard RNAseq dataset experiment or were you doing something special (e.g., RIPseq)?

For trimming, I actually trim off AGATCGG... for most datasets. Have a look here for what Scythe looks for.

ADD COMMENT
0
Entering edit mode

Thanks for your answer@Devon. I am not sure. What I am doing is part of big the project in a co-clinical trial.

Does peaky GC content indicate contaminated library like adapter (AGATCGG...) contamination?

ADD REPLY
2
Entering edit mode

This is Human RNA? Are you sure you don't have bacterial contamination?

ADD REPLY
0
Entering edit mode

@apelin20. Sorry for forgetting mention that.l Yes!!!!!, It is bacterial contamination sample!. I would like to use the sample to identify the bacterial integration site from the RNA seq data.

ADD REPLY
0
Entering edit mode

The middle peaks aren't from adapters, but they're from something else. Perhaps apelin20 is correct suggesting bacterial contamination (that'd also explain the weird GC distribution). Blasting a few of the top sequences should help in determining this.

ADD REPLY
0
Entering edit mode

I just want to say how knowledgeable you are! @apelin20 @Devon Ryan Thanks.

ADD REPLY
0
Entering edit mode

Based on that, do I need to adapter clipping and trim the low quality bases before I do analysis if I would like to identify the virus integration site? Thanks.

ADD REPLY
1
Entering edit mode

A bit of trimming won't hurt. Whether you can determine where the virus (I assume the earlier mention of bacterial integration was mistaken) integrated will depend on whether (1) the virus is getting included in a host transcript or (2) an integrated viral transcript includes host DNA. In either case, just trim adapter contamination and don't do any more than very light quality trimming.

ADD REPLY
1
Entering edit mode

Why don't you do a preliminary assembly with Velvet/Oases (or another assembler) and megablast the first top 100 or 1000 transcripts and see what's in your sample as suggested by Devon Ryan. It doesnt seem like typical adapters are present in your reads, and it doesn't look like you have a quality problem. You can always do the trimmings later (or now), you need to figure out what is in your sample and where is it.

ADD REPLY

Login before adding your answer.

Traffic: 1857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6