Question: RNA-seq fastqc report explanation
2
gravatar for Lerong
4.1 years ago by
Lerong80
United States
Lerong80 wrote:

Hi all,

I am new to  RNA-seq data analysis and just starting some QC analysis.

I run the fastqc with default setting and got the report. and would like some comments and suggestions for further QC steps. Please bear with me if the questions are stupid.

The base quality is pretty good. The problem is that the data has very high duplication levels. I read through the documents and find possible reasons are PCA amplification, adapter contamination etc. Any suggestions?

There are also many over-represented sequences and Kmer Content in the report. Any comments?

Another question is for the adapter content. I was told the adapter for trimming they used is CTGTCTCTTATACACATCT but there are certain number of universal adapters in the reads like AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA.

 

Thanks in advance.

 

 

rna-seq • 6.3k views
ADD COMMENTlink modified 4.1 years ago by Devon Ryan89k • written 4.1 years ago by Lerong80
2
gravatar for Devon Ryan
4.1 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

High levels of apparent duplication are normal in RNAseq and aren't due to actual PCR or optical duplicates. Any sort of rRNA or any other highly expressed transcript will cause the apparent duplication rate to jump.

Something that does deserve a bit of looking into the GC distribution. It's not normally so peaky like that. Similarly, there's a big jump in the prevalence of TGCCGACTA and CAAGTCGTC in the middle of reads. A spike on the of a hexamer (well, octamer in this case) at the start of reads is common, but I don't typically see that in the middle of reads for standard mRNAseq dataset. Was this a standard RNAseq dataset experiment or were you doing something special (e.g., RIPseq)?

For trimming, I actually trim off AGATCGG... for most datasets. Have a look here for what Scythe looks for.

ADD COMMENTlink written 4.1 years ago by Devon Ryan89k

Thanks for your answer@Devon Ryan.  I am not sure. What I am doing is part of big the project in a co-clinical trial. 

Does peaky GC content indicate contaminated library like adapter (AGATCGG...) contamination? 

 

ADD REPLYlink written 4.1 years ago by Lerong80
2

This is Human RNA? Are you sure you don't have bacterial contamination?

ADD REPLYlink written 4.1 years ago by apelin20470

@apelin20. Sorry for forgetting mention that.l Yes!!!!!, It is bacterial contamination sample!.  I would like to use the sample to identify the bacterial integration site from the RNA seq data.

ADD REPLYlink written 4.1 years ago by Lerong80

The middle peaks aren't from adapters, but they're from something else. Perhaps apelin20 is correct suggesting bacterial contamination (that'd also explain the weird GC distribution). Blasting a few of the top sequences should help in determining this.

ADD REPLYlink written 4.1 years ago by Devon Ryan89k

I just want to say how knowledgeable you are ! @apelin20  @ Devon Ryan Thanks.

ADD REPLYlink written 4.1 years ago by Lerong80

Based on that,  do I need to adapter clipping and trim the low quality bases before I do analysis if I would like to identify the virus integration site? Thanks. 

ADD REPLYlink written 4.1 years ago by Lerong80
1

A bit of trimming won't hurt. Whether you can determine where the virus (I assume the earlier mention of bacterial integration was mistaken) integrated will depend on whether (1) the virus is getting included in a host transcript or (2) an integrated viral transcript includes host DNA. In either case, just trim adapter contamination and don't do any more than very light quality trimming.

ADD REPLYlink written 4.1 years ago by Devon Ryan89k
1

Why don't you do a preliminary assembly with Velvet/Oases (or another assembler) and megablast the first top 100 or 1000 transcripts and see what's in your sample as suggested by Devon Ryan. It doesnt seem like typical adapters are present in your reads, and it doesn't look like you have a quality problem. You can always do the trimmings later (or now), you need to figure out what is in your sample and where is it.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by apelin20470
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1409 users visited in the last hour