Question: Fastqc for RNA-Seq data (Illumina 1.9)
2
gravatar for debitboro
3.7 years ago by
debitboro140
Belgium
debitboro140 wrote:

Hi,

I'm new in analyzing RNA-Seq data coming from Illumina plateform. As a first step, I tried to fastqc only one file, and I obtained this result. As you can see, I got failure in 6 modules. I think that the Sequence Duplication Levels module (despite the failure) is ok in RNA-Seq data. My question is to know if I have to trim the first 13 bp from the reads (Per base sequence content show the failure) or not ? And I need some advices from you about the kmer content failure.

Thanks

fastqc rna-seq kmer content • 6.4k views
ADD COMMENTlink modified 3.7 years ago by Dr. Mabuse47k • written 3.7 years ago by debitboro140
1

This looks like an example case for What is the reason for most software errors in Bioinformatics according to you?

ADD REPLYlink written 3.7 years ago by Dr. Mabuse47k
7
gravatar for Istvan Albert
3.7 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

Your data have very high levels of duplication and in the 10,000+ region. That is quite worrisome. In RNA-Seq duplication levels in the tens and hundreds are ok, 10K+ region is not ok. Your most duplicated sequence:

CTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGCGTT

is present 200K times and appears to match ribosomal DNA.

Hence to me it looks like your data has not been rRNA depleted most of the reads will map to rRNA. You may have to make do with just 7% of the data meaning about 2 million reads from the original 39 million. That may or may not be sufficient.

At this point worrying about quality filtering or kmer content is not all that relevant, that won't really make much difference.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Istvan Albert ♦♦ 81k

hi Istvan Albert,

thank you for your suggestions. I have performed the mapping of the 1M first reads to the reference human rRNA including 5S, 5.8S, 12S, 16S, 18S, and 28S, I got a result of 83% of mapping. I don't know what I'll do since the data I'm handling come from a NGS experience performed by another person.

thanks

ADD REPLYlink written 3.7 years ago by debitboro140

Ouch! This confirms @Istvan's observation (though you have ~4M usable reads).

ADD REPLYlink written 3.7 years ago by genomax75k
3
gravatar for genomax
3.7 years ago by
genomax75k
United States
genomax75k wrote:

Please do NOT trim the first bases. You may be throwing away perfectly good data.
See this blog post from Dr. Simon Andrews for more background on this. @decosterwouter: You may want to see the post as well.
This must be a stranded RNAseq library (which is indicated by the GC predominance).

Having a red "X" appear on FastQC module does not indicate an automatic failure of data. Simon had to decide on some reasonable intervals for judging the output of various modules and this "observation" becomes a side effect of those choices (I think those limits can be changed by a settings file, if I recall). You would want to consider what kind of data you are dealing with before deciding the actual failures part (other posts here are useful as well).

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by genomax75k
2

The plot in that blog post and the explanation for it are confusing.

Note how the plot is binned after position 10. The first 9 measures are individual measures but the rest are averages binned by some window, not explained properly. The labels don't seem right either. What the heck is 14-15? Of course the line is smoother after it gets binned. That plot (with many others in FastQC) are unscientific IMO.

Random priming

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Istvan Albert ♦♦ 81k

This looks like the FastQC standard plot, never noticed it was binned >10 but it looks it is adding to the confusion.

ADD REPLYlink written 3.7 years ago by Dr. Mabuse47k

FastQC does all sorts of weird smoothing. It does it on the GC% plots too. I really hate FastQC for this.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by John12k

The binning can be turned-off on the command line. FastQC will plot individual cycles in that case. Plots get unwieldy for long runs so the default binning is used.
Can the plots be done better I am sure they can but they serve the basic qualitative purpose now. FastQC code is open source so perhaps someone here can improve that part.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by genomax75k

Thanks for the link, will definitely read that. (And change my own trimming strategy!)

ADD REPLYlink written 3.7 years ago by WouterDeCoster42k
2
gravatar for Dr. Mabuse
3.7 years ago by
Dr. Mabuse47k
Bergen, Norway
Dr. Mabuse47k wrote:

You should not trim the first 12-13bp of your data, these are of good quality. I have seen a similar pattern in almost each RNA-seq sample. Per base sequence content seems to fail because of a slight bias in the reagents or protocol for a sequence composition in the first bases, such that are reads having a certain composition in the beginning are slightly enriched. Most of the tests FastQC runs are irrelevant for RNA-seq and mostly cause confusion. As you suggest the sequence duplication level is generally higher in RNA-seq. But the duplication level is quite high in your sample. This could be due to few very highly expressed genes, high content of ribosomal RNA, that is something you should check.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Dr. Mabuse47k
0
gravatar for WouterDeCoster
3.7 years ago by
Belgium
WouterDeCoster42k wrote:

I would suggest to use trimmomatic, which will also take care of the reads running into adapters. I would recommend trimming part of the beginning (~12), perform adapter clipping and then perform QC again. Perhaps removing low quality bases at the end will improve your result. The trimmomatic manual is very helpful. Veel succes met de analyse! http://www.usadellab.org/cms/?page=trimmomatic

ADD COMMENTlink written 3.7 years ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2141 users visited in the last hour