Question

Fastqc for RNA-Seq data (Illumina 1.9)

4

Entering edit mode

8.1 years ago

debitboro ▴ 260

Hi,

I'm new in analyzing RNA-Seq data coming from Illumina plateform. As a first step, I tried to fastqc only one file, and I obtained this result. As you can see, I got failure in 6 modules. I think that the Sequence Duplication Levels module (despite the failure) is ok in RNA-Seq data. My question is to know if I have to trim the first 13 bp from the reads (Per base sequence content show the failure) or not ? And I need some advices from you about the kmer content failure.

Thanks

RNA-Seq Fastqc Kmer content • 11k views

ADD COMMENT • link updated 8.1 years ago by Michael 54k • written 8.1 years ago by debitboro ▴ 260

1

Entering edit mode

This looks like an example case for What is the reason for most software errors in Bioinformatics according to you?

ADD REPLY • link 8.1 years ago by Michael 54k

score 8 · Answer 1 · 2016-03-31

8

Entering edit mode

8.1 years ago

Istvan Albert 100k

Your data have very high levels of duplication and in the 10,000+ region. That is quite worrisome. In RNA-Seq duplication levels in the tens and hundreds are ok, 10K+ region is not ok. Your most duplicated sequence:

CTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGCGTT

is present 200K times and appears to match ribosomal DNA.

Hence to me it looks like your data has not been rRNA depleted most of the reads will map to rRNA. You may have to make do with just 7% of the data meaning about 2 million reads from the original 39 million. That may or may not be sufficient.

At this point worrying about quality filtering or kmer content is not all that relevant, that won't really make much difference.

ADD COMMENT • link 8.1 years ago by Istvan Albert 100k

0

Entering edit mode

hi Istvan Albert,

thank you for your suggestions. I have performed the mapping of the 1M first reads to the reference human rRNA including 5S, 5.8S, 12S, 16S, 18S, and 28S, I got a result of 83% of mapping. I don't know what I'll do since the data I'm handling come from a NGS experience performed by another person.

thanks

ADD REPLY • link 8.1 years ago by debitboro ▴ 260

0

Entering edit mode

Ouch! This confirms @Istvan's observation (though you have ~4M usable reads).

ADD REPLY • link 8.1 years ago by GenoMax 141k

score 3 · Answer 2 · 2016-03-31

3

Entering edit mode

8.1 years ago

GenoMax 141k

Please do NOT trim the first bases. You may be throwing away perfectly good data.
See this blog post from Dr. Simon Andrews for more background on this. @decosterwouter: You may want to see the post as well.
This must be a stranded RNAseq library (which is indicated by the GC predominance).

Having a red "X" appear on FastQC module does not indicate an automatic failure of data. Simon had to decide on some reasonable intervals for judging the output of various modules and this "observation" becomes a side effect of those choices (I think those limits can be changed by a settings file, if I recall). You would want to consider what kind of data you are dealing with before deciding the actual failures part (other posts here are useful as well).

ADD COMMENT • link 8.1 years ago by GenoMax 141k

2

Entering edit mode

The plot in that blog post and the explanation for it are confusing.

Note how the plot is binned after position 10. The first 9 measures are individual measures but the rest are averages binned by some window, not explained properly. The labels don't seem right either. What the heck is 14-15? Of course the line is smoother after it gets binned. That plot (with many others in FastQC) are unscientific IMO.

Random priming

ADD REPLY • link 8.1 years ago by Istvan Albert 100k

0

Entering edit mode

This looks like the FastQC standard plot, never noticed it was binned >10 but it looks it is adding to the confusion.

ADD REPLY • link 8.1 years ago by Michael 54k

0

Entering edit mode

FastQC does all sorts of weird smoothing. It does it on the GC% plots too. I really hate FastQC for this.

ADD REPLY • link 8.1 years ago by John 13k

0

Entering edit mode

The binning can be turned-off on the command line. FastQC will plot individual cycles in that case. Plots get unwieldy for long runs so the default binning is used.
Can the plots be done better I am sure they can but they serve the basic qualitative purpose now. FastQC code is open source so perhaps someone here can improve that part.

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the link, will definitely read that. (And change my own trimming strategy!)

ADD REPLY • link 8.1 years ago by WouterDeCoster 47k

score 2 · Answer 3 · 2016-03-31

You should not trim the first 12-13bp of your data, these are of good quality. I have seen a similar pattern in almost each RNA-seq sample. Per base sequence content seems to fail because of a slight bias in the reagents or protocol for a sequence composition in the first bases, such that are reads having a certain composition in the beginning are slightly enriched. Most of the tests FastQC runs are irrelevant for RNA-seq and mostly cause confusion. As you suggest the sequence duplication level is generally higher in RNA-seq. But the duplication level is quite high in your sample. This could be due to few very highly expressed genes, high content of ribosomal RNA, that is something you should check.

score 0 · Answer 4 · 2016-03-31

I would suggest to use trimmomatic, which will also take care of the reads running into adapters. I would recommend trimming part of the beginning (~12), perform adapter clipping and then perform QC again. Perhaps removing low quality bases at the end will improve your result. The trimmomatic manual is very helpful. Veel succes met de analyse! http://www.usadellab.org/cms/?page=trimmomatic