Question

trimming fastq files with Trimmomatic

2

Entering edit mode

10.1 years ago

rodd ▴ 250

Dear all,

"TO TRIM OR NOT TO TRIM?"

My PE RNAseq library prep of human brain tissue was made with TruSeq Illumina kit A using index 5, and I've got a few yellow warnings that I'd like to know what you'd do.

I found a yellow warning for overrepresented sequences - none are Illumina adapters/index. When I align the multiple overrepresented sequences, they mostly overlap, and when I blast that sequence this is the result:

Homo sapiens uncharacterized LOC105378179 (LOC105378179), transcript variant X2, ncRNA

Should I REMOVE this sequence using Trimmomatic, since it's overly reprersented?

For last, I have a warning on per sequence GC content.

I thought of doing a "mild" trimming of the reads using trimmomatic (LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36) to remove low-quality bases, and that's all. What would you recommend?

QC RNA-Seq • 23k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by rodd ▴ 250

0

Entering edit mode

humm, I would trim your question a bit, it is too long...

Looking at GC content will not help decide for or against quality trimming. I would trim adapters even though FastQC did not complain. And RNAseq in general raises FastQC "Sequence duplication" flag, but to be certain you have to look at read mapping - highly expressed transcripts will falsely raise the duplication levels.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by h.mon 35k

Ram · Accepted Answer · 2015-06-02

9

Entering edit mode

10.1 years ago

Brice Sarver ★ 3.8k

For what it's worth, I always clean my raw data, at the very least for poor-quality base calls or poor-quality reads in general. So do my colleagues. The real (and somewhat longer) answer to your question, however, is rooted in what type of data you're generating and what you want to ultimately do with it.

If you have shotgun libraries and are looking to assemble a whole genome, keeping in low-quality reads and bases increases complexity and can dramatically increase run time. It's pretty striking; I've seen assembler get 'hung up' trying to sort out the k-mer graph, and the problem can disappear once poor reads are removed. This also applies to duplicates.

If you're just mapping to a reference and calling variants, it's less of a deal nowadays than it was a few years ago. BWA's MEM algorithm can soft-clip reads to improve mapping quality, and this is useful if you have residual adapters or low-quality spans at the beginning. I see this as a secondary bonus of sorts, but I would still trim my reads.

Also, you might have a non-random distribution of k-mers or subsequences represented just based on your library prep. Imagine that you PCR a single locus and then make a library out of it. You will definitely have an overrepresentation. Scaling up, say you targeted and sequenced the exome. Again, your distribution of k-mers might be non-random because you might expect to see certain motifs overrepresented (start/stop codons, for example).

So, I would clean my raw data using a series of best practices (remove low-quality bases/reads/adapters, identify overlaps, dedup - but the dedup doesn't apply to your expression data). I would also be a bit leery to just chop bases off for no reason other than a summary report suggests overrepresentation. The question to ask is, "Will this affect the biological interpretation of my data systematically?"

I would love to hear others' thoughts.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thanks for the comprehensive explanation, Brice. Specially because I am new, it honestly helped me better visualise the issues of RNAseq QC and its importance to finally translate to biological understanding. So, what parameters would you recommend for standard trimming? Should I just use what's in Trimmomatic manual?

java -jar trimmomatic-0.30.jar PE --phred33 input_forward.fq.gz input_reverse.fq.gz \
            output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
            output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
            ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
            SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PE1_rc
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
>PE2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE2_rc
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Remove leading low quality or N bases (below quality 3)
Remove trailing low quality or N bases (below quality 3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15
Drop reads below the 36 bases long

I have interest in gene expression regarding a specific chromosome locus. So I have RNA sequenced only from one brain sample in order to identify possibly novel transcripts of those specific genes, expressed in a particular brain region. Then I plan to use these results to guide the design of some RT-PCRs in several samples to prove my findings. (Without actually trimming my input data, I simply aligned it to the reference human genome and did not find expression of a RefSeq gene that consists of 2 merged genes in my sample - I saw in junctions.bed file that there are transcripts from gene 1 and from gene 2, but not gene1+2, and this is very relevant for my PhD studies). Anyway, this is still ongoing work. Sorry for the once again long text and thanks in advance!

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by rodd ▴ 250

0

Entering edit mode

Do you have any reason to believe that you have residual adapter, or are you asking whether to trim in general when you see overrepresentation? If I'm understanding correctly, you are suggesting trimming what FastQC indicates is overrepresented.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

I have no reason to believe my data is contaminated with residual adapter. I am just looking for the best practices in data QC.

ADD REPLY • link 10.1 years ago by rodd ▴ 250

2

Entering edit mode

If the overrepresented sequences are not adapters, it might be the case that your data just has some overrepresented sequences or k-mer enrichment - it is RNA-seq, after all, and thus a non-random sampling of the genome. Tools like FastQC are great for exploring your data, but people get too hung up on making sure their data passes everything. If something looks really wrong, it will give you an opportunity to investigate that.

To get a sense of what other labs do, I would pick a couple of (good) RNA-seq papers and see what they do, then emulate that. If you want to do something differently, make sure you can justify it to yourself and others. There's really no 'this is the 100% right way to do clean your data" approach.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thank you so much! I will research more on this! But you definitely helped a lot already!

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by rodd ▴ 250