Question: Does trimming RNA-Seq reads affect the differentially expressed genes?
1
gravatar for lakshmi9c
6 weeks ago by
lakshmi9c10
lakshmi9c10 wrote:

I am curious whether trimming of reads of RNA-Seq dataset affects the final outcome of gene expression or not? If so, then is there a reason behind it? Say, for example, low quality read trimming results in fewer number of genes. How is this justified?

sequencing rna-seq • 251 views
ADD COMMENTlink modified 6 weeks ago by genomax92k • written 6 weeks ago by lakshmi9c10

this is indeed quite a debated topic. It will pay off to look into recent literature about this.

here is one to start already: Alignment and mapping methodology influence transcript abundance estimation

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by lieven.sterck9.0k

Interesting article. I know different alignment and mapping methods vary in terms of accuracy, but my question here is more about the trimming of reads, and how it affects the genes. Can I get your insight on that?

ADD REPLYlink written 6 weeks ago by lakshmi9c10

yes, you are right. I meant to include a different paper : Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level or bit older one: Trimming of sequence reads alters RNA-Seq gene expression estimates

that does not mean I agree with those publications though (they're just some pointers for further reading) and I personally would also live by the advise of ATpoint

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by lieven.sterck9.0k

I personally do not feel well giving data to tools that are contaminated with adapters. Easy to check with fastqc. I always trim if the adapter contamination curve it produces is not basically < 1%. Call it paranoid, but that is how I do it. Better save than sorry.

ADD REPLYlink written 6 weeks ago by ATpoint42k

I totally agree! Feeding raw data into tools for downstream analysis without trimming is not advisable, especially when the reads contain adapter sequences in it. I agree with you, and I too trim my fastq data based on per base sequence quality and adapter contamination, and then check the quality using FASTQC. But I was wondering how does trimming the low-quality reads affect the final gene list? What if trimming removes some of the genes of my interest?

ADD REPLYlink written 6 weeks ago by lakshmi9c10

What if trimming removes some of the genes of my interest?

I do not see how this would be possible. Trimming is done on reads. I cannot think of way how trimming would remove counts f a specific gene. I fear you are overthinking this entire trimming issue, and frankly, if you start worrying about these kinds of things like a certain standard procedure might introduce some very specific biases then you will have a hard time and constant headache. I would simply apply standard procedures and then later, and only if you have strong evidence that something is going completely wrong, question these procedures and start digging. If a gene that you expect is not showing up then there are probably many potential reasons, but adapter trimming is most likely none of it.

ADD REPLYlink written 6 weeks ago by ATpoint42k

@ATpoint

Thanks for clarifying that! However, I have another question. I have performed Trimmomatic on my paired-end reads (SLIDING WINDOW 4:20) and I am confused with the output I have got. The number of transcripts/ transcripts IDs post-trimming are exactly the same as the number of transcripts pre-trimming. Trimming makes reads shorter, right? Correct me if I'm wrong, but doesn't trimming of reads result in less number of transcripts?

Also, I have done Trimmomatic with different parameters (SLIDING WINDOW 2:20, 10:20). But the results post-trimming are exactly the same for all! It's almost as if trimming didn't have any affect on the reads.

Can you please help me understand the reason behind this?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by lakshmi9c10

If there is nothing tro trim, then input=output.

ADD REPLYlink written 4 weeks ago by ATpoint42k

Would that be the case even if I use different trimming options- Sliding window, MINLEN? I am trimming in the first place because I have some low quality reads in my library. That would mean trimmomatic removes those reads, and I interpreted that with FASTQC report post-trimming. Shouldn't it be reflected on the number of transcripts as well?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by lakshmi9c10

When you say transcripts... what do you mean exactly? After aligning the newly trimmed reads the number of alignments to transcripts is the same.

Yes I would say if any trimming occurred it would be very suspicious to see EXACTLY the same number of alignments. If you mean the total number of genes detected (e.g. genes with > 1 reads) is the same... then that doesn't seem too unusual. Your trimming should just remove low-quality reads... but any genes that are considered expressed should have more than just a handful of reads.

Why don't you check the logs and see how many reads were removed by the trimming? If it wasn't very many then this makes sense. Either way to make sure that you aren't making a mistake (e.g. using the old files) somewhere you should just check the actual COUNTS (read alignments mapped to features) because in that case they should differ (assuming at least 1 read was tossed).

ADD REPLYlink written 4 weeks ago by benformatics2.0k

Yes, the number of alignments of newly trimmed reads to the transcripts is the same. It gives the same number of alignments even after read trimming.

The total number of genes, however, is the same, which you say is not unusual. Trimmomatic hasn't produced a log file, so I cannot say how many number of reads are removed by trimming. But the number of read alignments are the same.

ADD REPLYlink written 28 days ago by lakshmi9c10

It is also good to keep in mind the overall alignment parameters you are using.

A common artifact that I notice in RNA-seq experiments are coverage spikes. Overtrimming or undertrimming could both contribute to the formation of these artifacts. However, I would agree with ATpoint that it is better to be safe than sorry thus I would always recommend being conservative and removing adapter sequences that you know are present.

And in fact if you read the paper someone linked above you would note that the alignment rates only change by less than 10% in their case study with simulated reads. I am a big believer in that if your results change drastically after a making small change to your analysis (e.g. trimming vs not) then the signal probably isn't as strong as you think it is.

I would be extremely skeptically of genes that were present in a DGE analysis but disappeared after you trimmed your known adapters. Unless something in the logs suggests that there was a problem...

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by benformatics2.0k

My trimviz tool can trace the 'fate' of trimmed reads in your alignments, if that helps:

https://github.com/MonashBioinformaticsPlatform/trimviz

You can see the alignment fate of trimmed reads in trimviz FQ, by adding in the bam file and reference genome (in addition to the pre-trimmed and post-trimmed fastq files) and using 'diff' mode - it will then highlight differences between the trimmed part of the read and the genome flanking the trim point. If there are not many differences, then trimming to that extent was not necessary.

ADD REPLYlink written 4 weeks ago by stuart archer70
0
gravatar for genomax
6 weeks ago by
genomax92k
United States
genomax92k wrote:

low quality read trimming results in fewer number of genes.

That is a specific kind of trimming. How are you deciding on the Q score trim threshold? While it is true that generally stretches of adapter sequences tend to get low base qualities, they should be handled easily by a read trimming program.

If you have data that is not of good quality (e.g. overloading of flowcell, bad libraries) then trimming it using quality as a criteria can indeed take out sample sequences and can affect the DE result.

ADD COMMENTlink written 6 weeks ago by genomax92k

Hi genomax, Sorry for the delayed response, I decided on Q score (Phred score) threshold as Q= 20. So any of the reads having quality below 20 will be trimmed. I chose this based the average Phred score of my reads for all my samples, which is about 30.

Also, I think my data is of good quality, i.e. the FASTQC report shows pass for all QC parameters except "Per Sequence GC content", "Adapter content". After trimming adapters using Illuminaclip in trimmomatic, only "Per Sequence GC content" shows failure.

I don't think trimming with Q score threshold actually affects my sample sequences, because I trimmed using different Q score threshold, and it all gave the same number of genes in the end. Number of reads after trimming are different, but number of genes are same. What is your thought on this?

ADD REPLYlink modified 12 days ago • written 12 days ago by lakshmi9c10

If you got a result that makes sense then all is well. I assume your samples are of good quality and thus trimming did not make a difference in final result.

ADD REPLYlink written 12 days ago by genomax92k

Okay that makes sense. Could you also tell me what trimming options would be the best? Like, I chose Q score threshold as 20, but even Q=30 or Q=15 gives the same result. So how do I know which option to stick with?

ADD REPLYlink written 11 days ago by lakshmi9c10

If you are aligning to a good reference (and overall the QC looks fine) generally quality score based trimming is not needed. You could trim at Q15 to be conservative.

ADD REPLYlink written 11 days ago by genomax92k

Thanks! I will try that.

ADD REPLYlink written 11 days ago by lakshmi9c10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1933 users visited in the last hour