Question: removing overrepresented sequences from rna-seq
gravatar for anna
4 months ago by
anna10 wrote:

I have a lot of rna-seq paired end data which have a very good quality, but some of the files have a lot of overrepresented sequences, not adapters. I made a blast of these sequences. Some of them didn't match to anything, and some other seems to be rRNA. I understand that there are divided opinions, and some people say is better to remove the overrepresented sequences, and others says that there's no need to. This time i decided to remove them with cutadapt, because the overrepresented sequences varies from one file to another. But after removing them, the FastQC basic stadistics of these files changed (sequence length 1-150) and NEW overrepresented sequences appeared (i wasn't expecting to obtain more of the initial ones). I'm thinking that maybe i made a mistake with the cutadapt and want to try with trimmomatic, but i can't find in the manual, an option where i can specify the sequence that i want to remove from a specific file (my impression is that with trimmomatic i can remove only adapters that are recognized by the software). Can anyone give me an advice about what to do in order to proceed with the (de novo) assembly?

ADD COMMENTlink modified 4 months ago by h.mon30k • written 4 months ago by anna10

As long as you clean adapters (even that is not strictly necessary) you should be able to align your data and move forward. If you do have rRNA contamination (see if it is severe and/or variable among samples) then you would need to check on that to be sure that it is worth going forward with the analysis.

Can anyone give me an advice about what to do in order to proceed with the assembly?

If you are going to de novo assemble the data then just make sure it does not have any extraneous sequence present that should not be there in first place (e.g. adapters).

ADD REPLYlink written 4 months ago by genomax85k

Personally, I would not remove these overrepresented sequences for the reasons @h.mon explained below. And the fact that you observed "new" overrepresented sequences after removing the original ones means that some sequences will always be overrepresented with respect to others, again because of the reasons explained below.

Having said that, if you really need to remove known/custom sequences from your fastq files and would like to use Trimmomatic for this, you would need to create a multi fasta file and refer to this file when calling Trimmomatic with the ILLUMINACLIP:

java -jar trimmomatic-0.35.jar PE -phred33 ... ILLUMINACLIP:custom-fasta-file.fa:X:X:X ...

The example above assumes that your file, custom-fasta-file.fa is placed under the adapters directory, which itself is within the original Trimmomatic-X.XX directory. Please remember that this is a crude workaround and would only work for sequences at the beginning (5') of your reads.

ADD REPLYlink written 4 months ago by Haci370

thank you so much for your help!

ADD REPLYlink written 4 months ago by anna10
gravatar for h.mon
4 months ago by
h.mon30k wrote:

RNAseq will always contain over-represented sequences, because certain genes will be overly expressed and, thus, will result in over-represented sequences. If you remove these sequences, you will be removing genes, and your assembly will be less complete and / or more fragmented. Except for adapters, one should not remove any sequences to perform assembly. You may (and this is Trinity default, for example) perform digital normalization prior to assembly, to reduce memory usage and run time.

ADD COMMENTlink written 4 months ago by h.mon30k

Thanks a lot! Now i can move forward

ADD REPLYlink written 4 months ago by anna10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1296 users visited in the last hour