Quality and length trimming for de novo assembly of RNAseq
2
0
Entering edit mode
4.3 years ago
User 4014 ▴ 40

HI folks

Sorry for posting a very basic question.

I have data for 2x150 bp from NovoSeq that I would like to do de novo transcriptome assembly. My understanding is that we should avoid hard trimming and implement length filtering. Some studies even say Q5 with 36 bp length cut-off.

I however look around new publications published in 2019 and they are still using Q20. May I have your suggestions what is best? Also the triming length what is the cut-off length I should use since I have longer reads when the previous studies are mostly 100 bp?

Thanks in advance!

RNA-Seq rna-seq • 2.0k views
ADD COMMENT
0
Entering edit mode

My best suggestion is first trim the adapters then use FastQC to look it after excluding the adapters, If the main things like sequence length and and Sequence quality score and N content were fine, you can proceed your analyze. Dont try to Trim too much your data because you will have the false and pointless results.

ADD REPLY
0
Entering edit mode
4.3 years ago
GenoMax 141k

My understanding is that we should avoid hard trimming and implement length filtering. Some studies even say Q5 with 36 bp length cut-off.

Please don't do that. You should make sure there is no extraneous sequence (adapters etc) are present. You don't want them in your data. In general most well made new libraries should have average Q scores well above Q25 on a NovaSeq run. if you have data that is less than Q20 it is more than likely adapter contamination because of short (or no) inserts. That will generally be taken care of when you scan/trim the data. You can use bbduk.sh from BBMap suite. A guide is available here.

Also the triming length what is the cut-off length I should use since I have longer reads when the previous studies are mostly 100 bp?

Having longer reads should be good for assemblies. Don't do any trimming since you will throw away good data.

ADD COMMENT
0
Entering edit mode

Thank you very much! I will redo the trimming again. I did adapter removal, but went with quality trimming at q10.

Just for my curiosity, I understand that short reads (<20 bp) are likely to map to multiple genes and we cant use them for differential gene expression. Isn't it easier if we remove them out at this step (if there is any)? Or we can count them in for all genes they may map to?

ADD REPLY
0
Entering edit mode

When you are creating the assembly you want to keep a comprehensive representation of the data. If you do end up with reads that are <20 bp after trimming then you could eliminate them. That said, having data with majority of short reads would make me question the quality of the library and thus the assembly.

ADD REPLY
0
Entering edit mode

Sorry for getting back to you constantly. May I also have your opinion about "n" bases, please? I have some reads with n bases in my data, which I am surprised that they ain't removed by quality trimming (q20 with bbduk). Is it better to discard those reads? If I remove them, I will lose around 7M from a total of 150M. Otherwise, permitting a maximum of 3 n bases would reduce the loss to 5M. Att the moment, my command looks like this:

bbduk.sh -Xmx16g in1=unfixrm_102640_R1.cor.fq.gz in2=unfixrm_102640_R2.cor.fq.gz out1=bbduk_unfixrm_102640_R1.cor.fq.gz out2=bbduk_unfixrm_102640_R2.cor.fq.gz ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo trimq=20 qtrim=rl maxns=3 minlen=36 trimpolya=10 trimpolyg=10

Thank you very much!

ADD REPLY
0
Entering edit mode

Reads with N's? That is another red flag. Are these N's in certain cycles and are contiguous or distributed? There should be no reads with N's in a normal run. It is plausible that there was a problem with this run and the sequencing provider still chose to pass the flawed data along to you or your libraries are not good quality. You should ideally remove reads with any N's since they are going to add problems for assembler. With 140+M clean reads you should still have plenty of data to assemble your transcriptome.

ADD REPLY
0
Entering edit mode

I glimpsed at some of them and they are mostly on 2 cycles, 2 and 34. The 2nd cycle can be found on both reads, but the 34th cycle ones are on only R2. The remaining bases still have very high quality, though. The facility I used they used NEBNext Ultra II for a while, but they just got NovaSeq. Perhaps they need some time to learn more about the machine.

At the moment, I am thinking about to use BBDuk with a flag 'maxns=0' and 'outm=n_reads.fq.gz' to get reads with n bases. And if I trim from 5' to base 35, I guess I can still reuse them?

ADD REPLY
0
Entering edit mode

Sounds like a plan. Let us know what happens.

ADD REPLY
0
Entering edit mode

Thank you for all your suggestions! I will keep you posted! :)

ADD REPLY
0
Entering edit mode
4.3 years ago
Shyam ▴ 150

First of all, use FastQC to look at the quality of Data. FastQC website has examples of good and bad data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ . Only trim the 3' side of the read (right hand side) if the data is bad. Check if there is adapter contamination. You can use tools like Trimmomatic or as @genomax mentioned, bbduk to trim the adapters and the low quality ends. Remove any read with average quality less than cutoff score (Q25 is a good threshold, I personally don't go below Q25, especially for a de novo assembly). If you don't trim the bad quality bases, they might create issues in assembly. The read coverage is not uniform in transcriptome assembly, and the quality errors might create shorter contigs because of bubbles in the graph. You can implement a length filter after all the trimming is done to remove any reads that lost a major part of the sequence. You can choose based on the data.

ADD COMMENT
0
Entering edit mode

Thank you very much! I will redo trimming again! :)

ADD REPLY

Login before adding your answer.

Traffic: 1378 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6