Question: read trimming for small RNA NGS data
1
gravatar for ron128
2.9 years ago by
ron12810
ron12810 wrote:

Hello all. This seems to be a routinely discussed question with many answers around here, however I could not use the answers provided in other questions to solve my query. I have some mi-RNA seq data from Illumina Hiseq platform. Thats about all the information I have with me. I have not been able to identify the vendor who has done the sequencing, so approaching them is out of question. My problem is as follows : I have single end sequencing reads of 54 base length. I am trying to identify a good way to trim them. I have no idea what adapter to use for read trimming, so I have been stupidly looking a t other posts on here trying to make sense. Long story short, as suggested on some posts, my FastQc over represented sequence output gives me these two sequences as the adapter sequences in one sample :

AGCCGCCTGGATACCGCAGCTAGGAATAATGGAATTCTCGGGTGCCAAGG 189653 0.410031497 Illumina Small RNA Adapter 2 (100% over 21bp) CGCGACCTCAGATCAGACGTGGCGACCCGTGGAATTCTCGGGTGCCAAGG 184505 0.398901475 Illumina Small RNA Adapter 2 (100% over 21bp)

and these 3 sequences as the adapter in a different sample.

AGCCGCCTGGATACCGCAGCTAGGAATAATGGAATTCTCGGGTGCCAAGG 189653 0.410031497 Illumina Small RNA Adapter 2 (100% over 21bp) CGCGACCTCAGATCAGACGTGGCGACCCGTGGAATTCTCGGGTGCCAAGG 184505 0.398901475 Illumina Small RNA Adapter 2 (100% over 21bp) TTGCTGTGATGACTATCTTAGGACACCTTTGGAATTCTCGGGTGCCAAGG 50032 0.108169635 Illumina Small RNA Adapter 2 (100% over 21bp)

Now these are two different samples run in different lanes. I do not know if sequencing was pooled with an indexing adapter (although that is very likely given the total number of reads being small.) after matching over the four sequences I have deduced that TGGAATTCTCGGGTGCCAAGG is my illumina adapter sequence. The problem is I cannot find any mention of this being a adapter sequence in any of illumina's official documents on their FTP, other than this document http://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/basespace/small-rna-v1-0-release-notes-15061994-a.pdf. Is this the correct sequence?

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by ron12810

If it is vastly over represented I just remove them because what else could they be? I have found trimmomatic to be a fast tool because it is multi core enabled, you may have your own solution however.

ADD REPLYlink written 2.9 years ago by chris86290

Thanks a ton chris! I tried doing the same using trimmomatic, with the following parameters java -jartrimmomatic-0.33.jar SE -phred33 21A.fastq 21A_clipped.fastq ILLUMINACLIP:final_adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:18.

Doing this gives me an output which still has a length distribution peak at 50 bases which imply that a majority of my sequence reads have not really been t rimmed. Am I doing something really really stupid here? Should I have not been just born in this world? Many thanks for taking the time out and helping a fellow distressed soul!

ADD REPLYlink written 2.9 years ago by ron12810

It depends on the quality and how you set the sliding window. You will have to read the manual and play around until you get it to work. I remember it taking a while for me, but once I did it was nice and fast.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by chris86290

TGGAATTCTCGG is the start of the standard illumina small RNAseq adapter. If you just give that to trimmomatic (or run "Trim Galore!" with the --small_rna option) then you should be fine.

ADD REPLYlink written 2.9 years ago by Devon Ryan90k

Hey Devon thanks a lot for chimming in. I did try using trim galore and the sequences mentioned in its manual (which includes the one that you suggested!. Problem is it hardly trims any of my data with peaks still at 50 bases! This is the same result that I get after using trimmomatic. Am I looking at primer dimers here?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by ron12810

What does the FastQC adapter contamination plot look like? There should be a huge jump up in the percentage ~50% of the way through if this is really smallRNAseq.

ADD REPLYlink written 2.9 years ago by Devon Ryan90k

heres the QC for trimmomatic read trimming using the adapter.

https://s32.postimg.org/3zo4uuz8l/sequence_length_distribution.png

I tried something new. Using the default settings on capmirseq, which uses cutadapt for trimming, I gave the same adapter. This is the post trimming image for the sequence length distribution:

https://s31.postimg.org/l77gt8t5n/sequence_length_distribution.png

Now I am wondering what the peak at 33 signifies :/

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by ron12810

If you want trimmomatic to trim more just reduce the size of the sliding window and increase the quality required.

I remember spending a lot of time on QC, then I actually did the alignment and found it wasn't such a big deal usually the alignment algorithm takes into account quality scores, unless you have something really weird going on with your data - you should be OK.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by chris86290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 861 users visited in the last hour