Microrna Adapter Trimming
0
4
Entering edit mode
10.4 years ago
xiaojuhu13 ▴ 150

I have several microRNA data(illumina sequencing ) to trim first for further analysis. After I use the command cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCA -e 0.1 -O 5 -m 15 -o sheep_48_1_trim.fastq sheep_48_1_extract.fastq , there are still too many reads contain a long length rather 20-23nt. Then should I check the fastqc results file Overrepresented sequences, then compare them with miRBase data to exclude microRNA sequences and remove the real adaptor, but it is a huge job. Or it 's a wrong way to do the trimming analysis for microRNA. This is the Overrepresented sequences included in fastqc files:

>>Overrepresented sequences    fail
#Sequence    Count    Percentage    Possible Source
TACCCTGTAGAACCGAATTTGTTGGAATTCTCGGGTGCCAAGGAACTCCA    1415391    4.200865697255984    RNA PCR Primer, Index 1 (
TTCAAGTAATCCAGGATAGGCTTGGAATTCTCGGGTGCCAAGGAACTCCA    1052074    3.1225446378950354    RNA PCR Primer, Index 1 (
GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCTTGGAATTCTCGGGTGCC    827120    2.4548835166497236    No Hit
AACATTCAACGCTGTCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC    796866    2.365089960802059    RNA PCR Primer, Index 1 (
TGCCTATGCTGAAACCCAGAGGCTGTTTCTGAGCTGGAATTCTCGGGTGC    499804    1.4834130490806638    No Hit
AACATTCAACGCTGTCGGTGAGTGGAATTCTCGGGTGCCAAGGAACTCCA    460851    1.3678009521369836    RNA PCR Primer, Index 1 (
TGAGATGAAGCACTGTAGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGT    423765    1.2577300916832748    RNA PCR Primer, Index 1 (
TATTGCACTTGTCCCGGCCTGTTGGAATTCTCGGGTGCCAAGGAACTCCA    419685    1.2456206943190098    RNA PCR Primer, Index 1 (
TGAGGTAGTAGGTTGTATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCA    414839    1.2312378169593952    RNA PCR Primer, Index 1 (
TACCCTGTAGAACCGAATTTGTGTGGAATTCTCGGGTGCCAAGGAACTCC    378938    1.1246840241225131    RNA PCR Primer, Index 1 (
TGAGATGAAGCACTGTAGCTCTGGAATTCTCGGGTGCCAAGGAACTCCAG    341312    1.013010449311769    RNA PCR Primer, Index 1 (
TGTCTGAGCGTCGCTTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTT    298504    0.8859567526525888    RNA PCR Primer, Index 1 (
TGAGGTAGTAGATTGTATAGTTTGGAATTCTCGGGTGCCAAGGAACTCCA    294079    0.8728233988935513    RNA PCR Primer, Index 1 (
• 8.0k views
ADD COMMENT
1
Entering edit mode

If I don't get you wrong you just want to get rid of your adaptor sequences and stay with miRNAs. AFAIK miRNA are approx 19-24 bp long. Assume you did an 100bp single end Illumina sequencing you should always sequence round about 81 - 76 bp of your adaptor / nonsense sequence. What you can do is, eyeball some of your reads and look for the position the actual miRNA starts (nucleotide diversity should increase). Trim all the reads at that position. That way you should get rid off most of your adaptors.

ADD REPLY
0
Entering edit mode

Thanks, I check the overrepresented sequences and the illumina adapter(TGGAATTCTCGGGTGCCAAGGAACTCCA), after removing these sequencs, too many reads still have a long length rather than 19-24nt. And I search the miRBase dateset, the adaptor sequences above appeared some mature miRNA like bfl-miR-182b-3p, so what should I do next?

ADD REPLY
0
Entering edit mode

have you checked both ends?

ADD REPLY
0
Entering edit mode

what does both end mean?

ADD REPLY
0
Entering edit mode

I am removing the adapter I find in the Overrepresented sequences, but there are still too many reads have a long length, so I run fastqc again to find more adapters, to check whether they are the real adapters, I cope these finding sequences to the miRBase database. To get all adapters, I have already do five fastqc check.

ADD REPLY
0
Entering edit mode

The problem with FastQC is that it only checks the first 200k sequences so you will end up with - I don't know how many iterations of FastQC. What I mean is actually look at your sequences in a texteditor (on a *NIX system e.g. use 'less myseq.fastq' and look for adapters manually. There will be a point (so if you write them line by line this is one specific column) in the sequences where nucleotide diversity goes up. That is where you actual miRNA starts.

ADD REPLY
0
Entering edit mode

I do the analysis in your way, and I check the sequences like following, GGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGCTGGAATTCTCGGGGTCCAAGGAACGCCAGTCACTTAGGCATATCGTATGCCGT CATTGCACTTGTCTCGGTCTGATGGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA ACAGTAGTCTGCACATTGGTTAATGGAATTCTCGGGTGCCAAGGCACTCCAGTCGCTTAGGCATTTCGTATGCCGTCTTCTGCTTGAAAAAAA CTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTTTAGAAATCTCGGGTGACAAGGAACTCCAGTCACTTAGGCATCTC CTTGCGGCACCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTGGCTTGGAATCCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCAACTCG CACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTAAGGCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAA ACCCTGTAGAACCGAATTTGTTGGAATTCTCGGGTGCCAAGGAAGTCCAGTCACGTAGGCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAA TTCAAGTAATCCAGGATAGGCTTGGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA

ADD REPLY
0
Entering edit mode

the sequences I find similar with each other is GGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA, the sequences is too long, if I use cutadapt , should I set a comparatively high value for the mismatch(-e value)?

ADD REPLY
0
Entering edit mode

yep, give it a try

ADD REPLY
0
Entering edit mode

yeah, I used the command TGGAATTCTCGGGTGCCAAGGAGCTCCATTCAGTTAGGCATCGCGTATGCCGTCTTCTGCTTGAAAAAAAA -e 0.2 -O 5 -m 15 -o sheep_48_1_re1.fastq sheep_48_1.fastq huxj@LoginNode raw]$ grep ^[ACTGN] sheep_48_1_re1.fastq| head -n 100 GGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGC CATTGCACTTGTCTCGGTCTGA CCCFFFFFHHHDHIJJJIIJJB ACAGTAGTCTGCACATTGGTTAA CTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT CTTGCGGCACCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTGGCT CACCACGTTCCCGTGG CCCFFFFFHHHHHIJJ ACCCTGTAGAACCGAATTTGT CCCFDFFFGHHHHJGGHIJGI TTCAAGTAATCCAGGATAGGCT CCCFFFFFHHHHGJJJJFIIJI AACATTCAACGCTGTCGGTGAGTTT ATCCCGGACGAGCCCCCA GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT GCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC CCCFFFFFHHHHHJJJIIIJJJJJJJJIJJJJG TACCCTGTAGAACCGAATTTGT CACGCGCACCAACCTCACGGGGCTCATTCTCAGCACGGCTG yeah, still have a long length in some reads, centralized among 32-34nt

ADD REPLY
0
Entering edit mode

huxj@LoginNode raw]$ grep ^[ACTGN] sheep_48_1_re1.fastq| head -n 100 GGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCACGC CATTGCACTTGTCTCGGTCTGA ACAGTAGTCTGCACATTGGTTAA CTTGCGGCCCCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTCGCTT CTTGCGGCACCGGGTTCCTCCCGGGGCTACGCCTGTCTGAGCGTGGCT CACCACGTTCCCGTGG ACCCTGTAGAACCGAATTTGT TTCAAGTAATCCAGGATAGGCT AACATTCAACGCTGTCGGTGAGTTT ATCCCGGACGAGCCCCCA GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT GCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC TACCCTGTAGAACCGAATTTGT CACGCGCACCAACCTCACGGGGCTCATTCTCAGCACGGCTG yeah, still have a long length in some reads, centralized among 32-34nt

ADD REPLY
0
Entering edit mode

on those reads (which are way shorter) try to use jellyfish to identify overrepresented k-mers. After that you should end up with your desired sequences

ADD REPLY
0
Entering edit mode

I think that some miRNA experiments you might REALLY have a huge peak around 32nt, i.e. it's biology, not artifact. However, I am not 100% sure. Give a look to papers published with miRNA data and NGS and see if peaks aroun 32nt were observed...

ADD REPLY
0
Entering edit mode

yeah, there is a peak between 32-33nt.Then should I remove it further or not?

ADD REPLY

Login before adding your answer.

Traffic: 2933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6