Finding Out Adapters Information In Small Rna Seq Data
4
2
Entering edit mode
11.3 years ago
Houkto ▴ 220

Hi there,

Recently, I received 3 miRNA-seq data of human, rat and mouse (Solexa data) without adapters information. We are trying to contact the people who sequenced the data to get the information but a reply will take some time . As I want to proceed with the analysis, is there a way of finding the 3' end adapter of each read (essential) and 5' end (optional) of the small RNA data so start with trimming the data ?

Thanks

adaptor trimming • 12k views
ADD COMMENT
0
Entering edit mode

Hi..i have question

>TR4|c0_g1_i1 len=243 path=[221:0-242] [-1, 221, -2]
AGAAACAAACACAAAAGTAGGATCAAGCCTTGACCATCAGAGACAGGAATTAGGACTCCC
AGAACAGCTGAGTTCTACGGGTAAACAATTGATCTGCTCCCCACCTCCAGAACCCAAAGC
TCATCTTTCCCTGGCCGAGGCTCCACCTGCCCTGTTACTGGACTCTCCACAGGGGCCTGA
GCAAAATGATGTCCGGGATCAGACAGATGGACCTTGCAAATCAATATCAGAAAGTGAAAC
CAT
>TR5|c1_g1_i1 len=248 path=[487:0-146 488:147-179 489:180-247] [-1, 487, 488, 489, -2]
ATATCCCAGTCGCTGTTTTTTTTCTTTTCTTTTTTTTACATTTAGTCACTTATGTACAAT
TAAAGTTATGTTACTATCTTTTCATTGTCCTTTATTTAAATAAATCTTCTGGTCCTGTAA
TAGAAATAATGTACAGTCTAAGTACTTATGAACTATCTTTATCACAGTATTATTTATTGC
TTTCATTTCAATAAAATACTGAAACTTATTTTCCACTGCCAATAAAAATGTCTTTAAGAA
CAAAAAAA...

this is huge fasta file i want to separate all fasta sequences contain id c0_g1_i1..can you tell me?may be it is simple.but i am new . i do not

ADD REPLY
0
Entering edit mode

This question is unrelated to the original post so you should start a new thread/post and then come back and delete this post.

Asking unrelated/new questions by using the Submit Answers option on an existing thread is not going to get you answers you need.

ADD REPLY
3
Entering edit mode
11.3 years ago
Ryan Dale 5.0k

Have you tried FastQC? Adapter dimers might show up in the overrepresented sequences section of the report, and the overrepresented k-mer section may give you some additional clues.

In the FastQC source, FastQC/Contaminants/contaminant_list.txt has a convenient list of adapters that you can use as a starting point for further analysis.

ADD COMMENT
0
Entering edit mode

My pipeline does not produce such files, sorry.

ADD REPLY
0
Entering edit mode

What do you mean by your pipeline doesn't produce such files? You can download FastQC and check for the known adapters as @Daler mentioned. Or you can just provide it as input to FastQC and it gives you back a report. You can check out this report (html format) as to whether there are over-represented sequences. And if they match known Illumina adapters, FastQC also marks them as such.

ADD REPLY
0
Entering edit mode

This is really much better than my answer.

ADD REPLY
0
Entering edit mode

For microRNA-seq data the fastqc pipeline works quite fine, but there are two notes, I would like to add: 1) Assure, that the found k-mer is not a microRNA (since some microRNAs are extremely over-represented in these experiments, e.g. let-7, they are found by the k-mer search) and 2) use the original (complete) adapter for the clipping and not only the found k-mer. To find the originally used adapter, you can search at http://www.ecseq.com/IlluminaPrimer.html or download the 'Illumina adapter sequences letter' (https://shell.cgrb.oregonstate.edu/sites/default/files/Files/Docs/Illumina/misc/2011-10-11-Illumina-Customer-Sequence-Letter.pdf)

ADD REPLY
3
Entering edit mode
10.7 years ago
Mike Axtell ▴ 250

I have a cheap little perl script called find_3p_adapter.pl I wrote for precisely this purpose. It works by taking in the sequence of a microRNA that you know must be present at pretty high levels in your sample, along with the raw untrimmed FASTQ file. For instance, in a human brain, miR-124 would be a good bet. The script will search for all occurrences of the miRNA query sequence, and track the 'suffiix' (or suffices) that comes after it. That will tell you the adapter sequence.

Here's an example with some plant data (so I've used miR156 as the query, b/c it's pretty abundant in most plant tissues):

Algonquin:raw michaelaxtell$ gzip -d -c Apr1A_R1.fastq.gz | ./find_3p_adapter.pl -m ugacagaagagagugagcac
./find_3p_adapter.pl version 0.2
Thu Aug  8 16:12:12 EDT 2013
directory: /Users/michaelaxtell/data/sRNAseq_data/Physcomitrella_patens/HiSeq2500_Apr22_2013/raw
Query sequence: TGACAGAAGAGAGTGAGCAC

    Searching...Done

20754 out of 23706986 reads matched query TGACAGAAGAGAGTGAGCAC (875 reads per million)

Here are the top four adapters found:
Sequence    Frequency
TGGAATTCTCGGGTGCCAAGGAACTCCAGT    17771    85.627 %
CTGGAATTCTCGGGTGCCAAGGAACTCCAG    937    4.515 %
ATGGAATTCTCGGGTGCCAAGGAACTCCAG    911    4.390 %
TTGGAATTCTCGGGTGCCAAGGAACTCCAG    358    1.725 %

So in the example above, its a good bet that the adapter starts with "TGGAATTCTCG"

ADD COMMENT
1
Entering edit mode
11.3 years ago

You can just take a look at the SAM or FASTQ and see if you can identify a common sequence occurring in all the reads. The human brain is pretty great at pattern finding.

Print out reads for SAM:

samtools view -S file.sam | grep -o '[GATCN]\{30,300\}'

Print out reads for FASTQ:

grep -o '[GATCN]\{30,300\}' file.fastq
ADD COMMENT
0
Entering edit mode

Here is a section of the result:

any suggestions?


ATCACATTGCCAGGGATTAATCTCGTATGCCGTCTT
TACCCTGTAGATCCGAATTTGTATCTCGTATGCCGT
GTAAACATCCTTGACTGGAAGCTATCTCGTATGGCG
CTTTCAGTCGGATGTTTGCAGCATCTCGTATGCCGT
CGACTCTTAGCGGTGGATCACTCGGCTCGTGCGATC
AAGGGCTTGGCATCTCGTATGCCGTCTTCTGCTTGA
TACCCTGTAGATCCGAATTTGTGATCTCGTATGCCG
TGTAAACATCCTCGACTGGAACCATCTCGTATGCCG
TACCCTGTAGAACCGAATTTGTATCTCGTATGCCGT
TGAGATGAAGCACTGTAGCTATCTCGTATGCCGTCT
TGTAAACATCCTCGACTGGAAGCATCTCGTATGCCG
TGTAAACATCCTCGACTGGAAGCATCTCGTATGCCG
ACTCCATGATGAACACAATCTCGTATGCCGTCTTCT
CTGAGATGAAGCACTGTAGCTATCTCCTATGCCGCC
TACCCTGTAGAACCGAATTTGTATCTCGTATGCCGT
TACCCTGTAGATCCGAATTTGATCTCGTATGCCGTC
TGTAAACATCCTCGACTGGAAAATCTCGTATGCCGT
ADD REPLY
0
Entering edit mode

It doesn't look like you have any identifiable adapters, but you do have some over-represented sequences in this small sample. It would still be useful to run FastQC on these files.

ADD REPLY
0
Entering edit mode

Could you please break-down this grep expression? How does it find common sequences?

ADD REPLY
1
Entering edit mode

The regular expression does not find common sequences. grep -o outputs only matching strings. [GATCN] matches one of either G,A,T,C, or N in the string. {30,300} repeats this matching pattern a minimum of 30 times and a max of 300 times. These are values that correspond to common ranges of read lengths you would observe with Solexa or Illumina. The backslashes are just to escape the curly braces.

This prints the entire reads from the SAM output. It's up the the human to determine if the 3' ends have any sequence representing an adapter.

ADD REPLY
0
Entering edit mode

Ah yes, you're right, I misunderstood that this expression gets you the common sequences. Thanks a lot for the clarification.

ADD REPLY
0
Entering edit mode
11.3 years ago
Houkto ▴ 220

Thanks Daler, Matt Shirley and Arun for the help. I misunderstood the bit about FastQC (I thought my pipeline should produce QC that I can check). I downloaded the FastQC tool and run one of my fastq files and a screenprint of the result is here

IMAGE

My fastq sample sequenced with 36bp long reads, I want to trim the `3p end and I do not know which of the overrepresented sequences is the true adapter. Any suggestion ?

Thanks again.

ADD COMMENT
0
Entering edit mode

Seem like you used the Small RNA v1.5 Sample Preparation kit. Try to clip the adapter sequence 'ATCTCGTATGCCGTCTTCTGCTTG'. To verify if it worked, you can check the length distribution of your reads after clipping. You should see most of the reads having a length of around 24nt in length. To clip your adapter sequence, I would recommend cutadapt. (cutadapt -e 0.15 -O 7 -m 15 -a ATCTCGTATGCCGTCTTCTGCTTG input.fastq -o input.clipped.fastq)

ADD REPLY
0
Entering edit mode

Thanks David, I suspected that this was the adapter and already removed it and the most of the reads length around 22 and 23nt long. However, I was not sure so thanks for confirming that. I use a tool called Reaper to remove the adapter but I would like to know what are the settings -e -o of cutadapt; what do they do ? Thanks again

ADD REPLY
0
Entering edit mode

-e is the maximum allowed error rate for the adapter-read alignment and -o is just the file for the output

ADD REPLY
0
Entering edit mode

thanks a lot will give it ago

ADD REPLY

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6