Removing Illumina Adapters From Rna-Seq Data
2
4
Entering edit mode
12.3 years ago
Agatha ▴ 350

Hi,

I would like to remove the adapters from raw RNA-seq libraries and I have tried cutadapt (http://code.google.com/p/cutadapt/), which apparently should allow mismatches. However when I specify the adaptor to be cut like this P-UCGUAUGCCGUCUUCUGCUUGUidT , as it was used by the sequencing machine, no sequence is trimmed. When I tried the default FASTX Galaxy dummy adapter : TGTAGGCC, more than 70 000 sequences were trimmed out.

I have also tried the trimLRPattern function from Biostrings/Bioconductor, but I have the same issue as with cutadapt and I imagine I am not specifying the correct string to be clipped.

Also, I cannot do any data manipulation in Galaxy since the file has been loading for two days (approx 4.5 GB) so I need to find another solution..

What adaptor substrings should be used when dealing with RNA seq data? (not the entire default Illumina adapters)

Which is the best tool for this step in the quality control process ?

Sample sequences from the unprocessed FASTQ file:

GTCTGTGATGAATTGCNTTGACTTCTGNNNNNNNNN

CGGACAGGATTGACAGNTTGATAGCTCNNNNNNNNN

AGTCTGTGATGAATTGNTTTGACTTCTNNNNNNNNN

CAGGAACGGTGCACCANTCTCGTATGCNNNNNNNNN

Edit for the ones reading the post

I have used FAR successfully, it is easy to specify certain sub sequences of the adapter and it uses a pwa algorithm to score the best match in the read.

illumina rna adaptor next-gen sequencing fastq • 22k views
ADD COMMENT
0
Entering edit mode

Please provide some example sequences from the data your are interested in that contain the adapter sequences you wish to remove.

ADD REPLY
0
Entering edit mode

@malachig- I have updated my question with the required info

ADD REPLY
7
Entering edit mode
12.3 years ago

There are probably many tools in addition to those that you list. How about the flexible adapter remover 'FAR'.

From the SourceForge description:

  • FAR is an ideal tool for preprocessing sequencing data
  • FAR removes adapter sequences from sequencing runs via global alignment (exact)
  • FAR can be used to demultiplex barcoded sequencing runs (illumina sequencing runs)
  • FAR supports basic trimming of reads before/after alignment global alignment
  • FAR supports colorspace and basepairspace sequencing data
  • FAR supports phred quality trimming
  • FAR runs in parallel on multiple cpu's and supports Linux/Windows 32 and 64 bit
  • FAR gives detailed reports (e.g. length distribution of the reads trimmed) in the output
  • FAR significantly improves mapping rates and genome/transcriptome assemblies

FAR allows mismatches (see --cut-off parameter) and identification of partial adapter sequences where you are just reading some amount of bases into the adapter at the end of your reads (see --min-overlap parameter).

Remember it is also possible that the majority of your reads do not contain any adapter sequence at all. In many Illumina libraries, sequencing starts at the end of the adapter sequence and the first base of reported sequence is actual genome/transcriptome sequence. A variety of libraries types do not follow this pattern and may have adapter sequences that interfere with alignments that do not perform substring alignments (most next-gen sequence aligners).

Have you considered flipping this problem? Instead of searching for an unknown adapter in your reads, try aligning your reads to the genome/transcriptome without trimming anything. What proportion align? Take a small subset of reads and align them with a substring capable aligner such as BLAST or BLAT. Do you see a pattern in the alignment? Does the entire read align or do you get X bases at the beginning or end of the read that fail to align? If so, is there a pattern to the sequence that does not align. Does it look like a known Illumina adapter? etc.

ADD COMMENT
0
Entering edit mode

@ malachig - I will try this tool as well..do you have any suggestion regarding the adapter substring that should be used ?

ADD REPLY
0
Entering edit mode

The idea with a trimmer like FAR is that you specify the complete adapter sequence, if it finds the whole thing it will identify that. If only part of it is present, it should identify that as well. So you don't need to know the substring in that case it will tell you what portion of the adapter is present at the end of each read.

ADD REPLY
3
Entering edit mode
12.3 years ago

It's not clear from your question, but perhaps you are not having any luck w/ adapter clipping software because you might be mis-specifying the adapter sequence itself?

For example, you say that you are specifying the adapter as P-UCGUAUGCCGUCUUCUGCUUGUidT, but are you really including the P- and idT prefix/suffixes? That would be your first problem, you should remove them.

Second: are there really Us in your adapter? Maybe you should be substituting these for Ts?

Third, you ask:

What adaptor substrings should be used when dealing with RNA seq data? (not the entire default Illumina adapters)

No one can really answer that question for you without knowing the details of the library prep. You'll have to talk to the people preparing the library to know for sure.

Other tools you can look at for adapter trimming are:

I'm sure others will offer more .. I've only ever really used cutadapt and the fastx-toolkit, more the latter than the former, but have had good success with them.

ADD COMMENT
0
Entering edit mode

@ Steve Lianoglou - I am not including the suffixes...But it is still not working..no sequence is trimmed. I will try to reverse complement the sequence and then remove the adapter..you might have a point. Thank you.. Regarding the third aspect, I am just using some libraries from NCBI, and in the associated paper I could find brief details regarding the library preparation...I do not have any experience with sequencing data so at this point I am not sure how I can conclude from that what substrings I could use.. So you are saying that the only way to do this is to contact the seq guys?thanks

ADD REPLY
0
Entering edit mode

@agatha: if this is from an already published paper, you can give us the reference and we can help you identify the appropriate adapter sequence.

ADD REPLY
0
Entering edit mode

@ Steve Lianoglou [1] C. E. Joyce et al., “Deep sequencing of small RNAs from human skin reveals major alterations in the psoriasis miRNAome.,” Human molecular genetics, vol. 20, no. 20, pp. 4025-4040, Aug. 2011.

ADD REPLY
0
Entering edit mode

@ Steve Lianoglou - this is the paper - any help would be greatly appreciated

ADD REPLY
0
Entering edit mode

@agatha: I can't seem to download the small rna prep kit from Illumina, even though there is a link to download it which just redirect to their "order me" web front. Do you have the pdf?

ADD REPLY
0
Entering edit mode

@agatha: One thing you can do is to run FastQC on your input fastq file. It will detect enriched sequences in the library and try to match it against a list of "known" contaminants. It will list any adapters found "by name", you can then get the adapter's full length sequence from the contaminant_list.txt file that comes with FastQC itself.

ADD REPLY
0
Entering edit mode

@Steve Lianoglou -yes I do have the pdf- how can I send it to you?

ADD REPLY
0
Entering edit mode

@Steve Lianoglou- I am not sure what you mean by the known contaminants but I will try to find out PS I have ran the program with the complement of the seq adapter and cutadapt crashed..probably the adapter sequence is too long..

ADD REPLY
0
Entering edit mode

@agatha: Did you get this sorted out?

ADD REPLY
0
Entering edit mode

@Steve Lianoglou - Yes, I think so...I will see how correct it is if I will find any isomiRs after mapping :-) Thank you for your help anyways !

ADD REPLY

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6