CGHub/TCGA RNA-Seq data trimming adapters
Entering edit mode
6.3 years ago
umn_bist ▴ 390

So I have multiple (unaligned) paired end RNA-seq fastq files that I would like to trim against known adapters and base quality score.

Because I do not know which sequencer was used (most likely Illumina), I have it run through SolexaQA++ to determine the format first. If it's done by Illumina, depending on the version, I call the appropriate adapter list to pipe into cutadapt.

I have visited Illumina's website that has a pdf file of all adapters, but I am wondering how I can transfer that list into a fasta file. Is there a publicly available adapter_list.fasta for RNA-seq samples? 

Thank you very much for the help.


EDIT: I found that Trimmomatic supplies a set of Illumina adapters for PE, SE in fasta format. Today I learned that .fa is human readable and found that the list is nothing more than the adapter sequence with a ">insert sequencer here" above. What I don't understand is how are these clippers (specifically cutadapt) able to know which adapters are 5' and 3' and trimming accordingly. Also is there a conventional way of calling the sequencer that created the fastq file other than how I am doing it now via SolexaQA++?

EDIT: SolexaQA++ uses the code below to determine the sequencer for an unknown.fastq.

EDIT: I also found that FastQC can also determine sequencer type/version. Now that I know my RNA-seq was sequenced by Sanger/Illumina 1.9, what would be a relevant list of all adapters?


use strict;

use warnings;

my $format = "";

# set regular expressions

my $sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/;

my $solexa_regexp = qr/[\;<=>\?]/;

my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\[\]\^\_\`abcdefgh]/;

my $all_regexp = qr/[\@ABCDEFGHI]/;

# set counters

my $sanger_counter = 0;

my $solexa_counter = 0;

my $solill_counter = 0;

my $i;



# retrieve qualities

next unless $i % 4 eq 0;



# check qualities

if( m/$sanger_regexp/ ){

$sanger_counter = 1;



if( m/$solexa_regexp/ ){

$solexa_counter = 1;


if( m/$solill_regexp/ ){

$solill_counter = 1;



# determine format

if( $sanger_counter ){

$format = "sanger";


elsif( !$sanger_counter && $solexa_counter ){

$format = "solexa";


elsif( !$sanger_counter && !$solexa_counter && $solill_counter ){

$format = "illumina";


print "$format\n";
RNA-Seq cutadapt SolexaQA++ Illumina FastQC • 2.4k views
Entering edit mode
6.3 years ago
GenoMax 115k

Brian Bushnell (author of BBMap) includes a full list of commonly used adapter sequences in a file in the "resources" directory of BBMap software. BBMap is complete suite of tools for working with NGS data.

See this link for information about TCGA RNAseq data and how it was processed:

Entering edit mode

Thank you very much the reply. The description.txt will be extremely, extremely helpful. I should've known that the pipeline used for RNA seq would have thorough documentation in TCGA.

I have seen BBMap referred here and there but was hesitant to use because I am new to bioinformatics and cannot weigh tool algorithms. Regardless I will definitely look into it and it seems like the tool is robust, well documented, and guided. The fasta was exactly what I was looking for.

Your input has been very helpful. Any additional feedback, recommendations for rna-seq pipeline/analysis will be greatly appreciated.

Upvoted, bookmarked, accepted.


Login before adding your answer.

Traffic: 864 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6