Question: CGHub/TCGA RNA-Seq data trimming adapters
gravatar for umn_bist
3.3 years ago by
umn_bist320 wrote:

So I have multiple (unaligned) paired end RNA-seq fastq files that I would like to trim against known adapters and base quality score.

Because I do not know which sequencer was used (most likely Illumina), I have it run through SolexaQA++ to determine the format first. If it's done by Illumina, depending on the version, I call the appropriate adapter list to pipe into cutadapt.

I have visited Illumina's website that has a pdf file of all adapters, but I am wondering how I can transfer that list into a fasta file. Is there a publicly available adapter_list.fasta for RNA-seq samples? 

Thank you very much for the help.


EDIT: I found that Trimmomatic supplies a set of Illumina adapters for PE, SE in fasta format. Today I learned that .fa is human readable and found that the list is nothing more than the adapter sequence with a ">insert sequencer here" above. What I don't understand is how are these clippers (specifically cutadapt) able to know which adapters are 5' and 3' and trimming accordingly. Also is there a conventional way of calling the sequencer that created the fastq file other than how I am doing it now via SolexaQA++?

EDIT: SolexaQA++ uses the code below to determine the sequencer for an unknown.fastq.

EDIT: I also found that FastQC can also determine sequencer type/version. Now that I know my RNA-seq was sequenced by Sanger/Illumina 1.9, what would be a relevant list of all adapters?


use strict;

use warnings;

my $format = "";

# set regular expressions

my $sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/;

my $solexa_regexp = qr/[\;<=>\?]/;

my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\[\]\^\_\`abcdefgh]/;

my $all_regexp = qr/[\@ABCDEFGHI]/;

# set counters

my $sanger_counter = 0;

my $solexa_counter = 0;

my $solill_counter = 0;

my $i;



# retrieve qualities

next unless $i % 4 eq 0;



# check qualities

if( m/$sanger_regexp/ ){

$sanger_counter = 1;



if( m/$solexa_regexp/ ){

$solexa_counter = 1;


if( m/$solill_regexp/ ){

$solill_counter = 1;



# determine format

if( $sanger_counter ){

$format = "sanger";


elsif( !$sanger_counter && $solexa_counter ){

$format = "solexa";


elsif( !$sanger_counter && !$solexa_counter && $solill_counter ){

$format = "illumina";


print "$format\n";
ADD COMMENTlink modified 3.3 years ago by genomax67k • written 3.3 years ago by umn_bist320
gravatar for genomax
3.3 years ago by
United States
genomax67k wrote:

Brian Bushnell (author of BBMap) includes a full list of commonly used adapter sequences in a file in the "resources" directory of BBMap software. BBMap is complete suite of tools for working with NGS data.

See this link for information about TCGA RNAseq data and how it was processed:

ADD COMMENTlink written 3.3 years ago by genomax67k

Thank you very much the reply. The description.txt will be extremely, extremely helpful. I should've known that the pipeline used for RNA seq would have thorough documentation in TCGA.

I have seen BBMap referred here and there but was hesitant to use because I am new to bioinformatics and cannot weigh tool algorithms. Regardless I will definitely look into it and it seems like the tool is robust, well documented, and guided. The fasta was exactly what I was looking for. 

Your input has been very helpful. Any additional feedback, recommendations for rna-seq pipeline/analysis will be greatly appreciated.

Upvoted, bookmarked, accepted.

ADD REPLYlink written 3.3 years ago by umn_bist320
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1578 users visited in the last hour