Question

CGHub/TCGA RNA-Seq data trimming adapters

0

Entering edit mode

8.2 years ago

umn_bist ▴ 390

So I have multiple (unaligned) paired end RNA-seq fastq files that I would like to trim against known adapters and base quality score.

Because I do not know which sequencer was used (most likely Illumina), I have it run through SolexaQA++ to determine the format first. If it's done by Illumina, depending on the version, I call the appropriate adapter list to pipe into cutadapt.

I have visited Illumina's website that has a pdf file of all adapters, but I am wondering how I can transfer that list into a fasta file. Is there a publicly available adapter_list.fasta for RNA-seq samples?

Thank you very much for the help.

EDIT: I found that Trimmomatic supplies a set of Illumina adapters for PE, SE in fasta format. Today I learned that .fa is human readable and found that the list is nothing more than the adapter sequence with a ">insert sequencer here" above. What I don't understand is how are these clippers (specifically cutadapt) able to know which adapters are 5' and 3' and trimming accordingly. Also is there a conventional way of calling the sequencer that created the fastq file other than how I am doing it now via SolexaQA++?

EDIT: SolexaQA++ uses the code below to determine the sequencer for an unknown.fastq.

EDIT: I also found that FastQC can also determine sequencer type/version. Now that I know my RNA-seq was sequenced by Sanger/Illumina 1.9, what would be a relevant list of all adapters?

#!/usr/bin/perl

use strict;
use warnings;

my $format = "";

# set regular expressions
my $sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/;
my $solexa_regexp = qr/[\;<=>\?]/;
my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\[\]\^\_\`abcdefgh]/;
my $all_regexp = qr/[\@ABCDEFGHI]/;

# set counters
my $sanger_counter = 0;
my $solexa_counter = 0;
my $solill_counter = 0;

my $i;

while(<>){
    $i++;

    # retrieve qualities
    next unless $i % 4 eq 0;

    #print;
    chomp;

    # check qualities
    if( m/$sanger_regexp/ ){
        $sanger_counter = 1;
        last;
    }
    if( m/$solexa_regexp/ ){
        $solexa_counter = 1;
    }
    if( m/$solill_regexp/ ){
        $solill_counter = 1;
    }
}

# determine format
if( $sanger_counter ){
    $format = "sanger";
}
elsif( !$sanger_counter && $solexa_counter ){
    $format = "solexa";
}
elsif( !$sanger_counter && !$solexa_counter && $solill_counter ){
    $format = "illumina";
}

print "$format\n";

FastQC Illumina cutadapt SolexaQA-plus-plus RNA-Seq • 2.9k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by umn_bist ▴ 390

Ram · Accepted Answer · 2016-02-02

3

Entering edit mode

8.2 years ago

GenoMax 141k

Brian Bushnell (author of BBMap) includes a full list of commonly used adapter sequences in a file in the "resources" directory of BBMap software. BBMap is complete suite of tools for working with NGS data.

See this link for information about TCGA RNAseq data and how it was processed: https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/read/cgcc/unc.edu/illuminahiseq_rnaseqv2/rnaseqv2/unc.edu_READ.IlluminaHiSeq_RNASeqV2.mage-tab.1.9.0/DESCRIPTION.txt

ADD COMMENT • link 8.2 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much the reply. The description.txt will be extremely, extremely helpful. I should've known that the pipeline used for RNA seq would have thorough documentation in TCGA.

I have seen BBMap referred here and there but was hesitant to use because I am new to bioinformatics and cannot weigh tool algorithms. Regardless I will definitely look into it and it seems like the tool is robust, well documented, and guided. The fasta was exactly what I was looking for.

Your input has been very helpful. Any additional feedback, recommendations for rna-seq pipeline/analysis will be greatly appreciated.

Upvoted, bookmarked, accepted.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by umn_bist ▴ 390