So I have multiple (unaligned) paired end RNA-seq fastq files that I would like to trim against known adapters and base quality score.
Because I do not know which sequencer was used (most likely Illumina), I have it run through SolexaQA++ to determine the format first. If it's done by Illumina, depending on the version, I call the appropriate adapter list to pipe into cutadapt.
I have visited Illumina's website that has a pdf file of all adapters, but I am wondering how I can transfer that list into a fasta file. Is there a publicly available adapter_list.fasta for RNA-seq samples?
Thank you very much for the help.
EDIT: I found that Trimmomatic supplies a set of Illumina adapters for PE, SE in fasta format. Today I learned that .fa is human readable and found that the list is nothing more than the adapter sequence with a ">insert sequencer here" above. What I don't understand is how are these clippers (specifically cutadapt) able to know which adapters are 5' and 3' and trimming accordingly. Also is there a conventional way of calling the sequencer that created the fastq file other than how I am doing it now via SolexaQA++?
EDIT: SolexaQA++ uses the code below to determine the sequencer for an unknown.fastq.
EDIT: I also found that FastQC can also determine sequencer type/version. Now that I know my RNA-seq was sequenced by Sanger/Illumina 1.9, what would be a relevant list of all adapters?
#!/usr/bin/perl use strict; use warnings; my $format = ""; # set regular expressions my $sanger_regexp = qr/[!"#$%&'()*+,-.\/0123456789:]/; my $solexa_regexp = qr/[\;<=>\?]/; my $solill_regexp = qr/[JKLMNOPQRSTUVWXYZ\[\]\^\_\`abcdefgh]/; my $all_regexp = qr/[\@ABCDEFGHI]/; # set counters my $sanger_counter = 0; my $solexa_counter = 0; my $solill_counter = 0; my $i; while(<>){ $i++; # retrieve qualities next unless $i % 4 eq 0; #print; chomp; # check qualities if( m/$sanger_regexp/ ){ $sanger_counter = 1; last; } if( m/$solexa_regexp/ ){ $solexa_counter = 1; } if( m/$solill_regexp/ ){ $solill_counter = 1; } } # determine format if( $sanger_counter ){ $format = "sanger"; } elsif( !$sanger_counter && $solexa_counter ){ $format = "solexa"; } elsif( !$sanger_counter && !$solexa_counter && $solill_counter ){ $format = "illumina"; } print "$format\n";
Thank you very much the reply. The description.txt will be extremely, extremely helpful. I should've known that the pipeline used for RNA seq would have thorough documentation in TCGA.
I have seen BBMap referred here and there but was hesitant to use because I am new to bioinformatics and cannot weigh tool algorithms. Regardless I will definitely look into it and it seems like the tool is robust, well documented, and guided. The fasta was exactly what I was looking for.
Your input has been very helpful. Any additional feedback, recommendations for rna-seq pipeline/analysis will be greatly appreciated.
Upvoted, bookmarked, accepted.