I would like to analyze some small RNA data from NCBI (eg. https://www.ncbi.nlm.nih.gov/sra/SRR5593145), but I am not sure where I can find the adaptor sequences for trimming. Can anyone please suggest.
How Can I Tell What Is The Adapter Used In A Sequence Read Archive (Sra) Sample?
Identify adapter sequences for trimming from Illumina paired end fastq files
Thank you! Not sure if I can use BBMAP if it's single ends though.
TrueSeq small RNA kit sequences (based on the SRA link) should be in their sequence document.
Hi - apologies if I missed the answer somewhere on biostars... so I take it that the —clip option in fasta-dump isn’t trimming the adapters? Or cannot be completely trusted?
Did never hear of that option, and never heard anyone would use it for adapter trimming. By default NCBI does not store information on the adapter sequence, so not only does the tool not know what to look for, nor would I put any trust in this option. If you do not know the sequence run fastqc to check for adapters and then remove with specialized software such as trimmomatic, cutadapt or bbduk.sh. Depends on library prep kit which adapter was used.
Thank you ATpoint for the recommendations. Submitters to SRA do generally give information regarding library construction protocol like giving the RNA prep kit they used. The SRA toolkit is honestly quite confusing and I also wonder if ENA is removing these adapters. See below in list of options about --clip:
$ fastq-dump -h
fastq-dump [options] <path> [<path>...]
fastq-dump [options] <accession>
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
--table <table-name> Table name within cSRA object, default is
Read Splitting Sequence data may be used in raw form or
split into individual reads
--split-spot Split spots into individual reads
Full Spot Filters Applied to the full spot independently
-N|--minSpotId <rowid> Minimum spot id
-X|--maxSpotId <rowid> Maximum spot id
--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]
-W|--clip Remove adapter sequences from reads
Common Filters Applied to spots when --split-spot is not
set, otherwise - to individual reads
-M|--minReadLen <len> Filter by sequence length >= <len>
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
-E|--qual-filter Filter used in early 1000 Genomes data: no
sequences starting or ending with >= 10N
--qual-filter-1 Filter used in current 1000 Genomes data
Filters based on alignments Filters are active when alignment
data are present
--aligned Dump only aligned sequences
--unaligned Dump only unaligned sequences
--aligned-region <name[:from-to]> Filter by position on genome. Name can
either be accession.version (ex:
NC_000001.10) or file specific name (ex:
"chr1" or "1"). "from" and "to" are 1-based
--matepair-distance <from-to|unknown> Filter by distance between matepairs.
Use "unknown" to find matepairs split
between the references. Use from-to to limit
matepair distance on the same reference
Filters for individual reads Applied only with --split-spot set
--skip-technical Dump only biological reads
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip: deprecated, not
--bzip2 Compress output using bzip2: deprecated,
... more options sections ...
ENA mirrows NCBI, they don't change data. You will always have to trim adapters yourself using any (but not exclusively) of the tools I suggested. It is true that the method text may contain infos on library prep but this is just text, there is nothing like a field to enter an adapter sequence. NCBI will always (at least I never saw anything else) raw sequencing data as they came from the sequencer (at least this should be what submitters upload) because everyone should be free to use whatever adapter-removal strategy (or general data manipulation pipeline) they want.