Question: How do I find out the adaptor sequences for SRA data?
gravatar for MAPK
18 months ago by
MAPK1.7k wrote:

I would like to analyze some small RNA data from NCBI (eg., but I am not sure where I can find the adaptor sequences for trimming. Can anyone please suggest.

adaptor sra • 1.2k views
ADD COMMENTlink modified 14 months ago by hermidalc0 • written 18 months ago by MAPK1.7k

How Can I Tell What Is The Adapter Used In A Sequence Read Archive (Sra) Sample?
Identify adapter sequences for trimming from Illumina paired end fastq files

ADD REPLYlink written 18 months ago by genomax92k

Thank you! Not sure if I can use BBMAP if it's single ends though.

ADD REPLYlink written 18 months ago by MAPK1.7k

TrueSeq small RNA kit sequences (based on the SRA link) should be in their sequence document.

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax92k

Hi - apologies if I missed the answer somewhere on biostars... so I take it that the —clip option in fasta-dump isn’t trimming the adapters? Or cannot be completely trusted?

ADD REPLYlink modified 14 months ago • written 14 months ago by hermidalc0

Did never hear of that option, and never heard anyone would use it for adapter trimming. By default NCBI does not store information on the adapter sequence, so not only does the tool not know what to look for, nor would I put any trust in this option. If you do not know the sequence run fastqc to check for adapters and then remove with specialized software such as trimmomatic, cutadapt or Depends on library prep kit which adapter was used.

ADD REPLYlink modified 14 months ago • written 14 months ago by ATpoint42k

Thank you ATpoint for the recommendations. Submitters to SRA do generally give information regarding library construction protocol like giving the RNA prep kit they used. The SRA toolkit is honestly quite confusing and I also wonder if ENA is removing these adapters. See below in list of options about --clip:

$ fastq-dump -h

  fastq-dump [options] <path> [<path>...]
  fastq-dump [options] <accession>

  -A|--accession <accession>       Replaces accession derived from <path> in 
                                   filename(s) and deflines (only for single 
                                   table dump) 
  --table <table-name>             Table name within cSRA object, default is 


Read Splitting                     Sequence data may be used in raw form or
                                     split into individual reads
  --split-spot                     Split spots into individual reads 

Full Spot Filters                  Applied to the full spot independently
                                     of --split-spot
  -N|--minSpotId <rowid>           Minimum spot id 
  -X|--maxSpotId <rowid>           Maximum spot id 
  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...] 
  -W|--clip                        Remove adapter sequences from reads

Common Filters                     Applied to spots when --split-spot is not
                                     set, otherwise - to individual reads
  -M|--minReadLen <len>            Filter by sequence length >= <len> 
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value 
                                   optionally filter by value: 
  -E|--qual-filter                 Filter used in early 1000 Genomes data: no 
                                   sequences starting or ending with >= 10N 
  --qual-filter-1                  Filter used in current 1000 Genomes data 

Filters based on alignments        Filters are active when alignment
                                     data are present
  --aligned                        Dump only aligned sequences 
  --unaligned                      Dump only unaligned sequences 
  --aligned-region <name[:from-to]>  Filter by position on genome. Name can 
                                   either be accession.version (ex: 
                                   NC_000001.10) or file specific name (ex: 
                                   "chr1" or "1"). "from" and "to" are 1-based 
  --matepair-distance <from-to|unknown>  Filter by distance between matepairs. 
                                   Use "unknown" to find matepairs split 
                                   between the references. Use from-to to limit 
                                   matepair distance on the same reference 

Filters for individual reads       Applied only with --split-spot set
  --skip-technical                 Dump only biological reads 

  -O|--outdir <path>               Output directory, default is working 
                                   directory '.' ) 
  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream 
  --gzip                           Compress output using gzip: deprecated, not 
  --bzip2                          Compress output using bzip2: deprecated, 
                                   not recommended 

... more options sections ...
ADD REPLYlink written 14 months ago by hermidalc0

ENA mirrows NCBI, they don't change data. You will always have to trim adapters yourself using any (but not exclusively) of the tools I suggested. It is true that the method text may contain infos on library prep but this is just text, there is nothing like a field to enter an adapter sequence. NCBI will always (at least I never saw anything else) raw sequencing data as they came from the sequencer (at least this should be what submitters upload) because everyone should be free to use whatever adapter-removal strategy (or general data manipulation pipeline) they want.

ADD REPLYlink modified 14 months ago • written 14 months ago by ATpoint42k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 995 users visited in the last hour