Question

Connect SRA Bio Sample to Run

0

Entering edit mode

2.4 years ago

Ivan ▴ 60

I have a list of Sequence Read Archive accession numbers for raw data reads, and this list looks like this :

SAMN03421314  
SAMN03421315 
SAMN03421316 
SAMN03421317 
SAMN03421318 
SAMN03421319 
SAMN03421320 
SAMN03421321 
...

This list is stored in a text file. I downloaded every SRA file using sratoolkit's prefetch command. What I got is a list of folders, each containing .SRA file, but all those folders are named not by their Biosample (e.g. SAMN03421321 ), but by their Run (SRR1927228). What I want to do is connect each Biosample to the Run (eg. SAMN03421321 : SRR1927228), and not to do that manually, as I have a bunch of folders.

Is there a fast tool to do just that - not re-download genomes, but just look up those two IDs?

SRA sratoolkit • 666 views

ADD COMMENT • link updated 2.4 years ago by vkkodali_ncbi ★ 3.7k • written 2.4 years ago by Ivan ▴ 60

score 2 · Accepted Answer · 2021-11-23

You can use Entrez Direct for this as follows:

$ cat samples.txt 
SAMN03421314  
SAMN03421315 
SAMN03421316 
SAMN03421317 
SAMN03421318 
SAMN03421319 
SAMN03421320 
SAMN03421321 
$ epost -db biosample -input samples.txt | elink -target sra | efetch -format runinfo > runinfo.csv

The output csv file has 47 fields where field 1 is the SRA run accession and field 26 is the BioSample accession. You can parse the CSV using awk :

$ awk 'BEGIN{FS=",";OFS="\t"}{print $1,$26}' runinfo.csv
Run         BioSample
SRR1927184  SAMN03421314
SRR1927214  SAMN03421316
SRR1927218  SAMN03421318
SRR1927224  SAMN03421320
SRR1927212  SAMN03421315
SRR1927215  SAMN03421317
SRR1927221  SAMN03421319
SRR1927228  SAMN03421321