How to pull metadata from SRA based on BioSample ID
1
0
Entering edit mode
3.8 years ago
millere • 0

I have a list of over 8,000 BioSample IDs in a text file and I want to use this list to pull specific information on associated sequencing runs from the SRA database. The code I'm using does work, but it's very, very slow. Is there a faster way to do this using entrez-direct? Am I just missing something very obvious? Any help would be very much appreciated!

# an example biosample list
echo "SAMN14390563
SAMN14390566
SAMN14390576
SAMN14390578
SAMN14453547
SAMN14453553" > biosamples.txt

# pull SRA info based on example biosample list
cat biosamples.txt | xargs -n 1 sh -c 'esearch -db sra -query "$0 [BSPL]" | \
join-into-groups-of 20 | \
efetch -db sra -format runinfo -mode xml | \
xtract -pattern Row -def "NA" -element Run spots bases spots_with_mates avgLength \
size_MB download_path Experiment LibraryStrategy LibrarySelection LibrarySource \
LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject ProjectID \
Sample BioSample SampleType TaxID ScientificName SampleName CenterName \
Submission Consent > metadata_sra.txt'
ncbi entrez-direct • 2.9k views
ADD COMMENT
4
Entering edit mode
3.8 years ago
vkkodali_ncbi ★ 3.7k

You may want to try epost as follows:

$ cat samples.txt 
SAMN14390563
SAMN14390566
SAMN14390576
SAMN14390578
SAMN14453547
SAMN14453553
$ epost -db biosample -input samples.txt -format acc | \
elink -target sra | \
efetch -db sra -format runinfo -mode xml | \
xtract -pattern Row -def "NA" -element Run spots bases spots_with_mates avgLength \
size_MB download_path Experiment LibraryStrategy LibrarySelection LibrarySource \
LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject ProjectID \
Sample BioSample SampleType TaxID ScientificName SampleName CenterName \
Submission Consent > metadata_sra.txt

This skips the esearch step run once for every single accession. Note, epost has some limits on the number of accessions you can provide at a time. To circumvent this, Entrez Direct comes with a built in function called join-into-groups-of that can be used here. You can read more about it here -- look for it under the group "Processing in Groups".

ADD COMMENT
0
Entering edit mode

This is SO much faster! Thank you! Making use of the join-into-groups-of function, I ended up with the following command:

cat biosamples.txt | \
join-into-groups-of 500 | \
xargs -n 1 sh -c 'epost -db biosample -id "$0" -format acc | \
elink -target sra |  \
efetch -db sra -format runinfo -mode xml | \
xtract -pattern Row -def "NA" -element Run spots bases spots_with_mates avgLength \
size_MB download_path Experiment LibraryStrategy LibrarySelection LibrarySource \
LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject ProjectID \
Sample BioSample SampleType TaxID ScientificName SampleName CenterName \
Submission Consent >> metadata_sra.txt'

Whereas my original command was taking 4+ hours to run, this does the same thing in just a few minutes!

ADD REPLY

Login before adding your answer.

Traffic: 2339 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6