Get Sra Ids From Geo

7

Entering edit mode

11.2 years ago

tomislav.ilicic ▴ 120

Hi,

I am trying to get all sample ids (SRS ids) from a GEO ID. For example, I am trying to fetch all SRS ids belonging to GSE44183.

Is there any way to get these programatically? I was trying to get these by using the e-utilities from NCBI but I just couldn't make the right query.

Help would be very much appreciated. Best, Tomi

geo ncbi • 12k views

ADD COMMENT • link updated 5.9 years ago by j.aryaman25 ▴ 20 • written 11.2 years ago by tomislav.ilicic ▴ 120

0

Entering edit mode

You need to clarify this question. First, the title refers to "SRA ids". However, the question then uses "SRS ids", twice. Which is it? I suspect SRA.

Second, you need to define and give an example of the "sample ids" you want to retrieve. For this type of GEO record, one could retrieve GEO sample IDs (starting with GSM), or SRA read IDs (starting with SRR), or even SRX IDs. So please, define clearly what you want to do.

ADD REPLY • link 11.2 years ago by Neilfws 49k

0

Entering edit mode

Hi,

I apologize, I got a bit confused with the number of different ids in this case.

I have around 40 GSE ids where I want to download all sequencing data belonging to a GSE id (e.g.GSE44183). To do this, I thought to use fastq-dump which needs SRA ids as input. Hence, I am trying to fetch all SRA IDs belonging to a GSE. Maybe this is not the right approach, but I couldn't think of any other solution to download all the data in an easier way.

Hope this is clear now.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by tomislav.ilicic ▴ 120

0

Entering edit mode

So you want sequencing run accessions, i.e. SRR?

ADD REPLY • link 11.2 years ago by Neilfws 49k

0

Entering edit mode

Yes. From GSE ids.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 11.2 years ago by tomislav.ilicic ▴ 120

5

Entering edit mode

11.2 years ago

Neilfws 49k

Not sure that you can get from GSE to SRR in one step, but EUtils is definitely the way to go.

You can get from GSE to SRX using EDirect like this (using head to show the first 5 results):

esearch -db gds -query "GSE44183[ACCN] AND GSM[ETYP]" | efetch -format docsum | \
xtract -pattern ExtRelation -element RelationType,TargetObject | head -5

SRA    SRX300901
SRA    SRX300900
SRA    SRX300899
SRA    SRX300898
SRA    SRX300897

Then you could write the SRX to a file, parse and use in a new esearch query:

esearch -db sra -query "SRX300901[ACCN]" | efetch -format docsum | xtract -element Runs

<Run acc="SRR893074" total_spots="22020236" total_bases="3963642480" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>

That does not quite get you there, since the SRR is contained in an attribute. You may want to use the XML parser of your choice, rather than EDirect xtract, to process the XML returned by efetch.

Another approach that I have not yet explored: it may be possible to parse a GEO SOFT or MINiML file, which should be obtainable from the FTP site using the original GSE accession.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 11.2 years ago by Neilfws 49k

3

Entering edit mode

Hi,

Thanks for the help.

I solved it by doing this:

esearch -db sra -query "GSE52529" | efetch -format docsum | xtract -pattern DocumentSummary -element Run@acc

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 11.2 years ago by tomislav.ilicic ▴ 120

3

Entering edit mode

9.9 years ago

Kamil ★ 2.3k

Thanks to Neil and Tomislav for the helpful comments! I use this script to download all SRA files for a given SRA id:

	#!/usr/bin/env bash
	# sra2srr.sh
	#
	# Example
	# -------
	# To download all of the read archive files for SRP012001:
	# sra2srr.sh SRP012001 \| while read srr; do prefetch $srr; done
	#
	# For 'esearch', 'efetch', 'xtract', you must install Entrez Direct:
	# http://www.ncbi.nlm.nih.gov/books/NBK179288/
	#
	# For 'prefetch', you must install SRA Tools:
	# https://github.com/ncbi/sra-tools

	SRA=$1
	esearch -db sra -query $SRA \| \
	efetch -format docsum \| \
	xtract -pattern DocumentSummary -element Run@acc \| \
	tr '\t' '\n'

view raw sra2srr.sh hosted with ❤ by GitHub

ADD COMMENT • link 9.9 years ago by Kamil ★ 2.3k

1

Entering edit mode

Hi Kamil, can you make a little modification to catch the sample name at the same time? for example SRS, 'Sperm' ....

ADD REPLY • link 7.0 years ago by Shicheng Guo ★ 9.6k

2

Entering edit mode

5.9 years ago

j.aryaman25 ▴ 20

This code will get all SRR identifiers from a GSE:

#!/usr/bin/env bash

# gse2srr.sh
# Requires entrez-direct
# conda install -c bioconda entrez-direct

# To use,
# bash gse2srr.sh GSE52529
# This will create a text file GSE52529_SRR.txt

GSE=$1
echo "Finding all SRX associated with ${GSE}..."

mapfile -t SRX_ARRAY < <(esearch -db gds -query "${GSE}[ACCN] AND GSM[ETYP]" |\
efetch -format docsum | xtract -pattern ExtRelation -element TargetObject)

echo "Finding all SRR associated with ${GSE}..."

rm -f ${GSE}_SRR.txt

for i in "${SRX_ARRAY[@]}"
do
   echo "$i"
   esearch -db sra -query $i | efetch -format docsum | \
   xtract -pattern DocumentSummary -element Run@acc >> ${GSE}_SRR.txt
done

It is a bit slow because it does a database query for every SRX. I would be stunned if there isn't a faster way to do this, but it at least answers the question.

ADD COMMENT • link 5.9 years ago by j.aryaman25 ▴ 20

0

Entering edit mode

Given you have the study accession, e.g. PRJNA288801 you can simply look it up at the ENA and then make a fast download as described in this tutorial:

Fast download of FASTQ files from the European Nucleotide Archive (ENA)

ADD REPLY • link 5.9 years ago by ATpoint 88k

Login before adding your answer.