Using Entrez to download Supplementary files in GEO entry via command line?
1
0
Entering edit mode
14 months ago
ccc ▴ 30

Suppose I'm looking at a GEO entry like so: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM946533

Notice at the bottom is a table of 'Supplementary files', containing broadpeak and bigwig files. I'm wondering how to download these using Entrez.

I've tried a variety of approaches, the "closest" I've gotten is the following:

esearch -db gds -query GSM946533 | efetch -format docsum > docsum.xml

This gives me several tags for "suppFile", and they contain the correct file types:

<suppFile>BIGWIG, BROADPEAK</suppFile>

(this result is the last suppFile in the docsum.xml result) However, that's about as close as I can get to touching these files. Obviously I can just got to the webpage and HTML download them, but I'm wondering if there is a command line method here, or no

edit: "by closest to touching these files", I mean, so far I haven't been able to get Entrez to even fetch the filenames, just their types. Although I do get something like this, I don't think these are the files I'm looking for (they'll be formatted like GSM946533_mm9_wgEncodePsuHistoneG1eH3k04me3ME0S129InputPk.broadPeak.gz (notice the difference in how the histone marks are written... and even if these were the files, its not clear to me how to download them)):

<suppFile>BIGWIG, BROADPEAK, TXT</suppFile>
<Samples>
    <Sample>
        <Accession>GSM946525</Accession>
        <Title>PSU_ChipSeq_Megakaryo_H3K4me1</Title>
    </Sample>
    <Sample>
        <Accession>GSM946545</Accession>
        <Title>PSU_ChipSeq_G1E-ER4_H3K9me3</Title>
    </Sample>
    <Sample>
        <Accession>GSM946548</Accession>
        <Title>PSU_ChipSeq_CH12_H3K9me3</Title>
bash command-line entrez • 625 views
ADD COMMENT
2
Entering edit mode
14 months ago
GenoMax 141k

With the following you can get the FTP directory

$ esearch -db gds -query GSM946533 | efetch -format docsum | xtract -pattern DocumentSummary -element FTPLink | grep samples
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM946nnn/GSM946533/

The files you want are under suppl directory at that link: https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM946nnn/GSM946533/suppl/

So while you can't get the info directly this should get you close. Different samples seem to follow the same pattern of links.

ADD COMMENT
0
Entering edit mode

Whoa! Thank you! Exactly what I was looking for!!

I added onto your pipeline: sed 's/^ftp/https/; s/$/suppl\//' to reformat easier :)

ADD REPLY

Login before adding your answer.

Traffic: 3177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6