Downloading raw nucleotide seq reads using Programatically
1
0
Entering edit mode
9.1 years ago
moranr ▴ 290

Hi,

I am trying to download sequence reads for the reference genome for >100 animal species. Using EBIs REST URLs I can get FTP links using taxon names. However, all reads are returned. Is it possible to get the raw reads using an assembly accession to get the reads used in that specific assembly.

e.g. Gorilla gorilla

http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22tax_name(%22Gorilla%20gorilla%22)%20&%20library_source!=%22TRANSCRIPTOMIC%22%22&result=read_study&display=report&fields=fastq_ftp,instrument_platform,read_count,base_count

This returns many ftp links - I only want to get the raw reads for a specific assembly - e.g. GCA_000167515.2

Edit: This example assembly identifier I gave was from NCBI. I retrieve the raw fastq files from the EBI database, which does not recognise this directly .

Thanks for your help,

R

genome python sequencing fastq • 1.9k views
ADD COMMENT
0
Entering edit mode
9.1 years ago
Ram 43k

curl is your friend.

curl -o `curl -vs <URL> 2>&1 | grep "ASSEMBLY_IDENTIFIER" | cut -f2 | sed -re 's/;$//'`

Logic: Query the URL, get the results, narrow by identifier, get the FTP URL and curl the FTP URL to get the file.

I don't see why the python tag is relevant, BTW.

ADD COMMENT
0
Entering edit mode

The python tag was to indicate that a solution in python is fine.

The example assembly identifier I gave was from NCBI. I retrieve the raw fastq files from the EBI database, which does not recognise this.

ADD REPLY
1
Entering edit mode

You have an ID mapping problem in your hands then. Hmmm.

ADD REPLY

Login before adding your answer.

Traffic: 1911 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6