Question

Downloading raw nucleotide seq reads using Programatically

0

Entering edit mode

9.1 years ago

moranr ▴ 290

Hi,

I am trying to download sequence reads for the reference genome for >100 animal species. Using EBIs REST URLs I can get FTP links using taxon names. However, all reads are returned. Is it possible to get the raw reads using an assembly accession to get the reads used in that specific assembly.

e.g. Gorilla gorilla

http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22tax_name(%22Gorilla%20gorilla%22)%20&%20library_source!=%22TRANSCRIPTOMIC%22%22&result=read_study&display=report&fields=fastq_ftp,instrument_platform,read_count,base_count

This returns many ftp links - I only want to get the raw reads for a specific assembly - e.g. GCA_000167515.2

Edit: This example assembly identifier I gave was from NCBI. I retrieve the raw fastq files from the EBI database, which does not recognise this directly .

Thanks for your help,

R

genome python sequencing fastq • 1.9k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by moranr ▴ 290

Ram · Answer 1 · 2015-03-24

0

Entering edit mode

9.1 years ago

Ram 43k

curl is your friend.

curl -o `curl -vs <URL> 2>&1 | grep "ASSEMBLY_IDENTIFIER" | cut -f2 | sed -re 's/;$//'`

Logic: Query the URL, get the results, narrow by identifier, get the FTP URL and curl the FTP URL to get the file.

I don't see why the python tag is relevant, BTW.

ADD COMMENT • link 23 months ago by Ram 43k

0

Entering edit mode

The python tag was to indicate that a solution in python is fine.

The example assembly identifier I gave was from NCBI. I retrieve the raw fastq files from the EBI database, which does not recognise this.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.1 years ago by moranr ▴ 290

1

Entering edit mode

You have an ID mapping problem in your hands then. Hmmm.

ADD REPLY • link 9.1 years ago by Ram 43k