Reference Genome from NCBI BioProject ID or SRA
1
2
Entering edit mode
10 months ago
nickarsimet ▴ 30

Given an SRR/ERR number or a BioProject ID, is there a way (either through the Internet, by command line, by Python program, etc.) that can fetch the reference genome corresponding to that accession number?

Basically, I am working through a sequencing tutorial, and they are giving an example using the read associated with ERR1036032. A search on EBI shows that this read is from E. faecium; so, I went to NBCI's repo for this bacteria and downloaded the FASTA for the genome. However, the tutorial includes its own reference genome, and the included file doesn't match what I downloaded from NCBI. Can I get the reference genome from the ERR number alone?

genome sequencing alignment software error • 355 views
ADD COMMENT
2
Entering edit mode
10 months ago
GenoMax 111k

There are multiple genomes in the genome database for organisms especially ones that are commonly used. e.g. NCBI genome database has 23,186 genomes available as of today for Escherichia coli. (LINK)

When working with a particular organism using a genome from RefSeq genomes database is likely your best option since those are manually curated stable genomes. Here is a representative genome for this organism.

$ esearch -db taxonomy -query "1352  [taxID]" | elink -target assembly | efetch -format docsum | xtract -pattern DocumentSummary -element RefSeq,Organism,RefSeq_category,FtpPath_RefSeq | grep -v "na"
GCF_010120755.1 Enterococcus faecium (firmicutes)   representative genome   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/010/120/755/GCF_010120755.1_ASM1012075v1

You can only get the name of the organism from a particular SRA dataset but not a specific genome associated with it. You can find the TaxID in the query below that can be used to find the RefSeq genome as demonstrated above.

$ esearch -db sra -query "ERR1036032" | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR1036032,2015-09-29 05:07:35,2016-03-04 14:45:45,2128045,425609000,2128045,200,183,ASM25094v1,https://sra-downloadb.st-va.ncbi.nlm.nih.gov/sos1/sra-pub-run-8/ERR1036032/ERR1036032.1,ERX1114766,13764563,WGS,RANDOM,GENOMIC,PAIRED,500,0,ILLUMINA,Illumina HiSeq 2000,ERP009805,PRJEB8769,,293294,ERS683353,SAMEA3304528,simple,1352,Enterococcus faecium,SAMEA3304528,,,,,,,no,,,,,THE WELLCOME TRUST SANGER INSTITUTE,ERA490686,,public,16BA4EEAD61804CF2C7B75E487BEFC64,159B6E74B5596FD71E84956CF20C333B

You can then use NCBI's new tool called datasets (LINK) to download the genome sequence

$ datasets download genome taxon 1352 --exclude-protein
ADD COMMENT

Login before adding your answer.

Traffic: 2449 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6