Question: How do I get a list of RefSeq from Assembly in Entrez?
2
gravatar for bfeeny
4.4 years ago by
bfeeny20
United States
bfeeny20 wrote:

I am trying to get a list of RefSeq from an Assembly in Entrez, but its not working.  

First, I am working with GRCh37.1.  I can do a manual query on Entrez like so:

GCA_000001405.14[Assembly Accession]

This returns a webpage, that clearly has the RefSeq's in it.  I can get a text copy of the full report by clicking the hyperlink "Download full sequence report" which takes me to ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.25.assembly.txt

But when I try to do this query via BioPython, I only get the summary, and I can't seem to get the "full report" which shows the RefSeq's:
 

from Bio import Entrez

Entrez.email = "brianfeeny@fas.harvard.edu"     # Always tell NCBI who you are

# Do a search
handle = Entrez.esearch(db="assembly", term="Homo sapiens[Orgn] AND GCA_000001405.2[ASAC] AND Chromosome[ASLV]")
record = Entrez.read(handle)
for id in record['IdList']:
    esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
    esummary_record = Entrez.read(esummary_handle)
    print esummary_record

The above correctly shows ID 6328, which is the same UID as the web page query.  It returns this JSON data:

 

{u'DocumentSummarySet': DictElement({u'DbBuild': 'Build150509-1000.1', u'DocumentSummary': [DictElement({u'ChainId': '1405', u'AsmUpdateDate': '2013/08/08 00:00', u'GbUid': '6328', u'PropertyList': ['altloci', 'full-genome-representation', 'has-chromosome', 'replaced', 'replaced_genbank'], u'SubmissionDate': '2010/02/22 00:00', u'GB_BioProjects': [{u'BioprojectAccn': 'PRJNA31257', u'BioprojectId': '31257'}], u'AssemblyAccession': 'GCA_000001405.2', u'LastMajorReleaseAccession': 'GCA_000001405.1', u'Synonym': {u'RefSeq': '', u'Genbank': 'GCA_000001405.2', u'Similarity': ''}, u'AssemblyType': 'haploid-with-alt-loci', u'AssemblyDescription': 'Genome Reference Consortium Human Build 37 patch release 1 (GRCh37.p1)', u'RS_Projects': [], u'RefSeq_category': 'na', u'PartialGenomeRepresentation': 'false', u'Coverage': '0', u'AssemblyClass': 'haploid-with-alt-loci', u'SeqReleaseDate': '2010/02/22 00:00', u'AnomalousList': [], u'AssemblyStatus': 'Chromosome', u'AsmReleaseDate': '2010/02/22 00:00', u'ReleaseLevel': 'Patch', u'Taxid': '9606', u'LastUpdateDate': '2013/08/08 00:00', u'RsUid': '', u'SortOrder': '5C30000014059799', u'WGS': '', u'GB_Projects': ['31257'], u'BioSampleId': '', u'AssemblyName': 'GRCh37.p1', u'EnsemblName': '', u'Organism': 'Homo sapiens', u'Biosource': {u'Isolate': '', u'InfraspeciesList': [], u'Sex': ''}, u'SpeciesName': 'Homo sapiens', u'BioSampleAccn': '', u'SpeciesTaxid': '9606', u'UCSCName': '', u'Primary': '358', u'Meta': ' <Stats> <Stat category="alt_loci_count" sequence_tag="all">9</Stat> <Stat category="chromosome_count" sequence_tag="all">24</Stat> <Stat category="contig_count" sequence_tag="all">462</Stat> <Stat category="contig_l50" sequence_tag="all">24</Stat> <Stat category="contig_n50" sequence_tag="all">38508932</Stat> <Stat category="non_chromosome_replicon_count" sequence_tag="all">0</Stat> <Stat category="replicon_count" sequence_tag="all">24</Stat> <Stat category="scaffold_count" sequence_tag="all">260</Stat> <Stat category="scaffold_count" sequence_tag="placed">190</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">20</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">39</Stat> <Stat category="scaffold_l50" sequence_tag="all">21</Stat> <Stat category="scaffold_n50" sequence_tag="all">46395641</Stat> <Stat category="total_length" sequence_tag="all">3139095181</Stat> <Stat category="ungapped_length" sequence_tag="all">2899144443</Stat> </Stats> <FtpSites>   <FtpPath type="GenBank">ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p1/</FtpPath> </FtpSites> <assembly-level>3</assembly-level> <assembly-status>Chromosome</assembly-status> <representative-status>na</representative-status> <submitter-organization>Genome Reference Consortium</submitter-organization>    ', u'NCBIReleaseDate': '2010/02/22 00:00', u'SubmitterOrganization': 'Genome Reference Consortium', u'RS_BioProjects': []}, attributes={u'uid': u'6328'})]}, attributes={u'status': u'OK'})}


So I then try to fetch it:
for id in record['IdList']:
    efetch_handle = Entrez.efetch(db="assembly", id=id, retmode="text")
    print efetch_handle.read()

But this just returns "6328".  No actual file data, RefSeq's or otherwise.

Can someone identify what I am doing wrong?

 

 

 

 

biopython entrez • 4.2k views
ADD COMMENTlink modified 4.2 years ago by erik.kastman10 • written 4.4 years ago by bfeeny20
0
gravatar for erik.kastman
4.2 years ago by
erik.kastman10
United States
erik.kastman10 wrote:

I'm not very familiar with EFetch, but it seems to me that the assembly database may not be fully implemented. If you look at the EFetch help, there aren't any options for db=assembly:   Additionally, there are no direct links to download an assembly from the ncbi site, anyway. For an example, check out this example assembly page:   You can download the sequence report, and there's a link to a nucleotide sequence url, but no direct download.

I think you may need to write something that will grab related sequences instead of doing it directly from fetching the assembly. The logic is really tortured and the code could be cleaned, but I think this will do the job:

 

Search Assembly -> Get Assembly Summary -> Parse Nucleotide Accession Name from Summary -> Search Nucleotide DB to get ID from Name -> Fetch Sequence by ID

 

# Search Assembly
handle = Entrez.esearch(db="assembly", term="Homo sapiens[Orgn] AND GCA_000001405.2[ASAC] AND Chromosome[ASLV]")
record = Entrez.read(handle)
for id in record['IdList']:
    # Get Assembly Summary
    esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
    esummary_record = Entrez.read(esummary_handle)

    # Parse Accession Name from Summary
    accession_id = ['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession']

    # Search Name for ID
    refseq_id = Entrez.read(Entrez.esearch(db="nucleotide", term=accession_id))['IdList'][0]  # Likely only one sequence match?
    
    # Fetch Record with ID from nucleotide db
    seq_record = Entrez.efetch(db="nucleotide", id=refseq_id, retmode='text')
ADD COMMENTlink written 4.2 years ago by erik.kastman10

Should be like that:

accession_id = esummary_record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession']

And also instead of that doing that you can get the ftp path and download the file using wget:

link = esummary_record['DocumentSummarySet']['DocumentSummary'][0]['FtpPath_RefSeq']

ADD REPLYlink written 8 months ago by daytan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour