Question

How do I get a list of RefSeq from Assembly in Entrez?

5

Entering edit mode

10.2 years ago

bfeeny ▴ 50

I am trying to get a list of RefSeq from an Assembly in Entrez, but its not working.

First, I am working with GRCh37.1. I can do a manual query on Entrez like so:

GCA_000001405.14[Assembly Accession]

This returns a webpage, that clearly has the RefSeq's in it. I can get a text copy of the full report by clicking the hyperlink "Download full sequence report" which takes me to ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.25.assembly.txt

But when I try to do this query via BioPython, I only get the summary, and I can't seem to get the "full report" which shows the RefSeq's:

from Bio import Entrez

Entrez.email = "brianfeeny@fas.harvard.edu"     # Always tell NCBI who you are

# Do a search
handle = Entrez.esearch(db="assembly", term="Homo sapiens[Orgn] AND GCA_000001405.2[ASAC] AND Chromosome[ASLV]")
record = Entrez.read(handle)
for id in record['IdList']:
    esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
    esummary_record = Entrez.read(esummary_handle)
    print esummary_record

The above correctly shows ID 6328, which is the same UID as the web page query. It returns this JSON data:

{
    u'DocumentSummarySet':DictElement({
        u'DbBuild':'Build150509-1000.1',
        u'DocumentSummary':[
            DictElement({
                u'ChainId':'1405',
                u'AsmUpdateDate':'2013/08/08 00:00',
                u'GbUid':'6328',
                u'PropertyList':[
                    'altloci',
                    'full-genome-representation',
                    'has-chromosome',
                    'replaced',
                    'replaced_genbank'
                ],
                u'SubmissionDate':'2010/02/22 00:00',
                u'GB_BioProjects':[
                    {
                        u'BioprojectAccn':'PRJNA31257',
                        u'BioprojectId':'31257'
                    }
                ],
                u'AssemblyAccession':'GCA_000001405.2',
                u'LastMajorReleaseAccession':'GCA_000001405.1',
                u'Synonym':{
                    u'RefSeq':'',
                    u'Genbank':'GCA_000001405.2',
                    u'Similarity':''
                },
                u'AssemblyType':'haploid-with-alt-loci',
                u'AssemblyDescription':'Genome Reference Consortium Human Build 37 patch release 1 (GRCh37.p1)',
                u'RS_Projects':[

                ],
                u'RefSeq_category':'na',
                u'PartialGenomeRepresentation':'false',
                u'Coverage':'0',
                u'AssemblyClass':'haploid-with-alt-loci',
                u'SeqReleaseDate':'2010/02/22 00:00',
                u'AnomalousList':[

                ],
                u'AssemblyStatus':'Chromosome',
                u'AsmReleaseDate':'2010/02/22 00:00',
                u'ReleaseLevel':'Patch',
                u'Taxid':'9606',
                u'LastUpdateDate':'2013/08/08 00:00',
                u'RsUid':'',
                u'SortOrder':'5C30000014059799',
                u'WGS':'',
                u'GB_Projects':[
                    '31257'
                ],
                u'BioSampleId':'',
                u'AssemblyName':'GRCh37.p1',
                u'EnsemblName':'',
                u'Organism':'Homo sapiens',
                u'Biosource':{
                    u'Isolate':'',
                    u'InfraspeciesList':[

                    ],
                    u'Sex':''
                },
                u'SpeciesName':'Homo sapiens',
                u'BioSampleAccn':'',
                u'SpeciesTaxid':'9606',
                u'UCSCName':'',
                u'Primary':'358',
                u'Meta':' <Stats> <Stat category="alt_loci_count" sequence_tag="all">9</Stat> <Stat category="chromosome_count" sequence_tag="all">24</Stat> <Stat category="contig_count" sequence_tag="all">462</Stat> <Stat category="contig_l50" sequence_tag="all">24</Stat> <Stat category="contig_n50" sequence_tag="all">38508932</Stat> <Stat category="non_chromosome_replicon_count" sequence_tag="all">0</Stat> <Stat category="replicon_count" sequence_tag="all">24</Stat> <Stat category="scaffold_count" sequence_tag="all">260</Stat> <Stat category="scaffold_count" sequence_tag="placed">190</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">20</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">39</Stat> <Stat category="scaffold_l50" sequence_tag="all">21</Stat> <Stat category="scaffold_n50" sequence_tag="all">46395641</Stat> <Stat category="total_length" sequence_tag="all">3139095181</Stat> <Stat category="ungapped_length" sequence_tag="all">2899144443</Stat> </Stats> <FtpSites>   <FtpPath type="GenBank">ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p1/</FtpPath> </FtpSites> <assembly-level>3</assembly-level> <assembly-status>Chromosome</assembly-status> <representative-status>na</representative-status> <submitter-organization>Genome Reference Consortium</submitter-organization>    ',
                u'NCBIReleaseDate':'2010/02/22 00:00',
                u'SubmitterOrganization':'Genome Reference Consortium',
                u'RS_BioProjects':[

                ]
            },
            attributes={
                u'uid':u'6328'
            })
        ]
    },
    attributes={
        u'status':u'OK'
    })
}

So I then try to fetch it:

for id in record['IdList']:
    efetch_handle = Entrez.efetch(db="assembly", id=id, retmode="text")
    print efetch_handle.read()

But this just returns "6328". No actual file data, RefSeq's or otherwise.

Can someone identify what I am doing wrong?

Entrez BioPython • 8.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by bfeeny ▴ 50

Ram · Answer 1 · 2015-07-30

I'm not very familiar with EFetch, but it seems to me that the assembly database may not be fully implemented. If you look at the EFetch help, there aren't any options for db=assembly: Additionally, there are no direct links to download an assembly from the ncbi site, anyway. For an example, check out this example assembly page: You can download the sequence report, and there's a link to a nucleotide sequence url, but no direct download.

I think you may need to write something that will grab related sequences instead of doing it directly from fetching the assembly. The logic is really tortured and the code could be cleaned, but I think this will do the job:

Search Assembly -> Get Assembly Summary -> Parse Nucleotide Accession Name from Summary -> Search Nucleotide DB to get ID from Name -> Fetch Sequence by ID

# Search Assembly
handle = Entrez.esearch(db="assembly", term="Homo sapiens[Orgn] AND GCA_000001405.2[ASAC] AND Chromosome[ASLV]")
record = Entrez.read(handle)
for id in record['IdList']:
    # Get Assembly Summary
    esummary_handle = Entrez.esummary(db="assembly", id=id, report="full")
    esummary_record = Entrez.read(esummary_handle)

    # Parse Accession Name from Summary
    accession_id = ['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession']

    # Search Name for ID
    refseq_id = Entrez.read(Entrez.esearch(db="nucleotide", term=accession_id))['IdList'][0]  # Likely only one sequence match?

    # Fetch Record with ID from nucleotide db
    seq_record = Entrez.efetch(db="nucleotide", id=refseq_id, retmode='text')