Hi all,

I've been using entrez_fetch in R package rentrez v1.2.2 to extract nucleotide sequences in FASTA format for a large number of GIDs. For a small minority I've found entrez_fetch simply returns an empty string with a newline character - example below.

> entrez_fetch(db = "nuccore", id = "108597802", rettype="fasta_cds_na")
[1] "\n"

I get the same result using the accession rather than the GID.

> entrez_fetch(db = "nuccore", id = "DQ640652.1", rettype="fasta_cds_na")
[1] "\n"

The exact function works for most other GIDs/accessions I feed it, and it also works if I request alternative rettypes, e.g.

> entrez_fetch(db = "nuccore", id = "108597802", rettype="gb")
[1] "LOCUS       DQ640652               29746 bp    RNA     linear   VRL 12-JUN-2006\nDEFINITION  SARS coronavirus GDH-BJH01, complete genome.\nACCESSION   DQ640652\nVERSION     DQ640652.1\nKEYWORDS    .\nSOURCE      SARS coronavirus GDH-BJH01\n  ORGANISM  SARS coronavirus GDH-BJH01\n            Viruses; Riboviria; Nidovirales; Cornidovirineae; Coronaviridae;\n            Orthocoronavirinae; Betacoronavirus; Sarbecovirus.\nREFERENCE   1  (bases 1 to 29746)\n  AUTHORS   Cai,J.-P., Hei,A.-L., Hu,J.-H., Wang,S.-K., Zhang,C.-B., Dai,D.-P.,\n            Shen,Z.-Y., Guo,J., Li,M., Wu,Y.-S., Cheng,G., He,Y.-S. and Hou,M.\n  TITLE     Direct Submission\n  JOURNAL   Submitted (14-MAY-2006) National Center for Clinical Laboratory,\n            Beijing Hospital, 1 Da Hua Road, Dong Dan, Beijing 100730, China\nFEATURES             Location/Qualifiers\n     source          1..29746\n                     /organism=\"SARS coronavirus GDH-BJH01\"\n                     /mol_type=\"genomic RNA\"\n                     /strain=\"GDH-BJH01\"\n                     /isolation_source=\"Homo sapiens lung\"\n                     /host=\"Homo sapiens\"\n                     /db_xref=\"taxon:388737\"\n                     /country=\"China\"\nORIGIN      \n        1 ggcttccagg aaaagccaac

Curiously though using the API through a browser also returns a blank file: example.

If anyone is able to shed some light on why these sequences aren't being returned in FASTA format properly, I'd be very grateful!

DQ640652 is genome of SARS virus. It does not look like there are any annotations included in the GenBank file. Perhaps that is the reason for not getting anything back when you ask for CDS sequences.

Ah you're right! I hadn't noticed that - many thanks! Is there any metadata field that can help me identify and filter out the accessions without CDS annotations?


