Question

Collecting CDS from transcripts using Entrez

0

Entering edit mode

2.8 years ago

Taylor • 0

Hello,

I am trying to collect the protein coding region for a specific gene for a list of birds from the NCBI nucleotide database using Entrez. I am able to get 1 CDS for most birds, but one of them returns a ton of unrelated genes, whereas others that have the gene listed in the NCBI nucleotide database are not showing up. I was wondering if anyone could look at my code and see where the problem lies?

from Bio import Entrez
from Bio import SeqIO
import os

Entrez.email = "email@blahblah.com"
species_list = ["Gallus gallus", "Numida meleagris","Phasianus colchicus",
"Meleagris gallopavo", "Coturnix japonica", "Anas platyrhynchos", "Numida meleagris",
"Columba livia", "Anser cygnoides domesticus", "Rhea americana", "Apteryx rowi",
"Nothoprocta perdicaria", "Struthio camelus australis", "Dromaius novaehollandiae",
"Calidris pugnax", "Aquila chrysaetos chrysaetos", "Athene cunicularia", "Strix occidentalis caurina",
"Melopsittacus undulatus", "Strigops habroptila", "Lepidothrix coronata", "Corvus moneduloides",
"Hirundo rustica", "Taeniopygia guttata", "Lonchura striata domestica", "Motacilla alba alba",
"Serinus canaria", "Camarhynchus parvulus", "Parus major", "Cyanistes caeruleus", "Ficedula albicollis",
"Catharus ustulatus", "Sterna hirundo"]

length = len(species_list)
f = open("DMC1.txt", "x")

for i in range(length):
  species = species_list[i]
  handle1 = Entrez.esearch(db="gene", term= species + "[Orgn] AND dmc1[Gene]", idtype="acc", retmax = 3)
  record = Entrez.read(handle1)

  idlist = record["IdList"]
  handle2 = Entrez.efetch(db="nuccore", id=idlist, rettype="fasta_cds_na", retmax = 3, retmode = "text")

  print(species, file =open("DMC1.txt", "a"))
  print(handle2.read(),file =open("DMC1.txt", "a"))

entrez biopython • 1.3k views

ADD COMMENT • link updated 2.8 years ago by GenoMax 142k • written 2.8 years ago by Taylor • 0

1

Entering edit mode

Is gene called dmc1 in all species? If not that can lead it odd results. Genomes for some may not be as complete (as chicken) so that could be the other reason.

ADD REPLY • link 2.8 years ago by GenoMax 142k

0

Entering edit mode

Hi, thanks for your response! It is called DMC1 in all of the birds, which is why I am confused about all the junk it is outputting. I am hoping to streamline this so I can look at more genes in the same species list.

ADD REPLY • link 2.8 years ago by Taylor • 0

0

Entering edit mode

Using EntrezDirect I see the following:

$ esearch -db gene -query "Gallus gallus [orgn] AND DMC1 [GENE]" | efetch -format ft

1. DMC1
Interim Symbol: DMC1 and Name: DNA meiotic recombinase 1 [Gallus gallus (chicken)]
Other Designations: meiotic recombination protein DMC1/LIM15 homolog; DMC1 dosage suppressor of mck1 homolog, meiosis-specific homologous recombination
Chromosome: 1
Annotation: Chromosome 1 NC_052532.1 (50900155..50916096)
ID: 427903

2. HID1
Official Symbol: HID1 and Name: HID1 domain containing [Gallus gallus (chicken)]
Other Aliases: C17ORF28, C18H17orf28, DMC1, HID-1
Other Designations: protein HID1; HID1 domain-containing protein; UPF0663 transmembrane protein C17orf28; down-regulated in multiple cancers 1; downregulated in multiple cancer 1; protein hid-1 homolog
Chromosome: 18
Annotation: Chromosome 18 NC_052549.1 (11202948..11227150, complement)
ID: 422113

$ esearch -db gene -query "Taeniopygia guttata [orgn] AND DMC1 [GENE]" | efetch -format ft

1. DMC1
DNA meiotic recombinase 1 [Taeniopygia guttata (zebra finch)]
Other Designations: meiotic recombination protein DMC1/LIM15 homolog; DMC1 dosage suppressor of mck1 homolog, meiosis-specific homologous recombination
Chromosome: 1A
Annotation: Chromosome 1A NC_044212.2 (49423863..49438082)
ID: 100224391

$ esearch -db gene -query "Sterna hirundo [orgn] AND DMC1 [GENE]" | efetch -format ft

Nothing here.

ADD REPLY • link 2.8 years ago by GenoMax 142k

score 1 · Accepted Answer · 2021-06-30

I ended up just parsing out strings that contain ">" and the gene name. It's ugly but it works. Code below if anyone else ever runs into this problem:

for i in range(length):
  species = species_list[i]
  handle = Entrez.esearch(db="nucleotide", term= species + "[Orgn] AND DNA meiotic recombinase 1", idtype="acc")
  record = Entrez.read(handle)

  idlist = record["IdList"]
  handle = Entrez.efetch(db="nuccore", id=idlist, rettype="fasta_cds_na", retmode = "text", retmax = 10)
  h = handle.read()
  great = h.count(">")

  if great > 1:
    s = h.split(">")
    for gene in s:
      if "DMC1" in gene:
        h = gene
        break

  if "DMC1" in h:
    print(species, file =open(filename, "a"))
    print(h,file =open(filename, "a"))

  else:
    print(species, file =open("other.txt", "a"))
    print(h,file =open("other.txt", "a"))