Collecting CDS from transcripts using Entrez
1
0
Entering edit mode
2.8 years ago
Taylor • 0

Hello,

I am trying to collect the protein coding region for a specific gene for a list of birds from the NCBI nucleotide database using Entrez. I am able to get 1 CDS for most birds, but one of them returns a ton of unrelated genes, whereas others that have the gene listed in the NCBI nucleotide database are not showing up. I was wondering if anyone could look at my code and see where the problem lies?

from Bio import Entrez
from Bio import SeqIO
import os

Entrez.email = "email@blahblah.com"
species_list = ["Gallus gallus", "Numida meleagris","Phasianus colchicus",
"Meleagris gallopavo", "Coturnix japonica", "Anas platyrhynchos", "Numida meleagris",
"Columba livia", "Anser cygnoides domesticus", "Rhea americana", "Apteryx rowi",
"Nothoprocta perdicaria", "Struthio camelus australis", "Dromaius novaehollandiae",
"Calidris pugnax", "Aquila chrysaetos chrysaetos", "Athene cunicularia", "Strix occidentalis caurina",
"Melopsittacus undulatus", "Strigops habroptila", "Lepidothrix coronata", "Corvus moneduloides",
"Hirundo rustica", "Taeniopygia guttata", "Lonchura striata domestica", "Motacilla alba alba",
"Serinus canaria", "Camarhynchus parvulus", "Parus major", "Cyanistes caeruleus", "Ficedula albicollis",
"Catharus ustulatus", "Sterna hirundo"]

length = len(species_list)
f = open("DMC1.txt", "x")

for i in range(length):
  species = species_list[i]
  handle1 = Entrez.esearch(db="gene", term= species + "[Orgn] AND dmc1[Gene]", idtype="acc", retmax = 3)
  record = Entrez.read(handle1)

  idlist = record["IdList"]
  handle2 = Entrez.efetch(db="nuccore", id=idlist, rettype="fasta_cds_na", retmax = 3, retmode = "text")

  print(species, file =open("DMC1.txt", "a"))
  print(handle2.read(),file =open("DMC1.txt", "a"))
entrez biopython • 1.3k views
ADD COMMENT
1
Entering edit mode

Is gene called dmc1 in all species? If not that can lead it odd results. Genomes for some may not be as complete (as chicken) so that could be the other reason.

ADD REPLY
0
Entering edit mode

Hi, thanks for your response! It is called DMC1 in all of the birds, which is why I am confused about all the junk it is outputting. I am hoping to streamline this so I can look at more genes in the same species list.

ADD REPLY
0
Entering edit mode

Using EntrezDirect I see the following:

$ esearch -db gene -query "Gallus gallus [orgn] AND DMC1 [GENE]" | efetch -format ft

1. DMC1
Interim Symbol: DMC1 and Name: DNA meiotic recombinase 1 [Gallus gallus (chicken)]
Other Designations: meiotic recombination protein DMC1/LIM15 homolog; DMC1 dosage suppressor of mck1 homolog, meiosis-specific homologous recombination
Chromosome: 1
Annotation: Chromosome 1 NC_052532.1 (50900155..50916096)
ID: 427903

2. HID1
Official Symbol: HID1 and Name: HID1 domain containing [Gallus gallus (chicken)]
Other Aliases: C17ORF28, C18H17orf28, DMC1, HID-1
Other Designations: protein HID1; HID1 domain-containing protein; UPF0663 transmembrane protein C17orf28; down-regulated in multiple cancers 1; downregulated in multiple cancer 1; protein hid-1 homolog
Chromosome: 18
Annotation: Chromosome 18 NC_052549.1 (11202948..11227150, complement)
ID: 422113

$ esearch -db gene -query "Taeniopygia guttata [orgn] AND DMC1 [GENE]" | efetch -format ft

1. DMC1
DNA meiotic recombinase 1 [Taeniopygia guttata (zebra finch)]
Other Designations: meiotic recombination protein DMC1/LIM15 homolog; DMC1 dosage suppressor of mck1 homolog, meiosis-specific homologous recombination
Chromosome: 1A
Annotation: Chromosome 1A NC_044212.2 (49423863..49438082)
ID: 100224391

$ esearch -db gene -query "Sterna hirundo [orgn] AND DMC1 [GENE]" | efetch -format ft

Nothing here. 
ADD REPLY
1
Entering edit mode
2.8 years ago
Taylor • 0

I ended up just parsing out strings that contain ">" and the gene name. It's ugly but it works. Code below if anyone else ever runs into this problem:

for i in range(length):
  species = species_list[i]
  handle = Entrez.esearch(db="nucleotide", term= species + "[Orgn] AND DNA meiotic recombinase 1", idtype="acc")
  record = Entrez.read(handle)

  idlist = record["IdList"]
  handle = Entrez.efetch(db="nuccore", id=idlist, rettype="fasta_cds_na", retmode = "text", retmax = 10)
  h = handle.read()
  great = h.count(">")

  if great > 1:
    s = h.split(">")
    for gene in s:
      if "DMC1" in gene:
        h = gene
        break

  if "DMC1" in h:
    print(species, file =open(filename, "a"))
    print(h,file =open(filename, "a"))

  else:
    print(species, file =open("other.txt", "a"))
    print(h,file =open("other.txt", "a"))
ADD COMMENT
1
Entering edit mode

Since this solution works in your case I moved this comment to an answer. You can accept it (green checkmark) to provide closure for this thread.

ADD REPLY

Login before adding your answer.

Traffic: 1360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6