Hi everyone, I have been working with tomato (s.lycopersicum) with its wild relative(s.pennellii). So below is a pair of ortholog protein I got previously using some alignment method.
XP_004243382.1--> protein LOC107025503 [Solanum pennellii]
XP_015081778.1-->protein LOC101253299 [Solanum lycopersicum]
you can easily search them on ncbi, so what I have to do is to analyse th promoter sequences of the proteins of all the orthologous proteins. I used the biopython script below to extract the promoter sequnce of all the gene in every chromosome genbank file for both species. genbank Chromosome files can be downloaded through the ftp site of ncbi
below is the biopython script
def prom_extract_multi(in_gbk = "in.gbk", prom_len = 1000, file_out = "prom_out.fna"):
from Bio import Seq from Bio.Seq import Seq from Bio import SeqIO from Bio.SeqRecord import SeqRecord prom_out = "" GBrecord = next(SeqIO.parse(in_gbk, "genbank")) for feature in GBrecord.features: if feature.type =="gene": my_start = feature.location._start.position # Identifies the start position of the gene on the sense strand (5' to 3' irrespective of actual coding strand). my_end = feature.location._end.position # Identifies the end position of the gene on the sense strand (5' to 3' irrespective of actual coding strand). start_1000 = my_start - prom_len end_1000 = my_end + prom_len if feature.strand == -1: feat_loc = str(feature.location) my_prom = GBrecord[my_end:end_1000].reverse_complement() prom_out += "> Promoter rev_comp" + "___" + feature.qualifiers['db_xref'] + "___" + feat_loc + "\n" prom_out += my_prom.seq.__str__() + "\n\n" elif feature.strand == 1: feat_loc = str(feature.location) my_prom = GBrecord[start_1000:my_start] prom_out += "> Promoter" + "___" + feature.qualifiers['db_xref'] + "___" + feat_loc + "\n" prom_out += my_prom.seq.__str__()+"\n\n" file=open(file_out, 'w') file.write(prom_out) file.close()
so the output file of the promoter fasta file looks like something like this
So i tried to get my promoter sequence for the example pair shown above first, I relate the geneID from the output to the protein id(XP_ 004243382.1) of my example pair to get the promoter sequence for both species, however I was only able to get the one s.pennellii, did I miss something here?? I searched the chromosome genbank file and found something could be causing this,
the link shows “See also 69 discontinued or replaced items.” could this missed/replaced items are the part of my promoter sequence??
Also, can anyone please suggest what I can use to analyse promoter sequences, I tried promoterwise by ebi by failed to install on my laptop, I am thinking just to use blastn , I hope I have explained my question clear enough
thank you very much