Finding a single open reading frame with ribosomal binding site, using Biopython
Entering edit mode
3.5 years ago

I'm given a Fasta file, containing a large DNA(over 115,000 long) sequence, and I am tasked with finding a single large open reading frame contained within the DNA sequence using Biopython.

Now from other sources and the Biopython cookbook I've translated my sequence and found six open reading frames (three for each strand) and their positions within the sequence;

def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end-aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame+aa_start*3
                        end = min(seq_len,frame+aa_end*3+3)
                        start = seq_len-frame-aa_end*3-3
                        end = seq_len-frame-aa_start*3
                    answer.append((start, end, strand,
                aa_start = aa_end+1
    return answer

orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)

for start, end, strand, pro in orf_list:
print("%s...%s - length %i, strand %i, %i:%i" \
      % (pro[:30], pro[-3:], len(pro), strand, start, end))


TNRQVYGGTLQSLRTGTGIYSRLASSPTNR...CEA - length 160, strand 1, 7950:8433
GIPPGRTEGLGRYVHGESIFLEATLVPEPQ...VQS - length 171, strand -1, 19275:19791
ISVHLRRYFSVLLRAPVALNADSRVAGPLD...ITP - length 190, strand 1, 34079:34652
GCVNGNFPDYRVAVDDPGALVVGGELGQAL...VNT - length 771, strand -1, 39335:41651
LLKLLASRLTWLLVLSAALGFLTSVSYRLG...SPG - length 235, strand -1, 39358:40066
PGDVIVSKVPADNLRPRMSEINDFTNSIII...VLV - length 826, strand 1, 39362:41843
GPGRQSSSAYERDQRLHQQHHHQRERHQVR...QRS - length 764, strand 1, 39385:41680
LLLSSTSSASCWLKKFFFRLLLLSRSVALL...LAL - length 158, strand -1, 40459:40936
TRLIIRPPSHTTLYYDPYPQPLKVSSSVPC...RIT - length 159, strand -1, 54663:55143
IGAFSRGYNQGWVLIEAQEVRTTSPLFPRS...RDQ - length 201, strand 1, 56104:56710
KARGSSLGAIYCRRLKRRSLPSWQIESFNC...TNS - length 150, strand -1, 62497:62950
RRPARAGAELNKDDIRDTCLLSCSEVDNKV...VTL - length 168, strand 1, 72339:72846
SGPTTVRTSAPQRCWRTSLKPKLSLSDSTE...PVA - length 177, strand -1, 114494:115028
ERELWLETGPPTPLWGTGPDCSRAALNRQR...PRR - length 157, strand 1, 114953:115427

However I'm unsure if these open reading frames contain the ribosome binding site(consensus sequence of AAGGAGGTG and occurring between 6 to 9 nucleotides upstream from the positive 5' strand). Is there anyway to add to my code to include this? or would it already be included with Biopython?

I'm left with the output however I've no idea how to link this with possibly an online source like NCBI to actually find the source of my ORF (i.e. which protein it comes from). Is there anyway to use Biopython for this as well?

dna-sequence python biopython • 1.1k views
Entering edit mode

When you say a single one, are you looking for a specific ORF, or any single ORF?

Entering edit mode

I'm just told to find THE single ORF, there's no hinting it's a specific one aside from calling it large. I was under the impression the ones that I have found only 1 is possible with a correct RBS but thinking now that might not be so

Entering edit mode

Well you already stated that your ORF has to be the one with that specific RBS (if we assume for now that there is only 1 exact match). You also stated correctly that the RBS starts approximately -10 bp from the ORF, so none of your ORFs contain the RBS.

You will need to extract the 5’ UTR (untranslated region) for each ORF you found and look for the sequence within.

Once you have a candidate sequence of interest, Biopython has a BLAST module which you can call upon to search for the origin of the sequence.

Also just to note, I’m not sure what your organism is, but most genes begin with an ATG codon encoding Methionine, which I notice your translation table does not. Might be worth double checking.

Entering edit mode

Sorry for the add on question, but is there any way to find the 5' UTR and take it out with biopython (or even normal python)? As in can since I know the locations for each of the proposed ORFs can I excise that from the entire sequence?

Thanks for the heads up about the BLAST module, i've tinkered about with it and understand how it works, just obviously none of my ORFs would work currently.

As for the translation table, i'm not actually told where the sequence is from in terms of organism, i'm assuming due to the simplicity of the ORF it can only be a bacterium, yet i'm getting the same results for both translation tables 1 and 11.

I appreciate the help, pointing me in the right direction and all :)

Entering edit mode

The simplest way, to my mind, would simply be to do some arithmetic with the CDS coordinates you’ve already found. Something to the effect of, loop over each of your ORFs, check which strand its on, if 1 then extract, say, 50 bp of upstream sequence (i.e. from start-50 to start, and if the strand is -1, just add 50 to the end instead, You can then do a simple string search on each of the 5 UTRs. Biopython should be able to do a lot of the heavy lifting on dealing with the strand and so on, and has methods for reverse complimenting etc, so you probably won’t have to deal with all of this from scratch.

I’m not sure why you think your ORFs wouldn’t work at present, you have a protein sequence and the untranslated nucleotide sequence (presumably), you could simply provide the sequence string to BLAST(N/P). This wont give you the RBS if that’s what you meant, but it will still give you a pretty solid species identification. You’ve done the hardest part, which is finding all candidate ORFs already.

Sorry, I wasn’t clear when I said translation table, I didn’t mean translation table in the usual bioinformatic sense, I meant the literal table of sequence ORF results you posted in your answer. To be on the safe side, I would blast the untranslated nucleotide sequences before you translate them, since you don’t know whether its bacterial or human etc, it’ll corrupt the data if you blast an incorrectly translated sequence. I would guess its bacterial, else you’d have to worry about introns and splicing, which seems unduly harsh to me as an exercise ;)

Entering edit mode

Ah that makes sense now. I think the main problem i'm having is the abstract thinking of it all. I'll give the nBLAST a go and see what i get back in terms of species and then have a go at trying to find out the upstream sequences. Cheers again!


Login before adding your answer.

Traffic: 2410 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6