Question: Getting in frame stop codons from fasta
0
gravatar for User000
2.4 years ago by
User000260
User000260 wrote:

Hello, I need to find a start codon and all in frame stop codons related to that primary start codon, i.e startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCTG_stopCodon(2)_ATCAAGA. This is my code, which gives me the positions of all ORF. How can I modify it to achive positions of start codon and all in frame stop codons related to that primary start codon.

from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def getORF(sequence, treshold, start_codons, stop_codons):
    orfs = []
    for j in range(0, 3):
        start_codon_index = 0
        end_codon_index = 0
        start_codon_found = False
        for indx in range(j, len(sequence), 3):
            current_codon = sequence[indx:indx+3]
            if current_codon in start_codons and not start_codon_found:
                start_codon_found = True
                start_codon_index = indx
                s=start_codon_index+1
            if current_codon in stop_codons and start_codon_found:
                end_codon_index = indx
                e=end_codon_index+3
                length = end_codon_index - start_codon_index + 3
                if length >= treshold * 3:
                    orfs.append(s)
                    orfs.append(e)
                start_codon_found = False
    return orfs
f = open("myfile.fa","r")
o = open("out.txt","w")
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
for record in SeqIO.parse(f,'fasta'):
    seq=record.seq
    #compl_seq =record.seq.reverse_complement()
    name=record.id
    orfs = getORF(seq, 30, start, stop)
    #complement_orfs = getComplementORF(compl_seq, 30, start, stop)
    print >> o, name, '+', orfs, '-', complement_orfs
python • 1.9k views
ADD COMMENTlink modified 2.4 years ago by WouterDeCoster37k • written 2.4 years ago by User000260
1

Hi User000,

I'll try to give you some assistance in this, but I'm a bit confused by your example: startCodon_AGGAAG_stopCodon(1)_GAA**GGT**AAC**AGC**TCTG_stopCodon(2)_ATCAAGA Why is the part between stopCodon(1) and stopCodon(2) not in-frame? Why is there still a stretch of nucleotides after your final stopCodon?

Furthermore, how large is your input? That influences if we can do comfortably everything in memory or whether we have to think of writing memory-efficient scripts ;)

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

Hi, thanks a lot. These are my stop codons "TAA","TAG","TGA". I have a lot of artificial cds, (50K seq-s) and I would like to find a 1st start codon and all in frame stop codons related to that primary start codon..

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by User000260

Yes I see, but in your example, the second stop codon is not in frame...

So actually, you just have to search for the first start codon in the sequence. As soon as you found that, you can make in frame codons following that start codon. Then you just need to translate these codons and identify stop codons. Or am I oversimplifying things?

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

I explain better. I have CDS(those do not necessarily have a stop codon),so I add several bp artificially to my cds seq-s and want find all stop codons that are in frame with the 1st start codon. And i need positions of this start and stop codons..In the code above, it gives me start codon and stop codon (ORF), then it looks for the next start codon and stop codon. I dont know if it is explained better...

ADD REPLYlink written 2.4 years ago by User000260

Might be caffeine deprivation of my brain, but it's not yet completely clear to me. Why do you add artificial nucleotides to the cds?

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

not artificial nucleotides :) I lengthen them for ie 100 bp obviously using coordinates from genome, in such a way creating kind of artificial CDS, so I am interested to see how far away are my stop codons, so I can evaluate whether to lenghen my cds o not...

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by User000260

So let me try to summarize: 0) input is fasta 1) first job: find the first start codon 2) then: find all stop codons in frame with this first start codon 3) report as: startCodon_AGGAAG_stopCodon(1)_GAAGGTAACAGCTCT_stopCodon(2)_blablabla_finalStopCodon

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

resport as name_id pos_start_fr1 pos_stop1 pos_stop2, pos_start_fr2 pos_stop1 post_stop2

ADD REPLYlink written 2.4 years ago by User000260

So if I explain roughly the algorithm is like that: take a sequence and look for start codon in 3 frames (actually 6 cos I need also complementary),find 1st start codon write it,search for in frame stop codon,write it,search for the next in frame stop codon write it..smth like this, does it make sense?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by User000260

Do you need per sequence just one start codon, or the first start codon for each of the 6 reading frames?

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

start codon for each of the 6 frames if there is. so I prefer the output to be like this (+1 pos_start codons pos_stop_codon1 pos_stop_codon2, +2 pos_start codons pos_stop_codon1 pos_stop_codon2 ),otherwise I dont need frames written, but if divided by a comma I can understand that these are different frames..

ADD REPLYlink written 2.4 years ago by User000260

Hahha, this thread xD

ADD REPLYlink written 2.4 years ago by John12k
2

I'll probably write some real code with some pseudocode tomorrow to get him started. First thing I thought yesterday was "not another ORF script"...

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

What happens if there is more than one ATG in the fasta entry? Is only the first considered the start?

In which case, could the output look something like:

ID13123
    ORF 1: None
    ORF 2:
        Start - 342
        Ends - 453,499,589
    ORF 3:
        Start - 32
        Ends - 94,399,909,4302
ID13124
    ....
ADD REPLYlink written 2.4 years ago by John12k

I would consider only the first ATG as start and would like results as in the code (start1 end1, start1 end2) or (start1, end1, end2).

ADD REPLYlink written 2.4 years ago by User000260
3
gravatar for WouterDeCoster
2.4 years ago by
Belgium
WouterDeCoster37k wrote:

After the lovely discussion above it's time to start thinking about answers :) I wanted to write a few pointers to help you write the script, but got carried away while writing and produced it almost completely. Left a few pieces for you to complete ;)

If anything is unclear, let me know.

As you can see I splitted the function(s) in multiple very simple functions which only do one job. Because that makes writing, thinking and debugging so much easier. Code is far more readable. You start by writing just the frame work. First you need readingframes. Then you need startcodons, then you need stopcodons. Every function does his part of the chain. The function makeframes() is for you to write. If you have question about how to solve it or change the code to suit your needs, let me know!

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by WouterDeCoster37k

thank you so much! Looks super difficult for me, as I don't use a lot of things you used. About frames, may be I can create 3 frames and at the end use compl_seq =record.seq.reverse_complement() as in my code above? Also, while thinking how to do it, I discovered that there is a orf predictor that does almost what I want, I wish I could have it's script, but I guess it is written in perl.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by User000260
1

Take your time to go through it and I'd be happy to explain parts and functions you don't understand. But with perl stuff I can't help you ;)

Yes, you can use the .reverse_complement() method BUT I already converted each sequence to a string in line 48 in the list comprehension (my favorite construction in python). So just using .reverse_complement() no longer works, because a string is not a seq object and doesn't have a .reverse_complement() method. But of course you can change line 48 so the object remains a seq object by removing the str() function. But you need to convert it to a string then in the makeframes() function, because the code needs strings downstream. This is because a seq object doesn't have a .index() method. Is that clear?

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

Yes, more or less! I might come back to you later with questions!THNAKS

ADD REPLYlink written 2.4 years ago by User000260

Dear @Wouter, the "lovely discussion" you have mentioned was cool!

And explaining different parts of the valuable script would be a class for me, too

Thank you

ADD REPLYlink written 2.4 years ago by Farbod3.2k
1

For "educational purposes" I added comments to the code above explaining almost every line, let me know if something isn't clear.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by WouterDeCoster37k

Hi! So,I added makeframes function in the code above, smth like this:

for n in range(0,len(sequence),3):
    frames = sequence[n:n+3]
return([sequence])

It is giving me the result as seq-s, while I want ----- > a header, position_start_codon, pos_stop,pos_next_stop How to modify?thanks

ADD REPLYlink written 2.4 years ago by User000260

That makeframes() function doesn't make sense. You are just returning sequence in a list, and your frames = sequence[n:n+3] is not doing what you think it does.

For getting the position you'll need to use the index, for example using enumerate.

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

for frame in range(0,3): return([sequence])

?

ADD REPLYlink written 2.4 years ago by User000260

You only need 3 reading frames? Then why not just

def makeframes(sequence):
    return([sequence, sequence[1:], sequence[2:]])
ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

I need 6, but I was thinking to use reserve_compl later on..

ADD REPLYlink written 2.4 years ago by User000260

Possible, but I think it would make more sense to do it in this step and return 6 reading frames. However, you already decided that you need positions rather than sequences and you are free to write this script as you feel like :)

ADD REPLYlink written 2.4 years ago by WouterDeCoster37k

Ok, I try.. thanks you

ADD REPLYlink written 2.4 years ago by User000260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1547 users visited in the last hour