Question: ORF search gives negative start value
1
gravatar for emilia.skirmuntt
4.6 years ago by
emilia.skirmuntt10 wrote:

Hi,

I'm new to bioinformatics so please bear with me! I have a python script which is running through a series of contigs (in one fasta file) giving back all the ORFs found in each one of them. I used mostly the code from biopython cookbook (slightly modyfing it to give me an id of a sequence and run through several sequences). My problem is that the first ORF it is finding is starting with a negative value. I'm not sure what is causing this and how to fix it. Maybe someone can take a look at the code and output and help me find a problem here (if there is any...). I use Python 3 and Biopython 1.66.

from Bio import SeqIO


record = SeqIO.parse("test.fasta", 'fasta')
table = 1
min_pro_len = 100
ORFs = open("ORF.fasta", "w")

#finding orfs and translating into proteins
def find_orfs_with_trans(seq, trans_table, min_protein_length):
    answer = []
    seq_len = len(seq)
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            trans = str(nuc[frame:].translate(trans_table))
            trans_len = len(trans)
            aa_start = 0
            aa_end = 0
            while aa_start < trans_len:
                aa_end = trans.find("*", aa_start)
                if aa_end == -1:
                    aa_end = trans_len
                if aa_end-aa_start >= min_protein_length:
                    if strand == 1:
                        start = frame+aa_start*3
                        end = min(seq_len,frame+aa_end*3+3)
                    else:
                        start = seq_len-frame-aa_end*3-3
                        end = seq_len-frame-aa_start*3
                    answer.append(( start, end, strand, trans[aa_start:aa_end]))
                aa_start = aa_end +1
    answer.sort()
    return answer

#print results in a console and into a file
for each_record in record:
    print each_record.id
    orf_list = find_orfs_with_trans(each_record.seq, table, min_pro_len)
    for start, end, strand, pro in orf_list:
        print "%s...%s - length %i, strand %i, %i:%i" \
            % (pro[:30], pro[-3:], len(pro), strand, start, end)

        print >> ORFs,each_record.id, pro[:30], pro[-3:], len(pro), strand, start, end)

And the part of the output I'm getting (with a marked negative start and end of a sequence). Only the first contig get negative value. :

('NW_005871048.1', 'NNLQMNLSVLLPMQELHPWPHPRLMHFSPV', 'KRT', 223, -1, **-3, 669)**
('NW_005871048.1', 'GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG', 'VGR', 103, 1, 135, 447)
('NW_005871048.1', 'ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS', 'ESC', 124, 1, 836, 1211)
('NW_005871048.1', 'LSDLPALGWLHAGSAESPASSPHHPHNFLL', 'LRK', 113, 1, 877, 1219)
('NW_005871048.1', 'YRQIYMPRPRRTANTMNNHGNEDDQKENEK', 'HSL', 156, 1, 2204, 2675)
('NW_005871048.1', 'IFESKEQTEYLVTSLQRGPFQVPLVLKLSK', 'PQL', 136, 1, 3020, 3431)
('NW_005871048.1', 'QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE', 'IVH', 116, 1, 3923, 4274)
('NW_005871048.1', 'LANGRCSEGMVNYHVCLLLDRIDDPLYICD', 'LVL', 111, -1, 5507, 5843)
('NW_005871048.1', 'VPSNGVKPAGPGQEQRVKAGEAVESKEGGW', 'ISF', 171, 1, 6367, 6883)
('NW_005871048.1', 'QWSEAGWPGTGAASEGRRGSGEQRGGMGGF', 'FWL', 149, 1, 6375, 6825)
('NW_005871048.1', 'NTVIGITKCDTDGSGHVKGEGQKGTHGLPF', 'AVC', 169, 1, 6931, 7441)
('NW_005871048.1', 'IFHPWNIPFPRESALIPQPGPGGAAQAVGT', 'KKF', 103, -1, 7100, 7412)
('NW_005871048.1', 'ASAPWWPVRIIAIGHSAVQSMCILAFYYYY', 'EPL', 110, -1, 8472, 8805)
('NW_005871048.1', 'ATAAGFPLLLPGAPAGFPLPLPGAVAGFPS', 'ISP', 362, -1, 9244, 10333)
orfs python • 1.2k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by emilia.skirmuntt10

I guess that you're getting -3 value due to the following line:

start = seq_len-frame-aa_end*3-3

So, it seems that seq_len-frame-aa_end*3 operation is giving a result of 0, and then with the -3, you get a final start position of -3. Maybe you can add a print before this line, and check the values of seq_len, frame, aa_end for this given sequence.

ADD REPLYlink written 4.6 years ago by iraun3.8k
0
gravatar for emilia.skirmuntt
4.6 years ago by
emilia.skirmuntt10 wrote:

Thanks for answering!

You mean print before start line...? If I get rid of this -3 I'm getting output like that:

   ('NW_005871048.1', 'NNLQMNLSVLLPMQELHPWPHPRLMHFSPV', 'KRT', 223, -1, 0, 669)
('NW_005871048.1', 'GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG', 'VGR', 103, 1, 135, 447)
('NW_005871048.1', 'ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS', 'ESC', 124, 1, 836, 1211)
('NW_005871048.1', 'LSDLPALGWLHAGSAESPASSPHHPHNFLL', 'LRK', 113, 1, 877, 1219)
('NW_005871048.1', 'YRQIYMPRPRRTANTMNNHGNEDDQKENEK', 'HSL', 156, 1, 2204, 2675)
('NW_005871048.1', 'IFESKEQTEYLVTSLQRGPFQVPLVLKLSK', 'PQL', 136, 1, 3020, 3431)
('NW_005871048.1', 'QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE', 'IVH', 116, 1, 3923, 4274)

so I think that should solve the problem (apart from me being a bit confused why there was a -3 there in a first place)

ADD COMMENTlink written 4.6 years ago by emilia.skirmuntt10

That's because you want to find the start; typically, the length of aa is 3 (e.g. Cysteine: TGT,TGC), thus the -3.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by H.Hasani970

Removing -3 from the code is not the solution, because it will affect also to the other sequences (which are in principle OK). So, the first step should be to understand the meaning of this -3, and then think what to do. In this specific case, I guess that 0 is the correct start position, because the end position of the transcript is 669, and the total length of the protein is 223. So, the total length of the transcript will be 223*3 = 669 = end position.

ADD REPLYlink written 4.6 years ago by iraun3.8k

Well, that's why I'm confused because putting -3 there made sense but output make more sense without it and sequences (apart from the first one) looks exactly the same.

Can it be caused by length of the contig which is only a partial codon and not a multiple of three?

ADD REPLYlink written 4.6 years ago by emilia.skirmuntt10

If you look at the other sequences in the -1 strand, you'll see that if you remove -3, the start position will also change (this -3 is only affecting to the -1 strand sequences ).

ADD REPLYlink written 4.6 years ago by iraun3.8k

You're right of course. As I said I'm still a bit lost about all this scripting business.

How to fix it without taking out -3 then? with "if" exceptions for -1 strands?

ADD REPLYlink written 4.6 years ago by emilia.skirmuntt10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour