Size of Proteins (Acetohalobium arabaticum - species)
1
1
Entering edit mode
15 months ago

I have to find out the size of the protein sequence, but even using the codes below, I couldn't. This first code was to find how many proteins there are in total and to find the size of the sequences.

The attached image is just to show what I want the code to search for. I don't know what is missing in the code

arq = open("genoma9.faa")
    conteudo = arq.read()
    print(conteudo)
    fh = open("genoma9.faa")
    n= 0
    for line in fh:
        if line.startswith(">"):
            n+= 1
            print(line)
            proteins = line.count(">")
            print("Total of Proteins: " + str(proteins))

enter image description here

Trying to find this middles characters above the >WP:

Example:

>WP_013277001.1 DNA polymerase III subunit beta [Acetohalobium arabaticum]
MQIKIDRKNFYDGIQTVRKAISSKSTLPILSGILIETQEKKLKLVGTDLELGIECRVDANIIKDGAIVLPANHLANIVRE
LPNKELELELKKDNKIEISCGLSQFKIHGSPADEYPLLPEVGSGIEYTLSQEKFQAMINRIKFATSDDESRPFLTGGLLS
protein python • 1.1k views
ADD COMMENT
0
Entering edit mode

you said. FAA File Sequence

I'm going to post the code now

please, do so now.

ADD REPLY
0
Entering edit mode

Answer of the other post:

 openFile = open('genoma9.faa', 'r')
    writeFile = open('updatedFile.txt', 'w')
    for txtLine in openFile .readlines():
        if not (txtLine.startswith('>WP')):
            print(txtLine)
            writeFile.write(txtLine)
    writeFile.close()

    openFile.close()
ADD REPLY
0
Entering edit mode

Have you tried running this piece of code? It looks like it has an indentation error?

ADD REPLY
0
Entering edit mode

this post is the same of your previous one Print the size of a protein . Stop asking new questions and update your original post.

ADD REPLY
0
Entering edit mode

I reposted because I deleted the other one since I didn't post the code in the old post.

ADD REPLY
0
Entering edit mode

The edit button is for edits, no need to delete.

ADD REPLY
1
Entering edit mode
15 months ago
Mensur Dlakic ★ 27k

I think you are reinventing the wheel. There is no need to write separate code for handling biological sequences when it all exists in BioPython and can be accessed in several lines of code. What I show below could be optimized, so it is only for illustration. I suggest you save it into a file fasta_len_and_number.py or something like that.

import sys
from Bio import SeqIO

# open the file specified after script name
FastaFile = open(sys.argv[1], 'r')

counter = 0 # initialize sequence counter
for rec in SeqIO.parse(FastaFile, 'fasta'):
    counter = counter + 1 # increase sequence counter
    name = rec.id # sequence header
    seq = rec.seq # protein/DNA sequence
    seqLen = len(rec) # determine sequence length
    print(seqLen, name) # print the length + header

print('\n A total of %d sequences' % counter)
FastaFile.close()

Running this line:

python fasta_len_and_number.py genoma9.faa

will make hopefully a desired output. On one of my files when I tested, the last 10 lines look like this:

130 2HY5_A
114 2Q68_A
141 4CDL_A
47 6O3S_A
12 4AKT_C
37 4Z80_C
145 1Y23_A
215 6S5A_L

 A total of 112217 sequences
ADD COMMENT

Login before adding your answer.

Traffic: 2169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6