Question: Convert sequence file to fasta format using python
1
gravatar for MAPK
3.5 years ago by
MAPK1.7k
MAPK1.7k wrote:

Hi I am new in python and want to see how this can be done in python (I can do this in R). I have a text file myfile.txt with one column and thousands of rows as shown below. I want to convert this to fasta result.fasta format as shown below. How can I do this in python?

myfile.txt

ATGTGTGGTTTTCCCCC
ATTGGCGGGGTTTTTCAGGGG
ATGGGGGGGCCCCCCCCAAAAAA
TTGGTGGGGGGGGGGGGAA

result.fasta

>1
ATGTGTGGTTTTCCCCC
>2
ATTGGCGGGGTTTTTCAGGGG
>3
ATGGGGGGGCCCCCCCCAAAAAA
>4
TTGGTGGGGGGGGGGGGAA
python • 9.2k views
ADD COMMENTlink modified 3.5 years ago by Matt Shirley9.5k • written 3.5 years ago by MAPK1.7k
4
gravatar for Kevin Blighe
3.5 years ago by
Kevin Blighe70k
Republic of Ireland
Kevin Blighe70k wrote:

Create a new script called ConvertFASTA.py:

import sys

#File input
fileInput = open(sys.argv[1], "r")

#File output
fileOutput = open(sys.argv[2], "w")

#Seq count
count = 1 ;

#Loop through each line in the input file
print "Converting to FASTA..."
for strLine in fileInput:

    #Strip the endline character from each input line
    strLine = strLine.rstrip("\n")

    #Output the header
    fileOutput.write(">" + str(count) + "\n")
    fileOutput.write(strLine + "\n")

    count = count + 1
print ("Done.")

#Close the input and output file
fileInput.close()
fileOutput.close()

Then, run it with:

python ConvertFASTA.py myfile.txt result.fasta
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Kevin Blighe70k
2

For the OP, Kevin did a good job showing the pseudocode, and each step.This will run in the same fashion.

#!/usr/bin/env python
import sys
n = 0
with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            n += 1
            out.write('>' + str(n) + '\n' + line.strip())
ADD REPLYlink written 3.5 years ago by st.ph.n2.6k

I also have the similar query but I want to use the names of sequences to be used after the '>' symbol. for example:

Zebrafish ESLLRFGLRSDLDFR
Fugu ETVLSVGLSAETEIS
Chicken RALLAWGYSSDT

and I want:

result.fasta

>Zebrafish
ESLLRFGLRSDLDFR
 >Fugu
ETVLSVGLSAETEIS
>Chicken
RALLAWGYSSDT

Can I get some guidelines?

ADD REPLYlink written 2.6 years ago by mdsiddra30

Sometimes if you try and search for this type of information you would not need to wait to get an answer. Here is one solution.

ADD REPLYlink written 2.6 years ago by GenoMax96k

I am really thankful for your support. I would want to discuss that I receive an error,

    fileInput = open(sys.argv[1], "r")
IndexError: list index out of range

when I try using this solution (code) on windows OS and I do not use linux. Since according to my knowledge, argv is the built-in array of linux and so I guess it does not work when I run the script in python IDLE (3.6.4).

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by mdsiddra30
4
gravatar for Alex Reynolds
3.5 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Something a bit simpler and should work in Python 2 and 3:

#!/usr/bin/env python

import sys

c = 1
for l in sys.stdin:
    sys.stdout.write(">%d\n%s\n" % (c, l))
    c += 1

Usage:

$ convert.py < in.txt > out.fa
ADD COMMENTlink modified 2.6 years ago • written 3.5 years ago by Alex Reynolds31k
1

I think you want sys.stdout.write

ADD REPLYlink written 3.5 years ago by Matt Shirley9.5k
1

You're write, thanks. Fixed!

ADD REPLYlink written 3.5 years ago by Alex Reynolds31k

Or even simpler:

import sys

for c, l in enumerate(sys.stdin, start=1):
    sys.stdout.write(">%d\n%s\n" % (c, l))
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by a.zielezinski9.6k
3
gravatar for st.ph.n
3.5 years ago by
st.ph.n2.6k
Philadelphia, PA
st.ph.n2.6k wrote:
#!/usr/bin/env python

n = 0
with open('myfile.txt', 'r') as f:
    for line in f:
        n += 1
        print('>' + str(n) + '\n' + line.strip())
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by st.ph.n2.6k

Thanks, but it does not increase the fasta identifier as >1, >2, >3.... All sequences are named >1.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by MAPK1.7k

See edit, wrote it too quickly :)

ADD REPLYlink written 3.5 years ago by st.ph.n2.6k

To be fair to all posters you should accept all answers that work.

ADD REPLYlink written 3.5 years ago by GenoMax96k

I agree, is that feature not available? It would help users to see various ways of doing it. Apologies, I was composing my solution whilst the other guy had posted!

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Kevin Blighe70k

Thanks to everyone, all answers accepted! Wasn't aware of this feature. I was thinking it was similar to stackoverflow where you have option to accept only one answer.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by MAPK1.7k
1

One of the friendly features of Biostars. More than one ways of doing things and all instructive for those new to python.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by GenoMax96k
2
gravatar for Matt Shirley
3.5 years ago by
Matt Shirley9.5k
Cambridge, MA
Matt Shirley9.5k wrote:

FOUR! 🏌⛳️

$ python -c "import sys; [sys.stdout.write('>'+str(i)+'\n'+seq) for i, seq in enumerate(sys.stdin)]"
ADD COMMENTlink written 3.5 years ago by Matt Shirley9.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 891 users visited in the last hour
_