single line fasta to mult line fasta
8
1
Entering edit mode
5.3 years ago
User 6777 ▴ 20

I know for the reverse procedure, there are lots of methods available. But how can I change my fasta file with sequences like:

>gi|189094002
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRVKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD


to like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
VKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD


I know its just about inserting line breaks in each sequence, but is there any way that I can check whether the fasta file is single line fasta or not, and if is, add line breaks in each sequence..

Ps. I am using windows..

fasta • 15k views
4
Entering edit mode

If you plan to do more bioinformatics, do yourself (and us) a favor a switch to linux. Things will get so much easier then.

0
Entering edit mode

fasta_formatter from fastx tookit can do this (http://hannonlab.cshl.edu/fastx_toolkit/) -w specifies the width

0
Entering edit mode

Venu, I want exactly the equivalent of fold command for windows.. may be in perl or python script

3
Entering edit mode

Is this a homework question (since you keep asking for perl or python solution)? If all you need is a viable solution then you have plenty to choose from already.

0
Entering edit mode

Will this work?

0
Entering edit mode

Maybe, but not guaranteed: it might also 'fold' the description/fasta identifier and that's definitely not what you want.

See this reaction in this same thread: C: single line fasta to mult line fasta

9
Entering edit mode
5.3 years ago
GenoMax 110k

Since the OP has asked for a solution that will work on windows, reformat.sh from BBMap can be used like so (NN = number of characters you want on one line)

reformat.sh in=your_file.fa out=new_file.fa fastawrap=NN


On windows you would use it like so (adjust windows file path for BBMap install)

java -ea -Xmx200m cp=C:\bbmap\current\ jgi.ReformatReads in=x.fa out=y.fa fastawrap=NN

5
Entering edit mode
5.3 years ago
Medhat 9.0k

from this

https://github.com/jimhester/fasta_utilities

you can use wrap.pl "limits FASTA lines to 80 characters"

Or you can use from here

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage

fasta_formatter

fasta_formatter -i yourfile.fasta -w 80 -o yourout.fasta

[-w N] = max. sequence line width for output FASTA file. When ZERO (the default), sequence lines will NOT be wrapped - all nucleotides of each sequences will appear on a single line (good for scripting).

### advice: If you will work in bioinformatics try any Linux flavor

0
Entering edit mode

but both fasta_utilities and fastx_toolkit are large packages that need to be installed for this. I will prefer a small perl/ python code..

5
Entering edit mode
5.3 years ago
st.ph.n ★ 2.6k

Here's a quick one in python:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

print '\nConverting ' + inputfile + ' to ' + str(length) + ' chars per line...'

with open(inputfile, 'r') as f:

s = []                                             #sequence list
for line in f:
h.append(line.strip().split(">")[1])
else:
s.append(line.strip())                   #grab sequences

myseqs = dict(zip(h,s))                                 #map to dictionary, header:sequence

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

print >> outfile, '>' + header
while len(sequence) > 0:
print >> outfile, sequence[:length]
sequence = sequence[length:]

for i in myseqs:
print_seqs(i, myseqs[i])
print 'Converted ' + str(len(myseqs)) + ' sequences.'
print 'Done.\n'

outfile.close()


Run as 'python your_input_fasta Nchars'

EDIT

A cleaner, more memory efficient solution, per WouterDeCoster comment:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

with open(inputfile, 'r') as f:
for line in f:
if line.startswith(">"):
print >> outfile, line.strip()
else:
sequence = line.strip()
while len(sequence) > 0:
print >>outfile, sequence[:length]
sequence = sequence[length:]

1
Entering edit mode

Your code is not really bad, but there is plenty of opportunity for improvement. To start, there is no need to keep everything in memory. That's a bad practice, imagine doing this on a smaller computer with a huge fasta file... Streaming is better. Definitely more memory efficient, very likely quicker than your implementation. You could just read a line, if it's a header write to output file. If it's a sequence, write in chunks of n characters.

1
Entering edit mode

@WouterDeCoster, I wrote it the way it is so that the OP can understand what's going on (maybe learn something). I doubt he's using genome size fa files, or if he was, needn't be asking the question in the first place. Otherwise, everything would be contained in the first 'with open' statement of the input file, and not hold everything in memory in both lists, and a dict.

1
Entering edit mode

I definitely like your second script. I would say this is as easy to follow as the first (without using a dict etc) for an inexperienced user, but I'm obviously biased.

Just two more comments, if you don't mind:

• os.path.splitext(inputfile)[0] would be a cleaner way to get the name of the outputfile, or alternatively inputfile.replace('.fasta', '_multiline.fasta'). os.path.splitext(inputfile)[0] has the additional advantage that if the file is .fa instead of .fasta things will still be alright.
• The print >> synthax is not portable to python 3. You should write outfile.write(line). While python 2 is still around, it will not be supported forever and if you plan to write more scripts, try to write in python3. The >> synthax is even no longer described in the official python documentation of both python2.7 and python3.
4
Entering edit mode
5.3 years ago

if your data can fit in memory, here is a simple HTML form that should do the job:

0
Entering edit mode

This is pretty amazing, would never have thought about using html 0.o

1
Entering edit mode

Yes. Pierre always could provide some amazing solution. The most smart guy in the world.

0
Entering edit mode

Hi Pierre Lindenbaum,

This is amazing script. Thanks for the sharing.

Gopal

2
Entering edit mode
5.3 years ago
Guangchuang Yu ★ 2.5k
library(seqmagick)
fa_read(INFILE) %>% fa_write(OUTFILE, type = "interleaved")

0
Entering edit mode

Is there any perl/python equivalent code?

11
Entering edit mode
5.3 years ago
venu 7.0k

I would use fold command

fold -w 60 file.fa > out.fa


Check man fold

4
Entering edit mode

fold will wrongly fold the header line when it's too long, for example:

$fold -w 60 dataset_A.fa | cat -A | more >ADMP01000001 Bacteroides xylanisolvens SD CC 2a contig00137$
, whole genome shotgun sequence. [Bacteroides xylanisolvens $SD CC 2a]$
GCATGGACCCTTAAACCTACTATATAGTCTATACTAATAATACTCTCTGTTTAACCCATA$ACAATGCGACAGACTTAAATCCCATATAAATAAAAAAAACGAGCTAATTCTTTTTATAGA$
ACCACTCGCCACCGCATCAGAAAAGCCTAAAGGCTTTTCCATAGTACCTCAAATTTCCTG$ ADD REPLY 0 Entering edit mode He is using Windows™ from Microsoft Corp... ADD REPLY 0 Entering edit mode This answer is perfect and simplest as long as the header is not too long. ADD REPLY 5 Entering edit mode 5.3 years ago Since you are using Windows, I highly recommend seqkit, which is cross-platform and ultrafast! It's light weight (~8M) and out-of-the-box, no dependencies, no compilation, no configuration. Just download compressed executable file of your operating system, and uncompress it and then use in command line terminal. For your case, just: seqkit seq -w 60 seqs.fa > seqs2.fa  ADD COMMENT 0 Entering edit mode Thank you so much, I've tried all the other tools in this page (i needed a linux solution) and this is the easiest and the only one that gave me what i needed. Thanks! ADD REPLY 0 Entering edit mode 3.8 years ago conchoecia • 0 A solution using bioawk and fold. This avoids fold's incorrect behavior to fold long fasta headers. This should work with fasta files that contain multiple sequences. Awk people - I had to use echo twice since printf didn't place the newline in the correct position after the seqname... bioawk -cfastx '{system("printf \">%s\""$name " >> multiline.fasta"); system("echo '' >> multiline.fasta"); system("echo " \$seq " | fold -w 60 - >> multiline.fasta") }' singleline.fasta