Question: single line fasta to mult line fasta
0
gravatar for User 6777
2.3 years ago by
User 677710
United States
User 677710 wrote:

I know for the reverse procedure, there are lots of methods available. But how can I change my fasta file with sequences like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRQQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRVKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

to like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
QQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
VKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

I know its just about inserting line breaks in each sequence, but is there any way that I can check whether the fasta file is single line fasta or not, and if is, add line breaks in each sequence..

Ps. I am using windows..

Thanks for your consideration

fasta • 6.5k views
ADD COMMENTlink modified 10 months ago by conchoecia0 • written 2.3 years ago by User 677710
4

If you plan to do more bioinformatics, do yourself (and us) a favor a switch to linux. Things will get so much easier then.

ADD REPLYlink written 2.3 years ago by WouterDeCoster35k

fasta_formatter from fastx tookit can do this (http://hannonlab.cshl.edu/fastx_toolkit/) -w specifies the width

ADD REPLYlink written 2.3 years ago by microfuge950

Venu, I want exactly the equivalent of fold command for windows.. may be in perl or python script

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by User 677710
3

Is this a homework question (since you keep asking for perl or python solution)? If all you need is a viable solution then you have plenty to choose from already.

ADD REPLYlink written 2.3 years ago by genomax59k
6
gravatar for genomax
2.3 years ago by
genomax59k
United States
genomax59k wrote:

Since the OP has asked for a solution that will work on windows, reformat.sh from BBMap can be used like so (NN = number of characters you want on one line)

reformat.sh in=your_file.fa out=new_file.fa fastawrap=NN

On windows you would use it like so (adjust windows file path for BBMap install)

java -ea -Xmx200m cp=C:\bbmap\current\ jgi.ReformatReads in=x.fa out=y.fa fastawrap=NN
ADD COMMENTlink modified 2.3 years ago by Brian Bushnell16k • written 2.3 years ago by genomax59k
5
gravatar for Medhat
2.3 years ago by
Medhat8.0k
Poland
Medhat8.0k wrote:

from this

https://github.com/jimhester/fasta_utilities

you can use wrap.pl "limits FASTA lines to 80 characters"

Or you can use from here

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage

fasta_formatter

fasta_formatter -i yourfile.fasta -w 80 -o yourout.fasta

[-w N] = max. sequence line width for output FASTA file. When ZERO (the default), sequence lines will NOT be wrapped - all nucleotides of each sequences will appear on a single line (good for scripting).

advice: If you will work in bioinformatics try any Linux flavor

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Medhat8.0k

but both fasta_utilities and fastx_toolkit are large packages that need to be installed for this. I will prefer a small perl/ python code..

ADD REPLYlink written 2.3 years ago by User 677710
4
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

if your data can fit in memory, here is a simple HTML form that should do the job:

ADD COMMENTlink written 2.3 years ago by Pierre Lindenbaum115k

This is pretty amazing, would never have thought about using html 0.o

ADD REPLYlink written 2.3 years ago by WouterDeCoster35k
1

Yes. Pierre always could provide some amazing solution. The most smart guy in the world.

ADD REPLYlink written 2.3 years ago by Shicheng Guo7.3k

Hi Pierre Lindenbaum,

This is amazing script. Thanks for the sharing.

Gopal

ADD REPLYlink written 7 months ago by Gopal Joshi0
3
gravatar for st.ph.n
2.3 years ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

Here's a quick one in python:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

print '\nConverting ' + inputfile + ' to ' + str(length) + ' chars per line...'

with open(inputfile, 'r') as f:

     h = []                                             #headers list
     s = []                                             #sequence list
     for line in f:
           if line.startswith(">"):                      #grab headers
               h.append(line.strip().split(">")[1])
          else:
               s.append(line.strip())                   #grab sequences

myseqs = dict(zip(h,s))                                 #map to dictionary, header:sequence

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

def print_seqs(header, sequence):
        print >> outfile, '>' + header
        while len(sequence) > 0:
                print >> outfile, sequence[:length]
                sequence = sequence[length:]

for i in myseqs:
        print_seqs(i, myseqs[i])
print 'Converted ' + str(len(myseqs)) + ' sequences.'
  print 'Done.\n'

outfile.close()

Run as 'python your_input_fasta Nchars'

EDIT

A cleaner, more memory efficient solution, per WouterDeCoster comment:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

with open(inputfile, 'r') as f:
     for line in f:
        if line.startswith(">"):
                print >> outfile, line.strip()
        else:
                sequence = line.strip()
                while len(sequence) > 0:
                        print >>outfile, sequence[:length]
                        sequence = sequence[length:]
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by st.ph.n2.4k
1

Your code is not really bad, but there is plenty of opportunity for improvement. To start, there is no need to keep everything in memory. That's a bad practice, imagine doing this on a smaller computer with a huge fasta file... Streaming is better. Definitely more memory efficient, very likely quicker than your implementation. You could just read a line, if it's a header write to output file. If it's a sequence, write in chunks of n characters.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by WouterDeCoster35k
1

@WouterDeCoster, I wrote it the way it is so that the OP can understand what's going on (maybe learn something). I doubt he's using genome size fa files, or if he was, needn't be asking the question in the first place. Otherwise, everything would be contained in the first 'with open' statement of the input file, and not hold everything in memory in both lists, and a dict.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by st.ph.n2.4k
1

I definitely like your second script. I would say this is as easy to follow as the first (without using a dict etc) for an inexperienced user, but I'm obviously biased.

Just two more comments, if you don't mind:

  • os.path.splitext(inputfile)[0] would be a cleaner way to get the name of the outputfile, or alternatively inputfile.replace('.fasta', '_multiline.fasta'). os.path.splitext(inputfile)[0] has the additional advantage that if the file is .fa instead of .fasta things will still be alright.
  • The print >> synthax is not portable to python 3. You should write outfile.write(line). While python 2 is still around, it will not be supported forever and if you plan to write more scripts, try to write in python3. The >> synthax is even no longer described in the official python documentation of both python2.7 and python3.
ADD REPLYlink written 2.3 years ago by WouterDeCoster35k
2
gravatar for Guangchuang Yu
2.3 years ago by
Guangchuang Yu2.0k
China/Hong Kong/The University of Hong Kong
Guangchuang Yu2.0k wrote:

seqmagick:

library(seqmagick)
fa_read(INFILE) %>% fa_write(OUTFILE, type = "interleaved")
ADD COMMENTlink written 2.3 years ago by Guangchuang Yu2.0k

Is there any perl/python equivalent code?

ADD REPLYlink written 2.3 years ago by User 677710
7
gravatar for venu
2.3 years ago by
venu5.7k
Germany
venu5.7k wrote:

I would use fold command

fold -w 60 file.fa > out.fa

Check man fold

ADD COMMENTlink written 2.3 years ago by venu5.7k
3

fold will wrongly fold the header line when it's too long, for example:

$ fold -w 60 dataset_A.fa | cat -A | more
>ADMP01000001 Bacteroides xylanisolvens SD CC 2a contig00137$
, whole genome shotgun sequence. [Bacteroides xylanisolvens $
SD CC 2a]$
GCATGGACCCTTAAACCTACTATATAGTCTATACTAATAATACTCTCTGTTTAACCCATA$
ACAATGCGACAGACTTAAATCCCATATAAATAAAAAAAACGAGCTAATTCTTTTTATAGA$
ACCACTCGCCACCGCATCAGAAAAGCCTAAAGGCTTTTCCATAGTACCTCAAATTTCCTG$
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by shenwei3564.3k

He is using Windows™ from Microsoft Corp...

ADD REPLYlink written 2.3 years ago by Gabriel R.2.5k
3
gravatar for shenwei356
2.3 years ago by
shenwei3564.3k
China
shenwei3564.3k wrote:

Since you are using Windows, I highly recommend seqkit, which is cross-platform and ultrafast!

It's light weight (~8M) and out-of-the-box, no dependencies, no compilation, no configuration. Just download compressed executable file of your operating system, and uncompress it and then use in command line terminal.

For your case, just:

seqkit seq -w 60 seqs.fa > seqs2.fa
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by shenwei3564.3k

Thank you so much, I've tried all the other tools in this page (i needed a linux solution) and this is the easiest and the only one that gave me what i needed. Thanks!

ADD REPLYlink written 4 weeks ago by deboraguerrapeixe0
0
gravatar for Akliao
14 months ago by
Akliao0
San Diego
Akliao0 wrote:

Will this work?

https://en.wikipedia.org/wiki/Fold_(Unix)

ADD COMMENTlink written 14 months ago by Akliao0

Maybe, but not guaranteed: it might also 'fold' the description/fasta identifier and that's definitely not what you want.

See this reaction in this same thread: C: single line fasta to mult line fasta

ADD REPLYlink written 14 months ago by WouterDeCoster35k
0
gravatar for conchoecia
10 months ago by
conchoecia0
Santa Cruz, CA
conchoecia0 wrote:

A solution using bioawk and fold. This avoids fold's incorrect behavior to fold long fasta headers. This should work with fasta files that contain multiple sequences.

Awk people - I had to use echo twice since printf didn't place the newline in the correct position after the seqname...

bioawk -cfastx '{system("printf \">%s\"" $name " >> multiline.fasta"); system("echo '' >> multiline.fasta"); system("echo " $seq " | fold -w 60 - >> multiline.fasta") }' singleline.fasta
ADD COMMENTlink written 10 months ago by conchoecia0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1092 users visited in the last hour