Question: single line fasta to mult line fasta
0
gravatar for User 6777
15 months ago by
User 677710
United States
User 677710 wrote:

I know for the reverse procedure, there are lots of methods available. But how can I change my fasta file with sequences like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRQQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSRVKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

to like:

>gi|189094002
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
QQTKITFNRKNIAADDDEDDDDFFD
>gi|68485955
MDDAELNAIRQARLAELQRNAAGGGSSTNPSSSSSGGAQDSAQENMTITILNRVLTNEARERLSR
VKIVRRDSQQKQQTKITFNRKNIAGDDEDDDDFFD

I know its just about inserting line breaks in each sequence, but is there any way that I can check whether the fasta file is single line fasta or not, and if is, add line breaks in each sequence..

Ps. I am using windows..

Thanks for your consideration

fasta • 3.0k views
ADD COMMENTlink modified 6 weeks ago by Akliao0 • written 15 months ago by User 677710
4

If you plan to do more bioinformatics, do yourself (and us) a favor a switch to linux. Things will get so much easier then.

ADD REPLYlink written 15 months ago by WouterDeCoster23k

fasta_formatter from fastx tookit can do this (http://hannonlab.cshl.edu/fastx_toolkit/) -w specifies the width

ADD REPLYlink written 15 months ago by microfuge720

Venu, I want exactly the equivalent of fold command for windows.. may be in perl or python script

ADD REPLYlink modified 15 months ago • written 15 months ago by User 677710
3

Is this a homework question (since you keep asking for perl or python solution)? If all you need is a viable solution then you have plenty to choose from already.

ADD REPLYlink written 15 months ago by genomax37k
5
gravatar for Medhat
15 months ago by
Medhat7.1k
Poland
Medhat7.1k wrote:

from this

https://github.com/jimhester/fasta_utilities

you can use wrap.pl "limits FASTA lines to 80 characters"

Or you can use from here

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_clipper_usage

fasta_formatter

fasta_formatter -i yourfile.fasta -w 80 -o yourout.fasta

[-w N] = max. sequence line width for output FASTA file. When ZERO (the default), sequence lines will NOT be wrapped - all nucleotides of each sequences will appear on a single line (good for scripting).

advice: If you will work in bioinformatics try any Linux flavor

ADD COMMENTlink modified 15 months ago • written 15 months ago by Medhat7.1k

but both fasta_utilities and fastx_toolkit are large packages that need to be installed for this. I will prefer a small perl/ python code..

ADD REPLYlink written 15 months ago by User 677710
4
gravatar for Pierre Lindenbaum
15 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

if your data can fit in memory, here is a simple HTML form that should do the job:

ADD COMMENTlink written 15 months ago by Pierre Lindenbaum101k

This is pretty amazing, would never have thought about using html 0.o

ADD REPLYlink written 15 months ago by WouterDeCoster23k
1

Yes. Pierre always could provide some amazing solution. The most smart guy in the world.

ADD REPLYlink written 15 months ago by Shicheng Guo4.5k
3
gravatar for genomax
15 months ago by
genomax37k
United States
genomax37k wrote:

Since the OP has asked for a solution that will work on windows, reformat.sh from BBMap can be used like so (NN = number of characters you want on one line)

reformat.sh in=your_file.fa out=new_file.fa fastawrap=NN

On windows you would use it like so (adjust windows file path for BBMap install)

java -ea -Xmx200m cp=C:\bbmap\current\ jgi.ReformatReads in=x.fa out=y.fa fastawrap=NN
ADD COMMENTlink modified 15 months ago by Brian Bushnell14k • written 15 months ago by genomax37k
3
gravatar for st.ph.n
15 months ago by
st.ph.n1.9k
Philadelphia, PA
st.ph.n1.9k wrote:

Here's a quick one in python:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

print '\nConverting ' + inputfile + ' to ' + str(length) + ' chars per line...'

with open(inputfile, 'r') as f:

     h = []                                             #headers list
     s = []                                             #sequence list
     for line in f:
           if line.startswith(">"):                      #grab headers
               h.append(line.strip().split(">")[1])
          else:
               s.append(line.strip())                   #grab sequences

myseqs = dict(zip(h,s))                                 #map to dictionary, header:sequence

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

def print_seqs(header, sequence):
        print >> outfile, '>' + header
        while len(sequence) > 0:
                print >> outfile, sequence[:length]
                sequence = sequence[length:]

for i in myseqs:
        print_seqs(i, myseqs[i])
print 'Converted ' + str(len(myseqs)) + ' sequences.'
  print 'Done.\n'

outfile.close()

Run as 'python your_input_fasta Nchars'

EDIT

A cleaner, more memory efficient solution, per WouterDeCoster comment:

import sys

inputfile = sys.argv[1]
length = int(sys.argv[2])

outfile = open(inputfile.split(".fasta")[0] + '_multi-line.fasta', 'w') #open outfile for writing

with open(inputfile, 'r') as f:
     for line in f:
        if line.startswith(">"):
                print >> outfile, line.strip()
        else:
                sequence = line.strip()
                while len(sequence) > 0:
                        print >>outfile, sequence[:length]
                        sequence = sequence[length:]
ADD COMMENTlink modified 15 months ago • written 15 months ago by st.ph.n1.9k
1

Your code is not really bad, but there is plenty of opportunity for improvement. To start, there is no need to keep everything in memory. That's a bad practice, imagine doing this on a smaller computer with a huge fasta file... Streaming is better. Definitely more memory efficient, very likely quicker than your implementation. You could just read a line, if it's a header write to output file. If it's a sequence, write in chunks of n characters.

ADD REPLYlink modified 15 months ago • written 15 months ago by WouterDeCoster23k
1

@WouterDeCoster, I wrote it the way it is so that the OP can understand what's going on (maybe learn something). I doubt he's using genome size fa files, or if he was, needn't be asking the question in the first place. Otherwise, everything would be contained in the first 'with open' statement of the input file, and not hold everything in memory in both lists, and a dict.

ADD REPLYlink modified 15 months ago • written 15 months ago by st.ph.n1.9k
1

I definitely like your second script. I would say this is as easy to follow as the first (without using a dict etc) for an inexperienced user, but I'm obviously biased.

Just two more comments, if you don't mind:

  • os.path.splitext(inputfile)[0] would be a cleaner way to get the name of the outputfile, or alternatively inputfile.replace('.fasta', '_multiline.fasta'). os.path.splitext(inputfile)[0] has the additional advantage that if the file is .fa instead of .fasta things will still be alright.
  • The print >> synthax is not portable to python 3. You should write outfile.write(line). While python 2 is still around, it will not be supported forever and if you plan to write more scripts, try to write in python3. The >> synthax is even no longer described in the official python documentation of both python2.7 and python3.
ADD REPLYlink written 15 months ago by WouterDeCoster23k
2
gravatar for Guangchuang Yu
15 months ago by
Guangchuang Yu1.6k
China/Hong Kong/The University of Hong Kong
Guangchuang Yu1.6k wrote:

seqmagick:

library(seqmagick)
fa_read(INFILE) %>% fa_write(OUTFILE, type = "interleaved")
ADD COMMENTlink written 15 months ago by Guangchuang Yu1.6k

Is there any perl/python equivalent code?

ADD REPLYlink written 15 months ago by User 677710
3
gravatar for venu
15 months ago by
venu4.5k
Germany
venu4.5k wrote:

I would use fold command

fold -w 60 file.fa > out.fa

Check man fold

ADD COMMENTlink written 15 months ago by venu4.5k
3

fold will wrongly fold the header line when it's too long, for example:

$ fold -w 60 dataset_A.fa | cat -A | more
>ADMP01000001 Bacteroides xylanisolvens SD CC 2a contig00137$
, whole genome shotgun sequence. [Bacteroides xylanisolvens $
SD CC 2a]$
GCATGGACCCTTAAACCTACTATATAGTCTATACTAATAATACTCTCTGTTTAACCCATA$
ACAATGCGACAGACTTAAATCCCATATAAATAAAAAAAACGAGCTAATTCTTTTTATAGA$
ACCACTCGCCACCGCATCAGAAAAGCCTAAAGGCTTTTCCATAGTACCTCAAATTTCCTG$
ADD REPLYlink modified 15 months ago • written 15 months ago by shenwei3563.4k

He is using Windows™ from Microsoft Corp...

ADD REPLYlink written 15 months ago by Gabriel R.2.1k
3
gravatar for shenwei356
15 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

Since you are using Windows, I highly recommend seqkit, which is cross-platform and ultrafast!

It's light weight (~8M) and out-of-the-box, no dependencies, no compilation, no configuration. Just download compressed executable file of your operating system, and uncompress it and then use in command line terminal.

For your case, just:

seqkit seq -w 60 seqs.fa > seqs2.fa
ADD COMMENTlink modified 15 months ago • written 15 months ago by shenwei3563.4k
0
gravatar for Akliao
6 weeks ago by
Akliao0
San Diego
Akliao0 wrote:

Will this work?

https://en.wikipedia.org/wiki/Fold_(Unix)

ADD COMMENTlink written 6 weeks ago by Akliao0

Maybe, but not guaranteed: it might also 'fold' the description/fasta identifier and that's definitely not what you want.

See this reaction in this same thread: C: single line fasta to mult line fasta

ADD REPLYlink written 6 weeks ago by WouterDeCoster23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 847 users visited in the last hour