Question: Multiline Fasta To Single Line Fasta
14
gravatar for Palu
8.3 years ago by
Palu170
Palu170 wrote:

i have a fasta file with following format

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY
RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE
ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL
DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD
EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK
QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE
PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR
TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE
CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD
KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR
RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS
KPSTPSTPASKRKVGCIIYLFLYF

but i wanna to look sequence in a single line, not in many line as they are. any quick method??

fasta sequence • 46k views
ADD COMMENTlink modified 20 months ago by teckpor10 • written 8.3 years ago by Palu170
55
gravatar for Pierre Lindenbaum
8.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

using awk:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa


>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Edit: for Window$.

:-)

ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by Pierre Lindenbaum123k
14

Suggestion for Windows --> Switch to Linux :p

ADD REPLYlink written 8.3 years ago by Eric Normandeau10k
9

Good solution but be careful. If you redirect the result to a file,

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > out.fa

the first line is left empty.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by joreamayarom120
4

+1 for the windows fix :-)

ADD REPLYlink written 8.3 years ago by Michael Schubert6.9k
2

there will be an empty line at the beginning it should be removed like: tail -n +2 filein.fa > fileout.fa

ADD REPLYlink written 2.5 years ago by Medhat8.4k

yeah.. that was ~6 years ago. Now: http://stackoverflow.com/documentation/bioinformatics/4194

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Pierre Lindenbaum123k

my bad I just saw it :)

ADD REPLYlink written 2.5 years ago by Medhat8.4k

...the link is dead and I still get the first line empty after redirecting the output into a file - could you update the answer if there is a more elegant way (not piping through tail) to avoid this?

ADD REPLYlink modified 6 months ago • written 6 months ago by al-ash110
1

ADD REPLYlink written 6 months ago by Pierre Lindenbaum123k

Modified the original awk code from @Pierre Lindenbaum to below

awk '/^>/ { if(NR>1) print "";  printf("%s\n",$0); next; } { printf("%s",$0);}  END {printf("\n");}'

Uses NR (numbered row) to print a newline only for non-first fasta record. Not the most elegant but I hope this will help someone out there.

ADD REPLYlink modified 4 months ago • written 4 months ago by ljq20

you can use the answer provided by Jorge Amigo below. (it's not using tail)

ADD REPLYlink modified 6 months ago • written 6 months ago by lieven.sterck5.8k

Unfortunately i am a window lover. plz suggests something for window..plzzzz

ADD REPLYlink written 8.3 years ago by Palu170
6
gravatar for Jorge Amigo
4.9 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

here is a quick and simple perl one-liner:

perl -pe '/^>/ ? print "\n" : chomp' in.fasta > out.fasta

which will output an empty header line. you could use tail (which is faster than sed) to remove it:

perl -pe '/^>/' ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta

EDIT: even easier (do not use!):

perl -pe 'chomp unless /^>/' in.fasta > out.fasta

EDIT2: this last one liner does not work as expected. use this one instead, which performs inside a single perl call all the logic needed in all lines but the first one by using perl's $. internal line counter variable:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' in.fasta > out.fasta
ADD COMMENTlink modified 6 months ago • written 4.9 years ago by Jorge Amigo11k

Thank you!! This worked perfectly!

ADD REPLYlink written 13 months ago by rllombardi0

Is there a simple way to embed this one liner into a script that just takes a fasta file as input? Tried doing this myself but I am have no idea how to actually write perl scripts. Normally I would just embed this into system call via an R script, but all the quotes are throwing me off.

ADD REPLYlink written 13 months ago by caverill40

just create a simple script.pl file containing this (do not use!)

while (<>) { chomp unless /^>/; print }

EDIT: the previous code is wrong. use this one into script.pl instead:

while (<>) { $. > 1 and /^>/ ? print "\n" : chomp; print }

and run it

perl script.pl <in.fasta >out.fasta
ADD REPLYlink modified 6 months ago • written 10 months ago by Jorge Amigo11k

(for future reference:) Sorry to tell but the 'EDIT' version does not work as expected.

Problem is that it will have chomped the line previous to /^>/ and as such will add the header line to the previous sequence line. the other version works perfectly though.

ADD REPLYlink written 6 months ago by lieven.sterck5.8k
1

thanks for pointing it out. I've corrected my previous answer and tested thoroughly the new one.

ADD REPLYlink written 6 months ago by Jorge Amigo11k
4
gravatar for toni
8.3 years ago by
toni2.1k
Lyon
toni2.1k wrote:

Also a quick & dirty solution with Perl... fa2oneline.pl)

#!/usr/bin/perl -w
use strict;

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");

my $line = <IN>; 
print $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>gi/) { print "\n",$line,"\n"; }
else { print $line; }
}

print "\n";

then run :

perl fa2oneline.pl sample.fa > out.fa

Result :

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF
ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by toni2.1k
2

The regex needs to be changed?

$line=~m/^>gi/) should be $line=~m/^>/gi)

ADD REPLYlink written 6.9 years ago by stoker.neil50

sorry tony, thank you for this great help

ADD REPLYlink written 8.3 years ago by Palu170

this still doesnt seem to work, even with the regex change.

ADD REPLYlink written 13 months ago by caverill40
4
gravatar for Martin A Hansen
8.3 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Biopieces www.biopieces.org) is another way:

read_fasta -i file.fna | write_fasta -x

Cheers,

Martin

ADD COMMENTlink written 8.3 years ago by Martin A Hansen3.0k
2
gravatar for adhil.md
2.9 years ago by
adhil.md20
adhil.md20 wrote:

Python version to convert multi-line to two-line fasta format. It also converts multiple files. Output directory will have all the files with '_twoline.fasta' as suffix.

from Bio import SeqIO
import os
import re
import argparse

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'\n'+sequence+'\n')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert multiple files of multi-line fasta into two-line fasta format')
    parser.add_argument('-i',type=str,dest='ind',required=True,help="Input directory where all the fasta files are present")
    parser.add_argument('-o',type=str,dest='outd',required=True,help="Ouput directory")
    parser.add_argument('-f',type=str,dest='ffiles',required=True,help="Comma seperated fasta file names without spaces")
    args = parser.parse_args()
    print (args)
    filelist = args.ffiles.split(',')
    if not os.path.exists(args.outd):
        os.makedirs(args.outd)
    multi2linefasta(args.ind,args.outd,filelist)

To run

save the above code to 'convertfasta.py' file

python convertfasta.py -i "/path/to/inputfolder/" -o "/path/to/outputfolder/" -f "file1.fasta,file2.fasta,file3.fasta"

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by adhil.md20
1

I would like to give you some suggestions or comments...

  • Use Biopython for parsing fasta files to avoid assumptions about the format. When a parser exists, use it. It will make your code quicker and shorter.
  • Use the os module for handling directories and paths
  • Use the sys module to get input as a python script rather than running this function interactively
  • Use the with open(file) as ofile synthax to handle opening of files
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by WouterDeCoster41k

Thank You ...................................

ADD REPLYlink written 2.9 years ago by adhil.md20
1
gravatar for Aaronquinlan
8.3 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

Kent source's "faToTab". I'm sure EMBOSS has something for this as well.

If you are on a Window's machine, I'd just use Galaxy's fasta2tab.

ADD COMMENTlink written 8.3 years ago by Aaronquinlan11k
1
gravatar for teckpor
20 months ago by
teckpor10
teckpor10 wrote:

It seems that seqtk (https://github.com/lh3/seqtk) can be used for this task, although the webpage only mentions multiline fastq. The command to give is simply (I am using Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz, but one can use any fasta or gzipped fasta):

seqtk seq -l0 Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | gzip > Homo_sapiens.GRCh37.dna.primary_assembly.singleLines.fa.gz

For a quick proof that this works, try:

zcat Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | head -n10 | seqtk seq -l0 | cat -A
ADD COMMENTlink written 20 months ago by teckpor10
0
gravatar for Woa
8.3 years ago by
Woa2.7k
United States
Woa2.7k wrote:

use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new(-file => "myseq.fasta" , '-format' => 'Fasta');

while ( my $seq = $in->next_seq ) {
        print ">",$seq->id()," ",$seq->desc(),"\n",$seq->seq(),"\n";
}

Output:

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK

>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF
ADD COMMENTlink modified 8.3 years ago by Neilfws48k • written 8.3 years ago by Woa2.7k

I messed up with the formatting, The ">" symbols at the beginning of the Fasta header and not shown for some reason.

ADD REPLYlink written 8.3 years ago by Woa2.7k

Just indent with 4 spaces, otherwise ">" is interpreted as blockquote.

ADD REPLYlink written 8.3 years ago by Neilfws48k
0
gravatar for oigl
4.9 years ago by
oigl60
oigl60 wrote:

You can open the merged sequences in the UGENE Sequence View. To do it:

  1. Select "File>Open as" in the main UGENE menu.
  2. Select the "FASTA" format.
  3. Select "Merge sequences into a single sequence to show in the sequence viewer".

By default, UGENE will show the sequence itself, the complementary sequence and translations. It looks like here.

If required, you can export the merged sequence into a new file.

ADD COMMENTlink written 4.9 years ago by oigl60
0
gravatar for sayuj.koyyappurath
4.9 years ago by
France
sayuj.koyyappurath0 wrote:

Hi

Hope this works

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file

ADD COMMENTlink written 4.9 years ago by sayuj.koyyappurath0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 894 users visited in the last hour