Question: Multiline Fasta To Single Line Fasta
14
gravatar for Palu
7.6 years ago by
Palu170
Palu170 wrote:

i have a fasta file with following format

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY
RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE
ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL
DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD
EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK
QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE
PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR
TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE
CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD
KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR
RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS
KPSTPSTPASKRKVGCIIYLFLYF

but i wanna to look sequence in a single line, not in many line as they are. any quick method??

fasta sequence • 38k views
ADD COMMENTlink modified 11 months ago by teckpor10 • written 7.6 years ago by Palu170
52
gravatar for Pierre Lindenbaum
7.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

using awk:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa


>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Edit: for Window$.

:-)

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Pierre Lindenbaum116k
13

Suggestion for Windows --> Switch to Linux :p

ADD REPLYlink written 7.6 years ago by Eric Normandeau10k
8

Good solution but be careful. If you redirect the result to a file,

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > out.fa

the first line is left empty.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by joreamayarom110
4

+1 for the windows fix :-)

ADD REPLYlink written 7.6 years ago by Michael Schubert6.8k
2

there will be an empty line at the beginning it should be removed like: tail -n +2 filein.fa > fileout.fa

ADD REPLYlink written 21 months ago by Medhat8.1k

yeah.. that was ~6 years ago. Now: http://stackoverflow.com/documentation/bioinformatics/4194

ADD REPLYlink modified 21 months ago • written 21 months ago by Pierre Lindenbaum116k

my bad I just saw it :)

ADD REPLYlink written 21 months ago by Medhat8.1k

Unfortunately i am a window lover. plz suggests something for window..plzzzz

ADD REPLYlink written 7.6 years ago by Palu170
6
gravatar for Jorge Amigo
4.2 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

here is a quick and simple perl one-liner:

perl -pe '/^>/ ? print "\n" : chomp' in.fasta > out.fasta

which will output an empty header line. you could use tail (which is faster than sed) to remove it:

perl -pe '/^>/' ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta

EDIT: even easier:

perl -pe 'chomp unless /^>/' in.fasta > out.fasta
ADD COMMENTlink modified 8 weeks ago • written 4.2 years ago by Jorge Amigo11k

Thank you!! This worked perfectly!

ADD REPLYlink written 5 months ago by rllombardi0

Is there a simple way to embed this one liner into a script that just takes a fasta file as input? Tried doing this myself but I am have no idea how to actually write perl scripts. Normally I would just embed this into system call via an R script, but all the quotes are throwing me off.

ADD REPLYlink written 5 months ago by caverill40

just create a simple script.pl file containing this

while (<>) { chomp unless /^>/; print }

and run it

perl script.pl <in.fasta >out.fasta
ADD REPLYlink written 8 weeks ago by Jorge Amigo11k
4
gravatar for Martin A Hansen
7.6 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Biopieces www.biopieces.org) is another way:

read_fasta -i file.fna | write_fasta -x

Cheers,

Martin

ADD COMMENTlink written 7.6 years ago by Martin A Hansen3.0k
3
gravatar for toni
7.6 years ago by
toni2.1k
Lyon
toni2.1k wrote:

Also a quick & dirty solution with Perl... fa2oneline.pl)

#!/usr/bin/perl -w
use strict;

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");

my $line = <IN>; 
print $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>gi/) { print "\n",$line,"\n"; }
else { print $line; }
}

print "\n";

then run :

perl fa2oneline.pl sample.fa > out.fa

Result :

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF
ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by toni2.1k
1

The regex needs to be changed?

$line=~m/^>gi/) should be $line=~m/^>/gi)

ADD REPLYlink written 6.2 years ago by stoker.neil40

sorry tony, thank you for this great help

ADD REPLYlink written 7.6 years ago by Palu170

this still doesnt seem to work, even with the regex change.

ADD REPLYlink written 5 months ago by caverill40
2
gravatar for adhil.md
2.2 years ago by
adhil.md20
adhil.md20 wrote:

Python version to convert multi-line to two-line fasta format. It also converts multiple files. Output directory will have all the files with '_twoline.fasta' as suffix.

from Bio import SeqIO
import os
import re
import argparse

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'\n'+sequence+'\n')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert multiple files of multi-line fasta into two-line fasta format')
    parser.add_argument('-i',type=str,dest='ind',required=True,help="Input directory where all the fasta files are present")
    parser.add_argument('-o',type=str,dest='outd',required=True,help="Ouput directory")
    parser.add_argument('-f',type=str,dest='ffiles',required=True,help="Comma seperated fasta file names without spaces")
    args = parser.parse_args()
    print (args)
    filelist = args.ffiles.split(',')
    if not os.path.exists(args.outd):
        os.makedirs(args.outd)
    multi2linefasta(args.ind,args.outd,filelist)

To run

save the above code to 'convertfasta.py' file

python convertfasta.py -i "/path/to/inputfolder/" -o "/path/to/outputfolder/" -f "file1.fasta,file2.fasta,file3.fasta"

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by adhil.md20
1

I would like to give you some suggestions or comments...

  • Use Biopython for parsing fasta files to avoid assumptions about the format. When a parser exists, use it. It will make your code quicker and shorter.
  • Use the os module for handling directories and paths
  • Use the sys module to get input as a python script rather than running this function interactively
  • Use the with open(file) as ofile synthax to handle opening of files
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by WouterDeCoster35k

Thank You ...................................

ADD REPLYlink written 2.2 years ago by adhil.md20
1
gravatar for Aaronquinlan
7.6 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

Kent source's "faToTab". I'm sure EMBOSS has something for this as well.

If you are on a Window's machine, I'd just use Galaxy's fasta2tab.

ADD COMMENTlink written 7.6 years ago by Aaronquinlan10k
1
gravatar for teckpor
11 months ago by
teckpor10
teckpor10 wrote:

It seems that seqtk (https://github.com/lh3/seqtk) can be used for this task, although the webpage only mentions multiline fastq. The command to give is simply (I am using Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz, but one can use any fasta or gzipped fasta):

seqtk seq -l0 Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | gzip > Homo_sapiens.GRCh37.dna.primary_assembly.singleLines.fa.gz

For a quick proof that this works, try:

zcat Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | head -n10 | seqtk seq -l0 | cat -A
ADD COMMENTlink written 11 months ago by teckpor10
0
gravatar for Woa
7.6 years ago by
Woa2.7k
United States
Woa2.7k wrote:

use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new(-file => "myseq.fasta" , '-format' => 'Fasta');

while ( my $seq = $in->next_seq ) {
        print ">",$seq->id()," ",$seq->desc(),"\n",$seq->seq(),"\n";
}

Output:

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK

>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF
ADD COMMENTlink modified 7.6 years ago by Neilfws48k • written 7.6 years ago by Woa2.7k

I messed up with the formatting, The ">" symbols at the beginning of the Fasta header and not shown for some reason.

ADD REPLYlink written 7.6 years ago by Woa2.7k

Just indent with 4 spaces, otherwise ">" is interpreted as blockquote.

ADD REPLYlink written 7.6 years ago by Neilfws48k
0
gravatar for oigl
4.2 years ago by
oigl60
oigl60 wrote:

You can open the merged sequences in the UGENE Sequence View. To do it:

  1. Select "File>Open as" in the main UGENE menu.
  2. Select the "FASTA" format.
  3. Select "Merge sequences into a single sequence to show in the sequence viewer".

By default, UGENE will show the sequence itself, the complementary sequence and translations. It looks like here.

If required, you can export the merged sequence into a new file.

ADD COMMENTlink written 4.2 years ago by oigl60
0
gravatar for sayuj.koyyappurath
4.2 years ago by
France
sayuj.koyyappurath0 wrote:

Hi

Hope this works

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file

ADD COMMENTlink written 4.2 years ago by sayuj.koyyappurath0
0
gravatar for zeeefa
4.1 years ago by
zeeefa80
Sverige
zeeefa80 wrote:

You may have found a solution already, but here's a quick method: Open your file in a text editor (Notepad++?), find "\r\n" and replace it with blank, then type ">" in 'find what' and replace it with \r\n :) Hope it works/helps.

 

ADD COMMENTlink written 4.1 years ago by zeeefa80
2

I do not think that this is a good advice for several reasons (even if it can work in a few cases):

- FASTA files are sometimes BIG, like 20Gb, containing millions of records. Not sure whether Notepad will survive to this.

- (I think) This forum/site is more about learning/developing some coding skills while getting some help.

You should learn from the answers above ;-)

ADD REPLYlink written 4.1 years ago by toni2.1k

they asked for a quick method so I just posted what I knew off the top of my head really :( But you're right! Thank you :)

ADD REPLYlink written 4.1 years ago by zeeefa80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1128 users visited in the last hour