Multiline Fasta To Single Line Fasta
12
23
Entering edit mode
12.8 years ago
Palu ▴ 250

I have a fasta file with following format

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFY
RTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNE
ECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHL
DVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAAD
EEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAK
QLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKE
PAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYR
TIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDE
CKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLD
KILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEAR
RLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSS
KPSTPSTPASKRKVGCIIYLFLYF

but I wanna to look sequence in a single line, not in many line as they are. Any quick method?

fasta • 97k views
ADD COMMENT
74
Entering edit mode
12.8 years ago

Using awk:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF

Edit: for Window$.

:-)

ADD COMMENT
16
Entering edit mode

Suggestion for Windows --> Switch to Linux :p

ADD REPLY
11
Entering edit mode

Good solution but be careful. If you redirect the result to a file,

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > out.fa

the first line is left empty.

ADD REPLY
4
Entering edit mode

+1 for the windows fix :-)

ADD REPLY
4
Entering edit mode

There will be an empty line at the beginning it should be removed like: tail -n +2 filein.fa > fileout.fa

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

my bad I just saw it :)

ADD REPLY
0
Entering edit mode

...the link is dead and I still get the first line empty after redirecting the output into a file - could you update the answer if there is a more elegant way (not piping through tail) to avoid this?

ADD REPLY
4
Entering edit mode

ADD REPLY
1
Entering edit mode

Modified the original awk code from @Pierre Lindenbaum to below

awk '/^>/ { if(NR>1) print "";  printf("%s\n",$0); next; } { printf("%s",$0);}  END {printf("\n");}'

Uses NR (numbered row) to print a newline only for non-first fasta record. Not the most elegant but I hope this will help someone out there.

ADD REPLY
0
Entering edit mode

you can use the answer provided by Jorge Amigo below. (it's not using tail)

ADD REPLY
0
Entering edit mode

Unfortunately i am a window lover. plz suggests something for window..plzzzz

ADD REPLY
10
Entering edit mode
9.4 years ago

here is a quick and simple perl one-liner:

perl -pe '/^>/ ? print "\n" : chomp' in.fasta > out.fasta

which will output an empty header line. you could use tail (which is faster than sed) to remove it:

perl -pe '/^>/' ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta

EDIT: even easier (do not use!):

perl -pe 'chomp unless /^>/' in.fasta > out.fasta

EDIT2: this last one liner does not work as expected. use this one instead, which performs inside a single perl call all the logic needed in all lines but the first one by using perl's $. internal line counter variable:

perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' in.fasta > out.fasta
ADD COMMENT
0
Entering edit mode

Thank you!! This worked perfectly!

ADD REPLY
0
Entering edit mode

Is there a simple way to embed this one liner into a script that just takes a fasta file as input? Tried doing this myself but I am have no idea how to actually write perl scripts. Normally I would just embed this into system call via an R script, but all the quotes are throwing me off.

ADD REPLY
0
Entering edit mode

just create a simple script.pl file containing this (do not use!)

while (<>) { chomp unless /^>/; print }

EDIT: the previous code is wrong. use this one into script.pl instead:

while (<>) { $. > 1 and /^>/ ? print "\n" : chomp; print }

and run it

perl script.pl <in.fasta >out.fasta
ADD REPLY
0
Entering edit mode

(for future reference:) Sorry to tell but the 'EDIT' version does not work as expected.

Problem is that it will have chomped the line previous to /^>/ and as such will add the header line to the previous sequence line. the other version works perfectly though.

ADD REPLY
1
Entering edit mode

thanks for pointing it out. I've corrected my previous answer and tested thoroughly the new one.

ADD REPLY
5
Entering edit mode
12.8 years ago
toni ★ 2.2k

Also a quick & dirty solution with Perl... (fa2oneline.pl)

#!/usr/bin/perl -w
use strict;

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");

my $line = <IN>; 
print $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>gi/) { print "\n",$line,"\n"; }
else { print $line; }
}

print "\n";

then run:

perl fa2oneline.pl sample.fa > out.fa

Result :

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK
>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF
ADD COMMENT
2
Entering edit mode

The regex needs to be changed?

$line=~m/^>gi/) should be $line=~m/^>/gi)

ADD REPLY
0
Entering edit mode

sorry tony, thank you for this great help

ADD REPLY
0
Entering edit mode

this still doesnt seem to work, even with the regex change.

ADD REPLY
5
Entering edit mode
12.8 years ago

Biopieces (www.biopieces.org) is another way:

read_fasta -i file.fna | write_fasta -x

Cheers,
Martin

ADD COMMENT
4
Entering edit mode
7.4 years ago
adhil.md ▴ 40

Python version to convert multi-line to two-line fasta format. It also converts multiple files. Output directory will have all the files with _twoline.fasta as suffix.

from Bio import SeqIO
import os
import re
import argparse

def multi2linefasta(indir,outdir,filelist):
    for items in filelist:
        mfasta = outdir +"/"+re.sub('\..*','',items)+'_twoline.fasta'
        ifile = open(indir+'/'+items,'rU')
        with open(mfasta, 'w') as ofile:
            for record in SeqIO.parse(ifile, "fasta"):
                sequence = str(record.seq)
                ofile.write('>'+record.id+'\n'+sequence+'\n')


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert multiple files of multi-line fasta into two-line fasta format')
    parser.add_argument('-i',type=str,dest='ind',required=True,help="Input directory where all the fasta files are present")
    parser.add_argument('-o',type=str,dest='outd',required=True,help="Ouput directory")
    parser.add_argument('-f',type=str,dest='ffiles',required=True,help="Comma seperated fasta file names without spaces")
    args = parser.parse_args()
    print (args)
    filelist = args.ffiles.split(',')
    if not os.path.exists(args.outd):
        os.makedirs(args.outd)
    multi2linefasta(args.ind,args.outd,filelist)

To run, save the above code to convertfasta.py file

python convertfasta.py -i "/path/to/inputfolder/" -o "/path/to/outputfolder/" -f "file1.fasta,file2.fasta,file3.fasta"
ADD COMMENT
1
Entering edit mode

I would like to give you some suggestions or comments...

  • Use Biopython for parsing fasta files to avoid assumptions about the format. When a parser exists, use it. It will make your code quicker and shorter.
  • Use the os module for handling directories and paths
  • Use the sys module to get input as a python script rather than running this function interactively
  • Use the with open(file) as ofile synthax to handle opening of files
ADD REPLY
0
Entering edit mode

Thank You ...................................

ADD REPLY
0
Entering edit mode

Thank you it's also worked for me too. Best,

ADD REPLY
3
Entering edit mode
6.2 years ago
teckpor ▴ 30

It seems that seqtk (https://github.com/lh3/seqtk) can be used for this task, although the webpage only mentions multiline fastq. The command to give is simply (I am using Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz, but one can use any fasta or gzipped fasta):

seqtk seq -l0 Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | gzip > Homo_sapiens.GRCh37.dna.primary_assembly.singleLines.fa.gz

For a quick proof that this works, try:

zcat Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz | head -n10 | seqtk seq -l0 | cat -A
ADD COMMENT
1
Entering edit mode
12.8 years ago

Kent source's faToTab. I'm sure EMBOSS has something for this as well. If you are on a Windows machine, I'd just use Galaxy's fasta2tab.

ADD COMMENT
0
Entering edit mode
12.8 years ago
Woa ★ 2.9k
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new(-file => "myseq.fasta" , '-format' => 'Fasta');

while ( my $seq = $in->next_seq ) {
    print ">",$seq->id()," ",$seq->desc(),"\n",$seq->seq(),"\n";
}

Output:

>gi|321257144|ref|XP_003193485.1| flap endonuclease [Cryptococcus gattii WM276]
MGIKGLTGLLSENAPKCMKDHEMKTLFGRKVAIDASMSIYQFLIAVRQQDGQMLMNESGDVTSHLMGFFYRTIRMVDHGIKPCYIFDGKPPELKGSVLAKRFARREEAKEGEEEAKETGTAEDVDKLARRQVRVTREHNEECKKLLSLMGIPVVTAPGEAEAQCAELARAGKVYAAGSEDMDTLTFHSPILLRHLTFSEAKKMPISEIHLDVALRDLEMSMDQFIELCILLGCDYLEPCKGIGPKTALKLMREHGTLGKVVEHIRGKMAEKAEEIKAAADEEAEAEAEAEKYDSDPENEEGGETMINSDGEEVPAPSKPKSPKKKAPAKKKKIASSGMQIPEFWPWEEAKQLFLKPDVVNGDDLVLEWKQPDTEGLVEFLCRDKGFNEDRVRAGAAKLSKMLAAKQQGRLDGFFTVKPKEPAAKDAGKGKGKDTKGEKRKAEEKGAAKKKTKK

>gi|321473340|gb|EFX84308.1| hypothetical protein DAPPUDRAFT_47502 [Daphnia pulex]
MGIKGLTQVIGDTAPTAIKENEIKNYFGRKVAIDASMSIYQFLIAVRSEGAMLTSADGETTSHLMGIFYRTIRMVDNGIKPVYVFDGKPPDMKGGELTKRAEKREEASKQLVLATDAGDAVEMEKMNKRLVKVNKGHTDECKQLLTLMGIPYVEAPCEAEAQCAALVKAGKVYATATEDMDSLTFGSNVLLRYLTYSEAKKMPIKEFHLDKILDGLSYTMDEFIDLCIMLGCDYCDTIKGIGAKRAKELIDKHRCIEKVIENLDTKKYTVPENWPYQEARRLFKTPDVADAETLDLKWTQPDEEGLVKFMCGDKNFNEERIRSGAKKLCKAKTGQTQGRLDSFFKVLPSSKPSTPSTPASKRKVGCIIYLFLYF
ADD COMMENT
0
Entering edit mode

I messed up with the formatting, The ">" symbols at the beginning of the Fasta header and not shown for some reason.

ADD REPLY
0
Entering edit mode

Just indent with 4 spaces, otherwise ">" is interpreted as blockquote.

ADD REPLY
0
Entering edit mode
9.4 years ago
oigl ▴ 60

You can open the merged sequences in the UGENE Sequence View. To do it:

  1. Select "File>Open as" in the main UGENE menu.
  2. Select the "FASTA" format.
  3. Select "Merge sequences into a single sequence to show in the sequence viewer".

By default, UGENE will show the sequence itself, the complementary sequence and translations. It looks like here.

If required, you can export the merged sequence into a new file.

ADD COMMENT
0
Entering edit mode
9.4 years ago

Hi

Hope this works

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file
ADD COMMENT
0
Entering edit mode
2.1 years ago
Amirosein ▴ 70

Hi

Short answer:

Use seqtk as follows:

$ seqtk seq multi-line.fasta > single-line.fasta

Explained:

A toy example in multi-line format:

$ head -n5 celegans_chr1.fa
>gi|449020133|emb|BX284601.5| Caenorhabditis elegans Bristol N2 genomic chromosome, I
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC
CTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT

How to easily convert to single-line fasta (I only print head -c 170 as it would print the whole chromosome otherwise):

$ seqtk seq celegans_chr1.fa | head -c 170
>gi|449020133|emb|BX284601.5| Caenorhabditis elegans Bristol N2 genomic chromosome, I
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA

So you can simply redirect the stdout to a file as follows:

$ seqtk seq celegans_chr1.fa > celegans_chr1_single-line.fa

You may also use a file compressor like gzip to compress the file.

ADD COMMENT

Login before adding your answer.

Traffic: 1990 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6