Question: Convert multiple-sequence fasta file to single long sequence
0
gravatar for aberry814
4.7 years ago by
aberry81460
United States
aberry81460 wrote:

I have a fasta file containing millions of sequences and I want a simple script to convert this file into one long sequence. ie delete all headers and remove any spaces and line breaks. I can always add a ">seq_name" to the first line afterwards, so maintaining the top header is not necessary.

I've searched the forums but can only find scripts that do the reverse. I'm using millions of reads as a substitute for a complete genome, and my current pipeline cannot reconcile this, so I want to trick it into thinking that this is one long genome sequence.

Thanks for any help!!!

sequence • 7.9k views
ADD COMMENTlink modified 2.3 years ago by 2013149180 • written 4.7 years ago by aberry81460
3

Home work?

grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print ">Sequence_name\n" } { print }' > new​.fasta
ADD REPLYlink modified 4 months ago by RamRS26k • written 4.7 years ago by geek_y10k

Haha not homework. Actual work done being attempted by a below-average programmer (me).

This appears to work perfectly, thanks!

ADD REPLYlink written 4.7 years ago by aberry81460
1

This removes all line breaks as well.

ADD REPLYlink written 4.7 years ago by geek_y10k

Hi Guys I also want to remove the breaks in a multiline FASTA file. But I can't. Can anyone clarify for me . I am vary new to Bioinformatics. Thanks in Advance

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by 2013149180

using seqkit:

$ seqkit seq -w0 input.fa

Please move your post to a new post and try any one/all of the solutions provided above, before posting.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cpad011212k
5
gravatar for kloetzl
4.7 years ago by
kloetzl1.1k
European Union
kloetzl1.1k wrote:
$ cat multi.fasta | grep -v '^>' | grep '^.' | tr -d '[:blank:]' | cat >( echo '>seq_name') - > all.fasta
ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.7 years ago by kloetzl1.1k
1

Thanks! This works well except it doesn't delete the line breaks (easy enough to do after the fact.)

ADD REPLYlink written 4.7 years ago by aberry81460

This way it deletes the newlines as well

cat multi.fasta | grep -v '^>' | grep '^.' | tr -d '[:blank:]' | tr -d '\n' | cat <( echo '>seq_name') - > multi_concat.fasta
ADD REPLYlink modified 4 months ago by RamRS26k • written 6 months ago by chefarov130
2
gravatar for Brian Bushnell
4.7 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

An alternative, from BBTools:

fuse.sh in=sequences.fa out=fused.fa pad=0 fastawrap=2000000000

Note that this will fail when the length of the output sequence approaches 2 billion. But most programs will fail on unwrapped fasta lines exceeding 2Gbp anyway, so I don't really care about that. "pad" will put that many Ns in between discrete sequences.

It's generally better practice to write or use programs that can handle wrapped fasta, than to convert fasta to unwrapped before loading it. But there are always exceptions.

ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.7 years ago by Brian Bushnell17k
0
gravatar for Malcolm.Cook
4.7 years ago by
Malcolm.Cook1.1k
kansas, usa
Malcolm.Cook1.1k wrote:

Here's a perl one-liner:

perl -n  -e 'print if 1 == $. || ! m/^>/'  test.fa > out.fa

or, to stream edit destructively in-place:

perl -n -i -e 'print if 1 == $. || ! m/^>/' test.fa

also does not delete newlines or whitespace - but is this really needed by your downstream process?

ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.7 years ago by Malcolm.Cook1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 807 users visited in the last hour