FASTA file of fixed length
6
2
Entering edit mode
7.0 years ago
waqasnayab ▴ 230

Hi,

I have a FASTA file like this:

>1
TCAAGAGGGGTGAATGTGTTTCGCATGCACAAGGGACAGGAGTCT
>2
ATCAGAGCTGGTGGGGTGGAGAGACAGAAACAAGTGGGAGAAGGT
>3
TTATACCTACCTTATAGATAAGGAAATTGAAGCTTATAGAGTTTA
>4
ATTTTTCCTTATGATACTCTATTGCCTCTCCATGGATAAAGACAG
>5
AAACTCCTGACCTCAGGTGATCCACCTGCCTCGGCCTCCCAAAGT
>6
TGCACACCTTCAGAACTGTGAACCAAATAAACCTCTCTTCTTTAAAATTATTCATCCTCT
GGTATTCCTTTATAACAA
>7
CTCTTGATGTCATTTCACTTCGGATTCTTCTTTAGAAAACTTGGA


Every sequence has fixed length of 45nt but some sequences like sequence no .6 has more length. There are some more sequences like sequence no. 6. I want to make each sequence of fixed length, that is, 45. I can truncate the sequence from 3' end. Is there any tool or awk or sed command?

Thanks,

Waqas.

sequence fasta • 7.0k views
0
Entering edit mode

My apologies for bad formatting, assume it as:

>1
seq
>2
seq
.
.
.
.


and so on

8
Entering edit mode
7.0 years ago
kloetzl ★ 1.1k
cat foo.fasta | awk -v 'RS=>' 'NR>1{print ">" $1; printf("%.45s\n",$2)}'

6
Entering edit mode

Nice. Just note that this does not work with multi-line fasta if you want to trim more than one line.

0
Entering edit mode

even shorter: cat foo.fasta | awk -v 'RS=>' 'NR>1{printf(">%s\n%.45s\n",$1,$2)}'

3
Entering edit mode
7.0 years ago
matted 7.7k

Why reinvent the wheel? Install FASTX-toolkit and run:

fasta_formatter -i input.fa | fastx_trimmer -l 45 > output.fa

2
Entering edit mode

I dont want to be a debbie-downer the whole time, but in defense of reinventing wheels, I tried this because I thought it might be nice to have some basic fastx tools on my system.

First I had to compile and install libgtextutils-0.7, because, you know, dependencies are awesome and everyone should have one. Then I tried to compile FASTX toolkit and got this:

So i was like, ugh, fine i'll do the whole apt-get thing to get the dependency of the dependency, then when i ran the above commands (which requires 2 separate programs piped together..?) and then I got this:

Now i'm just feeling really stupid - and maybe I am - I fell off my bike recently and hit my head... but I think reinventing the wheel is often easier than installing bioinformatics programs, learning how they work, their parameters, etc etc. Particularly when a sed/awk/python/ruby/perl/java script is so easy to copy-paste, and in the case of python/ruby, very easy for a newbie to read and know how it works. I like that the python does some checking (which is why i put it in the thread), and I bet fastx tools does even more and probably a lot better checking/sanitation that the python code above does, but still, and this isn't a criticism of you at all Matt, I really beg whomever is writing these tools to think about dumb Biologists like me trying to run their code with no knowledge of any of this stuff.

0
Entering edit mode

The second problem is probably a bug within the debian package. That really should get fixed. I will have another go at it tomorrow.

1
Entering edit mode

HI, I am the one who uploaded the fastx-toolkit package version 0.0.14-1 to Debian. On my machine the trimming commands above just work. Maybe in John's environment there was some cruft from the failed manual installation ?

0
Entering edit mode

You need to copy .so file to your system path, or to set LD_LIBRARY_PATH, or better, to use use its precompiled statically linked binary available from its website. That said, I also think fastx-toolkit is over engineered. The separation of libgtextutils is both unnecessary and adding overhead. I discourage the use of this tool on sequencing data as it is often ~10X slower than a fast implementation. Even a naive perl/python implementation is faster.

1
Entering edit mode

I would've preferred to give an answer using seqtk, but I couldn't find an option for absolute trimming.

2
Entering edit mode

Just added yesterday: seqtk trimfq -L45 in.fa

2
Entering edit mode
7.0 years ago
Charles Plessy ★ 2.9k

The grand old EMBOSS does it for you with a simple command, seqret. Imagine that the sequences you posted are in a file called toto.fa, you can truncate all these sequences to 45 base using Uniform Sequence Adresses.

seqret -filter toto.fa[:45]
>1
TCAAGAGGGGTGAATGTGTTTCGCATGCACAAGGGACAGGAGTCT
>2
ATCAGAGCTGGTGGGGTGGAGAGACAGAAACAAGTGGGAGAAGGT
>3
TTATACCTACCTTATAGATAAGGAAATTGAAGCTTATAGAGTTTA
>4
ATTTTTCCTTATGATACTCTATTGCCTCTCCATGGATAAAGACAG
>5
AAACTCCTGACCTCAGGTGATCCACCTGCCTCGGCCTCCCAAAGT
>6
TGCACACCTTCAGAACTGTGAACCAAATAAACCTCTCTTCTTTAA
>7
CTCTTGATGTCATTTCACTTCGGATTCTTCTTTAGAAAACTTGGA


EMBOSS is great. It even has SAM support despite its age.

2
Entering edit mode

And Emboss is easy to install on Debian:

apt-get install emboss

1
Entering edit mode
7.0 years ago

I deleted my previous answer, but I'll type it in again since you have multi-line input.

1. "Linearize" your multi-line FASTA input, to make it single-line FASTA
2. Use substr() in awk to truncate the sequence to 45 characters

Here's a one-liner to do it all, with each step on its own line so you can see what's going on:

$sed -e 's/$$^>.*$$/#\1#/' input.fa \ | tr -d "\r" \ | tr -d "\n" \ | sed -e 's/$/#/' \
| tr "#" "\n" \
| sed -e '/^$/d' \ | awk '{ if ($0 ~ /^>/) { print $0; } else { print substr($0, 1, 45); } }' \
> output.fa


Should be very fast and gets around LD_LIBRARY_PATH DLL-hell etc. To give credit where due, I cribbed the linearization part from a previous answer by Frederic Mahe and just added the awk part at the end. Hopefully this helps.

1
Entering edit mode
7.0 years ago
5heikki 11k

Linearize the file and then fold it:

awk '!/^>/ {printf "%s", $0; n = "\n"} /^>/ {print n$0; n = ""}' file.fasta | fold -w 45

0
Entering edit mode
7.0 years ago
John 13k

In python for fun and profit:

import sys
import itertools
with open(sys.argv[-2],'r') as infile:
with open(sys.argv[-1],'w') as outfile:
try:
line = next(infile)
assert line.startswith('>')
for counter in itertools.count(1):
DNA = next(infile)[:45]
assert len(DNA) == 45, 'FASTA entry %s contains DNA shorter than 45 characters!' % counter
assert DNA.strip('ACGT') == '', 'FASTA entry %s contains characters other than ACGT for DNA?' % counter
outfile.write(line + DNA + '\n')
line = next(infile)
while not line.startswith('>'): line = next(infile)
except StopIteration:
print 'All done processing %s fasta lines!' % counter


Saving it as whatever.py and run it with python whatever.py /path/to/input.fasta /path/to/output.fasta

It should give an error if something unexpected happens.

2
Entering edit mode

Out of curiosity and perhaps offtopic, but why do you use sys.argv[-2]? Is counting from the end better or safer?

0
Entering edit mode

One shouldn't really use argv ever to begin with (I was just being lazy, and i wanted to keep they code looking simple), but yeah I think its probably safer if you use argv to go from the right-end, eg. "time python ./the_filename.py input output", or if the user wants to load a module (-m cprofile...) as well, all that stuff usually comes before rather than after.

1
Entering edit mode

I'm happy to learn :-) What would you then suggest to use instead of argv? By the way, I checked, and using something like "time python script.py stuff blabla" does not alter sys.argv when compared to "python script.py stuff blabla".

1
Entering edit mode

Oh yes, you're quite right, anything before the "python" argv doesnt see - but modules loaded with -m like cprofile are still treated as arguments by argv. Instead of argv you should probably use either argparse or docopt depending on your style. Personally, I think positional arguments are always a bad idea - the user has a lot more flexibility when everything should be defined with position-independent parameter flags like --output ./somefile --input ./somefile. But again, thats just my style. Some people like remembering orders, some people like remembering parameter names.

2
Entering edit mode

I think 'if' statements would be more appropriate than assertions in your code. Assertions are really for asserting things that you believe will always be true, not validating user input. They should only be failing when there's a bug in your code, whereas your assertions are going to fail on e.g. ambiguous nucleotides, RNA input etc. etc. https://wiki.python.org/moin/UsingAssertionsEffectively So you might prefer something like

if not condition:
raise SomeException


or

if not condition:
sys.stderr.write("Oh no!")
sys.exit(1)


Or whatever. It makes it easier to understand what you're doing.

0
Entering edit mode

Yup, you're right gearoid i should really have defined and raised an exception there, but as the argv issue I wanted to keep the code simple-looking. At first i used an if ... exit(), but then i thought it might be nicer for a beginner to see that such-and-such is an optional check, rather than a functional piece of code. I worry that beginners who see the code-golf'd sed/awk/perl examples which are always incredibly short and sweet think to themselves "no way will i ever be able to program like that!", and then miss out on a good opportunity to learn the basics. No one ever upvotes my python code when there's a shorter awk already in the thread, so I know it sounds like a sneaky politician's answer, but i really only write these code examples for people who are confused by awk, not to show off my python :)