Question: FASTA file of fixed length
2
gravatar for waqasnayab
4.0 years ago by
waqasnayab200
Pakistan
waqasnayab200 wrote:

Hi,

I have a FASTA file like this:

>1
TCAAGAGGGGTGAATGTGTTTCGCATGCACAAGGGACAGGAGTCT
>2
ATCAGAGCTGGTGGGGTGGAGAGACAGAAACAAGTGGGAGAAGGT
>3
TTATACCTACCTTATAGATAAGGAAATTGAAGCTTATAGAGTTTA
>4
ATTTTTCCTTATGATACTCTATTGCCTCTCCATGGATAAAGACAG
>5
AAACTCCTGACCTCAGGTGATCCACCTGCCTCGGCCTCCCAAAGT
>6
TGCACACCTTCAGAACTGTGAACCAAATAAACCTCTCTTCTTTAAAATTATTCATCCTCT
GGTATTCCTTTATAACAA
>7
CTCTTGATGTCATTTCACTTCGGATTCTTCTTTAGAAAACTTGGA

Every sequence has fixed length of 45nt but some sequences like sequence no .6 has more length. There are some more sequences like sequence no. 6. I want to make each sequence of fixed length, that is, 45. I can truncate the sequence from 3' end. Is there any tool or awk or sed command?

Thanks,

Waqas.

sequence fasta • 3.1k views
ADD COMMENTlink modified 4.0 years ago by 5heikki8.6k • written 4.0 years ago by waqasnayab200

My apologies for bad formatting, assume it as:

>1
seq
>2
seq
.
.
.
.

and so on

ADD REPLYlink modified 4.0 years ago by John12k • written 4.0 years ago by waqasnayab200
8
gravatar for kloetzl
4.0 years ago by
kloetzl1.1k
European Union
kloetzl1.1k wrote:
cat foo.fasta | awk -v 'RS=>' 'NR>1{print ">" $1; printf("%.45s\n", $2)}'
ADD COMMENTlink written 4.0 years ago by kloetzl1.1k
5

Nice. Just note that this does not work with multi-line fasta if you want to trim more than one line.

ADD REPLYlink written 4.0 years ago by lh332k

even shorter: cat foo.fasta | awk -v 'RS=>' 'NR>1{printf(">%s\n%.45s\n",$1,$2)}'

ADD REPLYlink written 4.0 years ago by kloetzl1.1k
2
gravatar for matted
4.0 years ago by
matted7.2k
Boston, United States
matted7.2k wrote:

Why reinvent the wheel? Install FASTX-toolkit and run:

fasta_formatter -i input.fa | fastx_trimmer -l 45 > output.fa
ADD COMMENTlink written 4.0 years ago by matted7.2k
2

I dont want to be a debbie-downer the whole time, but in defense of reinventing wheels, I tried this because I thought it might be nice to have some basic fastx tools on my system.

First I had to compile and install libgtextutils-0.7, because, you know, dependencies are awesome and everyone should have one. Then I tried to compile FASTX toolkit and got this:

enter image description here

So i was like, ugh, fine i'll do the whole apt-get thing to get the dependency of the dependency, then when i ran the above commands (which requires 2 separate programs piped together..?) and then I got this:

enter image description here

Now i'm just feeling really stupid - and maybe I am - I fell off my bike recently and hit my head... but I think reinventing the wheel is often easier than installing bioinformatics programs, learning how they work, their parameters, etc etc. Particularly when a sed/awk/python/ruby/perl/java script is so easy to copy-paste, and in the case of python/ruby, very easy for a newbie to read and know how it works. I like that the python does some checking (which is why i put it in the thread), and I bet fastx tools does even more and probably a lot better checking/sanitation that the python code above does, but still, and this isn't a criticism of you at all Matt, I really beg whomever is writing these tools to think about dumb Biologists like me trying to run their code with no knowledge of any of this stuff.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by John12k

The second problem is probably a bug within the debian package. That really should get fixed. I will have another go at it tomorrow.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by kloetzl1.1k
1

HI, I am the one who uploaded the fastx-toolkit package version 0.0.14-1 to Debian. On my machine the trimming commands above just work. Maybe in John's environment there was some cruft from the failed manual installation ?

ADD REPLYlink written 4.0 years ago by Charles Plessy2.7k

You need to copy .so file to your system path, or to set LD_LIBRARY_PATH, or better, to use use its precompiled statically linked binary available from its website. That said, I also think fastx-toolkit is over engineered. The separation of libgtextutils is both unnecessary and adding overhead. I discourage the use of this tool on sequencing data as it is often ~10X slower than a fast implementation. Even a naive perl/python implementation is faster.

ADD REPLYlink written 4.0 years ago by lh332k
1

I would've preferred to give an answer using seqtk, but I couldn't find an option for absolute trimming.

ADD REPLYlink written 4.0 years ago by matted7.2k
2

Just added yesterday: seqtk trimfq -L45 in.fa

ADD REPLYlink written 4.0 years ago by lh332k
2
gravatar for Charles Plessy
4.0 years ago by
Charles Plessy2.7k
Japan
Charles Plessy2.7k wrote:

The grand old EMBOSS does it for you with a simple command, seqret. Imagine that the sequences you posted are in a file called toto.fa, you can truncate all these sequences to 45 base using Uniform Sequence Adresses.

seqret -filter toto.fa[:45]
>1
TCAAGAGGGGTGAATGTGTTTCGCATGCACAAGGGACAGGAGTCT
>2
ATCAGAGCTGGTGGGGTGGAGAGACAGAAACAAGTGGGAGAAGGT
>3
TTATACCTACCTTATAGATAAGGAAATTGAAGCTTATAGAGTTTA
>4
ATTTTTCCTTATGATACTCTATTGCCTCTCCATGGATAAAGACAG
>5
AAACTCCTGACCTCAGGTGATCCACCTGCCTCGGCCTCCCAAAGT
>6
TGCACACCTTCAGAACTGTGAACCAAATAAACCTCTCTTCTTTAA
>7
CTCTTGATGTCATTTCACTTCGGATTCTTCTTTAGAAAACTTGGA

EMBOSS is great. It even has SAM support despite its age.

ADD COMMENTlink written 4.0 years ago by Charles Plessy2.7k
2

And Emboss is easy to install on Debian:

apt-get install emboss

ADD REPLYlink written 4.0 years ago by piet1.7k
1
gravatar for Alex Reynolds
4.0 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

I deleted my previous answer, but I'll type it in again since you have multi-line input.

  1. "Linearize" your multi-line FASTA input, to make it single-line FASTA
  2. Use substr() in awk to truncate the sequence to 45 characters

Here's a one-liner to do it all, with each step on its own line so you can see what's going on:

$ sed -e 's/\(^>.*$\)/#\1#/' input.fa \
    | tr -d "\r" \
    | tr -d "\n" \
    | sed -e 's/$/#/' \
    | tr "#" "\n" \
    | sed -e '/^$/d' \
    | awk '{ if ($0 ~ /^>/) { print $0; } else { print substr($0, 1, 45); } }' \
    > output.fa

Should be very fast and gets around LD_LIBRARY_PATH DLL-hell etc. To give credit where due, I cribbed the linearization part from a previous answer by Frederic Mahe and just added the awk part at the end. Hopefully this helps.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Alex Reynolds29k
1
gravatar for 5heikki
4.0 years ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

Linearize the file and then fold it:

awk '!/^>/ {printf "%s", $0; n = "\n"} /^>/ {print n $0; n = ""}' file.fasta | fold -w 45
ADD COMMENTlink written 4.0 years ago by 5heikki8.6k
0
gravatar for John
4.0 years ago by
John12k
Germany
John12k wrote:

In python for fun and profit:

import sys
import itertools
with open(sys.argv[-2],'r') as infile:
    with open(sys.argv[-1],'w') as outfile:
        try:
            line = next(infile)
            assert line.startswith('>')
            for counter in itertools.count(1):
                DNA = next(infile)[:45]
                assert len(DNA) == 45, 'FASTA entry %s contains DNA shorter than 45 characters!' % counter
                assert DNA.strip('ACGT') == '', 'FASTA entry %s contains characters other than ACGT for DNA?' % counter
                outfile.write(line + DNA + '\n')
                line = next(infile)
                while not line.startswith('>'): line = next(infile)
        except StopIteration:
            print 'All done processing %s fasta lines!' % counter

Saving it as whatever.py and run it with python whatever.py /path/to/input.fasta /path/to/output.fasta

It should give an error if something unexpected happens.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by John12k
2

Out of curiosity and perhaps offtopic, but why do you use sys.argv[-2]? Is counting from the end better or safer?

ADD REPLYlink written 4.0 years ago by WouterDeCoster43k

One shouldn't really use argv ever to begin with (I was just being lazy, and i wanted to keep they code looking simple), but yeah I think its probably safer if you use argv to go from the right-end, eg. "time python ./the_filename.py input output", or if the user wants to load a module (-m cprofile...) as well, all that stuff usually comes before rather than after.

ADD REPLYlink written 4.0 years ago by John12k
1

I'm happy to learn :-) What would you then suggest to use instead of argv? By the way, I checked, and using something like "time python script.py stuff blabla" does not alter sys.argv when compared to "python script.py stuff blabla".

ADD REPLYlink written 4.0 years ago by WouterDeCoster43k
1

Oh yes, you're quite right, anything before the "python" argv doesnt see - but modules loaded with -m like cprofile are still treated as arguments by argv. Instead of argv you should probably use either argparse or docopt depending on your style. Personally, I think positional arguments are always a bad idea - the user has a lot more flexibility when everything should be defined with position-independent parameter flags like --output ./somefile --input ./somefile. But again, thats just my style. Some people like remembering orders, some people like remembering parameter names.

ADD REPLYlink written 4.0 years ago by John12k
2

I think 'if' statements would be more appropriate than assertions in your code. Assertions are really for asserting things that you believe will always be true, not validating user input. They should only be failing when there's a bug in your code, whereas your assertions are going to fail on e.g. ambiguous nucleotides, RNA input etc. etc. https://wiki.python.org/moin/UsingAssertionsEffectively So you might prefer something like

if not condition:
    raise SomeException

or

if not condition:
    sys.stderr.write("Oh no!")
    sys.exit(1)

Or whatever. It makes it easier to understand what you're doing.

ADD REPLYlink written 4.0 years ago by gearoid200

Yup, you're right gearoid i should really have defined and raised an exception there, but as the argv issue I wanted to keep the code simple-looking. At first i used an if ... exit(), but then i thought it might be nicer for a beginner to see that such-and-such is an optional check, rather than a functional piece of code. I worry that beginners who see the code-golf'd sed/awk/perl examples which are always incredibly short and sweet think to themselves "no way will i ever be able to program like that!", and then miss out on a good opportunity to learn the basics. No one ever upvotes my python code when there's a shorter awk already in the thread, so I know it sounds like a sneaky politician's answer, but i really only write these code examples for people who are confused by awk, not to show off my python :)

ADD REPLYlink written 4.0 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1010 users visited in the last hour