Question: How to extract the last 1000 nt from a group of sequences in a FASTA file?
0
gravatar for Andrei
4 months ago by
Andrei10
Andrei10 wrote:

Hi everyone!

I would like to know how I can extract the 1000 last nucleotides from sequence in a FASTA file and their headers using linux command line.

There are about 2000 sequence in this FASTA file and need to extract the last 1000 nucleotides from each sequence keeping their original headers.

I am a beginner at bioinformatics and I am still learning how to manipulate FASTA files, and I'm stuck with this problem and I can not solve it.

Thanks!

sequence • 357 views
ADD COMMENTlink modified 4 months ago by Pierre Lindenbaum121k • written 4 months ago by Andrei10
1

See this post, which uses awk to do some similar: https://stackoverflow.com/questions/30264809/trim-first-n-bases-in-multi-fasta-file-with-awk-and-print-with-max-width-format

The command in the last code block should be somewhat easy to modify. Instead of using the left_trim argument to trim the first 1000 bases as in the example, you could modify the script to keep the last 1000 bases (something like replacing start=left_trim+1; for strat=length(rec) - left_trim + 1). Try it out and test on some examples.

ADD REPLYlink modified 4 months ago • written 4 months ago by manuel.belmadani870
1

I could do it! I'm very thankful for your help.

ADD REPLYlink written 4 months ago by Andrei10

Could you please paste here the explanation that you did about sed syntax?

ADD REPLYlink written 4 months ago by Andrei10
1

I removed it because it didn't handle sequences that spanned multiple lines, which is common for FASTA files. sed works line by line, so awk was more appropriate in that use case. But the command was cat file.txt | sed 's|^[^>].*\(.{1000}\)|\1|g, if you're intrested. Briefly, sed 's|PATTERN|REPLACEMENT|g' will match lines that have PATTERN and substitute it with REPLACEMENT. The pattern is a regular expression ^[^>].*\(.{1000}\), which means from the a line that begins (^) with anything except a > (^[^>]) followed by anything of unlimited length .* until you can't match anything but 1000 characters of anything (.{1000}, and see greedy matching which applies here.). The last part is wrapped in parenthesis (\(.{1000}\), or (GROUP)) where GROUP is a part of the regex you want to keep for later. The REPLACEMENT part just says "replace with group 1 " (\1). The backslashes before { } ( ) are there to say you don't want to match those characters, instead they are part of the regular expression syntax. I'd highly recommend learning more about sed and regexes for bioinformatics, they're really useful and widely used.

ADD REPLYlink modified 4 months ago • written 4 months ago by manuel.belmadani870
1

seqkit subseq supports this, if you want a fast solution.

seqkit subseq -r -1000:-1 seqs.fa > result.fa

If you want to learn programming, write some Python scripts using Biopython.

ADD REPLYlink modified 4 months ago • written 4 months ago by shenwei3564.7k
1
gravatar for 5heikki
4 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

This requires Biopython. Save it as extractEnds.py and chmod +x extractEnds.py to make it executable and then just run it

#!/usr/bin/python

import sys
from sys import argv
from Bio import SeqIO

if (len(sys.argv) < 2):
    print("Usage: extractEnds.py file.fna")
    sys.exit()
else:
    FNAME = argv[1]
    IFILE = open(FNAME, "rU")

for record in SeqIO.parse(IFILE, "fasta"):
    print(">%s\n%s" % (record.description, record.seq[-1000:]))

IFILE.close()

I'm quite a Python noob so if there's a way to make this better or there are mistakes, please let me know..

ADD COMMENTlink modified 4 months ago • written 4 months ago by 5heikki8.4k
2

No need for open(FNAME, "rU") or IFILE.close(). SeqIO handles all opening and closing itself.

You can modify this simply to:

for record in SeqIO.parse(sys.argv[2], "fasta"):
ADD REPLYlink modified 4 months ago • written 4 months ago by jrj.healey12k
1

Thanks. What's the difference between (argv[1], ..) and (sys.argv[1], ..)?

edit. nevermind, found it, it just depends on what gets imported in the start..

edit2. For years I sticked with epic GNU coreutil oneliners, but I'm starting to see the advantages of biopython. Especially if you write scripts that someone else inherits some day.. they will be much happier if you used biopython

ADD REPLYlink modified 4 months ago • written 4 months ago by 5heikki8.4k
3

If you want it to be robust, or you want the code to be used by anyone but yourself (especially if they're a novice), a dedicated parser like BioPython which covers lots and lots of edge cases is worth its weight in gold.

You could write that code in a reasonably concise one-liner too (this one also has the benefit of taking a second CLI arg for any length of sequence):

python3 -c "import sys; from Bio import SeqIO; [print(f'>{r.id}\n{r.seq[-int(sys.argv[2]):]}') for r in SeqIO.parse(sys.argv[1],'fasta')] ;" seqs.fa 1000

(Python 3 only because of a print-inside-list-comprehension and use of fstrings.
ADD REPLYlink modified 4 months ago • written 4 months ago by jrj.healey12k
1
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

linearize and use a second awk to extract the last 100 characters.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fasta |\
awk -F '\t' '{x=100;L=length($2);printf("%s\n%s\n",$1,(L<=x?$2:substr($2,1+L-x,x)));}'
ADD COMMENTlink written 4 months ago by Pierre Lindenbaum121k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 844 users visited in the last hour