Question: Extract N amino acids from fasta file
0
gravatar for martha.chapa.mc18
10 months ago by
martha.chapa.mc180 wrote:

Hi, I want to extract the first N aminoacids from sequences in a fasta file. I have this sequences,

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQTATVIDWDQIREASQTQRRQAAAIANAPVK
QGVVHEPIDAGVMAGNVPAEQRNAASIVQSIDGSKLSQISDRLPKFIKQGSDEVVYGKHV
VVSKLGPEVIGLILDLIKAQPANRALLLAKLQAISNDGNPEASNFMGFVFEYGLFGAVKN

for example, I want this sequence with only 30 aa, like:

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQ

Is there a program that can do this to all sequences in linux terminal? I hope you can help me. Thank you.

cut fasta file sequence • 311 views
ADD COMMENTlink modified 10 months ago by Joe18k • written 10 months ago by martha.chapa.mc180

You could convert to tabular format with seqkit and use the substring function from awk:

seqkit fx2tab file.fasta  | awk -v FS="\t" '{print ">"$1"\n"substr($2,1,30)}'
ADD REPLYlink modified 10 months ago • written 10 months ago by alex.zaccaron180
1

seqkit subseq -r 1:20 is enough.

ADD REPLYlink written 10 months ago by shenwei3565.5k

Be careful! This approach makes a lot of assumptions about the structure of the FASTA file.

ADD REPLYlink written 10 months ago by Alex Reynolds31k

Yes, it does. Sorry, I thought the input file was tabular format. I updated the comment.

ADD REPLYlink modified 10 months ago • written 10 months ago by alex.zaccaron180
1
gravatar for finswimmer
10 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Using seqkit:

$ seqkit subseq -r 1:30 input.fasta
ADD COMMENTlink written 10 months ago by finswimmer13k
0
gravatar for yztxwd
10 months ago by
yztxwd380
Southern Medical University
yztxwd380 wrote:
awk '{if(/>.*/) {print} else {print substr($0, 1, 30)} }' test.fa

test.fa

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQTATVIDWDQIREASQTQRRQAAAIANAPVKQGVVHEPIDAGVMAGNVPAEQRNAASIVQSIDGSKLSQISDRLPKFIKQGSDEVVYGKHVVVSKLGPEVIGLILDLIKAQPANRALLLAKLQAISNDGNPEASNFMGFVFEYGLFGAVKN

output

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQ
ADD COMMENTlink modified 10 months ago • written 10 months ago by yztxwd380
0
gravatar for Joe
10 months ago by
Joe18k
United Kingdom
Joe18k wrote:

With biopython:

#Usage: python3 scriptname.py file.fasta
import sys
from Bio import SeqIO

for i in SeqIO.parse(sys.argv[1], "fasta"):
    print(f">{i.description}\n{i.seq[0:30]}")

Or as a one-liner:

$ python3 -c 'import sys; from Bio import SeqIO; [print(f">{i.description}\n{i.seq[0:30]}") for i in SeqIO.parse(sys.argv[1], "fasta")];' file.fasta

Replace [0:30] with whatever range you like (it doesn't have to start at zero either).

ADD COMMENTlink written 10 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1001 users visited in the last hour