Question: Fasta header trimming for multiple delimiters
0
gravatar for kor272
3.0 years ago by
kor2720
kor2720 wrote:

I am relatively new to Linux, and I have read through this post: Fasta header trimming , but it does not quite solve my problem.

This is the format of the sequences in my file:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1

.. followed by the amino acid sequence.

I would like the format to be:

>P48347

+ sequence

As you can see, there are multiple delimiters, and I'm struggling to extract the characters I want correctly.

So far, my code is:

$ cut -d ' ' -f 1 | cut -d '|' -f 2 example.fasta > out.fasta

Which outputs:

P48347

+ sequence

I considered using sed to add the ">" back, but this seems a bit messy. I have also tried awk, but I am confused by how to use it with multiple delimiters and fasta format.

How do I extract the unique identifier in the header (P48347), without losing the '>' at the beginning?

Thanks in advance.

bash fasta • 1.4k views
ADD COMMENTlink modified 3.0 years ago by cpad011214k • written 3.0 years ago by kor2720
1
gravatar for 5heikki
3.0 years ago by
5heikki9.0k
Finland
5heikki9.0k wrote:
awk 'BEGIN{FS="|"}{if(/^>/){print ">"$2}else{print $0}}' input > output
ADD COMMENTlink written 3.0 years ago by 5heikki9.0k

Thanks, this works perfectly!

ADD REPLYlink written 3.0 years ago by kor2720
1
gravatar for bioplanet
3.0 years ago by
bioplanet60
bioplanet60 wrote:

Also in perl (if you want):

perl -e 'while(<>) {if($_=~/^.*?\|(.*?)\|/) {$id=$1; print ">$id\n";}}'
ADD COMMENTlink modified 3.0 years ago by genomax92k • written 3.0 years ago by bioplanet60
1
gravatar for Joe
3.0 years ago by
Joe18k
United Kingdom
Joe18k wrote:

Pure bash alternative:

#!/bin/bash
# usage:
# $ bash extract_header_field.sh seqs.fasta

while read line ; do
        if [ ${line:0:1} == ">" ] ; then
                IFS='|' read -a header <<< "$line"
        else
                seq="$line"
        echo -e ">${header[1]}""\n""$seq"
        fi
done < $1

As a more general note, you can change this script to split a fasta up and retrieve any field you like by changing the IFS='|' part to whatever "internal field separator" you like (e.g. IFS=',').

Then just change the number in the line ...${header[1]}... to whatever chunk you like.

In this case, >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1 there are 3 | symbols, so the elements of the array $header become:

>sp   # "${header[0]}"
P48347   # "${header[1]}"
14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1   # "${header[2]}"

(remember that its 0-based indexing)

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Joe18k
1
gravatar for cpad0112
3.0 years ago by
cpad011214k
Hyderabad India
cpad011214k wrote:

Output with sequence:

$ sed '/^>/ s/\(>\).*|\(P[0-9]\+\)|.*/\1\2/' test.fa

Output with sequence:

>P48347
atgc
>P48348
tgac

Output only headers:

 $ sed -n '/^>/p' test.fa | sed 's/\(>\).*|\(P[0-9]\+\)|.*/\1\2/'

Output only headers:

>P48347
>P48348

input:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
atgc
>sp|P48348|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
tgac
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1483 users visited in the last hour