Question: Fasta header trimming for multiple delimiters
0
gravatar for kor272
17 months ago by
kor2720
kor2720 wrote:

I am relatively new to Linux, and I have read through this post: Fasta header trimming , but it does not quite solve my problem.

This is the format of the sequences in my file:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1

.. followed by the amino acid sequence.

I would like the format to be:

>P48347

+ sequence

As you can see, there are multiple delimiters, and I'm struggling to extract the characters I want correctly.

So far, my code is:

$ cut -d ' ' -f 1 | cut -d '|' -f 2 example.fasta > out.fasta

Which outputs:

P48347

+ sequence

I considered using sed to add the ">" back, but this seems a bit messy. I have also tried awk, but I am confused by how to use it with multiple delimiters and fasta format.

How do I extract the unique identifier in the header (P48347), without losing the '>' at the beginning?

Thanks in advance.

bash fasta • 658 views
ADD COMMENTlink modified 17 months ago by cpad011211k • written 17 months ago by kor2720
1
gravatar for 5heikki
17 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:
awk 'BEGIN{FS="|"}{if(/^>/){print ">"$2}else{print $0}}' input > output
ADD COMMENTlink written 17 months ago by 5heikki8.4k

Thanks, this works perfectly!

ADD REPLYlink written 17 months ago by kor2720
1
gravatar for bioplanet
17 months ago by
bioplanet50
bioplanet50 wrote:

Also in perl (if you want):

perl -e 'while(<>) {if($_=~/^.*?\|(.*?)\|/) {$id=$1; print ">$id\n";}}'
ADD COMMENTlink modified 17 months ago by genomax65k • written 17 months ago by bioplanet50
1
gravatar for jrj.healey
17 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Pure bash alternative:

#!/bin/bash
# usage:
# $ bash extract_header_field.sh seqs.fasta

while read line ; do
        if [ ${line:0:1} == ">" ] ; then
                IFS='|' read -a header <<< "$line"
        else
                seq="$line"
        echo -e ">${header[1]}""\n""$seq"
        fi
done < $1

As a more general note, you can change this script to split a fasta up and retrieve any field you like by changing the IFS='|' part to whatever "internal field separator" you like (e.g. IFS=',').

Then just change the number in the line ...${header[1]}... to whatever chunk you like.

In this case, >sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1 there are 3 | symbols, so the elements of the array $header become:

>sp   # "${header[0]}"
P48347   # "${header[1]}"
14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1   # "${header[2]}"

(remember that its 0-based indexing)

ADD COMMENTlink modified 17 months ago • written 17 months ago by jrj.healey12k
1
gravatar for cpad0112
17 months ago by
cpad011211k
India
cpad011211k wrote:

Output with sequence:

$ sed '/^>/ s/\(>\).*|\(P[0-9]\+\)|.*/\1\2/' test.fa

Output with sequence:

>P48347
atgc
>P48348
tgac

Output only headers:

 $ sed -n '/^>/p' test.fa | sed 's/\(>\).*|\(P[0-9]\+\)|.*/\1\2/'

Output only headers:

>P48347
>P48348

input:

>sp|P48347|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
atgc
>sp|P48348|14310_ARATH 14-3-3-like protein GF14 epsilon OS=Arabidopsis thaliana GN=GRF10 PE=1 SV=1
tgac
ADD COMMENTlink modified 17 months ago • written 17 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1945 users visited in the last hour