fasta seq header
8
2
Entering edit mode
7.6 years ago
#### ▴ 220

I have multifasta sequence and i am trying to rename the header and print the first id only

>gene_1|id=abc|protein=capA|ABD5672.4
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTCG
TACGGGTTTT

desired output:

>gene_1
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTC
GTACGGGTTTT

I tried following using sed but didnt work it seems sed -r 's/^([^|]+\|){0}//;s/\|//'any improvement on this? or any other alternative ?

sequence fasta • 8.0k views
ADD COMMENT
4
Entering edit mode

Hi ####

It's perfectly possible that the discussion in this thread didn't make you happy, but deleting the thread isn't the right way to cope with that. I've reversed that now. There are many interesting ideas and solutions, so there is no reason to wipe this all away "forever".

Cheers, Wouter

ADD REPLY
0
Entering edit mode

print the second id only

Second ID is not gene_1. Do you want abc instead?

ADD REPLY
0
Entering edit mode

No in this case I want first id ..i.e. gene_1

ADD REPLY
1
Entering edit mode

Why'd you say "second id" in the question then?

ADD REPLY
3
Entering edit mode
7.6 years ago
Joe 21k

Is this what you want?

cat gene.fasta | sed 's/|.*//'

It'll just remove anything after the pipe symbol | which separates the field you want from everything else

ADD COMMENT
0
Entering edit mode

Thanks alot for such a quick and short reply.

ADD REPLY
0
Entering edit mode

Simpler: sed -i 's/|.*//' gene.fasta

ADD REPLY
1
Entering edit mode

Yeah I alluded to this in a comment below. I used a cat pipe solution though in case the OP was to make a mistake in his command, editing in-place with sed is risky. At least this way you get to see what it will do to your file before you commit to the change.

ADD REPLY
0
Entering edit mode

That's fair, yeah. Don't use -i until you're sure it's doing what you think it's doing!

ADD REPLY
2
Entering edit mode
7.6 years ago
Medhat 9.8k

cat gene.fasta | paste - - |awk '{ split($1,a,"|");print a[1] "\n" $2}'

ADD COMMENT
0
Entering edit mode

In this case if instead of printing 1st I tried to print lets say 4th id after the pipe :

cat gene.fasta | paste - - |awk '{ split($1,a,"|");print a[4] "\n" $2}'

its just printing the id and not the sequences

ADD REPLY
0
Entering edit mode

there is no 4th Id you have only 2 here

1 = gene_1

and

2 = id=abc;protein=capA

also your sequence is multiline format meanwhile the command above assumes that is is one line in this case you can linearize it first

then cat gene.fasta | awk '{ split($1,a,"|");print a[4] "\n" $2}' > result.fa

but this will be too much work the above solution is much better

ADD REPLY
0
Entering edit mode

#### : @Pierre has posted solutions to convert multi-line fasta to single line in many of his fasta file solutions. Search biostars.

ADD REPLY
0
Entering edit mode

I think this awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa

ADD REPLY
0
Entering edit mode

Linearizing the input is not necessary. You can simply awk for lines that are sequence headers — lines which start with > — and those lines which are not headers (sequences). You can then modify the header lines by whatever logic you want and print out the modified header. In the case of non-header lines — sequences — you just print out the line as-is.

ADD REPLY
5
Entering edit mode
7.6 years ago

Looking for a rigorous solution?

As @genomax2 said, "Fasta format is not a strictly defined standard". So you need to check FASTA format very carefully.

Here's a fake but may existing FASTA file. Note that the last record gene_6 is appending to sequence of gene_5, this often happens by cating too FASTA files where the first one is not ending with new-line-character \n. Similarly, concatenating files could result in the space in front of >gene_5.

$ cat seqs.fa
>gene_1|id=abc;protein=capA
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGT
>gene_2 id=abc;protein=capA
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGAC
>gene_3
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCG
>gene_4|
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGT
 >gene_5|
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGC>gene_6|id=123;protein=fake
ACTG

Let's do this step by step.

  1. Counting sequences with different methods

    $ grep -c '>' seqs.fa
    6
    
    $ grep -c '^>' seqs.fa
    4
    
    $ seqtk seq seqs.fa | grep -c '^>'
    4
    
    $ seqkit stat seqs.fa
    file     format  type  num_seqs  sum_len  min_len  avg_len  max_len
    seqs.fa  FASTA   DNA          4      251       42     62.8      119
    
  2. Seems some bad formats are there. Let's check the sequences:

    $ seqkit seq -v seqs.fa
    [ERRO] error when parsing seq: gene_4| (seq: invalid DNAredundant letter: >)
    
  3. Retrieving gene_4, there they are!!!

    $ seqkit grep -p "gene_4|"  seqs.fa
    >gene_4|
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGT >gene_5| ATGCGTCGAC
    GTCGTACGGGTTTTCGTACGGGTTATGC>gene_6|id=123;protein=fakeACTG
    
  4. Checking gene_5 and gene_6 in original file

    $ grep "gene_5|" -A 1 -B 1 seqs.fa
    CGTACGGGTTATGCGT
    >gene_5| 
    ATGCGTCGACGTCGTACGGGTTTT
    
    $ grep "gene_6|" -A 1 -B 1 seqs.fa
    ATGCGTCGACGTCGTACGGGTTTT
    CGTACGGGTTATGC>gene_6|id=123;protein=fake
    ACTG
    
  5. Fix it:

    $ sed -i 's/ >gene_5/>gene_5/; s/>gene_6/\n>gene_6/' seqs.fa
    
  6. Validate again

    $ seqkit seq -v seqs.fa | seqkit stat
    file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
    -     FASTA   DNA          6      214        4     35.7       46
    
  7. Format is OK now, you may want to remove gaps in sequences.

    $ seqkit seq -g seqs.fa| seqkit stat
    file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
    -     FASTA   DNA          6      214        4     35.7       46
    

    Well, no gaps.

  8. Finaly, we can handle the sequence IDs.

    $ seqkit seq -n -i seqs.fa           
    gene_1|id=abc;protein=capA
    gene_2
    gene_3
    gene_4|
    gene_5|
    gene_6|id=123;protein=fake
    

    Oh no! some headers do not have |. The Methods our friends kindlly provide may not work.

  9. Let's use a general way provided by seqkit, i.e., using regular expression to retrieve sequence id.

    $ seqkit seq -n -i seqs.fa --id-regexp '^([^| ]+)[\| ]?'
    gene_1
    gene_2
    gene_3
    gene_4
    gene_5
    gene_6
    

    ^([^| ]+)[\| ]? captures leading non-space and non-| characters.

    $ seqkit seq -i seqs.fa --id-regexp '^([^| ]+)[\| ]?'
    >gene_1
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCGACGT
    >gene_2
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCGAC
    >gene_3
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCG
    >gene_4
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGT
    >gene_5
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGC
    >gene_6
    ACTG
    

Learn more about the SeqKit.

An introductory post on Biostars: SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang

ADD COMMENT
2
Entering edit mode
7.6 years ago
arnstrm ★ 1.9k

yet another option with cut

cut -f 1 -d "|" gene.fasta

-f 1 to print first field

-d "|" to set the delimiter for fields as pipe

ADD COMMENT
2
Entering edit mode
7.6 years ago
$ pip install pyfaidx
$ faidx -e "lambda x: x.split('|')[0]" genes.fa
>gene_1
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTC
GTACGGGTTTT
...
ADD COMMENT
1
Entering edit mode
7.6 years ago

You can just use awk directly. No need to use cat:

$ awk '{ if ($0 ~ /^>/) { n = split($0, a, "|"); printf("%s\n", a[1]); } else { print $0; } }' in.fa > out.fa

Really no need to use slow parsers, either, until you do something more complex. This is a very basic text-editing exercise that all informaticists should be able to do.

ADD COMMENT
0
Entering edit mode

You don't need cat with the sed solutions either, if you want to use inplace editing...my answer could be re-written sed -i 's/|.*//' filename but I figured simply printing the output avoids the OP making mistakes and irreversibly overwriting their files till they are sure what it'll do ;)

ADD REPLY
1
Entering edit mode

Using sed works too, definitely. I tend to shy away from sed because of differences between BSD and GNU versions (but that's solved on OS X with Homebrew, I guess).

ADD REPLY
0
Entering edit mode

along the same line,

sed -i.old 's/|.*//' filename

keeps the old file with .oldextension and does the inplace edit too. More elegant than

 sed 's/|.*//' filename > filename.new

imho

ADD REPLY
0
Entering edit mode

Do you think so? :P

I actually find a cat - sed - redirect, while not computationally efficient maybe because of an extra system call, to be a bit more elegant because you don't have to create extra messy intermediate files and you can test/build the command as you go.

Personal preference though!

ADD REPLY
1
Entering edit mode

True. I'm usually building pipelines as I go, and I start off with cat file. Then I hit up to go to the previous command, and I append a sed to it. It is easier building that way.

But once I build it the first time, I optimize it for later re-use. At this point, all the unnecessary cats are thrown out.

ADD REPLY
0
Entering edit mode

Same, and piping the end of the pipeline to a head to look how it's doing (if the file is big).

ADD REPLY
0
Entering edit mode

I start with a head -n and replace it with a cat once I know the pipeline works fine :)

ADD REPLY
2
Entering edit mode

I just define a function in bashrc that strings head and tail together to inspect both ends of a file incase something horrible has happened somewhere in the middle :p

ADD REPLY
0
Entering edit mode

cat <(head $1) <(tail $1)?

ADD REPLY
0
Entering edit mode

Even simpler, just 2 calls one after the other, but your way would totally work too - I wrapped it in a slightly more complicated if so that I could pass a number of lines to print if I wanted more than default 10

ends(){ if [ "$2" == "" ]; then head $1 tail $1 elif [ "$2" != "" ] ; then head -"${2}" $1 tail -"${2}" $1 else echo "Parameter not recognised" fi }

** It's not written as a oneliner in bashrc, I've just copied it though, so there are some missing ";" as a oneliner FYI.

ADD REPLY
0
Entering edit mode

A million solutions :p

ADD REPLY
0
Entering edit mode
7.6 years ago
seta ★ 1.9k

Please try this command: awk '{print $1}' file.fa > output.fa

ADD COMMENT
1
Entering edit mode

For this to work, you need to set the input file delimiter, something like this:

awk -F"|" '{print $1}'
ADD REPLY
0
Entering edit mode

I tried its not working

ADD REPLY
0
Entering edit mode

Why do you think this works?

ADD REPLY
0
Entering edit mode
7.6 years ago
John 13k

Please just use a proper FASTA parser. awk/grep/sed cannot parse FASTA reliably. Why do people even bother writing FASTA/FASTQ parsers if no one is going to use them? Why do people bother making custom bioinformatic file formats if people are going to treat them like a weird CSV? Convert all your data to CSV if you want CSV. But if you want to stick with FASTA - for whatever reason - use a FASTA parsing API like BioPython/Perl/Ruby/etc.

Everything else is loaded with assumptions about the rest of your data which we cannot possible know.

ADD COMMENT
2
Entering edit mode

I cleaned up the unpleasant off-topic discussion which originated from this reply.
A summary of the useful bits of replies on this is below. I apologise for any context that has been lost by this operation.

jrj.healey

For complex input data, or situations where you don't know your own files that well I very much agree. However, for working with files on the commandline that don't require you to have installed Biopython (for example, can be a right pain in the ass to install), I think knowing the necessary oneliners with sed/cut/grep/awk and so on is an absolute requirement...

the point he's getting at is that you shouldn't necessarily treat a fasta as a file that can be coerced in to something which is simple a delimited text file, when parsers already exist, which is what a couple of the suggestions in this thread do.

genomax2

By using a standard file parser one is able to account for even edge cases and get the desired output. You are not able to post your entire file (due to space constraints) and the command line solutions people suggest consider only those snippets. Fasta format is not a strictly defined standard so if your file does not follow the same exact structure throughout then you could end up with silent corruption of data that you may not discover until some other downstream tool throws an error.

Ram

I'd use cut or awk or sed only after ensuring all known edge cases, and even then I'm risking unknown/unpredictable edge cases. AN actively maintained tool would bypass that.

Why one would use cut or awk with CSV files without running a few checks beforehand) when they are insensitive to quoted commas is beyond me :) Also, encoding, quoted new lines, vertical tabs, etc. Takeaway: Not even CSV file are as simple as they seem.

John

That's OK Ram - i'm not high value i'm just chatty

Thank you for keeping Biostars a nice place for all users!

ADD REPLY

Login before adding your answer.

Traffic: 820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6