Question: fasta seq header
2
gravatar for ####
20 months ago by
####170
####170 wrote:

I have multifasta sequence and i am trying to rename the header and print the first id only

>gene_1|id=abc|protein=capA|ABD5672.4
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTCG
TACGGGTTTT

desired output:

>gene_1
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTC
GTACGGGTTTT

I tried following using sed but didnt work it seems sed -r 's/^([^|]+\|){0}//;s/\|//'any improvement on this? or any other alternative ?

sequence fasta • 1.1k views
ADD COMMENTlink modified 20 months ago by Matt Shirley8.7k • written 20 months ago by ####170
3

Hi ####

It's perfectly possible that the discussion in this thread didn't make you happy, but deleting the thread isn't the right way to cope with that. I've reversed that now. There are many interesting ideas and solutions, so there is no reason to wipe this all away "forever".

Cheers, Wouter

ADD REPLYlink written 20 months ago by WouterDeCoster34k

print the second id only

Second ID is not gene_1. Do you want abc instead?

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax58k

No in this case I want first id ..i.e. gene_1

ADD REPLYlink modified 20 months ago • written 20 months ago by ####170
1

Why'd you say "second id" in the question then?

ADD REPLYlink written 20 months ago by RamRS19k
3
gravatar for jrj.healey
20 months ago by
jrj.healey8.6k
United Kingdom
jrj.healey8.6k wrote:

Is this what you want?

cat gene.fasta | sed 's/|.*//'

It'll just remove anything after the pipe symbol | which separates the field you want from everything else

ADD COMMENTlink written 20 months ago by jrj.healey8.6k

Thanks alot for such a quick and short reply.

ADD REPLYlink modified 20 months ago • written 20 months ago by ####170

Simpler: sed -i 's/|.*//' gene.fasta

ADD REPLYlink written 20 months ago by Daniel3.7k
1

Yeah I alluded to this in a comment below. I used a cat pipe solution though in case the OP was to make a mistake in his command, editing in-place with sed is risky. At least this way you get to see what it will do to your file before you commit to the change.

ADD REPLYlink written 20 months ago by jrj.healey8.6k

That's fair, yeah. Don't use -i until you're sure it's doing what you think it's doing!

ADD REPLYlink written 20 months ago by Daniel3.7k
2
gravatar for Medhat
20 months ago by
Medhat8.0k
Poland
Medhat8.0k wrote:

cat gene.fasta | paste - - |awk '{ split($1,a,"|");print a[1] "\n" $2}'

ADD COMMENTlink modified 20 months ago • written 20 months ago by Medhat8.0k

In this case if instead of printing 1st I tried to print lets say 4th id after the pipe :

cat gene.fasta | paste - - |awk '{ split($1,a,"|");print a[4] "\n" $2}'

its just printing the id and not the sequences

ADD REPLYlink written 20 months ago by ####170

there is no 4th Id you have only 2 here

1 = gene_1

and

2 = id=abc;protein=capA

also your sequence is multiline format meanwhile the command above assumes that is is one line in this case you can linearize it first

then cat gene.fasta | awk '{ split($1,a,"|");print a[4] "\n" $2}' > result.fa

but this will be too much work the above solution is much better

ADD REPLYlink modified 20 months ago • written 20 months ago by Medhat8.0k

#### : @Pierre has posted solutions to convert multi-line fasta to single line in many of his fasta file solutions. Search biostars.

ADD REPLYlink written 20 months ago by genomax58k

I think this awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa

ADD REPLYlink written 20 months ago by Medhat8.0k

Linearizing the input is not necessary. You can simply awk for lines that are sequence headers — lines which start with > — and those lines which are not headers (sequences). You can then modify the header lines by whatever logic you want and print out the modified header. In the case of non-header lines — sequences — you just print out the line as-is.

ADD REPLYlink modified 20 months ago • written 20 months ago by Alex Reynolds26k
5
gravatar for shenwei356
20 months ago by
shenwei3564.2k
China
shenwei3564.2k wrote:

Looking for a rigorous solution?

As @genomax2 said, "Fasta format is not a strictly defined standard". So you need to check FASTA format very carefully.

Here's a fake but may existing FASTA file. Note that the last record gene_6 is appending to sequence of gene_5, this often happens by cating too FASTA files where the first one is not ending with new-line-character \n. Similarly, concatenating files could result in the space in front of >gene_5.

$ cat seqs.fa
>gene_1|id=abc;protein=capA
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGT
>gene_2 id=abc;protein=capA
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGAC
>gene_3
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCG
>gene_4|
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGT
 >gene_5|
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGC>gene_6|id=123;protein=fake
ACTG

Let's do this step by step.

  1. Counting sequences with different methods

    $ grep -c '>' seqs.fa
    6
    
    $ grep -c '^>' seqs.fa
    4
    
    $ seqtk seq seqs.fa | grep -c '^>'
    4
    
    $ seqkit stat seqs.fa
    file     format  type  num_seqs  sum_len  min_len  avg_len  max_len
    seqs.fa  FASTA   DNA          4      251       42     62.8      119
    
  2. Seems some bad formats are there. Let's check the sequences:

    $ seqkit seq -v seqs.fa
    [ERRO] error when parsing seq: gene_4| (seq: invalid DNAredundant letter: >)
    
  3. Retrieving gene_4, there they are!!!

    $ seqkit grep -p "gene_4|"  seqs.fa
    >gene_4|
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGT >gene_5| ATGCGTCGAC
    GTCGTACGGGTTTTCGTACGGGTTATGC>gene_6|id=123;protein=fakeACTG
    
  4. Checking gene_5 and gene_6 in original file

    $ grep "gene_5|" -A 1 -B 1 seqs.fa
    CGTACGGGTTATGCGT
    >gene_5| 
    ATGCGTCGACGTCGTACGGGTTTT
    
    $ grep "gene_6|" -A 1 -B 1 seqs.fa
    ATGCGTCGACGTCGTACGGGTTTT
    CGTACGGGTTATGC>gene_6|id=123;protein=fake
    ACTG
    
  5. Fix it:

    $ sed -i 's/ >gene_5/>gene_5/; s/>gene_6/\n>gene_6/' seqs.fa
    
  6. Validate again

    $ seqkit seq -v seqs.fa | seqkit stat
    file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
    -     FASTA   DNA          6      214        4     35.7       46
    
  7. Format is OK now, you may want to remove gaps in sequences.

    $ seqkit seq -g seqs.fa| seqkit stat
    file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
    -     FASTA   DNA          6      214        4     35.7       46
    

    Well, no gaps.

  8. Finaly, we can handle the sequence IDs.

    $ seqkit seq -n -i seqs.fa           
    gene_1|id=abc;protein=capA
    gene_2
    gene_3
    gene_4|
    gene_5|
    gene_6|id=123;protein=fake
    

    Oh no! some headers do not have |. The Methods our friends kindlly provide may not work.

  9. Let's use a general way provided by seqkit, i.e., using regular expression to retrieve sequence id.

    $ seqkit seq -n -i seqs.fa --id-regexp '^([^| ]+)[\| ]?'
    gene_1
    gene_2
    gene_3
    gene_4
    gene_5
    gene_6
    

    ^([^| ]+)[\| ]? captures leading non-space and non-| characters.

    $ seqkit seq -i seqs.fa --id-regexp '^([^| ]+)[\| ]?'
    >gene_1
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCGACGT
    >gene_2
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCGAC
    >gene_3
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCG
    >gene_4
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGT
    >gene_5
    ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGC
    >gene_6
    ACTG
    

Learn more about the SeqKit.

An introductory post on Biostars: SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang

ADD COMMENTlink modified 20 months ago • written 20 months ago by shenwei3564.2k
2
gravatar for arnstrm
20 months ago by
arnstrm1.7k
Ames, IA
arnstrm1.7k wrote:

yet another option with cut

cut -f 1 -d "|" gene.fasta

-f 1 to print first field

-d "|" to set the delimiter for fields as pipe

ADD COMMENTlink written 20 months ago by arnstrm1.7k
2
gravatar for Matt Shirley
20 months ago by
Matt Shirley8.7k
Cambridge, MA
Matt Shirley8.7k wrote:
$ pip install pyfaidx
$ faidx -e "lambda x: x.split('|')[0]" genes.fa
>gene_1
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTC
GTACGGGTTTT
...
ADD COMMENTlink written 20 months ago by Matt Shirley8.7k
1
gravatar for Alex Reynolds
20 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

You can just use awk directly. No need to use cat:

$ awk '{ if ($0 ~ /^>/) { n = split($0, a, "|"); printf("%s\n", a[1]); } else { print $0; } }' in.fa > out.fa

Really no need to use slow parsers, either, until you do something more complex. This is a very basic text-editing exercise that all informaticists should be able to do.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Alex Reynolds26k

You don't need cat with the sed solutions either, if you want to use inplace editing...my answer could be re-written sed -i 's/|.*//' filename but I figured simply printing the output avoids the OP making mistakes and irreversibly overwriting their files till they are sure what it'll do ;)

ADD REPLYlink written 20 months ago by jrj.healey8.6k
1

Using sed works too, definitely. I tend to shy away from sed because of differences between BSD and GNU versions (but that's solved on OS X with Homebrew, I guess).

ADD REPLYlink written 20 months ago by Alex Reynolds26k

along the same line,

sed -i.old 's/|.*//' filename

keeps the old file with .oldextension and does the inplace edit too. More elegant than

 sed 's/|.*//' filename > filename.new

imho

ADD REPLYlink modified 20 months ago • written 20 months ago by arnstrm1.7k

Do you think so? :P

I actually find a cat - sed - redirect, while not computationally efficient maybe because of an extra system call, to be a bit more elegant because you don't have to create extra messy intermediate files and you can test/build the command as you go.

Personal preference though!

ADD REPLYlink written 20 months ago by jrj.healey8.6k
1

True. I'm usually building pipelines as I go, and I start off with cat file. Then I hit up to go to the previous command, and I append a sed to it. It is easier building that way.

But once I build it the first time, I optimize it for later re-use. At this point, all the unnecessary cats are thrown out.

ADD REPLYlink written 20 months ago by RamRS19k

Same, and piping the end of the pipeline to a head to look how it's doing (if the file is big).

ADD REPLYlink written 20 months ago by WouterDeCoster34k

I start with a head -n and replace it with a cat once I know the pipeline works fine :)

ADD REPLYlink written 20 months ago by RamRS19k
2

I just define a function in bashrc that strings head and tail together to inspect both ends of a file incase something horrible has happened somewhere in the middle :p

ADD REPLYlink written 20 months ago by jrj.healey8.6k

cat <(head $1) <(tail $1)?

ADD REPLYlink written 20 months ago by RamRS19k

Even simpler, just 2 calls one after the other, but your way would totally work too - I wrapped it in a slightly more complicated if so that I could pass a number of lines to print if I wanted more than default 10

ends(){ if [ "$2" == "" ]; then head $1 tail $1 elif [ "$2" != "" ] ; then head -"${2}" $1 tail -"${2}" $1 else echo "Parameter not recognised" fi }

** It's not written as a oneliner in bashrc, I've just copied it though, so there are some missing ";" as a oneliner FYI.

ADD REPLYlink modified 20 months ago • written 20 months ago by jrj.healey8.6k

A million solutions :p

ADD REPLYlink written 20 months ago by WouterDeCoster34k
0
gravatar for seta
20 months ago by
seta1.0k
Sweden
seta1.0k wrote:

Please try this command: awk '{print $1}' file.fa > output.fa

ADD COMMENTlink written 20 months ago by seta1.0k
1

For this to work, you need to set the input file delimiter, something like this:

awk -F"|" '{print $1}'
ADD REPLYlink written 20 months ago by arnstrm1.7k

I tried its not working

ADD REPLYlink written 20 months ago by ####170

Why do you think this works?

ADD REPLYlink written 20 months ago by Alex Reynolds26k
0
gravatar for John
20 months ago by
John12k
Germany
John12k wrote:

Please just use a proper FASTA parser. awk/grep/sed cannot parse FASTA reliably. Why do people even bother writing FASTA/FASTQ parsers if no one is going to use them? Why do people bother making custom bioinformatic file formats if people are going to treat them like a weird CSV? Convert all your data to CSV if you want CSV. But if you want to stick with FASTA - for whatever reason - use a FASTA parsing API like BioPython/Perl/Ruby/etc.

Everything else is loaded with assumptions about the rest of your data which we cannot possible know.

ADD COMMENTlink modified 20 months ago • written 20 months ago by John12k
2

I cleaned up the unpleasant off-topic discussion which originated from this reply.
A summary of the useful bits of replies on this is below. I apologise for any context that has been lost by this operation.

jrj.healey

For complex input data, or situations where you don't know your own files that well I very much agree. However, for working with files on the commandline that don't require you to have installed Biopython (for example, can be a right pain in the ass to install), I think knowing the necessary oneliners with sed/cut/grep/awk and so on is an absolute requirement...

the point he's getting at is that you shouldn't necessarily treat a fasta as a file that can be coerced in to something which is simple a delimited text file, when parsers already exist, which is what a couple of the suggestions in this thread do.

genomax2

By using a standard file parser one is able to account for even edge cases and get the desired output. You are not able to post your entire file (due to space constraints) and the command line solutions people suggest consider only those snippets. Fasta format is not a strictly defined standard so if your file does not follow the same exact structure throughout then you could end up with silent corruption of data that you may not discover until some other downstream tool throws an error.

Ram

I'd use cut or awk or sed only after ensuring all known edge cases, and even then I'm risking unknown/unpredictable edge cases. AN actively maintained tool would bypass that.

Why one would use cut or awk with CSV files without running a few checks beforehand) when they are insensitive to quoted commas is beyond me :) Also, encoding, quoted new lines, vertical tabs, etc. Takeaway: Not even CSV file are as simple as they seem.

John

That's OK Ram - i'm not high value i'm just chatty

Thank you for keeping Biostars a nice place for all users!

ADD REPLYlink modified 20 months ago • written 20 months ago by WouterDeCoster34k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 842 users visited in the last hour