8
2
Entering edit mode
5.5 years ago
#### ▴ 220

I have multifasta sequence and i am trying to rename the header and print the first id only

>gene_1|id=abc|protein=capA|ABD5672.4
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTCG
TACGGGTTTT

desired output:

>gene_1
ATGCGTCGACGTCGTACGGGTTTT
CGTACGGGTTATGCGTCGACGTC
GTACGGGTTTT


I tried following using sed but didnt work it seems sed -r 's/^([^|]+\|){0}//;s/\|//'any improvement on this? or any other alternative ?

sequence fasta • 3.5k views
4
Entering edit mode

Hi ####

It's perfectly possible that the discussion in this thread didn't make you happy, but deleting the thread isn't the right way to cope with that. I've reversed that now. There are many interesting ideas and solutions, so there is no reason to wipe this all away "forever".

Cheers, Wouter

0
Entering edit mode

print the second id only

Second ID is not gene_1. Do you want abc instead?

0
Entering edit mode

No in this case I want first id ..i.e. gene_1

1
Entering edit mode

Why'd you say "second id" in the question then?

3
Entering edit mode
5.5 years ago
Joe 20k

Is this what you want?

cat gene.fasta | sed 's/|.*//'

It'll just remove anything after the pipe symbol | which separates the field you want from everything else

0
Entering edit mode

Thanks alot for such a quick and short reply.

0
Entering edit mode

Simpler: sed -i 's/|.*//' gene.fasta

1
Entering edit mode

Yeah I alluded to this in a comment below. I used a cat pipe solution though in case the OP was to make a mistake in his command, editing in-place with sed is risky. At least this way you get to see what it will do to your file before you commit to the change.

0
Entering edit mode

That's fair, yeah. Don't use -i until you're sure it's doing what you think it's doing!

2
Entering edit mode
5.5 years ago
Medhat 9.4k

cat gene.fasta | paste - - |awk '{ split($1,a,"|");print a[1] "\n"$2}'

0
Entering edit mode

In this case if instead of printing 1st I tried to print lets say 4th id after the pipe :

cat gene.fasta | paste - - |awk '{ split($1,a,"|");print a[4] "\n"$2}'


its just printing the id and not the sequences

0
Entering edit mode

there is no 4th Id you have only 2 here

1 = gene_1

and

2 = id=abc;protein=capA

also your sequence is multiline format meanwhile the command above assumes that is is one line in this case you can linearize it first

then cat gene.fasta | awk '{ split($1,a,"|");print a[4] "\n"$2}' > result.fa

but this will be too much work the above solution is much better

0
Entering edit mode

#### : @Pierre has posted solutions to convert multi-line fasta to single line in many of his fasta file solutions. Search biostars.

0
Entering edit mode

I think this awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa

0
Entering edit mode

Linearizing the input is not necessary. You can simply awk for lines that are sequence headers — lines which start with > — and those lines which are not headers (sequences). You can then modify the header lines by whatever logic you want and print out the modified header. In the case of non-header lines — sequences — you just print out the line as-is.

5
Entering edit mode
5.5 years ago

Looking for a rigorous solution?

As @genomax2 said, "Fasta format is not a strictly defined standard". So you need to check FASTA format very carefully.

Here's a fake but may existing FASTA file. Note that the last record gene_6 is appending to sequence of gene_5, this often happens by cating too FASTA files where the first one is not ending with new-line-character \n. Similarly, concatenating files could result in the space in front of >gene_5.

$cat seqs.fa >gene_1|id=abc;protein=capA ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGCGTCGACGT >gene_2 id=abc;protein=capA ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGCGTCGAC >gene_3 ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGCGTCG >gene_4| ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGCGT >gene_5| ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGC>gene_6|id=123;protein=fake ACTG  Let's do this step by step. 1. Counting sequences with different methods $ grep -c '>' seqs.fa
6

$grep -c '^>' seqs.fa 4$ seqtk seq seqs.fa | grep -c '^>'
4

$seqkit stat seqs.fa file format type num_seqs sum_len min_len avg_len max_len seqs.fa FASTA DNA 4 251 42 62.8 119  2. Seems some bad formats are there. Let's check the sequences: $ seqkit seq -v seqs.fa
[ERRO] error when parsing seq: gene_4| (seq: invalid DNAredundant letter: >)

3. Retrieving gene_4, there they are!!!

$seqkit grep -p "gene_4|" seqs.fa >gene_4| ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGT >gene_5| ATGCGTCGAC GTCGTACGGGTTTTCGTACGGGTTATGC>gene_6|id=123;protein=fakeACTG  4. Checking gene_5 and gene_6 in original file $ grep "gene_5|" -A 1 -B 1 seqs.fa
CGTACGGGTTATGCGT
>gene_5|
ATGCGTCGACGTCGTACGGGTTTT

$grep "gene_6|" -A 1 -B 1 seqs.fa ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGC>gene_6|id=123;protein=fake ACTG  5. Fix it: $ sed -i 's/ >gene_5/>gene_5/; s/>gene_6/\n>gene_6/' seqs.fa

6. Validate again

$seqkit seq -v seqs.fa | seqkit stat file format type num_seqs sum_len min_len avg_len max_len - FASTA DNA 6 214 4 35.7 46  7. Format is OK now, you may want to remove gaps in sequences. $ seqkit seq -g seqs.fa| seqkit stat
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   DNA          6      214        4     35.7       46


Well, no gaps.

8. Finaly, we can handle the sequence IDs.

$seqkit seq -n -i seqs.fa gene_1|id=abc;protein=capA gene_2 gene_3 gene_4| gene_5| gene_6|id=123;protein=fake  Oh no! some headers do not have |. The Methods our friends kindlly provide may not work. 9. Let's use a general way provided by seqkit, i.e., using regular expression to retrieve sequence id. $ seqkit seq -n -i seqs.fa --id-regexp '^([^| ]+)[\| ]?'
gene_1
gene_2
gene_3
gene_4
gene_5
gene_6


^([^| ]+)[\| ]? captures leading non-space and non-| characters.

$seqkit seq -i seqs.fa --id-regexp '^([^| ]+)[\| ]?' >gene_1 ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCGACGT >gene_2 ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCGAC >gene_3 ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGTCG >gene_4 ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGCGT >gene_5 ATGCGTCGACGTCGTACGGGTTTTCGTACGGGTTATGC >gene_6 ACTG  Learn more about the SeqKit. An introductory post on Biostars: SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang ADD COMMENT 2 Entering edit mode 5.5 years ago arnstrm ★ 1.8k yet another option with cut cut -f 1 -d "|" gene.fasta  -f 1 to print first field -d "|" to set the delimiter for fields as pipe ADD COMMENT 2 Entering edit mode 5.5 years ago $ pip install pyfaidx
$faidx -e "lambda x: x.split('|')[0]" genes.fa >gene_1 ATGCGTCGACGTCGTACGGGTTTT CGTACGGGTTATGCGTCGACGTC GTACGGGTTTT ...  ADD COMMENT 1 Entering edit mode 5.5 years ago You can just use awk directly. No need to use cat: $ awk '{ if ($0 ~ /^>/) { n = split($0, a, "|"); printf("%s\n", a[1]); } else { print $0; } }' in.fa > out.fa  Really no need to use slow parsers, either, until you do something more complex. This is a very basic text-editing exercise that all informaticists should be able to do. ADD COMMENT 0 Entering edit mode You don't need cat with the sed solutions either, if you want to use inplace editing...my answer could be re-written sed -i 's/|.*//' filename but I figured simply printing the output avoids the OP making mistakes and irreversibly overwriting their files till they are sure what it'll do ;) ADD REPLY 1 Entering edit mode Using sed works too, definitely. I tend to shy away from sed because of differences between BSD and GNU versions (but that's solved on OS X with Homebrew, I guess). ADD REPLY 0 Entering edit mode along the same line, sed -i.old 's/|.*//' filename  keeps the old file with .oldextension and does the inplace edit too. More elegant than  sed 's/|.*//' filename > filename.new  imho ADD REPLY 0 Entering edit mode Do you think so? :P I actually find a cat - sed - redirect, while not computationally efficient maybe because of an extra system call, to be a bit more elegant because you don't have to create extra messy intermediate files and you can test/build the command as you go. Personal preference though! ADD REPLY 1 Entering edit mode True. I'm usually building pipelines as I go, and I start off with cat file. Then I hit up to go to the previous command, and I append a sed to it. It is easier building that way. But once I build it the first time, I optimize it for later re-use. At this point, all the unnecessary cats are thrown out. ADD REPLY 0 Entering edit mode Same, and piping the end of the pipeline to a head to look how it's doing (if the file is big). ADD REPLY 0 Entering edit mode I start with a head -n and replace it with a cat once I know the pipeline works fine :) ADD REPLY 2 Entering edit mode I just define a function in bashrc that strings head and tail together to inspect both ends of a file incase something horrible has happened somewhere in the middle :p ADD REPLY 0 Entering edit mode cat <(head$1) <(tail $1)? ADD REPLY 0 Entering edit mode Even simpler, just 2 calls one after the other, but your way would totally work too - I wrapped it in a slightly more complicated if so that I could pass a number of lines to print if I wanted more than default 10 ends(){ if [ "$2" == "" ]; then head $1 tail$1 elif [ "$2" != "" ] ; then head -"${2}" $1 tail -"${2}" $1 else echo "Parameter not recognised" fi } ** It's not written as a oneliner in bashrc, I've just copied it though, so there are some missing ";" as a oneliner FYI. ADD REPLY 0 Entering edit mode A million solutions :p ADD REPLY 0 Entering edit mode 5.5 years ago seta ★ 1.7k Please try this command: awk '{print$1}' file.fa > output.fa

1
Entering edit mode

For this to work, you need to set the input file delimiter, something like this:

awk -F"|" '{print \$1}'

0
Entering edit mode

I tried its not working

0
Entering edit mode

Why do you think this works?

0
Entering edit mode
5.5 years ago
John 13k

Please just use a proper FASTA parser. awk/grep/sed cannot parse FASTA reliably. Why do people even bother writing FASTA/FASTQ parsers if no one is going to use them? Why do people bother making custom bioinformatic file formats if people are going to treat them like a weird CSV? Convert all your data to CSV if you want CSV. But if you want to stick with FASTA - for whatever reason - use a FASTA parsing API like BioPython/Perl/Ruby/etc.

Everything else is loaded with assumptions about the rest of your data which we cannot possible know.

2
Entering edit mode

I cleaned up the unpleasant off-topic discussion which originated from this reply.
A summary of the useful bits of replies on this is below. I apologise for any context that has been lost by this operation.

jrj.healey

For complex input data, or situations where you don't know your own files that well I very much agree. However, for working with files on the commandline that don't require you to have installed Biopython (for example, can be a right pain in the ass to install), I think knowing the necessary oneliners with sed/cut/grep/awk and so on is an absolute requirement...

the point he's getting at is that you shouldn't necessarily treat a fasta as a file that can be coerced in to something which is simple a delimited text file, when parsers already exist, which is what a couple of the suggestions in this thread do.

genomax2

By using a standard file parser one is able to account for even edge cases and get the desired output. You are not able to post your entire file (due to space constraints) and the command line solutions people suggest consider only those snippets. Fasta format is not a strictly defined standard so if your file does not follow the same exact structure throughout then you could end up with silent corruption of data that you may not discover until some other downstream tool throws an error.

Ram

I'd use cut or awk or sed only after ensuring all known edge cases, and even then I'm risking unknown/unpredictable edge cases. AN actively maintained tool would bypass that.

Why one would use cut or awk with CSV files without running a few checks beforehand) when they are insensitive to quoted commas is beyond me :) Also, encoding, quoted new lines, vertical tabs, etc. Takeaway: Not even CSV file are as simple as they seem.

John

That's OK Ram - i'm not high value i'm just chatty

Thank you for keeping Biostars a nice place for all users!