Trim The Fasta Title
6
1
Entering edit mode
8.9 years ago
KJ Lim ▴ 130

Good day.

I parsed title of desire sequences from a Fasta file. The title of sequences are rather long, thus, I want to trim a bit the title with Unix sed command. The title is looks like:

AC166615 weakly similar to UniRef100_A1EGX0 Cluster: 1-aminocyclopropane-1-carboxylate bla bla bla


I would like to trim the sequence's title as:

AC166615      1-aminocyclopropane-1-carboxylate bla bla bla


I tried with the sed command as below:

sed 's/^.*\:/'$'\t''/g' seqTitle.txt  I got the output like below with the sequence ID removed as well. But, I wish to keep the sequence ID.  1-aminocyclopropane-1-carboxylate bla bla bla  Could someone kindly please give me some guide about the Unix sed manipulation? Thanks a lot. Have a nice weekend. fasta • 2.8k views ADD COMMENT 3 Entering edit mode 8.9 years ago I can give you a small hack, if all the titles are like that, then you can cut the ID first and merge it with description that you are getting with your own sed code. So, cut -f1 -d" " seqTitle.txt > id && sed 's/^.*\:/'$'\t''/g' seqTitle.txt > desc

paste -d"\t" id desc > wanted.txt

you can remove the tmp files produced

rm id desc

Cheers

2
Entering edit mode
8.9 years ago

My regex skills suck so I often find it a lot faster to just write a dirty one-liner python script. It's a bit of a linux hack, but it works:

echo "print '\n'.join([line.split()[0] + '\t' + line.split(': ')[-1].strip() for line in open('yourFile','r')])" | python

1
Entering edit mode
8.9 years ago
Random ▴ 160

I find awk to be more intuitive:

awk 'BEGIN{OFS=":"}{split($0,a,":"); print$1,a[2]}'


But I also managed to do it with sed, albeit I doubt this is the best way to do it:

sed 's/\ .\+$$:$$/\1\2/'


I can't point you to a guide, except maybe for O'Reilly's "Sed and Awk", but I found this list of explained one-liners to be particularly useful to me:

The same site also has tips on awk and perl one-liners, in case you are interested.

0
Entering edit mode

Random, thanks for your suggestion. The sed command you mentioned here will have the ":" included: AC166615: 1-aminocyclopropane-1-carboxylate bla bla bla I think that should be fine. Thanks.

0
Entering edit mode

For some reason I had assumed you wanted the ":" to separate the two fields. If you can still use awk and do:

  awk '{split($0,a,":"); print$1,a[2]}'


Or use sed and do:

sed 's/\ .\+:/\1\2/'

0
Entering edit mode

Thanks for the suggestion.

1
Entering edit mode
8.9 years ago
Joachim ★ 2.9k

If I get you right, then you want to remove the string content between the accession and the description following ":". You can do that on a Mac with:

sed -E 's/( |       )[^:]+://'


Note 1: The big white space after the "|" symbol is a tab character, which I inserted by pressing Ctrl-V and then TAB (same on Mac).

Note 2: In Linux you need to replace the "-E" option with "-r".

The regexp itself works as follows:

1. find the earliest space or tab character ("( | )")
2. proceed as long as long as the letters are not a colon ("[^:]+")
3. match a single colon character (":")

Sed is instructed to:

1. carry out a substitution via "s"
2. the substituted text is an empty string "//"
3. for each line, carry out the match and replace only once (no "/g" option)

Hope that helps.

0
Entering edit mode

The command works! Thanks a lot.

0
Entering edit mode
3.6 years ago

1)Suppose you have some 10,000 such headers. I think this approach should help. Keep all the headers in a file (file_name)

cut -d ":" -f2 file_name > temp2

will fetch you this part of your string: 1-aminocyclopropane-1-carboxylate bla bla bla

2)awk '{print \$1}' names > temp1

will fetch you this part of your string: AC166615

3)paste -d ":" temp1 temp2 > final_headers.txt

will give you AC166615: 1-aminocyclopropane-1-carboxylate bla bla bla

4)rm temp*