Question: Trim The Fasta Title
1
gravatar for KJ Lim
6.6 years ago by
KJ Lim120
KJ Lim120 wrote:

Good day.

I parsed title of desire sequences from a Fasta file. The title of sequences are rather long, thus, I want to trim a bit the title with Unix sed command. The title is looks like:

AC166615 weakly similar to UniRef100_A1EGX0 Cluster: 1-aminocyclopropane-1-carboxylate bla bla bla

I would like to trim the sequence's title as:

AC166615      1-aminocyclopropane-1-carboxylate bla bla bla

I tried with the sed command as below:

sed 's/^.*\:/'$'\t''/g' seqTitle.txt

I got the output like below with the sequence ID removed as well. But, I wish to keep the sequence ID.

      1-aminocyclopropane-1-carboxylate bla bla bla

Could someone kindly please give me some guide about the Unix sed manipulation?

Thanks a lot. Have a nice weekend.

fasta • 2.1k views
ADD COMMENTlink modified 14 months ago by shubhra.bhattacharya120 • written 6.6 years ago by KJ Lim120
3
gravatar for Sukhdeep Singh
6.6 years ago by
Sukhdeep Singh9.6k
Netherlands
Sukhdeep Singh9.6k wrote:

I can give you a small hack, if all the titles are like that, then you can cut the ID first and merge it with description that you are getting with your own sed code.

So, cut -f1 -d" " seqTitle.txt > id && sed 's/^.*\:/'$'\t''/g' seqTitle.txt > desc

paste -d"\t" id desc > wanted.txt

you can remove the tmp files produced

rm id desc

Cheers

ADD COMMENTlink written 6.6 years ago by Sukhdeep Singh9.6k
2
gravatar for Damian Kao
6.6 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

My regex skills suck so I often find it a lot faster to just write a dirty one-liner python script. It's a bit of a linux hack, but it works:

echo "print '\n'.join([line.split()[0] + '\t' + line.split(': ')[-1].strip() for line in open('yourFile','r')])" | python
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Damian Kao15k
1
gravatar for Random
6.6 years ago by
Random160
Random160 wrote:

I find awk to be more intuitive:

awk 'BEGIN{OFS=":"}{split($0,a,":"); print $1,a[2]}'

But I also managed to do it with sed, albeit I doubt this is the best way to do it:

sed 's/\(\)\ .\+\(:\)/\1\2/'

I can't point you to a guide, except maybe for O'Reilly's "Sed and Awk", but I found this list of explained one-liners to be particularly useful to me:

The same site also has tips on awk and perl one-liners, in case you are interested.

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Random160

Random, thanks for your suggestion. The sed command you mentioned here will have the ":" included: AC166615: 1-aminocyclopropane-1-carboxylate bla bla bla I think that should be fine. Thanks.

ADD REPLYlink written 6.6 years ago by KJ Lim120

For some reason I had assumed you wanted the ":" to separate the two fields. If you can still use awk and do:

  awk '{split($0,a,":"); print $1,a[2]}'

Or use sed and do:

sed 's/\(\)\ .\+:\(\)/\1\2/'
ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by Random160

Thanks for the suggestion.

ADD REPLYlink written 6.5 years ago by KJ Lim120
1
gravatar for Joachim
6.6 years ago by
Joachim2.8k
San Francisco, California
Joachim2.8k wrote:

If I get you right, then you want to remove the string content between the accession and the description following ":". You can do that on a Mac with:

sed -E 's/( |       )[^:]+://'

Note 1: The big white space after the "|" symbol is a tab character, which I inserted by pressing Ctrl-V and then TAB (same on Mac).

Note 2: In Linux you need to replace the "-E" option with "-r".

The regexp itself works as follows:

  1. find the earliest space or tab character ("( | )")
  2. proceed as long as long as the letters are not a colon ("[^:]+")
  3. match a single colon character (":")

Sed is instructed to:

  1. carry out a substitution via "s"
  2. the substituted text is an empty string "//"
  3. for each line, carry out the match and replace only once (no "/g" option)

Hope that helps.

ADD COMMENTlink written 6.6 years ago by Joachim2.8k

The command works! Thanks a lot.

ADD REPLYlink written 6.6 years ago by KJ Lim120
0
gravatar for shubhra.bhattacharya
14 months ago by
shubhra.bhattacharya120 wrote:

1)Suppose you have some 10,000 such headers. I think this approach should help. Keep all the headers in a file (file_name)

cut -d ":" -f2 file_name > temp2

will fetch you this part of your string: 1-aminocyclopropane-1-carboxylate bla bla bla

2)awk '{print $1}' names > temp1

will fetch you this part of your string: AC166615

3)paste -d ":" temp1 temp2 > final_headers.txt

will give you AC166615: 1-aminocyclopropane-1-carboxylate bla bla bla

4)rm temp*

ADD COMMENTlink written 14 months ago by shubhra.bhattacharya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1098 users visited in the last hour