Question: Command To Change Fasta File Format
0
gravatar for biolab
5.3 years ago by
biolab1.1k
biolab1.1k wrote:

Hi, everyone,

I want to change format 1 into format 2, as shown below. The command I used is :

cut -f 1 inputfile | sed -e 's/\n/\t/g' -e 's/>/\n>' > outputfil.

However, it doesn't work, what's wrong with this command?

FORMAT 1

>gene1 ATPaseII
CTGATGCA
>gene2 Actin1
CGGGCGGTA
>gene3 pesudogene
ATGACTGACTG

FORMAT 2

>gene1  CTGATGCA
>gene2  CGGGCGGTA
>gene3  ATGACTGACTG

Thank you very much.

command-line linux • 1.9k views
ADD COMMENTlink modified 3.8 years ago by Biostar ♦♦ 20 • written 5.3 years ago by biolab1.1k
3
gravatar for Eric Normandeau
5.3 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Your attempt is kind of close. You may want to try the following command. The parts of the pipeline are separated on different lines to make the whole easier to read. The \characters inform the terminal that the command continues on the next line.

cut -d " " -f 1 f1 | \
    perl -pe 's/>/_newline_>/; s/\n/\t/' | \
    perl -pe 's/_newline_//' | \
    perl -pe 's/_newline_/\n/g' | \
    perl -pe 's/\t$//' > f2

Here are some details about the steps:

  1. cut -d " " -f 1 where -d " "specifies that the delimiter is the space. This removes anything after the first space of the line.
  2. perl -pe is used mostly like sed -e, but sometimes I find it better to use perl, so rather than learning both sed and perl, I suggest learning only perl.
  3. 's/>/_newline_>/ adds a unique string to recreate the lines later
  4. 's/\n/\t/'replaces the newlines by tabs. At this point, the whole file is only one line.
  5. perl -pe 's/_newline_//' removes the first occurence of _newline_ in the file to avoid starting the file with an empty line later.
  6. perl -pe 's/_newline_/\n/g' changes the _newline_ string with a new line.
  7. perl -pe 's/\t$//' removes tabulations at the end of the lines.

In this example, I use pipes (|) a few times at places that may not be evident. Perl treats the file, or the input it gets through a pipe, one line at a time, as delimited by a new line character(\n or some such). Thus, for example, when I remove all the new line characters at step 4, I create one long line and must use a pipe so that the next transformation can be applied to the whole file, not only the line that is currently being treated. This permits the trick in item 5 where I only remove the first occurrence or _newline_ in the whole file, which is now on one line.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Eric Normandeau10k

Hi Eric, Could you please briefly tell me what's the difference between _newline_ and \n? Thanks a lot!

ADD REPLYlink modified 5.3 years ago by Eric Normandeau10k • written 5.3 years ago by biolab1.1k
1

As a side note, I edited your comment to use full English. Could u pls is just as easy to write as Could you please. The latter is more polite and is also more pleasant to read for a person who spent a few minutes to help you and future users ;)

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Eric Normandeau10k

Yes, you are right. As a perl beginer, I think I really learned something about this language, especially the three s/>/_newline_ / commands, which look alike but differs, in your code. THANK YOU.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by biolab1.1k

_newline_ is just a string I decided to use to mark the positions where I will later put a \n back. I could just as well have used any string, like INSERT_NEWLINE_HERE :)

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Eric Normandeau10k

Really helpful and informative, THANKS!!

ADD REPLYlink written 5.3 years ago by biolab1.1k
3
gravatar for Frédéric Mahé
5.3 years ago by
France, Montpellier, CIRAD
Frédéric Mahé2.9k wrote:

Here is a solution with Awk:

awk 'BEGIN {RS = ">"} NR > 1 {print ">"$1"\t"$NF}' inputfile

It uses > to separate records. Skip the first empty record with NR > 1 and print the first $1 and the last part $NF of each record.

ADD COMMENTlink written 5.3 years ago by Frédéric Mahé2.9k

Will this work if the sequences are long and span multiple lines?

ADD REPLYlink written 5.3 years ago by Eric Normandeau10k
1

No, it will not. If the sequences span multiple lines, one may first linearize the fasta file (each sequence is written on one line). This can be done with Awk too: awk 'NR==1 {print ; next} {printf (/^>/) ? "\n"$0"\n" : $1}' file.fas

ADD REPLYlink written 5.3 years ago by Frédéric Mahé2.9k

It's great to know these awk commands, and I was unexpected to learn so many commands when I originally posted this question. THANKS.

ADD REPLYlink written 5.3 years ago by biolab1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 885 users visited in the last hour