Parsing a fasta file
0
0
Entering edit mode
4.8 years ago
AP ▴ 80

Hello everyone,

I have a fasta file like this in multiple lines

>jgi|Necha2|9923|gw1.3.792.1
CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAA
GAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGAC
TGCTCCCGGAGGCTCAAAGTTCGAGCGAAACCTGCCATCGCCTATCCTCGACCGTGAGCCG
>jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004
CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTC
TTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGG
CCTCTGGTTCCAAGAACGCGCAATCGCGTTCTTCCAAGGCTGGTCTCGCGTTCCCTGTTGGTCGTGTCCA
CCGTCTTCTCCGAAAGGGCAACTACGCTCAGCGTGTCGGTGCCGGTGCCCCTGTGTACCTCGCTGCCGTC
CTTGAGTATCTTGCTGCCGAAATCCTCGAGTTGGCTGGCAACGCTGCCCGTGACAACAAGAAGACCCGTA

How can I change this into format like this with fasta sequence in just one line

>jgi|Necha2|9923|gw1.3.792.1
CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAAGAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGAC

 >jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004
    CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTCTTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGGCCTCTGGTTCCAAGAACGCGCAATCGCGTTCTTCCAAGGCTGGTCTCGCGTTCCCTGTTGGTCGTGTCCACCGTCTTCTCCGAAAGGGCAACTACGCTCAGCGTGTCGGTGCCGGTGCCCCTGTGTACCTCGCTGCCGTCCTTGAGTATCTTGCTGCCGAAATCCTCGAGTTGGCTGGCAACGCTGCCCGTGACAACAAGAAGACCCGTA

Thank you, Ambika

Assembly sequence • 877 views
ADD COMMENT
0
Entering edit mode

What have you tried?

ADD REPLY
0
Entering edit mode

I tried this awk command

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input_best_transcripts.fasta > output.fa

It gives me the fasta file with single line

>jgi|Necha2|9923|gw1.3.792.1    CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAAGAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGACTGCTCCCGGAGGCTCAAAGTTCGAGCGAAACCTGCCATCGCCTATCCTCGACCGTGAGCCG
>jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004 CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTCTTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGGCCTCTGGTTCCAAGAACGCGC

but I think what I need is one line with the name, gene and scaffold details and next line with the sequence.

ADD REPLY
0
Entering edit mode

Try changing from \t to \n:

awk '/^>/ {printf("%s%s\n",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'

Or seqkit with global flags:

-w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

For example:

seqkit seq -w 0 input_best_transcripts.fasta
ADD REPLY
0
Entering edit mode

I think I figure it out with the sed command. I just replaced the space that I had with that output.fa file with a line using this code.

 sed 's/\t/\n/g' output.fa> output1.fa

>jgi|Necha2|9923|gw1.3.792.1
CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAAGAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGACTGCTCCCGGAGGCTCAAAGTTCGAGCGAAACCTGCCATCGCCTATCCTCGACCGTGAGCCG
>jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004
CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTCTTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGGCCTCTGGTTCCAAGAACGCGCAATCGCGTTCTTCCAAGGCTGGTCTCGCGTTCCCTGTTGGTCGTGTCCACCGTCTTCTCCGAAAGGGC
ADD REPLY
0
Entering edit mode

That's great! What you needed in your earlier awk was to either use > as the record separator and manipulate the file or substitute \n with '' (replace with nothing, essentially removing the new line) on all lines not starting with a >

ADD REPLY

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6