Question: Parsing a fasta file
0
gravatar for Ambika
26 days ago by
Ambika30
United States/Auburn/Auburn University
Ambika30 wrote:

Hello everyone,

I have a fasta file like this in multiple lines

>jgi|Necha2|9923|gw1.3.792.1
CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAA
GAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGAC
TGCTCCCGGAGGCTCAAAGTTCGAGCGAAACCTGCCATCGCCTATCCTCGACCGTGAGCCG
>jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004
CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTC
TTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGG
CCTCTGGTTCCAAGAACGCGCAATCGCGTTCTTCCAAGGCTGGTCTCGCGTTCCCTGTTGGTCGTGTCCA
CCGTCTTCTCCGAAAGGGCAACTACGCTCAGCGTGTCGGTGCCGGTGCCCCTGTGTACCTCGCTGCCGTC
CTTGAGTATCTTGCTGCCGAAATCCTCGAGTTGGCTGGCAACGCTGCCCGTGACAACAAGAAGACCCGTA

How can I change this into format like this with fasta sequence in just one line

>jgi|Necha2|9923|gw1.3.792.1
CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAAGAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGAC

 >jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004
    CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTCTTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGGCCTCTGGTTCCAAGAACGCGCAATCGCGTTCTTCCAAGGCTGGTCTCGCGTTCCCTGTTGGTCGTGTCCACCGTCTTCTCCGAAAGGGCAACTACGCTCAGCGTGTCGGTGCCGGTGCCCCTGTGTACCTCGCTGCCGTCCTTGAGTATCTTGCTGCCGAAATCCTCGAGTTGGCTGGCAACGCTGCCCGTGACAACAAGAAGACCCGTA

Thank you, Ambika

sequence assembly • 143 views
ADD COMMENTlink written 26 days ago by Ambika30

What have you tried?

ADD REPLYlink written 26 days ago by RamRS22k

I tried this awk command

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input_best_transcripts.fasta > output.fa

It gives me the fasta file with single line

>jgi|Necha2|9923|gw1.3.792.1    CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAAGAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGACTGCTCCCGGAGGCTCAAAGTTCGAGCGAAACCTGCCATCGCCTATCCTCGACCGTGAGCCG
>jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004 CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTCTTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGGCCTCTGGTTCCAAGAACGCGC

but I think what I need is one line with the name, gene and scaffold details and next line with the sequence.

ADD REPLYlink modified 26 days ago • written 26 days ago by Ambika30

Try changing from \t to \n:

awk '/^>/ {printf("%s%s\n",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'

Or seqkit with global flags:

-w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

For example:

seqkit seq -w 0 input_best_transcripts.fasta
ADD REPLYlink modified 26 days ago • written 26 days ago by SMK1.8k

I think I figure it out with the sed command. I just replaced the space that I had with that output.fa file with a line using this code.

 sed 's/\t/\n/g' output.fa> output1.fa

>jgi|Necha2|9923|gw1.3.792.1
CGTGGCCAGGCTCTTTATATTCGCTCACTCTTCGAGGCCAACCGCAATGTGACTGATCCGAGACACCAAAGAGCTCTGTTGACGGAGACAGAGAAGCTACTGGAGAGCTGGAAGCACCCCGATCCCTACACGCCCCCGACTGCTCCCGGAGGCTCAAAGTTCGAGCGAAACCTGCCATCGCCTATCCTCGACCGTGAGCCG
>jgi|Necha2|59698|estExt_Genewise1.C_sca_3_chr4_2_00004
CAATTGTCATCACCACGACGCTCTCCGACTCATCCATTCGTCGCAAAGTCTCACCAGACTCACTCACTTCTTTTTTCACCCAAACCACCCCAACCAACCCACTTTCAAAATGACTGGCGGCGGCAAGTCTGGCGGCAAGGCCTCTGGTTCCAAGAACGCGCAATCGCGTTCTTCCAAGGCTGGTCTCGCGTTCCCTGTTGGTCGTGTCCACCGTCTTCTCCGAAAGGGC
ADD REPLYlink modified 26 days ago • written 26 days ago by Ambika30

That's great! What you needed in your earlier awk was to either use > as the record separator and manipulate the file or substitute \n with '' (replace with nothing, essentially removing the new line) on all lines not starting with a >

ADD REPLYlink written 26 days ago by RamRS22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 975 users visited in the last hour