Entering edit mode
5.0 years ago
mxlsherry1992
▴
80
Hi,
I have a fasta file, like this:
>TRINITY_DN100000_c1_g1::TRINITY_DN100000_c1_g1_i3::g.3039::m.3039 TRINITY_DN100000_c1_g1::TRINITY_DN100000_c1_g1_i3::g.3039 ORF type:complete len:100 (-) TRINITY_DN100000_c1_g1_i3:1027-1326(-)
MVWIKFRGLHRVLTSTPLVKSGKTPSQTWAFLDISVELIVFLFLNVHKSPMPHFKIYSEA
FSEEWSLLWLQYSRHLIQKPKPWQIKIELLHLCCCNRLC*
>TRINITY_DN100000_c1_g6::TRINITY_DN100000_c1_g6_i2::g.84365::m.84365 TRINITY_DN100000_c1_g6::TRINITY_DN100000_c1_g6_i2::g.84365 ORF type:complete len:112 (-) TRINITY_DN100000_c1_g6_i2:379-714(-)
MEMMQEIIPFAREMLSARPSKGTMKVYLVGGTFAVLGIVSGMVEAACSLFPEQEESTLTK
LMEDCLTVTAQNQEPQTFIPEDDEQDAEMEAKAKDLPMFRQRRMSFRAHAS*
if I want to only keep the second header, like this (the amino acid sequence keep unchanged), how should I correct this command sed 's/::.*//' input > output
:
>TRINITY_DN100000_c1_g1_i3
MVWIKFRGLHRVLTSTPLVKSGKTPSQTWAFLDISVELIVFLFLNVHKSPMPHFKIYSEA
FSEEWSLLWLQYSRHLIQKPKPWQIKIELLHLCCCNRLC*
>TRINITY_DN100000_c1_g6_i2
MEMMQEIIPFAREMLSARPSKGTMKVYLVGGTFAVLGIVSGMVEAACSLFPEQEESTLTK
LMEDCLTVTAQNQEPQTFIPEDDEQDAEMEAKAKDLPMFRQRRMSFRAHAS*
this command can only keep the first header >TRINITY_DN100000_c1_g1
if I want to keep the second header with the isoform information TRINITY_DN100000_c1_g1_i3
, how should I correct this command?
Hi thanks! this command will give this:
Some of the sequences look good, but some still look messed up...if you know how to modify the command further..?
Thank you