Question

how to add > to fasta file header and merge two line headers into one line(see below)

0

Entering edit mode

3.6 years ago

yaqinguo629 • 0

Q1: add > to the header; Q2: merge two header line into the same line and keep space between them; Q3: remove space between header and sequence

MT657978.1

Acaulospora foveata isolate 

AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC

MT626044.1

Claroideoglomus etunicatum 

ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Hi,everyone. I have sequences file like above, but I want to be like below:

>MT657978.1 Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
>MT626044.1 Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

sequence next-gen • 1.1k views

ADD COMMENT • link 3.5 years ago by yaqinguo629 • 0

0

Entering edit mode

Use awk and operate on NR. Do that once for every first line and once for every second line in separate subshells, and paste the output from them with blank space as separator. To this, with a similar awk operating on every third line, paste using a unique delimiter that you then replace with a new line character using sed.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

input:

$ cat test.fa 

MT657978.1
Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
MT626044.1
Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Output:

$ awk 'NR%3==1 {getline seq;print ">"$0,seq}; NR %3 ==0 {print}' test.fa

>MT657978.1 Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
>MT626044.1 Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

ADD REPLY • link 3.6 years ago by cpad0112 21k

0

Entering edit mode

Later in the thread, OP says that they could have multiple sequence lines, so NR%x is not going to work. OP's data is quite mangled.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

 MT657978

AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC

AB626044

ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Hello, everyone. If the sequences look like this, this is another story. How to add > to the header and remove space? using before script, I couldn't add > at all. Sorry, I am a novice. I have many questions related to this. I really appreciate your effort.

ADD REPLY • link updated 3.5 years ago by Ram 43k • written 3.5 years ago by yaqinguo629 • 0

Ram · Accepted Answer · 2020-10-07

2

Entering edit mode

3.6 years ago

oakhamwolf ▴ 20

Assuming that your file has a consistent structure of

ID
#new line#
Genus species
#new line#
sequence
#new line#

then the following will work:

cat test | paste - - - - - - | awk -F "\t" -v OFS="\n" '{print ">"$1" "$3,$5;}'

If you don't have the #new line# lines in there then use:

cat test | paste - - - | awk -F "\t" -v OFS="\n" '{print ">"$1" "$2,$3;}'

Hope this helps

ADD COMMENT • link updated 3.6 years ago by Ram 43k • written 3.6 years ago by oakhamwolf ▴ 20

0

Entering edit mode

Thanks very much. Yep, if I have a consistent structure like you proposed, the first one works perfectly. Thanks again! But the issue is that I found out my fasta file doesn't have a consistent structure (see below), so do you have any solution for that?

  ID-1
  #new line#
  Genus species
  #new line#
  sequence
  #new line#
  ID-2
 #new line#
 Genus species
 #new line#
 sequence
 #new line#
 sequence
 #new line#

ADD REPLY • link 3.6 years ago by yaqinguo629 • 0

0

Entering edit mode

First off, remove all empty lines with sed '/^$/d'. Then, find a way to replace [ATGC]{3}\n[ATGC]{3} with the same thing sans \n. Once these two steps are done, you can use cpad0112's awk (or the second awk command from ooakhamwolf) to get to the final result.

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

Please don't use pre tags. Use the 101010 button on the toolbar instead.

code_formatting

ADD REPLY • link 3.6 years ago by Ram 43k

0

Entering edit mode

I GOT IT. THANKS VERY MUCH!

ID
#NEW LINES#
SEQUENCES
#NEW LINES#
ID
#NEW LINES#
SEQUENCES
SEQUENCES
#NEW LINES#

ADD REPLY • link 3.5 years ago by yaqinguo629 • 0

0

Entering edit mode

I was speaking to oakhamwolf - but I understand the confusion. Replies can get confusing.

ADD REPLY • link 3.5 years ago by Ram 43k