how to add > to fasta file header and merge two line headers into one line(see below)
1
0
Entering edit mode
3.6 years ago

Q1: add > to the header; Q2: merge two header line into the same line and keep space between them; Q3: remove space between header and sequence

MT657978.1

Acaulospora foveata isolate 

AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC

MT626044.1

Claroideoglomus etunicatum 

ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Hi,everyone. I have sequences file like above, but I want to be like below:

>MT657978.1 Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
>MT626044.1 Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA
sequence next-gen • 1.1k views
ADD COMMENT
0
Entering edit mode

Use awk and operate on NR. Do that once for every first line and once for every second line in separate subshells, and paste the output from them with blank space as separator. To this, with a similar awk operating on every third line, paste using a unique delimiter that you then replace with a new line character using sed.

ADD REPLY
0
Entering edit mode

input:

$ cat test.fa 

MT657978.1
Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
MT626044.1
Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Output:

$ awk 'NR%3==1 {getline seq;print ">"$0,seq}; NR %3 ==0 {print}' test.fa

>MT657978.1 Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
>MT626044.1 Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA
ADD REPLY
0
Entering edit mode

Later in the thread, OP says that they could have multiple sequence lines, so NR%x is not going to work. OP's data is quite mangled.

ADD REPLY
0
Entering edit mode
 MT657978

AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC

AB626044

ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Hello, everyone. If the sequences look like this, this is another story. How to add > to the header and remove space? using before script, I couldn't add > at all. Sorry, I am a novice. I have many questions related to this. I really appreciate your effort.

ADD REPLY
2
Entering edit mode
3.6 years ago
oakhamwolf ▴ 20

Assuming that your file has a consistent structure of

ID
#new line#
Genus species
#new line#
sequence
#new line#

then the following will work:

cat test | paste - - - - - - | awk -F "\t" -v OFS="\n" '{print ">"$1" "$3,$5;}'

If you don't have the #new line# lines in there then use:

cat test | paste - - - | awk -F "\t" -v OFS="\n" '{print ">"$1" "$2,$3;}'

Hope this helps

ADD COMMENT
0
Entering edit mode

Thanks very much. Yep, if I have a consistent structure like you proposed, the first one works perfectly. Thanks again! But the issue is that I found out my fasta file doesn't have a consistent structure (see below), so do you have any solution for that?

  ID-1
  #new line#
  Genus species
  #new line#
  sequence
  #new line#
  ID-2
 #new line#
 Genus species
 #new line#
 sequence
 #new line#
 sequence
 #new line#
ADD REPLY
0
Entering edit mode

First off, remove all empty lines with sed '/^$/d'. Then, find a way to replace [ATGC]{3}\n[ATGC]{3} with the same thing sans \n. Once these two steps are done, you can use cpad0112's awk (or the second awk command from ooakhamwolf) to get to the final result.

ADD REPLY
0
Entering edit mode

Please don't use pre tags. Use the 101010 button on the toolbar instead.

code_formatting

ADD REPLY
0
Entering edit mode

I GOT IT. THANKS VERY MUCH!

ID
#NEW LINES#
SEQUENCES
#NEW LINES#
ID
#NEW LINES#
SEQUENCES
SEQUENCES
#NEW LINES#
ADD REPLY
0
Entering edit mode

I was speaking to oakhamwolf - but I understand the confusion. Replies can get confusing.

ADD REPLY

Login before adding your answer.

Traffic: 2699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6