Question: how to add > to fasta file header and merge two line headers into one line(see below)
0
gravatar for yaqinguo629
12 days ago by
yaqinguo6290 wrote:

Q1: add > to the header; Q2: merge two header line into the same line and keep space between them; Q3: remove space between header and sequence

MT657978.1

Acaulospora foveata isolate 

AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC

MT626044.1

Claroideoglomus etunicatum 

ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Hi,everyone. I have sequences file like above, but I want to be like below:

>MT657978.1 Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
>MT626044.1 Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA
next-gen sequence • 149 views
ADD COMMENTlink modified 7 days ago • written 12 days ago by yaqinguo6290

Use awk and operate on NR. Do that once for every first line and once for every second line in separate subshells, and paste the output from them with blank space as separator. To this, with a similar awk operating on every third line, paste using a unique delimiter that you then replace with a new line character using sed.

ADD REPLYlink written 12 days ago by RamRS30k

input:

$ cat test.fa 

MT657978.1
Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
MT626044.1
Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Output:

$ awk 'NR%3==1 {getline seq;print ">"$0,seq}; NR %3 ==0 {print}' test.fa

>MT657978.1 Acaulospora foveata isolate 
AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC
>MT626044.1 Claroideoglomus etunicatum 
ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA
ADD REPLYlink written 11 days ago by cpad011214k

Later in the thread, OP says that they could have multiple sequence lines, so NR%x is not going to work. OP's data is quite mangled.

ADD REPLYlink written 11 days ago by RamRS30k
 MT657978

AAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTC

AB626044

ACATACGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAA

Hello, everyone. If the sequences look like this, this is another story. How to add > to the header and remove space? using before script, I couldn't add > at all. Sorry, I am a novice. I have many questions related to this. I really appreciate your effort.

ADD REPLYlink modified 7 days ago by RamRS30k • written 7 days ago by yaqinguo6290
2
gravatar for oakhamwolf
12 days ago by
oakhamwolf20
oakhamwolf20 wrote:

Assuming that your file has a consistent structure of

ID
#new line#
Genus species
#new line#
sequence
#new line#

then the following will work:

cat test | paste - - - - - - | awk -F "\t" -v OFS="\n" '{print ">"$1" "$3,$5;}'

If you don't have the #new line# lines in there then use:

cat test | paste - - - | awk -F "\t" -v OFS="\n" '{print ">"$1" "$2,$3;}'

Hope this helps

ADD COMMENTlink modified 11 days ago by RamRS30k • written 12 days ago by oakhamwolf20

Thanks very much. Yep, if I have a consistent structure like you proposed, the first one works perfectly. Thanks again! But the issue is that I found out my fasta file doesn't have a consistent structure (see below), so do you have any solution for that?

  ID-1
  #new line#
  Genus species
  #new line#
  sequence
  #new line#
  ID-2
 #new line#
 Genus species
 #new line#
 sequence
 #new line#
 sequence
 #new line#
ADD REPLYlink modified 11 days ago • written 11 days ago by yaqinguo6290

First off, remove all empty lines with sed '/^$/d'. Then, find a way to replace [ATGC]{3}\n[ATGC]{3} with the same thing sans \n. Once these two steps are done, you can use cpad0112's awk (or the second awk command from ooakhamwolf) to get to the final result.

ADD REPLYlink modified 11 days ago • written 11 days ago by RamRS30k

Please don't use pre tags. Use the 101010 button on the toolbar instead.

code_formatting

ADD REPLYlink written 11 days ago by RamRS30k

I GOT IT. THANKS VERY MUCH!

ID
#NEW LINES#
SEQUENCES
#NEW LINES#
ID
#NEW LINES#
SEQUENCES
SEQUENCES
#NEW LINES#
ADD REPLYlink modified 11 days ago • written 11 days ago by yaqinguo6290

I was speaking to oakhamwolf - but I understand the confusion. Replies can get confusing.

ADD REPLYlink written 10 days ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour