Question

How to append strings (from one file) to Fasta headers (in another file)

2

Entering edit mode

7.6 years ago

al-ash ▴ 200

Hi! I have two input files, fastas.txt with multiple FASTA sequences, such as shown in an example below:

>Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK
>Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK

and second file additional_header_information.txt with strings, such as:

XYZ
aksjdkasdj

And I' like to merge the strings from the second file with the headers in the first file to generate:

>XYZ_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK
>aksjdkasdj_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK

I previously used bioawk to put one specific prefix to all FASTA headers:

bioawk -c fastx '{ print ">PREFIX_" $name "\n" $seq }' input.txt >outupt.txt

So I thought I might be able to use some sort of loop to make bioawk go through the lines of additional_header_information.txt and of fastas.txt and combine them...but I did not get anything functional.

I also tried to modify python script from replace fasta headers with another name in a text file (see the original script):

fasta= open('fastas.txt')
newnames= open('additional_header_information.txt')
newfasta= open('additional_header_information_fastas.txt', 'w')

for line in fasta:
        if line.startswith('>'):
            newname= newnames.readline()
        newfasta.write(newname)
    else:
            newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()

so it would not replace the FASTA headers but rather it is adding to them a string from the additional_header_information.txt but this is also not working for me.

I'll be thankful for tour tips how to use bioawk or there but actually any other solution will be also most welcomed!

bioawk FASTA header append • 5.6k views

ADD COMMENT • link updated 7.6 years ago by Joe 21k • written 7.6 years ago by al-ash ▴ 200

score 4 · Answer 1 · 2016-09-18

4

Entering edit mode

7.6 years ago

shenwei356 8.4k

I suppose that lines of additional_header_information.txt is equal to number of sequences in fastas.txt.

I use SeqKit (seqkit fx2tab) to convert to FASTA to tabular format, and then merge them to additional_header_information.txt. Then awk is used to reorder the tabular format. At last, seqkit tab2fx is used to convert tabular format back to FASTA format.

$ paste <(seqkit fx2tab fastas.txt | cut -f 1,2 ) additional_header_information.txt      |   awk '{print $3"_"$1"\t"$2}'  |  seqkit tab2fx -w 0
>XYZ_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK
>aksjdkasdj_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK

ADD COMMENT • link 7.6 years ago by shenwei356 8.4k

0

Entering edit mode

Almost there - only my output looks like:

>XYZ
_BlapFAR9_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKKLIPVRGDTSVEGLGLGPVERRTITERVSVIFHVAANVRF
>aksjdkasdj_BlucFAR9_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKKLIPVCGDTSVEGLGLGPVERRTITERVSVIFHVAANVRF

...it seems that the problem is the new line in the additional_header_information.txt which separates the individual lines (if I test it on files with more than 2 fastas, always all but the last fasta header contain after executing the command the additional new line). Is there a way how not to include the new line into the new header?

PS. thanks for your explanation of each step!

ADD REPLY • link 7.6 years ago by al-ash ▴ 200

0

Entering edit mode

You mean this?

XYZ

aksjdkasdj 

anotherline

If yes, just filter the additional_header_information.txt file with awk by omitting blank lines.

$ paste <( seqkit fx2tab fastas.txt | cut -f 1,2 ) <( awk '{if (length($0) > 0) print} ' additional_header_information.txt )  | awk '{print $3"_"$1"\t"$2}'  |  seqkit tab2fx -w 0

ADD REPLY • link 7.6 years ago by shenwei356 8.4k

0

Entering edit mode

Actually my input files really look like

>Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI

and

info1
info2
info3
info4

and the output is

>info1
_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info2
_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info3
_Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info4_Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI

So I'm not sure what is going on there...some excessive line breaks?

EDIT: explained: CR+LF originating from windows text editor "notepad"

ADD REPLY • link 7.6 years ago by al-ash ▴ 200

score 3 · Answer 2 · 2016-09-18

3

Entering edit mode

7.6 years ago

Pierre Lindenbaum 161k

juste paste and awk:

$ paste <(cat header.txt) <(cat input.fasta  | paste - - | cut -c 2-)  | awk '{printf(">%s_%s\n%s\n",$1,$2,$3);}'

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Great! But actually the output is the same as for the command using SeqKit: I still have the problem with break lines in the FASTA headers (all but the last one from the list):

>info1
_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info2
_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info3
_Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info4_Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI

Obviously there is something wrong with my input files - might you have any suggestions?

ADD REPLY • link 7.6 years ago by al-ash ▴ 200

1

Entering edit mode

your files come from windows ? excel ?

what the output of

file header.txt 
file input.fa

if it's something like

 ASCII text, with CRLF, LF line terminators

then

1) you should remove the \r with 'tr -d '\r' https://en.wikipedia.org/wiki/Newline#Common_problems 2) you should never use excel

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

New.txt: ASCII text, with CRLF line terminators

Yes, to make the test files I used windows text editor "notepad" (I run linux in virtualbox) which I though is primitive enough not to introduce any unwanted characters. Obviously I was wrong as I had no idea that there is anything like CR+LF in windows; thanks for your help!

ADD REPLY • link 7.6 years ago by al-ash ▴ 200

score 0 · Answer 3 · 2016-09-20

0

Entering edit mode

7.6 years ago

Joe 21k

Linux distributions come with dos2unix as a builtin that can often deal with weird line feeds etc. Sometimes works and doesn't do any hard to run a file through if coming from Windows.

ADD COMMENT • link 7.6 years ago by Joe 21k