How to append strings (from one file) to Fasta headers (in another file)
3
2
Entering edit mode
4.9 years ago
al-ash ▴ 150

Hi! I have two input files, fastas.txt with multiple FASTA sequences, such as shown in an example below:

>Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK
>Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK


and second file additional_header_information.txt with strings, such as:

XYZ
aksjdkasdj


And I' like to merge the strings from the second file with the headers in the first file to generate:

>XYZ_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK
>aksjdkasdj_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK


I previously used bioawk to put one specific prefix to all FASTA headers:

bioawk -c fastx '{ print ">PREFIX_" $name "\n"$seq }' input.txt >outupt.txt


So I thought I might be able to use some sort of loop to make bioawk go through the lines of additional_header_information.txt and of fastas.txt and combine them...but I did not get anything functional.

I also tried to modify python script from replace fasta headers with another name in a text file (see the original script):

fasta= open('fastas.txt')

for line in fasta:
if line.startswith('>'):
newfasta.write(newname)
else:
newfasta.write(line)

fasta.close()
newnames.close()
newfasta.close()


so it would not replace the FASTA headers but rather it is adding to them a string from the additional_header_information.txt but this is also not working for me.

I'll be thankful for tour tips how to use bioawk or there but actually any other solution will be also most welcomed!

bioawk FASTA header append • 3.4k views
4
Entering edit mode
4.9 years ago

I suppose that lines of additional_header_information.txt is equal to number of sequences in fastas.txt.

I use SeqKit (seqkit fx2tab) to convert to FASTA to tabular format, and then merge them to additional_header_information.txt. Then awk is used to reorder the tabular format. At last, seqkit tab2fx is used to convert tabular format back to FASTA format.

$paste <(seqkit fx2tab fastas.txt | cut -f 1,2 ) additional_header_information.txt | awk '{print$3"_"$1"\t"$2}'  |  seqkit tab2fx -w 0
>XYZ_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK
>aksjdkasdj_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKK

0
Entering edit mode

Almost there - only my output looks like:

>XYZ
_BlapFAR9_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKKLIPVRGDTSVEGLGLGPVERRTITERVSVIFHVAANVRF
>aksjdkasdj_BlucFAR9_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSIDDRLKKMLELPLFDRLRKERPSNLKKLIPVCGDTSVEGLGLGPVERRTITERVSVIFHVAANVRF


...it seems that the problem is the new line in the additional_header_information.txt which separates the individual lines (if I test it on files with more than 2 fastas, always all but the last fasta header contain after executing the command the additional new line). Is there a way how not to include the new line into the new header?

PS. thanks for your explanation of each step!

0
Entering edit mode

You mean this?

XYZ

aksjdkasdj

anotherline


If yes, just filter the additional_header_information.txt file with awk by omitting blank lines.

$paste <( seqkit fx2tab fastas.txt | cut -f 1,2 ) <( awk '{if (length($0) > 0) print} ' additional_header_information.txt )  | awk '{print $3"_"$1"\t"$2}' | seqkit tab2fx -w 0  ADD REPLY 0 Entering edit mode Actually my input files really look like >Blap_contig79 MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI >Bluc_contig23663 MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI >Blap_contig7988 MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI >Bluc_contig1223663 MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI  and info1 info2 info3 info4  and the output is >info1 _Blap_contig79 MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI >info2 _Bluc_contig23663 MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI >info3 _Blap_contig7988 MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI >info4_Bluc_contig1223663 MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI  So I'm not sure what is going on there...some excessive line breaks? EDIT: explained: CR+LF originating from windows text editor "notepad" ADD REPLY 3 Entering edit mode 4.9 years ago juste paste and awk: $ paste <(cat header.txt) <(cat input.fasta  | paste - - | cut -c 2-)  | awk '{printf(">%s_%s\n%s\n",$1,$2,\$3);}'

0
Entering edit mode

Great! But actually the output is the same as for the command using SeqKit: I still have the problem with break lines in the FASTA headers (all but the last one from the list):

>info1
_Blap_contig79
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info2
_Bluc_contig23663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info3
_Blap_contig7988
MSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI
>info4_Bluc_contig1223663
MSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI


Obviously there is something wrong with my input files - might you have any suggestions?

1
Entering edit mode

your files come from windows ? excel ?

what the output of

file header.txt
file input.fa


if it's something like

 ASCII text, with CRLF, LF line terminators


then

1) you should remove the \r with 'tr -d '\r' https://en.wikipedia.org/wiki/Newline#Common_problems 2) you should never use excel

0
Entering edit mode
New.txt: ASCII text, with CRLF line terminators


Yes, to make the test files I used windows text editor "notepad" (I run linux in virtualbox) which I though is primitive enough not to introduce any unwanted characters. Obviously I was wrong as I had no idea that there is anything like CR+LF in windows; thanks for your help!

0
Entering edit mode
4.9 years ago
Joe 19k

Linux distributions come with dos2unix as a builtin that can often deal with weird line feeds etc. Sometimes works and doesn't do any hard to run a file through if coming from Windows.