A quick idea how to connect two sets of multifasta files with a number of of NNNs in between sequences
2
0
Entering edit mode
7.8 years ago
xi100f • 0

Hi. I am looking for a quick idea how to connect two sets of multifasta files with a number of of NNNs in between. Would appreciate if anyone pointed me to existing script or other solution.

sort of:

file1
>1_set1
AAA

>2_set1
CCC
...

file2

> 1_set2
GGG

>2_set2
TTT

...

to get something like

>1_set1_set_2
AAANNNGGG

>2_set1_set2
CCCNNNTTT

Sorry for formatting, I am kinda new here... Thanks in advance, Xi

sequence Assembly sequencing alignment • 1.5k views
ADD COMMENT
0
Entering edit mode

Thanks for the replies! I can use grep 'set' | sed 's/>\|_set*//g' | sort | comm -3 file1 file2 to make a sorted list of common sequences. Then use fastafetch -F to pull only relevant sequence pairs. I wrote little python script some time ago which gets rid of end of line characters in sequence part and writes two lines per record. Then should be fast and easy to test your solutions. Will let you know how it went soon.

Thanks for help!

ADD REPLY
2
Entering edit mode
7.8 years ago

assuming both files have two lines (header+sequence) per record:

paste file1.fasta file2.fasta | \
awk -F '\t' '{if(NR%2==1) {printf(">%s_and_%s\n",substr($1,2),substr($2,2));} else  {printf("%sNNN%s\n",$1,$2);}}'
ADD COMMENT
0
Entering edit mode

Slight modification for the awk part to match the requested headers in original post

awk -F '\t' '{if(NR%2==1) {printf(">%s%s\n",substr($1,2),substr($2,3));} else  {printf("%sNNN%s\n",$1,$2);}}'
ADD REPLY
0
Entering edit mode

Yes this worked a treat! Exactly what I was looking for. Thanks!

ADD REPLY
0
Entering edit mode

How to delete replies/comments?

ADD REPLY
2
Entering edit mode
7.8 years ago
5heikki 11k

Also assuming that the sequences are in correct order..

paste File1 File2 | sed -e 's/\t/NNN/' -e 's/NNN>/_/' > union
ADD COMMENT
0
Entering edit mode

Slight modification of the sed part to match the original header request.

sed -e 's/\t/NNN/' -e 's/NNN>/_/' -e 's/_[0-9]\+//'
ADD REPLY
0
Entering edit mode

This one produced the same output as awk solution. An it was noticeable faster than awk (my file has 53000 lines). Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6