Question

A quick idea how to connect two sets of multifasta files with a number of of NNNs in between sequences

0

Entering edit mode

7.8 years ago

xi100f • 0

Hi. I am looking for a quick idea how to connect two sets of multifasta files with a number of of NNNs in between. Would appreciate if anyone pointed me to existing script or other solution.

sort of:

file1
>1_set1
AAA

>2_set1
CCC
...

file2

> 1_set2
GGG

>2_set2
TTT

...

to get something like

>1_set1_set_2
AAANNNGGG

>2_set1_set2
CCCNNNTTT

Sorry for formatting, I am kinda new here... Thanks in advance, Xi

sequence Assembly sequencing alignment • 1.5k views

ADD COMMENT • link updated 7.8 years ago by Pierre Lindenbaum 161k • written 7.8 years ago by xi100f • 0

0

Entering edit mode

Thanks for the replies! I can use grep 'set' | sed 's/>\|_set*//g' | sort | comm -3 file1 file2 to make a sorted list of common sequences. Then use fastafetch -F to pull only relevant sequence pairs. I wrote little python script some time ago which gets rid of end of line characters in sequence part and writes two lines per record. Then should be fast and easy to test your solutions. Will let you know how it went soon.

Thanks for help!

ADD REPLY • link updated 7.8 years ago by GenoMax 141k • written 7.8 years ago by xi100f • 0

score 2 · Answer 1 · 2016-06-21

2

Entering edit mode

7.8 years ago

Pierre Lindenbaum 161k

assuming both files have two lines (header+sequence) per record:

paste file1.fasta file2.fasta | \
awk -F '\t' '{if(NR%2==1) {printf(">%s_and_%s\n",substr($1,2),substr($2,2));} else  {printf("%sNNN%s\n",$1,$2);}}'

ADD COMMENT • link 7.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Slight modification for the awk part to match the requested headers in original post

awk -F '\t' '{if(NR%2==1) {printf(">%s%s\n",substr($1,2),substr($2,3));} else  {printf("%sNNN%s\n",$1,$2);}}'

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Yes this worked a treat! Exactly what I was looking for. Thanks!

ADD REPLY • link 7.8 years ago by xi100f • 0

0

Entering edit mode

How to delete replies/comments?

ADD REPLY • link 7.8 years ago by xi100f • 0

score 2 · Answer 2 · 2016-06-21

2

Entering edit mode

7.8 years ago

5heikki 11k

Also assuming that the sequences are in correct order..

paste File1 File2 | sed -e 's/\t/NNN/' -e 's/NNN>/_/' > union

ADD COMMENT • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Slight modification of the sed part to match the original header request.

sed -e 's/\t/NNN/' -e 's/NNN>/_/' -e 's/_[0-9]\+//'

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

This one produced the same output as awk solution. An it was noticeable faster than awk (my file has 53000 lines). Thanks!

ADD REPLY • link 7.8 years ago by xi100f • 0