Pairwise comparison of AA fasta files which contain multiple domains
1
0
Entering edit mode
3.9 years ago
MaheJaan • 0

I have about 5000 .fasta files which contain 2 versions (transcript) of the same gene split into their domains, like so:

 >zf-C4_1_ENST00000512784
RLCLVCGDIASGYHYGVASCEACKAFFKRTIQGNIEYSCPATNECEITKRRRKSCQACRF
MKCLKVGMLK

>Hormone_recep_1_ENST00000512784
IKALTTLCDLADRELVVIIGWAKHIPGFSSLSLGDQMSLLQSAWMEILILGIVYRSLPYD
DKLVYAEDYIMDEEHSRLAGLLELYRAILQLVRRYKKLKVEKEEFVTLKALALANSDSMY
IEDLEAVQKLQDLLHEALQDYELSQRHEEPWRTGKLLLTLPLLRQTAAKAVQHFYSVKLQ

>zf-C4_1_ENST00000644823
RLCLVCGDIASGYHYGVASCEACKAFFKRTIQGNIEYSCPATNECEITKRRRKSCQACRF
MKCLKVGMLK

>Hormone_recep_1_ENST00000644823
IKALTTLCDLADRELVVIIGWAKHIPGFSSLSLGDQMSLLQSAWMEILILGIVYRSLPYD
DKLVYAEDYIMDEEHSRLAGLLELYRAILQLVRRYKKLKVEKEEFVTLKALALANSDSMY
IEDLEAVQKLQDLLHEALQDYELSQRHEEPWRTGKLLLTLPLLRQTAAKAVQHFYSVKLQ

I will like to run a pairwise comparison on the same domains.

When I try to run this with msa(mySequences) it is treated as 4 sequences to compare, not just the domains. Any help on how I can so this, and maybe make a loop in R for the other 5000 files?

R • 535 views
ADD COMMENT
0
Entering edit mode
3.9 years ago
Mensur Dlakic ★ 29k

You can use this script to sort your sequences by size, since there seem to be a clear difference in domain sizes. After that split the resulting file into the two domains, and proceed as planned.

The syntax:

sort_contigs.pl -b -z -p your_file.fas sorted_file.fas
ADD COMMENT

Login before adding your answer.

Traffic: 3718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6