I am aligning CDS sequences I downloaded from Ensembl to some draft assemblies someone has published. I used LAST to align then, and it appears to have worked, see example here:
# name start alnSize strand seqSize alignment | |
# batch 1 | |
a score=54 EG2=6.6e-09 E=8.8e-16 | |
s L_fuelleborni.EQ177488.1 627 100 + 1093 GTGCTCCACTCTGCTCATTGTAGGCGATAATTCTCCAGCTGTGGAAGCTGTGGTGGACTGCAACTCTAA | |
s O_niloticus.ENSONIG00000002093_ENSONIT00000002611 704 100 + 1074 GTGCCCTTCTCTGCTGGTGGTTGGCGATAGCTCTCCTGCCGTGGAGGCCGTGGTTGAGTGCAACACTAA | |
a score=42 EG2=0.0034 E=4.7e-10 | |
s L_fuelleborni.EQ177488.1 278 148 + 1093 AGGATGAAATCCTGACCAATCACGACCTCATCGCCACATACCGCCACCGCatcacaacaacaatgaACC | |
s O_niloticus.ENSONIG00000002093_ENSONIT00000002611 541 148 + 1074 AGGAGGAAATCCACCACAACCATGATCTAATCGCCACATACCGCCACCACATCATGAATGACATGAACC | |
a score=41 EG2=0.01 E=1.4e-09 | |
s L_fuelleborni.ABPK01036261.1 317 93 + 1071 ACTGACCTTTGAAGCAGCCCAGTCAATCCAGCCTTCAGCACACGGGTCGACATTAATGAGTACCAGCCCC | |
s O_niloticus.ENSONIG00000002093_ENSONIT00000002611 582 93 - 1074 ACTGATCTTGTGTGCAGCCCAGTCCATCCATCCCTCAGCACACGAGTTGATGTTGATGAGAACAAGGCCT |
What I can't figure out is that for most of the query CDS's ("O_niloticus" lines in the example), some of the transcript aligns to the + strand of the assembly, and others to - strand. This seems to be the case only when a single CDS aligns to multiple different contigs (e.g. L_fuelleborni.EQ177488 and L_fuelleborni.ABPK01036261.1 are different contigs...I think), i.e. all of the CDS alignments within a contig seem to be on the same strand.
So, my question is: why would this be the case? These are fairly closely related species, so I don't think chimerism/rearrangement is plausible (again, its nearly every CDS). Is it possible the orientation of the contigs assembly is not yet determined? If that's the case can I just reverse compliment the - strand alignments and tack them (with spaces) together with the + strand bit? The end application is a FASTA for each query CDS containing all the matching sequences from a number of other species (for PAML).
Any insight would be greatly appreciated!