Hope you're well. I started a PhD in Jan this year and frankly I am struggling. I'm coming into the second week of trying to figure out how to attempt this problem and it's honestly starting to get to me a bit - the current task I'm working on is not my main project and every day I don't make progress is a delay on my actual project. So! A bit worried to say the least (!)
Here is the issue: I ran some sequences using BLASTn. I downloaded the Aligned Fasta file for each. I have sorted reads by if they are a single read (ie the accession number only appears once) or multiple reads of the same entry (example below, edited for brevity):
>XM_042246932|1:88-371_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182) TGAGGGCTGTCCTTATTCTGGGGCTTCTCCTCCTTTCTGTCACAGTCCAGGGCAAGAAATTTCAGAGGTGTGAGCTTGCC.. >XM_042246932|1:388-668_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182) GGTTGTGTTTGACCAAATGGGAAAGCAGTTATAACACAAAAACGACAAACTACAATCCTAACAGTGAAAGCACTGATTAT...
This is so I can use the seqkit to combine the multiple hit entries. Once I've done that, I'll be filtering based on query sequence length (using bioawk which I know how to do) and carrying on with the pipeline (alignment, analysis, ect.).
The problem is, some of these multiple hits are overlapping and I have no idea how to treat them! For example:
>XM_023558748|1:**49-362**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513) ATGAAGGCTCTCCTTATTCTGGGCCTTCTCTTCCTTTCTGGCACTGCCCAGGGCAAGGTCTTTGAAAGGTGCGAGTTGGC.. >XM_023558748|1:**349-416**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513) GGGTGGCATGGAGAAACCATTGTCAGAACCACGATGTCACTCAGTACATTCAAGGCTGTGGAGTCTAA
My two questions:
Is there a way I could isolate these overlapping sequences from the nonoveralpping multiple hits? I just can't fathom how I can achieve that right now with my current skill set.
Aside from using the EMBOSS merger command, is there another way for me to merge these overlapping files together? I did give it a go and it had wonderfully merged the above Elephant example correctly. But I am going to be dealing with thousands of hits, is it possible to write a python script to run merger through the overlapping files? I realise this is a tall order. I'd try to configure this script to use seqkit to merge the non-overlapping multiple hits too.
Thank you kindly for reading - I have trawled these forums the last few months and frankly I only think I've managed to get where I am thanks to your answers. Any advice, suggestions or critiques are gratefully received.