Is it possible to isolate and merge overlapping hits from a BLAST search Aligned Fasta file?
1
0
Entering edit mode
4 months ago
Leah ▴ 10

Hello all,

Hope you're well. I started a PhD in Jan this year and frankly I am struggling. I'm coming into the second week of trying to figure out how to attempt this problem and it's honestly starting to get to me a bit - the current task I'm working on is not my main project and every day I don't make progress is a delay on my actual project. So! A bit worried to say the least (!)

Here is the issue: I ran some sequences using BLASTn. I downloaded the Aligned Fasta file for each. I have sorted reads by if they are a single read (ie the accession number only appears once) or multiple reads of the same entry (example below, edited for brevity):

SINGLE HIT

>XM_040233432|1:208-654_PREDICTED:_Oryx_dammah_lysozyme_C_kidney_isozyme_(LOC120857726)_mRNA
ATGAAGGTTCTCCTTATTCTGGGGCTTCTCCTCCTTTCAGTCGCTGTCCAAGGCAAGGTCTTTGAGAGATGTGAGCT...


MULTIPLE HIT

>XM_042246932|1:88-371_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
TGAGGGCTGTCCTTATTCTGGGGCTTCTCCTCCTTTCTGTCACAGTCCAGGGCAAGAAATTTCAGAGGTGTGAGCTTGCC..
>XM_042246932|1:388-668_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
GGTTGTGTTTGACCAAATGGGAAAGCAGTTATAACACAAAAACGACAAACTACAATCCTAACAGTGAAAGCACTGATTAT...


This is so I can use the seqkit to combine the multiple hit entries. Once I've done that, I'll be filtering based on query sequence length (using bioawk which I know how to do) and carrying on with the pipeline (alignment, analysis, ect.).

The problem is, some of these multiple hits are overlapping and I have no idea how to treat them! For example:

>XM_023558748|1:**49-362**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
ATGAAGGCTCTCCTTATTCTGGGCCTTCTCTTCCTTTCTGGCACTGCCCAGGGCAAGGTCTTTGAAAGGTGCGAGTTGGC..

>XM_023558748|1:**349-416**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
GGGTGGCATGGAGAAACCATTGTCAGAACCACGATGTCACTCAGTACATTCAAGGCTGTGGAGTCTAA


My two questions:

1. Is there a way I could isolate these overlapping sequences from the nonoveralpping multiple hits? I just can't fathom how I can achieve that right now with my current skill set.

2. Aside from using the EMBOSS merger command, is there another way for me to merge these overlapping files together? I did give it a go and it had wonderfully merged the above Elephant example correctly. But I am going to be dealing with thousands of hits, is it possible to write a python script to run merger through the overlapping files? I realise this is a tall order. I'd try to configure this script to use seqkit to merge the non-overlapping multiple hits too.

Thank you kindly for reading - I have trawled these forums the last few months and frankly I only think I've managed to get where I am thanks to your answers. Any advice, suggestions or critiques are gratefully received.

GenBank Fasta Linux Python BLAST • 570 views
0
Entering edit mode

Not sure if I understand what are you trying to do. But, if you get your blastn results as a table (outfmt 6) it would be easier to analyze them. You will be able to classify/filter blastn hits by evalue, score, etc. and also, to identify querys matching multiple subjects.

0
Entering edit mode

Thank you for the response Buffo - I used the online BLAST instead of command line. I did suggest to my supervisor about going about it the command line way but we went with online. I am kicking myself slightly as it does seem to be more commonly used.

It does sound like it would help but not necessarily solve the issue I am having which is how to treat these overlapping hits. While I'm reticent to recollect my data and set myself back a few weeks, if it means making this issue easier to solve, I'll give it a go. Thank you for taking the time out to answer

0
Entering edit mode
4 months ago
Jiyao Wang ▴ 210

I am not sure whether this will help. When you get the result page, a link near the top left allows you to download the search result in a JSON blob to be parsed.

0
Entering edit mode

Thank you kindly for your response Jiyao Wang - I've never come across JSON before but it looks pretty interesting. I am going to attempt to have a go another way but thank you for the suggestion and I appreciate you taking the time out to answer