Question: Multi-fasta sequence file alignment using sliding window
gravatar for susheelbhanu
11 months ago by
susheelbhanu0 wrote:

Dear all,

I am trying to align two fasta files with multiple sequences. The idea would be to use a sliding window approach, as done on this site:

I want to use the same 80aa sliding window, and find all alignments with greater than 35% identity. Finally, I would like an overall identity score the aligned sequences.

I've tried BLAST, FASTA36, and DIAMOND so far. None of them have the 'sliding window' option. I even tried adjusting 'word sizes' in BLAST but no luck yet.

Any ideas on how one can pursue this?

Thank you!

ADD COMMENTlink modified 11 months ago • written 11 months ago by susheelbhanu0

Since this seems to be too confusing, here's a breakdown of what I want to do:

  1. Break each protein in a multifasta file into 80aa sub-peptides

  2. run BLAST on each sub-peptide against a database of interest

  3. Collect and Screen BLAST outputs for > 35% identity over the 80aa lengths

Thank you for your help!

ADD REPLYlink written 11 months ago by susheelbhanu0

This is almost certainly going to need some custom code.

What exactly are you trying to achieve? Please describe the collection of sequences you have? How long are they? Are they similar to each other? How were they collected? Sliding window is one thing but are you looking for similarity without introducing any breaks in sequence?

ADD REPLYlink modified 11 months ago • written 11 months ago by GenoMax96k

@genomax: My file looks like so:


>DEF1_ARAHY UniProt Ara h 12 UniProt B3EWP3 Arachis hypogaea Peanut Defensin 1 71 ktvagfcifflvlflaqegvvkteaklcnhladtyrgpcftnascddhcknkehfvsgtcmkmacwcahnc

>DEF2_ARAHY UniProt Ara h 13 UniProt B3EWP4 Arachis hypogaea Peanut Defensin 2 79 vqkrtiimekkmagfcifflilflaqeygvegkeclnlsdkfkgpclgskncdhhcrdiehllsgvcrddfrcwcnrkc

>DEF3_ARAHY UniProt Ara h 13 UniProt C0HJZ1 Arachis hypogaea Peanut Defensin 3 72 mekkmagfcifflvlflaqeygvegkvclnlsdkfkgpclgtkncdhhcrdiehllsgvcrddfrcwcnrnc

>Q0PKR4_ARAHY UniProt Ara h 8 UniProt Q0PKR4 Arachis hypogaea Peanut Pathogenesis-related protein 10 157 mgvftfedeitstlppaklynamkdadsltpkiiddvksveivegsggpgtikkltivedgetrfilhkveaideanyaynysvvggvalpptaekitfetklveghnggstgklsvkfhskgdakpeeedmkkgkakgealfkaiegyvlanptqy

>Q0GM57_ARAHY UniProt Ara h 3 UniProt Q0GM57 Arachis hypogaea Peanut Iso-Ara h3 512 makllalslcfcvlvlgassvtfrqggeenecqfqrlnaqrpdnrieseggyietwnpnnqefqcagvalsrtvlrrnalrrpfysnapleiyvqqgsgyfglifpgcpstyeepaqegrryqsqkpsrrfqvgqddpsqqqqdshqkvhrfdegdliavptgvafwmyndedtdvvtvtlsdtssihnqldqfprrfylagnqeqeflryqqqqgsrphyrqisprvrgdeqenegsnifsgfaqeflqhafqvdrqtvenlrgenereeqgaivtvkgglrilspdeedessrsppsrreefdedrsrpqqrgkydenrrgykngieeticsasvkknlgrssnpdiynpqagslrsvneldlpilgwlglsaqhgtiyrnamfvphytlnahtivvalngrahvqvvdsngnrvydeelqeghvlvvpqnfavaakaqsenyeylafktdsrpsianlagensiidnlpeevvansyrlpreqarqlknnnpfkffvppfdhqsmreva

The idea is to break these into 80aa peptides, and then BLAST them against a blastdb (generated from another file). I would apply the default gap penalties incorporated into the BLOSUM62 matrix analyses in BLAST. Eventually, the scores from each of the peptides (i.e. sub-peptides) should then be collated into the same file, with results for each peptide (**not sub-peptides)**.

This is similar to the "allermatch" website where the describe their method here: **80-amino-acid sliding window: The input sequence is chopped up in 80-amino-acid windows. For each 80- amino acid window, the program counts which allergen it hits (with a specific identity).**

May sliding window is not the right terminology, but I hope this makes sense. Please let me know if I should clarify further.


ADD REPLYlink modified 11 months ago • written 11 months ago by susheelbhanu0

Please use ADD COMMENT/ADD REPLY when responding to existing comments to keep threads logically organized. Please use the formatting bar (especially the code option) to present your post better.
code_formatting It would be great if you can clean up your sequences posted above.

I am afraid it is still not clear what you want to achieve. 80-AA window may make specific sense for the tool you had linked in original post but it may not be valid for your application. With sliding windows there is a step involved. Using that in case of blasting sequences may complicate things unnecessarily.

I would suggest that you could blast sequences in one file against all in other but then instead of trying to parse blast results you could pick out homologous sequences identified by blast and start doing multiple sequence alignments instead.

ADD REPLYlink modified 11 months ago • written 11 months ago by GenoMax96k

Thanks for the pointers @genomax. For certain reasons, I need to replicate the analyses performed by the tool. Others who have the scripts to do this do not want to share it, so my reliance on the experts here.

I've already run BLAST, DIAMOND etc., yielding variable results. However, the people I am working suggest that the above approach needs to be done.

Thanks anyways for the insights.

ADD REPLYlink written 11 months ago by susheelbhanu0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1803 users visited in the last hour