I uploaded a single FASTA file with multiple gene clusters from different organisms to an online program called antiSMASH, or fungiSMASH in my case. Most clusters have a ketosynthase (KS) gene in them. antiSMASH identified the putative KS genes and provided me with an output in a text file. Can anyone assist me in extracting (parsing may be the correct term?) the genes of interest (nucleotide sequences) with associated accession number and definition, or just the taxon name from the definition?
I believe all of the genes of interest will say:
And if this is present then I will want the range of nucleotides indicated adjacent to the heading aSDomain. For example:
However, sometimes the adjacent numbers will says something like:
In which case I believe I would want to concatenate each range indicated.
In which case I believe I would want to concatenate all ranges and then take the complementary sequence.
I would like to do an alignment and then make a phylogenetic tree based on the extracted KS genes. I believe FASTA format would be a good output to have my KS genes in, but I can convert if necessary.
This is my first time using antiSMASH and I'm new to coding so I apologies for any obvious blunders and I would have preferred to attach a file of my output data but I didn't see that as an option! Thanks in advance for any help!
Here's a link to my output:
If someone has a better way of attaching a large text file (~1.7 million characters), I'm all ears.
Here's a link all of the output that antiSMASH generated (not just text file of all annotated genes):