I need to build a table of amino acid frequencies per position along a ~900 amino acid amplicon that is all coding, based on the alignment of 1-2 million illumina 100mers to that amplicon. I'm trying to emulate results I've done before, but in that earlier case I had PacBio long reads that each mapped across the entire 900aa amplicon. In that earlier case I was able to use MUSCLE to generate fasta format multiple alignments and then used a local script to print out the frequency of each amino acid at each residue in the reference amplicon.
Now my collaborator wants me to generate the same table but this time I don't have PacBio long reads, I only have illumina 100mers. I've translated the Illumina data into 33mer aa sequences but I don't think typical multiple alignment tools will properly align these many (millions of) short reads against a long reference. FAMSA finished quickly for such a large batch but as I feared all the 33mers were stretched out and piled towards the center of the long ~900aa amplicon. Clustalw2 was unable to finish after almost a week, and I also tried PRALINE on a small set and saw that even with just my small test set the 33aa reads were already getting stretched out. These reads should all align well, mostly intact (few/no gaps) to discrete positions in the amplicon.
I then had a brief moment of hope when a forum post led me to the MView program. I then ran ncbi blastp on the pool of 33aa short reads against the amplicon and fed that to MView, but it was also not suited for my needs. I guess it also is expecting all the aligned reads to be of about the same length, and to line up in the same spot within a reference.
Is there some way I can do a multiple alignment of these short 33aa reads against this 900aa amplicon, letting them each independently fall in their proper alignment to the amplicon, but then generate an MSA file in fasta format from those alignments?