Question: Per position amino acid frequency from mapping short reads vs. long coding amplicon
gravatar for jmartin77777
22 months ago by
jmartin777770 wrote:

I need to build a table of amino acid frequencies per position along a ~900 amino acid amplicon that is all coding, based on the alignment of 1-2 million illumina 100mers to that amplicon. I'm trying to emulate results I've done before, but in that earlier case I had PacBio long reads that each mapped across the entire 900aa amplicon. In that earlier case I was able to use MUSCLE to generate fasta format multiple alignments and then used a local script to print out the frequency of each amino acid at each residue in the reference amplicon.

Now my collaborator wants me to generate the same table but this time I don't have PacBio long reads, I only have illumina 100mers. I've translated the Illumina data into 33mer aa sequences but I don't think typical multiple alignment tools will properly align these many (millions of) short reads against a long reference. FAMSA finished quickly for such a large batch but as I feared all the 33mers were stretched out and piled towards the center of the long ~900aa amplicon. Clustalw2 was unable to finish after almost a week, and I also tried PRALINE on a small set and saw that even with just my small test set the 33aa reads were already getting stretched out. These reads should all align well, mostly intact (few/no gaps) to discrete positions in the amplicon.

I then had a brief moment of hope when a forum post led me to the MView program. I then ran ncbi blastp on the pool of 33aa short reads against the amplicon and fed that to MView, but it was also not suited for my needs. I guess it also is expecting all the aligned reads to be of about the same length, and to line up in the same spot within a reference.

Is there some way I can do a multiple alignment of these short 33aa reads against this 900aa amplicon, letting them each independently fall in their proper alignment to the amplicon, but then generate an MSA file in fasta format from those alignments?

Thanks, John

alignment • 562 views
ADD COMMENTlink modified 22 months ago • written 22 months ago by jmartin777770

Thats worth a try. I'm concerned about whether tools will correctly handle a sam file with protein alignments, but if it works that would be what I need!

ADD REPLYlink written 22 months ago by jmartin777770
gravatar for h.mon
22 months ago by
h.mon30k wrote:

DIAMOND can align the Illumina reads to the reference database and output to a number of formats, including SAM. You can probably get the information you need parsing the SAM file directly, or convert the SAM file to a MSA fasta.

ADD COMMENTlink written 22 months ago by h.mon30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1043 users visited in the last hour