Question: Extract sequence subset from multiple sequence alignment based on position in a specific species.
0
gravatar for nad7wf
2 days ago by
nad7wf0
nad7wf0 wrote:

I would like to generate a logo for the amino acid domain surrounding a particular mutation of interest. I am using the "msa" package in R to generate the multiple sequence alignment based on the full peptide sequence, then I would like to extract a short subset of the consensus matrix that surrounds my AA of interest in order to generate the logo. I know the position of the amino acid in my species (Glycine max). However, as expected, when aligned to other sequences, gaps in my sequence relative to the sequence in other species cause my amino acid of interest to be shifted to a new position in the final alignment. This makes it difficult to automate the process of extracting the amino acids surrounding my mutation based on position.

My question is, is there a way to extract a subset of the multiple sequence alignment based on the position in my species of interest so that I get the amino acid I'm interested in, regardless of its new position in the MSA? Ultimately, I plan to wrap this into an R function so I can run this on many sequences, so finding the new position by eye, based on the surrounding sequence, is not really an option.


Example

Before alignment, AA of interest is at position 82
Screen-Shot-2020-10-17-at-12-34-32-PM

After alignment, AA of interest is now at position 95
Screen-Shot-2020-10-17-at-12-39-32-PM


I would like to pull a few amino acids on either side of this Lysine in order to generate a logo (in reality, I have many more species in my MSA). However, once I've aligned the sequences, I can no longer use "position 82" to pull the correct amino acids. Is there a way to extract the desired amino acids based on their position in Glycine max, specifically?


Example Fasta to work with (is there a better way to format this when posting a question? I'm new here, sorry!):

>Glycine_max MSNPSDEKEQCQKKRKSTICEASNFKTSRRRFFSNKNEEDMNKGVSTTLKLYDDPWKIKKTLTDSDLGILSRLSLAADLVKKQI >Jatropha_curcas RMRFYTAEDLEEERRKGISTNLSLSYRYNTVNILDNNLDLWKIKKRLKESDLGNLSRLLVPSNLVKKDVLPLLSSDLVNG IKSKRGAPVVFWDVDTETTHNLVFKKWDSSGS


Code used for generating MSA and consensus matrix:

library(msa)
alignment <- msaClustalOmega(readAAStringSet("/path/to/fasta_file.fa"))

# This pulls the wrong amino acids.
consensus_mat <- consensusMatrix(alignment)[, 72:84]
msa biostrings R • 62 views
ADD COMMENTlink written 2 days ago by nad7wf0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2133 users visited in the last hour