Extract sequence subset from multiple sequence alignment based on position in a specific species.
0
0
Entering edit mode
6 months ago
nad7wf • 0

I would like to generate a logo for the amino acid domain surrounding a particular mutation of interest. I am using the "msa" package in R to generate the multiple sequence alignment based on the full peptide sequence, then I would like to extract a short subset of the consensus matrix that surrounds my AA of interest in order to generate the logo. I know the position of the amino acid in my species (Glycine max). However, as expected, when aligned to other sequences, gaps in my sequence relative to the sequence in other species cause my amino acid of interest to be shifted to a new position in the final alignment. This makes it difficult to automate the process of extracting the amino acids surrounding my mutation based on position.

My question is, is there a way to extract a subset of the multiple sequence alignment based on the position in my species of interest so that I get the amino acid I'm interested in, regardless of its new position in the MSA? Ultimately, I plan to wrap this into an R function so I can run this on many sequences, so finding the new position by eye, based on the surrounding sequence, is not really an option.


Example

Before alignment, AA of interest is at position 82
Screen-Shot-2020-10-17-at-12-34-32-PM

After alignment, AA of interest is now at position 95
Screen-Shot-2020-10-17-at-12-39-32-PM


I would like to pull a few amino acids on either side of this Lysine in order to generate a logo (in reality, I have many more species in my MSA). However, once I've aligned the sequences, I can no longer use "position 82" to pull the correct amino acids. Is there a way to extract the desired amino acids based on their position in Glycine max, specifically?


Example Fasta to work with (is there a better way to format this when posting a question? I'm new here, sorry!):

>Glycine_max MSNPSDEKEQCQKKRKSTICEASNFKTSRRRFFSNKNEEDMNKGVSTTLKLYDDPWKIKKTLTDSDLGILSRLSLAADLVKKQI >Jatropha_curcas RMRFYTAEDLEEERRKGISTNLSLSYRYNTVNILDNNLDLWKIKKRLKESDLGNLSRLLVPSNLVKKDVLPLLSSDLVNG IKSKRGAPVVFWDVDTETTHNLVFKKWDSSGS


Code used for generating MSA and consensus matrix:

library(msa)
alignment <- msaClustalOmega(readAAStringSet("/path/to/fasta_file.fa"))

# This pulls the wrong amino acids.
consensus_mat <- consensusMatrix(alignment)[, 72:84]
R Biostrings msa • 393 views
ADD COMMENT

Login before adding your answer.

Traffic: 2382 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6