Efficient ways to retrieve all sequences associated with a disease?
0
1
Entering edit mode
13 months ago

Hello everyone,

Recently I've been given the task of collecting all the available amino acid sequences for a protein that is associated with a disease, the idea is to collect all the available mutations associated with the protein and make conclusions from their pattern.

My first approach was trying biomart from Ensembl, I've added the ensembl gene ID in the filters, as well as the disease name in the filters (since its already available as an option), and in the results I selected the protein ID and the sequences, however, the result generated was some 67K sequences which is unlikely to be correct, I've noticed some normal healthy sequences were also within the results, hence, i ditched it all (comment if you think I'm did something wrong).

My second approach was going to all the protein sequences of that gene in NCBI and check them 1 by 1 if they're actually a mutant, obviously, this is taking forever...

Any advice on a more efficient way I could do this? i.e. search for a protein and retrieve the sequence of all of its mutants, the protein I'm trying to find its mutant forms is Beta-amyloid precursor protein (associated with Alzheimer).

Sequences data-mining monogenetic-disorders • 493 views
1
Entering edit mode

You can download sequences (from whichever source you want to, you did not include UniProt in yout list) and then do a multiple sequence alignment to find differences. You will notice that there would be many redundant sequences (which you could eliminate, keeping track of ID's of those you do) so the job of creating the multiple sequence alignment will become manageable.

1
Entering edit mode

To follow from this, one could get the UniProt IDs of the disease-associated proteins from https://www.uniprot.org/, and then retrieve the sequences of these in FASTA format via e-utils (I have code here Need help to retrive sequences and here https://github.com/kevinblighe/PythonScripts). After that, a multi-sequence alignment (MSA), as per Genomax, may help, or, do a pairwise alignment of each disease-associated protein sequence with a reference / healthy sequence.

0
Entering edit mode

If you're expecting redundancies, why not just cluster those 67,000 sequences OP got at 100% coverage and 100% identity first?

Alternatively, perhaps if OP could come up with some heuristic (e.g., 100% coverage, 98% identity), clustering at that cutoff and inspecting the cluster(s) containing a few "landmark" sequences might net them all the sequences they're interested already.

(Also, just curious, but how would one go about inspecting an MSA with 67,000 sequences?)

1
Entering edit mode

I am not sure where OP got 67K sequences since I am seeing following with EntrezDirect :

For all entries (check the count line):

$esearch -db protein -query "Beta amyloid protein" <ENTREZ_DIRECT> <Db>protein</Db> <WebEnv>MCID_60e4c25c7344f057e13e6e76</WebEnv> <QueryKey>1</QueryKey> <Count>1747</Count> <Step>1</Step> </ENTREZ_DIRECT> For human sequences:$ esearch -db protein -query "Beta amyloid protein AND human [ORGN]"
<ENTREZ_DIRECT>
<Db>protein</Db>
<WebEnv>MCID_60e4c24021d65503e5566464</WebEnv>
<QueryKey>1</QueryKey>
<Count>134</Count>
<Step>1</Step>
</ENTREZ_DIRECT>


Most sequences are expected to be near identical so they can indeed be clustered (e.g. CD-HIT) and only a single representative kept. OP wants all changes so even if there is one AA change then it becomes a candidate to keep . If there are large deletions/insertions then creating the MSA would be more challenging.

0
Entering edit mode

Ah thanks for clarifying that GenoMax !! With OP's exact phrasing ("Beta-amyloid precursor protein"), there are 4638 candidates. That's definitely not 67K, I suppose.

Most sequences are expected to be near identical so they can indeed be clustered (e.g. CD-HIT) and only a single representative kept.

I was going more along the lines of using the clusters themselves as a way of disambiguating sequences of interest from others in that 67K set, since a cluster defined at ~100% coverage and identity would be these highly identical sequences you mention.