evaluate amino acid diversity at each position of peptide library
0
0
Entering edit mode
3.9 years ago
toth.joe ▴ 30

I made a library of randomized amino acids in a region of a protein. The length of the randomized region varies from 5 to 15. I want to use an NGS dataset of 1 million reads to evaluate the library for diversity (are there any duplicates?) and distribution of amino acids at each position. Using Python, I can parse the reads with REGEX and split the randomized regions into separate files of 5-mer, 6-mer etc. peptides What is the most efficient way to sort and count uniques? Is there an efficient way to align 100,000 5 mer peptides and find the distribution of amino acids at position 1, 2, 3 etc?

sequence sequencing alignment • 745 views
ADD COMMENT
0
Entering edit mode

Are these peptides always the same length - or to be more specific, do you actually need alignment?

If you want to know how many unique entires there are from a list of elements, you can simply get the len() of the set() of the list in python. I doubt you'll find something more efficient than set whilst staying within python itself.

ADD REPLY
0
Entering edit mode

The peptides have different lengths from 5 to 15 amino acids. I will try the set() method to solve the number of uniques problem. Could I use the same idea to sort 5-mer peptides that start with GLY into one set, then the ones that start with ALA into another, etc. This way I could get a count of the distribution of amino acids at position 1. Then repeat for position 2 etc. etc. Is that efficient for 1 million peptides?

ADD REPLY
0
Entering edit mode

You can't use a set for the latter part specifically, but you can apply the set after you've filtered to de-duplicate if needed.

It would probably look something like: set(filter(lambda x: x.startswith("GLY"), peptides)). I doubt running this over and over for all possible starting peptides is very efficient though.

You might be able to use collections.Counter if you only need numbers relating to what starts with what etc, but this won't do any filtering, so it depends what the end goals are.

ADD REPLY

Login before adding your answer.

Traffic: 2608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6