Entering edit mode
3.9 years ago
toth.joe
▴
30
I made a library of randomized amino acids in a region of a protein. The length of the randomized region varies from 5 to 15. I want to use an NGS dataset of 1 million reads to evaluate the library for diversity (are there any duplicates?) and distribution of amino acids at each position. Using Python, I can parse the reads with REGEX and split the randomized regions into separate files of 5-mer, 6-mer etc. peptides What is the most efficient way to sort and count uniques? Is there an efficient way to align 100,000 5 mer peptides and find the distribution of amino acids at position 1, 2, 3 etc?
Are these peptides always the same length - or to be more specific, do you actually need alignment?
If you want to know how many unique entires there are from a list of elements, you can simply get the
len()
of theset()
of the list in python. I doubt you'll find something more efficient thanset
whilst staying within python itself.The peptides have different lengths from 5 to 15 amino acids. I will try the set() method to solve the number of uniques problem. Could I use the same idea to sort 5-mer peptides that start with GLY into one set, then the ones that start with ALA into another, etc. This way I could get a count of the distribution of amino acids at position 1. Then repeat for position 2 etc. etc. Is that efficient for 1 million peptides?
You can't use a
set
for the latter part specifically, but you can apply theset
after you've filtered to de-duplicate if needed.It would probably look something like:
set(filter(lambda x: x.startswith("GLY"), peptides))
. I doubt running this over and over for all possible starting peptides is very efficient though.You might be able to use
collections.Counter
if you only need numbers relating to what starts with what etc, but this won't do any filtering, so it depends what the end goals are.