I have a fasta file with 100 random permutations of a genome. It is then broken up into ORFs. Here is a sample:
>Shuffled1_1 [5 - 79] Based on NC_000908.2 >MNGLISMLIYFDEYENEIFVEKKDL >Shuffled1_2 [36 - 101] Based on NC_000908.2 >MMSMKMKFSLKKKICNYIPKLQ >Shuffled1_3 [19 - 111] Based on NC_000908.2 >MNVNLLWWVWKWNFRWKKRFVTTYQSSNKLK >Shuffled1_4 [105 - 134] Based on NC_000908.2 >MKVERQISTL >Shuffled1_5 [148 - 177] Based on NC_000908.2 >MINNSSSPAL >Shuffled1_6 [174 - 242] Based on NC_000908.2 >MINKFPELLLTQLSYVRYYLFAI
There are over 3 million records. I need to calculate the 95th percentile by averaging the estimates of each shuffled permutation. I am having trouble figuring out how to get the 95th percentile of shuffled1 before moving onto shuffled2 and so on. Any help would be appreciated. I was able to pull out just the first shuffled permutation using awk and get the percentile using this code:
from Bio import SeqIO import numpy as np seqLen= fasta_sequence = SeqIO.parse(open("firstperm.fasta"),"fasta") for fasta in fasta_sequence: name,sequence = fasta.id, str(fasta.seq) seqLen.append(len(fasta)) a = np.array(seqLen) p = np.percentile(a,95) print p
I am just trying to figure out a way to duplicate this over the entire fasta file for all 100 shuffled permutations. Basically I am looking for 100 numbers in a list for each 95th percentile.