PROVEAN input script
Entering edit mode
3.8 years ago
s_herrera ▴ 10

Hi everyone,

I would like to write a python script to extract amino acid substitutions and indels from several protein alignments, and use it as an input for variant effect prediction with PROVEAN. So far I have modified an initial script (C: How to extract polymorphic positions from several alignments?) to extract only amino acid substitutions, but I haven't been able to extract correctly the indels according to PROVEAN input format ( I have been able to extract indels position by position, but this doesn't work for PROVEAN. Any help would be very welcome!!

Here is the script:

for sequence in sequences: 
    print('Alignment:    ' + sequence.split('/')[-1].split('_')[0])
    alignment =, 'fasta')
    align_array_trim = prune_aln(alignment) #prune_aln is a function that I've created.
    align_array = np.array([list(rec) for rec in alignment], np.character)
    single = {}

    if ((align_array_trim.shape[1] / alignment.get_alignment_length()) > 0.7): 
    #Only alignments over 70% of initial length after gaps are removed
        for record in alignment:

        refseq = list(alignment)[-1]
        file1 ='|')[1] + '.fasta'
        f = open(filepath1,'w')
        f.write('>' + + '\n' + str(refseq.seq) + '\n')

        for column in range(align_array.shape[1]):
            c = str(align_array[:,column])
            counter = Counter(align_array[:,column])
            if (len(counter) == 2):
            # single polymorphism
                main_letter = counter.most_common(1)[0][0]
                letter = counter.most_common(2)[1][0]
                for pos in np.where(align_array[:,column] == letter)[0]:
                    change = main_letter.decode('UTF-8') + str(column + 1) + letter.decode('UTF-8')
                    single.setdefault(change, [])

        file2 ='|')[1] + '_variants.txt'
        f2 = open(filepath2,'w')              
        for change in single:
            if not ('-' in change): #Here I'm avoiding indels because I don't know how to include them
                number_spp = len(single.get(change))
                if (number_spp == 1) and (single.get(change)[0] == '<Species_of_interest>'): #I only want changes that are otherwise conserved and just change in my focus species.
                    f2.write(str(re.findall('\d+', change)[0]) + ',' + change[0] + ',' + change[-1] + '\n')


       #Input: a protein sequence and amino acid variants
       #Provean input format per alignment: <position>,<reference_amino_acids>,<variant_amino_acids>
python provean variant effect prediction • 1.7k views

Login before adding your answer.

Traffic: 1381 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6