I have 2 fasta files (File A, File B) each one contains sequence, How can I report the relative frequencies of the nucleotides in set A and set B with python?
Thanks in advance
I have 2 fasta files (File A, File B) each one contains sequence, How can I report the relative frequencies of the nucleotides in set A and set B with python?
Thanks in advance
Hi check the link in the comments and read the biopython documentation. Until then, try this:
#!/usr/bin/env python
from Bio import SeqIO
for fasta_path in ['/path/to/fasta1', '/path/to/fasta2']:
    count_nucletides = dict([(i,0) for i in "ACTG"])
    for record in SeqIO.parse(fasta_path, 'fasta'):
        for nucleotide in count_nucletides:
            count_nucletides[nucleotide] += record.seq.count(nucleotide)
    print(fasta_path)
    print('\n'.join(['{}: {}'.format(i,count_nucletides[i]) for i in count_nucletides]))
                    
                
                Try this:
#!/usr/bin/env python3
import os
import sys
from Bio import SeqIO
def main(input_path: str, output_path: str) -> None:
    mono = [(i,0) for i in "ACTG"]
    di = [(i+y,0) for i in "ACTG" for y in "ACTG"]
    count_nucletides = dict(mono + di)
    with open(input_path, 'r') as input_handle:
        with open(output_path, 'w') as output_handle:
            output_handle.write(
                'records,{}\n'.format(
                    ','.join(map(lambda x: x[0], sorted(count_nucletides.items(), key=lambda x: x[0])))
                )
            )
            for record in SeqIO.parse(input_handle, 'fasta'):
                for nucleotide in count_nucletides:
                    count_nucletides[nucleotide] += record.seq.count(nucleotide)
                count = list(map(lambda x: x[1], sorted(count_nucletides.items(), key=lambda x: x[0])))
                output_handle.write("{},{}\n".format(
                    record.id,
                    ','.join(map(str, count))
                ))
    return None
if __name__ == "__main__":
    main( 
        input_path=sys.argv[1],
        output_path=sys.argv[2]
    )
Save to a file "count_mono_di.py". Run:
python3 count_mono_di.py input.fasta output.csv
I use this sample to test:
>record_1
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
CGATATAGCGCTAG
>record_2
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
ATATAGCGCTAGAGATAGATCGATAGCATCGATACG
This was the output:
records,A,AA,AC,AG,AT,C,CA,CC,CG,CT,G,GA,GC,GG,GT,T,TA,TC,TG,TT
record_1,50,4,12,14,18,25,8,0,10,7,32,16,9,2,2,27,19,4,4,0
record_2,105,6,15,29,51,51,11,0,28,12,69,41,20,2,2,65,41,16,8,0
                    
                Hugo, I really appreciate your effort. This script helps! Would it be possible that you modify this script so that the result is the frequencies of mono- or dinucleotides instead of counts? For example, would it be possible that the output looks like below:
records,Freq_A, Freq_C, Freq_G, Freq_T, Freq_AA, Freq_AC, Freq_AG, Freq_AT, Freq_CA, Freq_CC, ....
record_1, values of corresponding frequencies
record_2, values of corresponding frequencies
Besides, I would like to acknowledge your effort in my publications when I use this script, please let me know how can I cite this script (your papers, or your GitHub website, etc) ?
Hi ! Well, i made a few mistakes so first i will correct then and format the output as you want:
#!/usr/bin/env python3
import os
from sys import argv
from collections import OrderedDict
from itertools import product
from Bio import SeqIO
def get_word_dict(word_size: int = 1, recursive = True):
    count_dict = []
    if not recursive:
        for y in product("ACGT", repeat=word_size):
            count_dict.append(''.join(y))
    else:
        for i in range(1, word_size + 1):
            for y in product("ACGT", repeat=i):
                count_dict.append(''.join(y))
    return OrderedDict.fromkeys(count_dict, 0)
def freq_from_dict(nucleotide_dict, record):
    for nucleotide in nucleotide_dict:
        if len(nucleotide_dict) == 1:
            nucleotide_dict[nucleotide] += record.seq.count(nucleotide)
        else:
            nucleotide_dict[nucleotide] += record.seq.count_overlap(nucleotide)
    sum_dict = {}
    for nucleotide in nucleotide_dict:
        length_kmer = len(nucleotide)
        if not length_kmer in sum_dict:
            sum_dict[length_kmer] = sum([y for i,y in nucleotide_dict.items() if len(i) == length_kmer])
    for nucleotide in nucleotide_dict:
        nucleotide_dict[nucleotide] = nucleotide_dict[nucleotide]/sum_dict[len(nucleotide)]
    return nucleotide_dict
def main(input_path: str, output_path: str, word_size: int = 1, recursive = True) -> None:
    result = []
    with open(input_path, 'r') as input_handle:
        for record in SeqIO.parse(input_handle, 'fasta'):
            freq_dict = freq_from_dict(get_word_dict(word_size=word_size, recursive=recursive), record=record)
            result.append('{},{}\n'.format(record.id, ','.join(map(str ,freq_dict.values()))))
    with open(output_path, 'w') as output_handle:
        output_handle.write('records,{}\n'.format(','.join(freq_dict.keys())))
        for i in result:
            output_handle.write(i)
if __name__ == "__main__":
    main( 
        input_path=argv[1],
        output_path=argv[2],
        word_size=(int(argv[3])),
        recursive=(True if argv[4] else False)
    )
I used the same sample as above, check the results, maybe I'm wrong again. To obtain the frequency, i divide the total count by the sum of all k mers of the same length.
I made other changes, now you can pass a word length and if you want to build the dictionary count recursively, for example if your word length is 3 the script will count all 3-mer, 2-mer and 1-mer.
You can run like this (mind the order of args):
# python3 count_uni_di.py input_path output_path word_size true
# or
# python3 count_uni_di.py input_path output_path word_size
python3 count_uni_di.py sample.fasta output.csv 2 true
As for acknowledgement, a positive vote is more than enough. Check out this EMBOSS tool, I think it's best suited for heavy use and citation. After that, if you still want to use this script i can post it on github as a snippet.
Hi, Hugo,
I really appreciate your effort in writing this script! I test your script with a fasta file containing 98,913 sequences, and your script works without any problem of generating a table containing mono- and dinucleotide frequencies of each sequence in the fasta file! I used EMBOSS tool to check a couple of sequences from the fasta file and the results I got from EMBOSS match the results from your script. I would say this is amazing! I hope you can put this on github so that other scientists (especially someone like me, who does not have computational background) could benefit from it. I am sure I will keep using your script and I will acknowledge it in my publications.
Regards,
Xiao
Hi, Hugo, I acknowledged you in my paper recently published. Thanks for your help! Xiao PS, paper link https://elifesciences.org/articles/83548#info
Hi Xiao,
Thanks for getting back to me! I'm so glad to hear that my script was able to make a difference for you.
I wanted to also thank you for including me in the acknowledgment section of your publication. I haven't had a chance to read the paper in full yet, but it looks very cool. I'm honored to be a part of your work and I'm glad my orcid information is there to help others find me.
Once again, thanks for your support. I hope your research continues to be successful and I wish you all the best !
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The rationale behind getting nucleotides' relative frequencies is very simple: read each fasta file, discard headers, split sequences into nucleotides, and count each one independently. If you need help coding it you'd better show what you've tried so far.
duplicate: nucleotides content determination using python
Hi Hugo, I found this script super helpful. How can I change it to give the frequencies for triplets? I ultimately want to know the frequencies between two or more sequences of aminoacid changes. Thanks in advance!
Hello :v
Try this:
If that doesn't work, post a sample input and desired output so we can figure out how to do what you want ;)