Relative frequencies of the nucleotides in Fasta files
2
0
Entering edit mode
20 months ago
Amr ▴ 140

I have 2 fasta files (File A, File B) each one contains sequence, How can I report the relative frequencies of the nucleotides in set A and set B with python?

fasta • 2.4k views
1
Entering edit mode

The rationale behind getting nucleotides' relative frequencies is very simple: read each fasta file, discard headers, split sequences into nucleotides, and count each one independently. If you need help coding it you'd better show what you've tried so far.

0
Entering edit mode
0
Entering edit mode

Hi Hugo, I found this script super helpful. How can I change it to give the frequencies for triplets? I ultimately want to know the frequencies between two or more sequences of aminoacid changes. Thanks in advance!

0
Entering edit mode

Hello :v

Try this:

python3 count_uni_di.py sample.fasta output.csv 3


If that doesn't work, post a sample input and desired output so we can figure out how to do what you want ;)

1
Entering edit mode
20 months ago
hugo.avila ▴ 410

#!/usr/bin/env python
from Bio import SeqIO
for fasta_path in ['/path/to/fasta1', '/path/to/fasta2']:
count_nucletides = dict([(i,0) for i in "ACTG"])
for record in SeqIO.parse(fasta_path, 'fasta'):
for nucleotide in count_nucletides:
count_nucletides[nucleotide] += record.seq.count(nucleotide)
print(fasta_path)
print('\n'.join(['{}: {}'.format(i,count_nucletides[i]) for i in count_nucletides]))

0
Entering edit mode

Thanks, does this script count mono-nucleotide frequencies (not di-nucleotide frequencies) of a fasta file?

1
Entering edit mode

Hi, it counts only mono-nucleotide frequencies (A, C, G and T). If you want to count di-nucleotides, replace the 4° line for this:

dict([(i+y,0) for i in "ACTG" for y in "ACTG"])


it is not very pretty but i think will do the job.

0
Entering edit mode

Thanks, Hugo, if I have a fasta file containing thousands of sequences, and I wonder if the script could generate a table containing mono- and di-nucleotides frequencies of each sequence in the fasta file?

1
Entering edit mode

Try this:

#!/usr/bin/env python3

import os
import sys
from Bio import SeqIO

def main(input_path: str, output_path: str) -> None:
mono = [(i,0) for i in "ACTG"]
di = [(i+y,0) for i in "ACTG" for y in "ACTG"]
count_nucletides = dict(mono + di)
with open(input_path, 'r') as input_handle:
with open(output_path, 'w') as output_handle:
output_handle.write(
'records,{}\n'.format(
','.join(map(lambda x: x[0], sorted(count_nucletides.items(), key=lambda x: x[0])))
)
)
for record in SeqIO.parse(input_handle, 'fasta'):
for nucleotide in count_nucletides:
count_nucletides[nucleotide] += record.seq.count(nucleotide)
count = list(map(lambda x: x[1], sorted(count_nucletides.items(), key=lambda x: x[0])))
output_handle.write("{},{}\n".format(
record.id,
','.join(map(str, count))
))
return None

if __name__ == "__main__":
main(
input_path=sys.argv[1],
output_path=sys.argv[2]
)


Save to a file "count_mono_di.py". Run:

python3 count_mono_di.py input.fasta output.csv


I use this sample to test:

>record_1
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
CGATATAGCGCTAG
>record_2
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
ATATAGCGCTAGAGATAGATCGATAGCATCGATACG


This was the output:

records,A,AA,AC,AG,AT,C,CA,CC,CG,CT,G,GA,GC,GG,GT,T,TA,TC,TG,TT
record_1,50,4,12,14,18,25,8,0,10,7,32,16,9,2,2,27,19,4,4,0
record_2,105,6,15,29,51,51,11,0,28,12,69,41,20,2,2,65,41,16,8,0

0
Entering edit mode

Hugo, I really appreciate your effort. This script helps! Would it be possible that you modify this script so that the result is the frequencies of mono- or dinucleotides instead of counts? For example, would it be possible that the output looks like below:

records,Freq_A, Freq_C, Freq_G, Freq_T, Freq_AA, Freq_AC, Freq_AG, Freq_AT, Freq_CA, Freq_CC, ....

record_1, values of corresponding frequencies

record_2, values of corresponding frequencies

Besides, I would like to acknowledge your effort in my publications when I use this script, please let me know how can I cite this script (your papers, or your GitHub website, etc) ?

1
Entering edit mode

Hi ! Well, i made a few mistakes so first i will correct then and format the output as you want:

#!/usr/bin/env python3

import os
from sys import argv
from collections import OrderedDict
from itertools import product
from Bio import SeqIO

def get_word_dict(word_size: int = 1, recursive = True):
count_dict = []
if not recursive:
for y in product("ACGT", repeat=word_size):
count_dict.append(''.join(y))
else:
for i in range(1, word_size + 1):
for y in product("ACGT", repeat=i):
count_dict.append(''.join(y))
return OrderedDict.fromkeys(count_dict, 0)

def freq_from_dict(nucleotide_dict, record):
for nucleotide in nucleotide_dict:
if len(nucleotide_dict) == 1:
nucleotide_dict[nucleotide] += record.seq.count(nucleotide)
else:
nucleotide_dict[nucleotide] += record.seq.count_overlap(nucleotide)
sum_dict = {}
for nucleotide in nucleotide_dict:
length_kmer = len(nucleotide)
if not length_kmer in sum_dict:
sum_dict[length_kmer] = sum([y for i,y in nucleotide_dict.items() if len(i) == length_kmer])
for nucleotide in nucleotide_dict:
nucleotide_dict[nucleotide] = nucleotide_dict[nucleotide]/sum_dict[len(nucleotide)]
return nucleotide_dict

def main(input_path: str, output_path: str, word_size: int = 1, recursive = True) -> None:
result = []
with open(input_path, 'r') as input_handle:
for record in SeqIO.parse(input_handle, 'fasta'):
freq_dict = freq_from_dict(get_word_dict(word_size=word_size, recursive=recursive), record=record)
result.append('{},{}\n'.format(record.id, ','.join(map(str ,freq_dict.values()))))
with open(output_path, 'w') as output_handle:
output_handle.write('records,{}\n'.format(','.join(freq_dict.keys())))
for i in result:
output_handle.write(i)

if __name__ == "__main__":
main(
input_path=argv[1],
output_path=argv[2],
word_size=(int(argv[3])),
recursive=(True if argv[4] else False)
)


I used the same sample as above, check the results, maybe I'm wrong again. To obtain the frequency, i divide the total count by the sum of all k mers of the same length.

I made other changes, now you can pass a word length and if you want to build the dictionary count recursively, for example if your word length is 3 the script will count all 3-mer, 2-mer and 1-mer.

You can run like this (mind the order of args):

# python3 count_uni_di.py input_path output_path word_size true
# or
# python3 count_uni_di.py input_path output_path word_size

python3 count_uni_di.py sample.fasta output.csv 2 true


As for acknowledgement, a positive vote is more than enough. Check out this EMBOSS tool, I think it's best suited for heavy use and citation. After that, if you still want to use this script i can post it on github as a snippet.

1
Entering edit mode

Hi, Hugo,

I really appreciate your effort in writing this script! I test your script with a fasta file containing 98,913 sequences, and your script works without any problem of generating a table containing mono- and dinucleotide frequencies of each sequence in the fasta file! I used EMBOSS tool to check a couple of sequences from the fasta file and the results I got from EMBOSS match the results from your script. I would say this is amazing! I hope you can put this on github so that other scientists (especially someone like me, who does not have computational background) could benefit from it. I am sure I will keep using your script and I will acknowledge it in my publications.

Regards,

Xiao

0
Entering edit mode

Hi, Hugo, May I know your full name? I would like to put your name into the acknowledgment session of my paper. Thanks!

1
Entering edit mode

Hi, I'm glad this snippet helped you. Of course you can have my name, here is my orcid, it contains my information. Please attach it next to my name, it helps me to show up in Google searches.

1
Entering edit mode
13 months ago
hugo.avila ▴ 410

Thanks !

0
Entering edit mode