How to calculate the Genetic Distance (P-Distance) by using mtDNA pacBio data?
0
0
Entering edit mode
9 days ago

Dear researchers and experts,

I am using mtDNA PacBio data to calculate genetic distance (P distance). For my analysis, I have been utilizing bioinformatics tools on a Linux platform. Initially, I tried using BCFtools, followed by running a script in Python (PyCharm) to measure genetic distance (P distance). Could you please suggest some tools for these calculations? I humbly request your guidance from the beginning. Could someone please help me? Below, I have attached a screenshot of my calculations. Currently, I am using data from three species, with each species having data from three different locations. Please guide me about this.

These are the commands I have used:

Step 2: Call Variants

Identify the differences between your PacBio reads and the reference genome using a variant caller such as BCFtools.

Commands

Call variants using BCFtools:

bcftools mpileup -f reference.fasta sorted_aligned_reads_with_rg.bam | bcftools call -mv -Oz -o variants.vcf.gz
bcftools filter -i 'QUAL > 30 && DP > 10' variants.vcf.gz -Oz -o filtered_variants.vcf.gz
bcftools index filtered_variants.vcf.gz


Step 3: Calculate Genetic Distance

Use the filtered variants to calculate the genetic distance (P distance) between the sequences.

Example Python Script to Calculate P Distance

1. Extract allele frequencies:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AF\n' filtered_variants.vcf.gz > allele_frequencies.txt


2. Calculate P distance using Python:

import pandas as pd
import matplotlib.pyplot as plt

allele_freqs = pd.read_csv('allele_frequencies.txt', sep='\t', names=['CHROM', 'POS', 'REF', 'ALT', 'AF'])

# Calculate P distance
total_sites = len(allele_freqs)
diff_sites = sum(allele_freqs['AF'] > 0)  # Count sites with non-reference alleles

p_distance = diff_sites / total_sites
print(f'P distance: {p_distance}')

# Visualization
# Histogram of allele frequencies
plt.figure(figsize=(10, 6))
plt.hist(allele_freqs['AF'], bins=20, edgecolor='black')
plt.title('Histogram of Allele Frequencies')
plt.xlabel('Allele Frequency')
plt.ylabel('Count')
plt.grid(True)
plt.show()

# Scatter plot of positions vs allele frequencies
plt.figure(figsize=(10, 6))
plt.scatter(allele_freqs['POS'], allele_freqs['AF'], alpha=0.5)
plt.title('Allele Frequencies by Position')
plt.xlabel('Position')
plt.ylabel('Allele Frequency')
plt.grid(True)
plt.show()


pacBio Genetic-Distance mtDNA • 194 views
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (text becomes text), or use one of (a) the option highlighted in the image below/ (b) fenced code blocks for multi-line code. Fenced code blocks are useful in syntax highlighting. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.