convert population allele count data into population allele frequency
1
0
Entering edit mode
4.0 years ago
Ana ▴ 200

Hi all,

I have a SNPs file contains the allele counts across populations of each SNP are represented by two lines in the file, with the counts of allele one on the first line and the counts for second allele on the second (more precisely each column corresponds to one population and the number of rows are twice the number of SNPs because each pair of numbers corresponds to each allele)

1   0   0   0   0   0   0   0   0   0   1   2   1   0   0   0   0
0   2   0   0   0   0   0   0   0   0   0   4   0   2   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
2   2   0   0   2   1   0   0   0   0   2   4   0   0   0   2   0


The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population. instead of population allele count how can I get population allele frequency for each allele? I actually know how to do it in R but my file is so big (nearly 10G), so I am already stuck! I am learning perl but have not figured out yet how to do that Any help is sincerely appreciated.

allele frequency alle count python • 1.7k views
2
Entering edit mode
4.0 years ago

From what I understand, you have 17 populations (columns), and that the sum of each column should be equal to the total number of alleles in each population? For example, in your pasted data, the first population n=3, the second population n=4, etc?

If that is the case, then I believe that this code works:

awk 'NR==FNR {for(i=1; i<=NF; i++) $i=(a[i]+=$i); next} {for(i=1; i<=NF; i++) if(a[i]==0) a[i]=1; print ($1/a[1])"\t"($2/a[2])"\t"($3/a[3])"\t"($4/a[4])"\t"($5/a[5])"\t"($6/a[6])"\t"($7/a[7])"\t"($8/a[8])"\t"($9/a[9])"\t"($10/a[10])"\t"($11/a[11])"\t"($12/a[12])"\t"($13/a[13])"\t"($14/a[14])"\t"($15/a[15])"\t"($16/a[16])"\t"($17/a[17])}' MyData MyData 0.333333 0 0 0 0 0 0 0 0 0 0.333333 0.2 1 0 0 0 0 0 0.5 0 0 0 0 0 0 0 0 0 0.4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.666667 0.5 0 0 1 1 0 0 0 0 0.666667 0.4 0 0 0 1 0  This reads through the file twice: the first time, It calculates the totals of each columns. The second time, it performs a division. It can certainly be tidied up though. ADD COMMENT 0 Entering edit mode Hi @Kevin Blighr. I am not quite sure about your solution. For instance if I take only the first and second row from the first column which are allele counts for alleles A and a at locus X1 in the first population then I would calculate allele frequency like this for allele A in population 1 frequency of allele A at locus X1 in population 1 =Number of copies of allele A in population/Total number of A/a gene copies in population  . This will be equal to 1 for allele A and 0 for allele a. In addition missing genotypes should be accounted as well! ADD REPLY 0 Entering edit mode Hi Ana, okay, I think that I understand it now: This will print the minor allele frequency for each SNP (1 row per SNP): awk '{for(i=1; i<=NF; i++) tally[i]+=$i}; (NR%2)==1{for(i=1; i<=NF; i++) allele1[i]=$i}; (NR%2)==0{for(i=1; i<=NF; i++) allele2[i]=$i; for(i=1; i<=NF; i++) if(tally[i]==0) tally[i]=1; for(i=1; i<=NF; i++) if (allele1[i]<=allele2[i]) printf allele1[i]/tally[i]"\t"; else printf allele2[i]/tally[i]"\t"; for(i=1; i<=NF; i++) tally[i]=0; print "\n"}' MyData | sed '/^$/d' 0 0 0 0 0 0 0 0 0 0 0 0.333333 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  This prints the major allele frequencies: awk '{for(i=1; i<=NF; i++) tally[i]+=$i}; (NR%2)==1{for(i=1; i<=NF; i++) allele1[i]=$i}; (NR%2)==0{for(i=1; i<=NF; i++) allele2[i]=$i; for(i=1; i<=NF; i++) if(tally[i]==0) tally[i]=1; for(i=1; i<=NF; i++) if (allele1[i]>=allele2[i]) printf allele1[i]/tally[i]"\t"; else printf allele2[i]/tally[i]"\t"; for(i=1; i<=NF; i++) tally[i]=0; print "\n"}' MyData | sed '/^$/d' 1 1 0 0 0 0 0 0 0 0 1 0.666667 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0  Hope that this helps!!! ADD REPLY 0 Entering edit mode Hi @Kevin Blighe, Thank you so much for your incredible help.... I think there is still a problem and I cannot use your new codes. But instead of calculating minor and major allele frequency, how can I calculate the frequency of allele 1 at each locus at odd row and allele 2 at even row to get something like below, this will solve my problem. Can you help me with that? Thanks a lot 1 0 0 0 0 0 0 0 0 0 1 0.33 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0.66 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0  ADD REPLY 1 Entering edit mode Maybe this is what you mean? awk '{for(i=1; i<=NF; i++) tally[i]+=$i}; (NR%2)==1{for(i=1; i<=NF; i++) allele1[i]=$i}; (NR%2)==0{for(i=1; i<=NF; i++) allele2[i]=$i; for(i=1; i<=NF; i++) if(tally[i]==0) tally[i]=1; for(i=1; i<=NF; i++) printf allele1[i]/tally[i]"\t"; printf "\n"; for(i=1; i<=NF; i++) printf allele2[i]/tally[i]"\t"; printf "\n"; for(i=1; i<=NF; i++) tally[i]=0}' MyData | sed 's/\t\$//g'
1   0   0   0   0   0   0   0   0   0   1   0.333333    1   0   0   0   0
0   1   0   0   0   0   0   0   0   0   0   0.666667    0   1   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0           0   0   0   0   0
1   1   0   0   1   1   0   0   0   0   1   1           0   0   0   1   0

1
Entering edit mode

yesss, this is exactly what I want. thanks a lot.

0
Entering edit mode

Phewww! Thanks! Goodnight :)