Export Allele Frequency Of All Alleles In The Whole Data For A Subsets Of Individuals
2
0
Entering edit mode
11.8 years ago
Jianfengmao ▴ 310

Background: I am new to genomics and have few programing skills. I depends on VCF and VCFTools for my genomic analysis.

I have my population genomic data in VCF format which is popular in human 1000 genome project. By using VCFTools, I can get allele frequency statistics for each locus and this allele frequency is summarized by all the individuals in the VCF file.

my question:
I want to prepare an input file for some population genetic software, like BayeScan (http://cmpg.unibe.ch/software/bayescan/index.html) to predicting selection along the genome. I need to export an allele frequency statistics for each populations (a group of individuals), and the allele frequency for the same loci (genomic position) should have the same alleles and in the same order.

But, the allele frequency from VCFTools do not satisfy my need. For some ZERO count allele, it outputs nothing. I need the allele frequency was output for all alleles in the locus for all the populations.

I would like to hearing your directions on my problem. All who depends on VCF will benefit from your suggestions.

Example:

This is an example for my data, here euro, ice, norA are three populations. For the position, 1, 3, 32 there are frequency for two allele each in population euro, but only one for the other two populations. I need the Zero count alleles be output as 0 not nothing.

abt6-mao-mbp:BayeScan jianfengmao$head euro_chr1.frq.count CHROM POS NALLELES NCHR {ALLELE:COUNT} 1 1 2 9 T:8 A:1 1 3 2 9 A:8 G:1 1 32 2 9 C:7 G:2 1 69 1 9 A:9 1 75 1 9 G:9 1 87 1 9 C:9 1 88 1 9 T:9 1 116 1 9 A:9 1 141 2 9 C:0 T:9 abt6-mao-mbp:BayeScan jianfengmao$ head ice_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 1 3 T:3

1 3 1 3 A:3

1 32 1 3 C:3

1 69 1 3 A:3

1 75 1 3 G:3

1 87 1 3 C:3

1 88 1 3 T:3

1 116 1 3 A:3

1 141 2 3 C:0 T:3

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 1 6 T:6

1 3 1 6 A:6

1 32 1 7 C:7

1 69 1 6 A:6

1 75 2 6 G:5 A:1

1 87 1 6 C:6

1 88 1 6 T:6

1 116 1 6 A:6

1 141 2 7 C:0 T:7

vcf vcftools • 5.9k views
3
Entering edit mode

1
Entering edit mode

Also, several of your questions, including this one, can be easily solved by writing a script. As you have been in this field for at least several months, I would not call you a newcomer. You should really learn a whatever scripting language and solve the problem by yourself. This is your work.

0
Entering edit mode

Ih3, Your suggestion is just what I want and what I am doing now. I intend to do that by myself and I used all my time on that. And, I have gotten much till now. But, I have not learn everything I need. I have none training on computer and programming, I am sorry for my so many questions here, really sorry. I will try my best to do that by myself. I will learn programming in perl and python, like I started to learn R four years ago in an isolated environment in China. Thanks for what I have gotten from you all.

0
Entering edit mode

Pierre Lindenbaum, I will do what you have mentioned. Thanks for your kindness.

2
Entering edit mode
11.8 years ago
Joachim ★ 2.9k

Have a look at a scripting language that allows you to easily read your data files line-by-line. Then split the lines into columns and evaluate the data in those columns respectively. You can use a hash-map to keep track of the counts for each allele (allele names become keys, their accumulated frequency is the respective value).

For Python, it would look a bit like this:

Regarding the frequency count, you would use the hash-map/dictionary as follows (I call the Python dictionary frequencies):

• you can check whether you have stored a frequency for an allele x already using if _x_ in frequencies:
• if you have a frequency for that allele stored already, do accumulated_count = frequencies[_x_] followed by frequencies[_x_] = accumulated_count + count_of_this_line
• if there is no frequency stored for that allele yet, then do frequency[_x_] = count_of_this_line
0
Entering edit mode

Joachim, Thanks for your kindness. I will try to do that following you.

0
Entering edit mode

Dear Joachim, I checked what you have written for me. Thanks for your patience for me. Though I have begun to learn programming, I still can not do such a simple thing by myself. You have given me a good directions. Thanks a lot.

1
Entering edit mode
11.8 years ago
lh3 33k

This Perl script takes a VCF and a sample-population pair list as input and outputs the allele count at each site for each population (you need the -s option). The Lua version has a few comments. Example:

zcat ALL.2of4intersection.20100804.genotypes.vcf.gz | vcf2afs.pl -s 20100804.ALL.panel -


This DOESN'T answer your question, but you can learn from the script.

0
Entering edit mode