Question: The difference between singletons SNPs vs SNPs have MAF less than or equal 0.01
0
gravatar for haneenih7
7 weeks ago by
haneenih770
KAUST
haneenih770 wrote:

Hello all,

I struggled the whole day trying to figure out what is really the difference between SNPs present only in one individual (singletons) vs SNPs having a minor allele frequency of ≤ 1%

I have one data set of SNPs having 2 plant species (14 wild individuals and 157 cultivated plants individuals) with total number of SNPs: 11,046,501

I used --singletons option from the vcftools to get a file to have the list of SNPs occur in one individual.

This gave 3,671,719 singletons SNPs. Calculating the MAF ≤ 1% using vcftools option: --max-maf 0.01 to keep sites with MAF less than or equal to 0.01 and calculating the number of SNPs in this vcf file gave 4,901,160 SNPs (44% of the total SNPs) - That's a lot :(

The big issue is when I get two different subsets from the VCF file: one having wild individuals only (14 samples with 9,839,152 SNPs ) and I did calculate the same things --singletons and --max-maf 0.01 and what surprised me is the big number of singletons (3,513,130) and the small number of SNPs having MAF ≤ 1% (160,140 SNPs only)

The other VCF subset was the cultivated samples only (157 individuals with 2,617,322 SNPs) and this has 836,609 singletons SNPs and 989,630 SNPs with MAF ≤ 1% (both numbers are more or less similar, not like the wild) - I am so confused !!!!!

I tried outputting SNPs with MAF ≤ 1% with plink as well, and it gave exactly the same results.

How to interpret these results? I was expecting to get similar numbers between singletons and SNPs with MAF ≤ 1%. But it seems it doesn't work that way. So back to the main question:

What is the difference between singletons SNPs vs SNPs have MAF ≤ 0.01

.

P.S. Just to clarify: getting rid of wild individuals (n=14) with the non-polymorphic SNPs results in a huge drop of the number of SNPs, and this is because the wild samples are highly diverse compare to cultivated (there is a lof of SNPs and differences between wild genome sequence with the reference genome)

. . Thanks for your help!

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by haneenih770
0
gravatar for haneenih7
7 weeks ago by
haneenih770
KAUST
haneenih770 wrote:

It was easy at the end to explain this observation.

Comparing allele frequency can happen in 14 individuals at max is not comparable with the allele frequency of a bigger population. The same MAF cut off ( 0.01) in the wild population (n=14) will be very small, because we need to have the same SNP occurring in all individuals to get allele frequency < 0.01, which will be actually 0. If only one individual has a different allele (that is a singleton), the allele frequency of one allele different allele in 14 individuals is 0.07... which is higher than 0.01

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by haneenih770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1896 users visited in the last hour