The difference between singletons SNPs vs SNPs have MAF less than or equal 0.01
1
1
Entering edit mode
3.9 years ago
Hann ▴ 110

Hello all,

I struggled the whole day trying to figure out what is really the difference between SNPs present only in one individual (singletons) vs SNPs having a minor allele frequency of ≤ 1%

I have one data set of SNPs having 2 plant species (14 wild individuals and 157 cultivated plants individuals) with total number of SNPs: 11,046,501

I used --singletons option from the vcftools to get a file to have the list of SNPs occur in one individual.

This gave 3,671,719 singletons SNPs. Calculating the MAF ≤ 1% using vcftools option: --max-maf 0.01 to keep sites with MAF less than or equal to 0.01 and calculating the number of SNPs in this vcf file gave 4,901,160 SNPs (44% of the total SNPs) - That's a lot :(

The big issue is when I get two different subsets from the VCF file: one having wild individuals only (14 samples with 9,839,152 SNPs ) and I did calculate the same things --singletons and --max-maf 0.01 and what surprised me is the big number of singletons (3,513,130) and the small number of SNPs having MAF ≤ 1% (160,140 SNPs only)

The other VCF subset was the cultivated samples only (157 individuals with 2,617,322 SNPs) and this has 836,609 singletons SNPs and 989,630 SNPs with MAF ≤ 1% (both numbers are more or less similar, not like the wild) - I am so confused !!!!!

I tried outputting SNPs with MAF ≤ 1% with plink as well, and it gave exactly the same results.

How to interpret these results? I was expecting to get similar numbers between singletons and SNPs with MAF ≤ 1%. But it seems it doesn't work that way. So back to the main question:

What is the difference between singletons SNPs vs SNPs have MAF ≤ 0.01

.

P.S. Just to clarify: getting rid of wild individuals (n=14) with the non-polymorphic SNPs results in a huge drop of the number of SNPs, and this is because the wild samples are highly diverse compare to cultivated (there is a lof of SNPs and differences between wild genome sequence with the reference genome)

. . Thanks for your help!

SNP sequencing population genetics • 2.4k views
ADD COMMENT
1
Entering edit mode
3.9 years ago
Hann ▴ 110

It was easy at the end to explain this observation.

Comparing allele frequency can happen in 14 individuals at max is not comparable with the allele frequency of a bigger population. The same MAF cut off ( 0.01) in the wild population (n=14) will be very small, because we need to have the same SNP occurring in all individuals to get allele frequency < 0.01, which will be actually 0. If only one individual has a different allele (that is a singleton), the allele frequency of one allele different allele in 14 individuals is 0.07... which is higher than 0.01

ADD COMMENT

Login before adding your answer.

Traffic: 2544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6