SNP data analysis - advices for statistics
1
1
Entering edit mode
8.4 years ago
dovah ▴ 40

Hi,

I hope this is the right forum to ask you for some advice on how to analyse SNP data.

I have analysed the SNPs in different strains of the same species (i have 30 of them). I have recorded the SNP number count for all these strains, in four different genomic regions (coding, non-coding...etc). The data are stored in a file with three columns: $1-SNPcount; $2-strain; $3region. "SNP" is thus a numerical variable, while "strain" and "region" are factorial.

How would you advise to statistically analyse those data? I plotted the %of SNP in each region per strain, but this wouldn't obviously take into account the richness in SNP of each strain. I might think of doing a glm(SNP~strain+region), but obviously the results of the model would definitely depend on which variable level you choose as "reference".

I am grateful for any constructive advice :)

snp genome R • 2.1k views
ADD COMMENT
0
Entering edit mode

I don't see what is the question you want to answered. Do you want to know if some strains/regions have more SNPs than others?

ADD REPLY
0
Entering edit mode

yes, exactly.

ADD REPLY
0
Entering edit mode

I would suggest you to first control for coverage, to see if this may bias your results. Then, you could compare the proportions of coding and non-coding SNPs (or any other "category") between strains using a Fisher's exact test.

ADD REPLY
1
Entering edit mode
8.2 years ago
reza.jabal ▴ 580

Hi,

I believe you should first investigate population structure to detect potential outliers (before committing yourself to any further analysis) by doing PCA (Principal Component Analysis). You may find this from "Cross Validated" forum useful!

ADD COMMENT

Login before adding your answer.

Traffic: 2407 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6