Question: Coding genotypes from nucleotides to 0/1/2 in R.
0
gravatar for devenvyas
4.0 years ago by
devenvyas570
Stony Brook
devenvyas570 wrote:

I have a massive data table with dbSNP rs ids as rows and samples as columns in this kind of format

dbSNP    Sample    Sample    Sample    Sample    Sample    Sample
rs10000011    CC    CC    CC    CC    TC    TC
rs1000002    TC    TT    CC    TT    TT    TC
rs10000023    TG    TG    TT    TG    TG    TG
rs1000003    AA    AG    AG    AA    AA    AG
rs10000041    TT    TG    TT    TT    TG    GG
rs10000046    GG    GG    AG    GG    GG    GG
rs10000057    AA    AG    GG    AA    AA    AA
rs10000073    TC    TT    TT    TT    TT    TT
rs10000092    TC    TC    CC    TC    TT    TT

There are over a 1000 samples and >547,000 loci in this table from a HGDP dataset (ftp://ftp.cephb.fr/hgdp_supp10/), and I would like to do a massive Principle Component Analysis (with samples colored based on population).

In order to do that, I need to code my genotypes first. I was wondering, how would I do this (preferably in R, as the file is probably too big for JMP Genomics)?

Also, I have some spots lacking data, which are indicated by --- or 00. I am going to standardize those to NA using a find and replace script, but how do I code it so R will still be able to run the PCA. Thanks!

 

snp R • 4.6k views
ADD COMMENTlink modified 3.9 years ago by brownhaircolor0 • written 4.0 years ago by devenvyas570
1

I am not sure if R can easily handle this big dataset either. I would suggest you to use PLINK (that directly evaluates the PCA), but I am afraid you will need to create extra files to describe your data. See: https://www.cog-genomics.org/plink2/input and here https://www.cog-genomics.org/plink2/strat#pca.

ADD REPLYlink written 3.9 years ago by alesssia520

I can run R on UF's HPC cluster though. It will be able to handle it there.

ADD REPLYlink written 3.9 years ago by devenvyas570
2
gravatar for alesssia
3.9 years ago by
alesssia520
London, UK
alesssia520 wrote:

I still believe that using PLINK is a better idea, but if you want to use R, you can write a script that, for each line of your matrix (SNP), selects the major/minor allele and translates each genotype data (CC, TC, CT, TT, NN, ...) into 0,1,2, or missing accordingly. Use apply to cycle over the matrix to have better performance.

ADD COMMENTlink written 3.9 years ago by alesssia520

But how do I get my data into Plink in the first place?

ADD REPLYlink written 3.9 years ago by devenvyas570

You can modify your file in order to have the correct PLINK format. See https://www.cog-genomics.org/plink2/input and https://www.cog-genomics.org/plink2/formats. You may be specifically interested in the .tped file (https://www.cog-genomics.org/plink2/formats#tped). Note that you need to add extra columns (mastering shell scripting will be useful) and  that you should 'separate' the alleles in your genomic data (that is "CC" should be "C C").

To recode see: https://www.cog-genomics.org/plink2/data#recode.

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by alesssia520

I'm not even sure plink can handle something this large. If it uses SVD then it'll need to use its own linear algebra functions, since BLAS/LAPACK use 32 bit matrix indices (at least in the fortran code, not sure about the C versions).

ADD REPLYlink written 3.9 years ago by Devon Ryan90k

PLINK's --pca only passes a [# of samples] x [# of samples] matrix to BLAS/LAPACK, so this dataset is not too large.  (I think the current limit is around 46k samples, though I haven't tested this.)

EIGENSOFT 6 has a fast PCA approximation which may work in some cases PLINK does not.

ADD REPLYlink written 3.9 years ago by chrchang5235.0k

I am looking into Eigensoft. For my eventual, primary data application (calculating admixture) I need Eigenstrat file format for Admixtools.

For now, JMP Genomics was able to code the genotypes for just my actual samples (92 samples, 587297 SNPs) and got R to do a PCA on just those (at least I think it is), which ate through 24GB of RAM in 3 hours. I am re-running those job with 48 GB to see if that is enough; I cannot imagine how much RAM the merged dataset would take.

I got a script from support at UF HPC that was able to code the fully data set (but it excludes the sites where the two sets are reading opposite strands).

ADD REPLYlink written 3.9 years ago by devenvyas570
2
gravatar for Jimbou
3.9 years ago by
Jimbou690
Germany
Jimbou690 wrote:

You can use the SNPassoc package. But if the data is big plink would be the better choice as mentioned from alesssia already.

library(SNPassoc)
data("SNPs")
d <- SNPs[,-c(1:5)]

Create a SNP Matrix

myDat<-setupSNP(d,1:ncol(d),sep="")

Use the additive function and apply to create a numerical variable.

mydatnum <- apply(myDat,2,additive)

Prior the transformation you should exchange 00 or --- to NA with something like this:

d[ d == 00 ] <- NA
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Jimbou690

Many thanks to you Jimbou ( Vielen Dank) It took two weeks to find an answer how to code my genotype into numerical variables in R

Still ill test your way hoping that it works. Otherwise i want to contact with you for extra help if thats ok and possible.

Best, Halah

ADD REPLYlink written 2.3 years ago by um_helal_9990
0
gravatar for devenvyas
3.9 years ago by
devenvyas570
Stony Brook
devenvyas570 wrote:

Anyone have any suggestions? I tried stackoverflow, but they sent me back here.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by devenvyas570
0
gravatar for george.ry
3.9 years ago by
george.ry1.1k
United Kingdom
george.ry1.1k wrote:
head -1 yourbigfile \
> yournewfile \
&& paste \
    <(tail -n+2 yourbigfile | cut -f1) \
    <(tail -n+2 yourbigfile | cut -f2- | sed $(python2 -c "from itertools import product; print 's/---/NA/g;s/00/NA/g;' + ';'.join(['s/{}/{}/g'.format(''.join(p), i) for i, p in enumerate(product('ACGT', 'ACGT'))])")) \
>> yournewfile
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by george.ry1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 817 users visited in the last hour