Convert plink format to genotype input format
1
0
Entering edit mode
17 months ago
berndmann ▴ 10

How can I convert from plink format (bed,bim and fam) to some input genotype formats which are defined as followed:

Format 1:

The first value in each line is the individual’s id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing).

Example:

id1 1 1 2 0 1 1 1 1 1 0
id2 0 2 1 1 0 1 1 1 2 2
id3 1 2 0 1 2 1 0 1 2 0
id4 2 1 1 1 1 1 1 1 2 1

Format 2:

For each individual there are two lines. The first line gives the individual’s id and the read counts for the reference allele. The second line gives the individual’s id and the read counts for the alternative allele.

Example:

id1 0 0 0 2 1 1 2 0 2 1
id1 3 1 2 1 2 2 1 4 4 3
id2 1 0 1 1 4 2 1 2 3 1
id2 0 1 2 1 1 1 3 2 3 2
id3 0 2 1 3 2 1 3 1 2 2
id3 2 3 3 1 2 2 0 2 1 2
id4 1 1 4 0 0 1 1 2 1 1
id4 1 3 2 3 2 1 2 2 1 3
genotype plink • 1.4k views
ADD COMMENT
0
Entering edit mode

the thing is, plink .bed files dont really include an entry for read count. so this makes me suspect you have a second file (perhaps a .fasta, a .bam, or a .vcf file?) what other inputs would you be drawing from here?

ADD REPLY
0
Entering edit mode

Sure! I should have been more precise. I have a full set of plink format files, bed,bim and fam, which I can use. I will edit the question. Thanks pointing to this.

ADD REPLY
0
Entering edit mode

So I guess we are looking at some weird version of the EIGENSTRAT format (https://reich.hms.harvard.edu/software/InputFileFormats). So converting from .ped to EIGENSTRAT and adjust the file might be the way to go?. Maybe one can confirm this?

ADD REPLY
1
Entering edit mode
16 months ago
berndmann ▴ 10

For the first format one could do something like this in R :

system("plink --bfile 1kg_22_trio --snps-only --recode")
#we have to create a par.PED.EIGENSTRAT file to use convertf see example in EIGENSTRAT software repository
#call covertf afterwards
system("convertf -p par.PED.EIGENSTRAT")

#read eigenstrat format
eigen <- read_tsv("1kg_22_trio.eigenstratgeno", col_names = F)

#format is long format we need wide format 
#first splitting by each character
split <- eigen$X1 %>% strsplit('')

#convert to wide matrix and add sample id to first value of each row
wide <- rbind(c("HG00421",sapply(split, "[[", 1)),c("HG00422",sapply(split, "[[", 2)) ,c("HG00423",sapply(split, "[[", 3)),
              c("HG00445",sapply(split, "[[", 4)), c("HG00446",sapply(split, "[[", 5)),c("HG00447",sapply(split, "[[", 6))) %>%
  as_tibble()

#write as genotype.txt
write_delim(wide,'genotype.txt',col_names = F)

Still puzzeling about the second format.

ADD COMMENT

Login before adding your answer.

Traffic: 2835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6