Question

Convert plink format to genotype input format

0

Entering edit mode

17 months ago

berndmann ▴ 10

How can I convert from plink format (bed,bim and fam) to some input genotype formats which are defined as followed:

Format 1:

The first value in each line is the individual’s id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing).

Example:

id1 1 1 2 0 1 1 1 1 1 0
id2 0 2 1 1 0 1 1 1 2 2
id3 1 2 0 1 2 1 0 1 2 0
id4 2 1 1 1 1 1 1 1 2 1

Format 2:

For each individual there are two lines. The first line gives the individual’s id and the read counts for the reference allele. The second line gives the individual’s id and the read counts for the alternative allele.

Example:

id1 0 0 0 2 1 1 2 0 2 1
id1 3 1 2 1 2 2 1 4 4 3
id2 1 0 1 1 4 2 1 2 3 1
id2 0 1 2 1 1 1 3 2 3 2
id3 0 2 1 3 2 1 3 1 2 2
id3 2 3 3 1 2 2 0 2 1 2
id4 1 1 4 0 0 1 1 2 1 1
id4 1 3 2 3 2 1 2 2 1 3

genotype plink • 1.4k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 17 months ago by berndmann ▴ 10

0

Entering edit mode

the thing is, plink .bed files dont really include an entry for read count. so this makes me suspect you have a second file (perhaps a .fasta, a .bam, or a .vcf file?) what other inputs would you be drawing from here?

ADD REPLY • link 17 months ago by LauferVA 4.2k

0

Entering edit mode

Sure! I should have been more precise. I have a full set of plink format files, bed,bim and fam, which I can use. I will edit the question. Thanks pointing to this.

ADD REPLY • link 17 months ago by berndmann ▴ 10

0

Entering edit mode

So I guess we are looking at some weird version of the EIGENSTRAT format (https://reich.hms.harvard.edu/software/InputFileFormats). So converting from .ped to EIGENSTRAT and adjust the file might be the way to go?. Maybe one can confirm this?

ADD REPLY • link 16 months ago by berndmann ▴ 10

score 1 · Accepted Answer · 2022-12-03

For the first format one could do something like this in R :

system("plink --bfile 1kg_22_trio --snps-only --recode")
#we have to create a par.PED.EIGENSTRAT file to use convertf see example in EIGENSTRAT software repository
#call covertf afterwards
system("convertf -p par.PED.EIGENSTRAT")

#read eigenstrat format
eigen <- read_tsv("1kg_22_trio.eigenstratgeno", col_names = F)

#format is long format we need wide format 
#first splitting by each character
split <- eigen$X1 %>% strsplit('')

#convert to wide matrix and add sample id to first value of each row
wide <- rbind(c("HG00421",sapply(split, "[[", 1)),c("HG00422",sapply(split, "[[", 2)) ,c("HG00423",sapply(split, "[[", 3)),
              c("HG00445",sapply(split, "[[", 4)), c("HG00446",sapply(split, "[[", 5)),c("HG00447",sapply(split, "[[", 6))) %>%
  as_tibble()

#write as genotype.txt
write_delim(wide,'genotype.txt',col_names = F)

Still puzzeling about the second format.