Convert plink format to genotype input format
1
0
Entering edit mode
9 weeks ago
berndmann ▴ 10

How can I convert from plink format (bed,bim and fam) to some input genotype formats which are defined as followed:

Format 1: The first value in each line is the individual’s id. The remaining values are the genotypes of the individual at each locus, either 0, 1, or 2 (or 9 if missing).

Example:

id1 1 1 2 0 1 1 1 1 1 0
id2 0 2 1 1 0 1 1 1 2 2
id3 1 2 0 1 2 1 0 1 2 0
id4 2 1 1 1 1 1 1 1 2 1


Format 2 : For each individual there are two lines. The first line gives the individual’s id and the read counts for the reference allele. The second line gives the individual’s id and the read counts for the alternative allele.

Example:

id1 0 0 0 2 1 1 2 0 2 1
id1 3 1 2 1 2 2 1 4 4 3
id2 1 0 1 1 4 2 1 2 3 1
id2 0 1 2 1 1 1 3 2 3 2
id3 0 2 1 3 2 1 3 1 2 2
id3 2 3 3 1 2 2 0 2 1 2
id4 1 1 4 0 0 1 1 2 1 1
id4 1 3 2 3 2 1 2 2 1 3

format plink genotype convert • 393 views
0
Entering edit mode

the thing is, plink .bed files dont really include an entry for read count. so this makes me suspect you have a second file (perhaps a .fasta, a .bam, or a .vcf file?) what other inputs would you be drawing from here?

0
Entering edit mode

Sure! I should have been more precise. I have a full set of plink format files, bed,bim and fam, which I can use. I will edit the question. Thanks pointing to this.

0
Entering edit mode

So I guess we are looking at some weird version of the EIGENSTRAT format (https://reich.hms.harvard.edu/software/InputFileFormats). So converting from .ped to EIGENSTRAT and adjust the file might be the way to go?. Maybe one can confirm this?

1
Entering edit mode
9 weeks ago
berndmann ▴ 10

For the first format one could do something like this in R :

system("plink --bfile 1kg_22_trio --snps-only --recode")
#we have to create a par.PED.EIGENSTRAT file to use convertf see example in EIGENSTRAT software repository
#call covertf afterwards
system("convertf -p par.PED.EIGENSTRAT")

eigen <- read_tsv("1kg_22_trio.eigenstratgeno", col_names = F)

#format is long format we need wide format
#first splitting by each character
split <- eigen\$X1 %>% strsplit('')

#convert to wide matrix and add sample id to first value of each row
wide <- rbind(c("HG00421",sapply(split, "[[", 1)),c("HG00422",sapply(split, "[[", 2)) ,c("HG00423",sapply(split, "[[", 3)),
c("HG00445",sapply(split, "[[", 4)), c("HG00446",sapply(split, "[[", 5)),c("HG00447",sapply(split, "[[", 6))) %>%
as_tibble()

#write as genotype.txt
write_delim(wide,'genotype.txt',col_names = F)


Still puzzeling about the second format.