Question

Calculate average parental genotype

1

Entering edit mode

5.2 years ago

coleman_jonathan ▴ 470

I am interested in creating an averaged value for genotype across two parents:

E.g. Input (three unrelated couples, 1, 2 and 3):

Individual    SNP1    SNP2    SNP3    SNP4 ...
Father_1      1       0       1       2    ...
Mother_1      0       0       2       1    ...
Father_2      1       0       1       1    ...
Mother_2      2       1       0       1    ...
Father_3      1       2       0       1    ...
Mother_3      1       .       0       2    ...

E.g. Output (the midparental genotypes for couple 1, 2 and 3):

Individual    SNP1    SNP2    SNP3    SNP4 ...
Midparent_1   0.5     0       1.5     1.5  ...
Midparent_2   1.5     0.5     0.5     1    ...
Midparent_3   1       .       0       1.5  ...

I can think of a way to do this using e.g. a PLINK .raw file and manipulating that in R, but am concerned that would become memory- and storage-heavy when assessing 100000s of SNPs and 1000s of pairs.

Does anyone know of an existing tool that can do this?

Thanks

gwas SNP • 1.4k views

ADD COMMENT • link updated 5.2 years ago by finswimmer 16k • written 5.2 years ago by coleman_jonathan ▴ 470

1

Entering edit mode

Can you provide more example rows, add more parent pairs. Pretty sure it is easily done using R. Something like split by ped lapply, then use colMeans.

ADD REPLY • link 5.2 years ago by zx8754 11k

0

Entering edit mode

Thanks - have added. My concern is less the method to do so in R (if we end up doing that, I'll add the code as an answer), but more whether there's a less computationally-intensive / more sophisticated method that I'm missing.

ADD REPLY • link 5.2 years ago by coleman_jonathan ▴ 470

0

Entering edit mode

Do we assume rows 1,2 is one parent pair and rows 3,4 is another parent pair, etc?

ADD REPLY • link 5.2 years ago by zx8754 11k

0

Entering edit mode

That can be assumed (in that it would be straightforward to ensure the file is formatted in that way)

ADD REPLY • link 5.2 years ago by coleman_jonathan ▴ 470

0

Entering edit mode

Hello,

you've also posted this question on stackexchange, but with slightly (but important) differences in your in- and output. You should tell us when you cross-post your questions.

Thanks.

fin swimmer

ADD REPLY • link 5.2 years ago by finswimmer 16k

score 0 · Answer 1 · 2019-02-08

We could use R:

# example data
df1 <- read.table(text = "
Individual  SNP1    SNP2    SNP3    SNP4
Father_1    1   0   1   2
Mother_1    0   0   2   1
Father_2    1   0   1   1
Mother_2    2   1   0   1
Father_3    1   2   0   1
Mother_3    1   .   0   2
", header = TRUE, na.string = ".")

Using loops

t(sapply(split(df1[, -1], rep(1:(nrow(df1)/2), each = 2)), colMeans))
#   SNP1 SNP2 SNP3 SNP4
# 1  0.5  0.0  1.5  1.5
# 2  1.5  0.5  0.5  1.0
# 3  1.0   NA  0.0  1.5

Or

# subset using recycling then get mean
(df1[ c(TRUE, FALSE), -1] + df1[ c(FALSE, TRUE), -1]) / 2

Using awk:

# get odd rows
sed -n 'p;n' original.txt > file1.txt
# get even rows
sed -n 'n;p' original.txt > file2.txt

Then use the awk script from SO: Average of multiple files in shell:

awk 'FNR == 1 { nfiles++; ncols = NF }
     { for (i = 1; i < NF; i++) sum[FNR,i] += $i
       if (FNR > maxnr) maxnr = FNR
     }
     END {
         for (line = 1; line <= maxnr; line++)
         {
             for (col = 1; col < ncols; col++)
                  printf "  %f", sum[line,col]/nfiles;
             printf "\n"
         }
     }' file*.txt

score 0 · Answer 2 · 2019-02-08

In python you can do this with reading the file two lines at a time, so you never hold more than two lines at once. This will use next to no memory, and I'm pretty sure it will be much faster than the R solution. There is probably also a solution with pandas and chunking the input file read.

infile = open("infile.tsv")
header = infile.readline()
header = header.strip().split()[1:]
print("\t".join(header))

while True:

    line1 = infile.readline()
    line2 = infile.readline()

    if (not line1) or (not line2): break # EOF

    assert "Father" in line1 and "Mother" in line2

    genotypes1 = map(float, line1.strip().split()[1:])
    genotypes2 = map(float, line2.strip().split()[1:])

    average_genotypes = [str((x+y)/2) for x, y in zip(genotypes1, genotypes2)]

    print("\t".join(average_genotypes))