Calculate average parental genotype
2
1
Entering edit mode
5.2 years ago

I am interested in creating an averaged value for genotype across two parents:

E.g. Input (three unrelated couples, 1, 2 and 3):

Individual    SNP1    SNP2    SNP3    SNP4 ...
Father_1      1       0       1       2    ...
Mother_1      0       0       2       1    ...
Father_2      1       0       1       1    ...
Mother_2      2       1       0       1    ...
Father_3      1       2       0       1    ...
Mother_3      1       .       0       2    ...

E.g. Output (the midparental genotypes for couple 1, 2 and 3):

Individual    SNP1    SNP2    SNP3    SNP4 ...
Midparent_1   0.5     0       1.5     1.5  ...
Midparent_2   1.5     0.5     0.5     1    ...
Midparent_3   1       .       0       1.5  ...

I can think of a way to do this using e.g. a PLINK .raw file and manipulating that in R, but am concerned that would become memory- and storage-heavy when assessing 100000s of SNPs and 1000s of pairs.

Does anyone know of an existing tool that can do this?

Thanks

gwas SNP • 1.4k views
ADD COMMENT
1
Entering edit mode

Can you provide more example rows, add more parent pairs. Pretty sure it is easily done using R. Something like split by ped lapply, then use colMeans.

ADD REPLY
0
Entering edit mode

Thanks - have added. My concern is less the method to do so in R (if we end up doing that, I'll add the code as an answer), but more whether there's a less computationally-intensive / more sophisticated method that I'm missing.

ADD REPLY
0
Entering edit mode

Do we assume rows 1,2 is one parent pair and rows 3,4 is another parent pair, etc?

ADD REPLY
0
Entering edit mode

That can be assumed (in that it would be straightforward to ensure the file is formatted in that way)

ADD REPLY
0
Entering edit mode

Hello,

you've also posted this question on stackexchange, but with slightly (but important) differences in your in- and output. You should tell us when you cross-post your questions.

Thanks.

fin swimmer

ADD REPLY
0
Entering edit mode
5.2 years ago
zx8754 11k

We could use R:

# example data
df1 <- read.table(text = "
Individual  SNP1    SNP2    SNP3    SNP4
Father_1    1   0   1   2
Mother_1    0   0   2   1
Father_2    1   0   1   1
Mother_2    2   1   0   1
Father_3    1   2   0   1
Mother_3    1   .   0   2
", header = TRUE, na.string = ".")

Using loops

t(sapply(split(df1[, -1], rep(1:(nrow(df1)/2), each = 2)), colMeans))
#   SNP1 SNP2 SNP3 SNP4
# 1  0.5  0.0  1.5  1.5
# 2  1.5  0.5  0.5  1.0
# 3  1.0   NA  0.0  1.5

Or

# subset using recycling then get mean
(df1[ c(TRUE, FALSE), -1] + df1[ c(FALSE, TRUE), -1]) / 2

Using awk:

# get odd rows
sed -n 'p;n' original.txt > file1.txt
# get even rows
sed -n 'n;p' original.txt > file2.txt

Then use the awk script from SO: Average of multiple files in shell:

awk 'FNR == 1 { nfiles++; ncols = NF }
     { for (i = 1; i < NF; i++) sum[FNR,i] += $i
       if (FNR > maxnr) maxnr = FNR
     }
     END {
         for (line = 1; line <= maxnr; line++)
         {
             for (col = 1; col < ncols; col++)
                  printf "  %f", sum[line,col]/nfiles;
             printf "\n"
         }
     }' file*.txt
ADD COMMENT
0
Entering edit mode
5.2 years ago

In python you can do this with reading the file two lines at a time, so you never hold more than two lines at once. This will use next to no memory, and I'm pretty sure it will be much faster than the R solution. There is probably also a solution with pandas and chunking the input file read.

infile = open("infile.tsv")
header = infile.readline()
header = header.strip().split()[1:]
print("\t".join(header))

while True:

    line1 = infile.readline()
    line2 = infile.readline()

    if (not line1) or (not line2): break # EOF

    assert "Father" in line1 and "Mother" in line2

    genotypes1 = map(float, line1.strip().split()[1:])
    genotypes2 = map(float, line2.strip().split()[1:])

    average_genotypes = [str((x+y)/2) for x, y in zip(genotypes1, genotypes2)]

    print("\t".join(average_genotypes))
ADD COMMENT

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6