Question

Closed:Standardization of A1 / A2 designations across multiple .ped files for downstream processing

0

Entering edit mode

7.9 years ago

LauferVA 4.2k

Hello,

I am working with transethnic data. I have 4 different .ped files in the same genomic locus. Because of the different ethnicities, the major and minor allele for certain SNPs is different between ethnicities. In addition, we are combining in-house data with data from 1000 genomes project, there are strand issues, etc. Here is a sample of a few of the SNPs in the region.

Chr Pos RSID    RA_A1   RA_A2   ASW_A1  ASW_A2  CEU_A1  CEU_A2  YRI_A1  YRI_A2

5   8017939 rs2167898   A   G   A   G   A   G   A   G

5   7991791 rs6879924   G   A   G   A   G   A   G   A

5   8049644 rs6896973   G   A   G   A   G   A   G   A

5   8048942 rs17249829  G   A   C   T   C   T   C   T

5   8012069 rs2924461   A   G   A   G   G   A   A   G

5   7985855 rs2654520   G   A   C   T   C   T   T   C

5   8026075 rs6893120   G   A   C   T   C   T   C   T

5   8047543 rs876712    T   G   A   C   A   C   A   C

I have 2 downstream applications for this data.

The first is to make haplotypes for ~20 to 40 SNPs in the region. I am thinking for this application, I can simply flip the strand of certain SNPs, then use the --recodeHV feature in plink. I am thinking I do not need to worry if the major and minor designations are opposite. The .info file output by --recodeHV and read by Haploview does not have an A1 or A2 column in the first place. Is this accurate? Are there concerns I should be aware of with simply strand flipping then re-coding?

The second is to run the PAINTOR2 algorithm on people with RA from 3 global populations. It requires 3 things. Z-scores for each SNP, an LD matrix, and functional annotation information, but this last one is not the subject of the current post. I will be using the 1000 genomes data for to get an LD matrix for the SNPs in the region, but not to get the association statistics, of course, those come directly from the metaanalysis. In this case, using Plink could be problematic. The first reason is that the designation of ref vs. alt alleles for the LD computations must match the ref/alt alleles for the Z-score computations for each SNP. If the LD computations are performed in PLINK, it would automatically use the minor allele as the "1" allele even if it is not necessarily the reference allele in one of the datasets. The second is that the A1/A2 designations in the meta-analysis might not match those from 1000 genomes, meaning that those would need to be made to match. After this, I would need to make THOSE match with the designations from our in-house data.

I could write all my own software to do this. For instance, I could write a script to get a Z-score for each SNP in the way that I want. But, I am anxious to avoid a lot of the error-prone I/O steps if possible, not to mention anxious to avoid re-writing functionality that has already been written for these exact applications. Can anyone suggest options for tgetting LD relationships between SNPs in the region from 1kG and appropriately combining those with A1/A2 designations from a meta-analysis that might not be the same (including both major/minor allele issues and strand issues), and appropriately combining my data with this data?.

Here are references for the softwares involved: plink data management - http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml

haploview input files - https://www.broadinstitute.org/scientific-community/science/programs/medical-and-population-genetics/haploview/chapter-2-files

PAINTOR 2 - https://github.com/gkichaev/PAINTOR_FineMapping

Strand-flip trans-ethnic meta-analysis • 877 views

ADD COMMENT • link 7.9 years ago by LauferVA 4.2k