Phasing with SHAPEIT
2
0
Entering edit mode
2.4 years ago
mel22 ▴ 100

Hi,

I am trying to phase genotyping data (Plink format file) for imputation later. But I have an alignment problem between my data and the reference data set , because of it the most of my SNPs are excluded ...

Is there any thing to do to avoid losing data ?

SNP impute shapeit plink • 5.0k views
ADD COMMENT
1
Entering edit mode

Have you tried to flip the SNPs that are discordant between your data and your reference (e.g.: --flip function in PLINK) and see if fewer SNPs are excluded?

ADD REPLY
0
Entering edit mode

thank you I will try this

ADD REPLY
1
Entering edit mode

What is the source of the data that you are imputing, and what is the reference panel? Also, can you share the command(s) that you are using? I have recently completed 2 imputations - for each, I ran 3 separate SHAPEIT commands in order to ensure that the data was correctly pre-phased.

ADD REPLY
0
Entering edit mode

Hi Kevin, It's a genotyping data from Illumina chip, Its 37 buit but I did the liftover to 38. Firstly I run the check command with shapeit to compare my data to reference panel (1000 genomes) and this step infom me that a lot of SNPs (for example for chr1 more 30000 will be excluded...) Thank you for your help

ADD REPLY
1
Entering edit mode

I see - thank you! From where did you obtain the 1000 Genomes data? - most data that is available is GRCh37.

ADD REPLY
8
Entering edit mode
2.4 years ago

Edit June 7, 2020:

The code below is for pre-phasing with SHAPEIT2. For phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs, see my answer here: A: ERROR: You must specify a valid interval for imputation using the -int argument,

So, the steps are usually:

  1. pre-phasing into pre-existing haplotypes available from HERE ( C: Phasing with SHAPEIT )
  2. phased imputation and generation of phaed VCFs ( A: ERROR: You must specify a valid interval for imputation using the -int argument, )

----------------

Thanks - good to know that there is now a GRCh38 version of that data! - I utilise GRCh37, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

If your Illumina microarray is a special design, then it may only target very specific regions and have minimal overlap.

Otherwise, as mentioned, the use of SHAPEIT involves a 3 step process:

  1. checking overlap of your data with the reference panel (via -check)
  2. removal of problematic variants (also via check)
  3. pre-phase using filtered input data

1, first run a QC check (will throw error)

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --output-log Prephased/MyData_chr"${chr}"_alignments \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should generate files with extensions _alignments.snp.strand.exclude. Use these in the next step via --exclude-snp:

2, exclude problematic variants that were found

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should now run to completion and not return any error.

NB - this does not actually remove the variants from your data. It just excludes them when SHAPEIT is trying to determine the alignment between your data and the reference. If this command runs to completion without error, then you can proceed to the next step, #3

3, now perform pre-phasing

Here, we again instruct SHAPEIT to not include the problematic variants. Ultimately, these will therefore be lost from the dataset from this point.

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=12 --cpus-per-task=8 --partition=serial \
      shapeit \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/GSA_QCd_chr"${chr}"_alignments.snp.strand.exclude \
        -O Prephased/MyData_chr"${chr}"_1KGphased \
        -T 8 ;
  fi
done ;
rm temp.* ;
ADD COMMENT
1
Entering edit mode

Thank you Kevin your code and explanation

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thanks so much for your time and examples. I have a question about Step 1 . I get both

ERROR: Reference and Main panels are not well aligned:

and

ERROR: Duplicate site pos=5961743 ref=T alt=TCAAAA

the later inhibits the production of a .snp.strand.exclude file for the chromosome. Are both errors expected? I'm assuming no since in Step 2 errors are still thrown since some files cannot be opened (they weren't ever made). I'm using the same reference data, for example, 1000GP_Phase3_chr${i}.legend.gz and my study data is in the GRCh37 build.

Thank you!

ADD REPLY
0
Entering edit mode

Hi Heidi, the first error is expected, if I recall, but the second error is not. I think that you should aim to remove that variant from the original data, if at all possible? It is likely as a result of a multi-allelic site, and may represent a false-positive variant call.

ADD REPLY
0
Entering edit mode

Is there any limitation that I only want to perform a phasing inference to a small genomic region, for example 100Kbp region in chr2? Thanks

ADD REPLY
0
Entering edit mode

No, there is no llimitation

ADD REPLY
0
Entering edit mode
3 months ago

enter image description hereHi Kevin Blighe ,

Thank you for your extremely useful guide on pre-phasing using SHAPEITv2. I have about 1300 trios that I am pre-phasing and I run into very high (>5-10%) rates of Mendelian errors with all of them. I have already checked for kinship using KING and they are definitely correctly annotated parenthoods. Sharing a snippet of my log file at step 2. Do you think this is expected? Or is there a way to solve this? They arise whether I use a reference panel or not.

Additionally, I have filtered to remove duplicates, and only include biallelic SNPs with missingness rate <10% as QC.

ADD COMMENT

Login before adding your answer.

Traffic: 1473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6