Question

Phasing with SHAPEIT

0

Entering edit mode

5.6 years ago

mel22 ▴ 100

Hi,

I am trying to phase genotyping data (Plink format file) for imputation later. But I have an alignment problem between my data and the reference data set , because of it the most of my SNPs are excluded ...

Is there any thing to do to avoid losing data ?

SNP impute shapeit plink • 11k views

ADD COMMENT • link updated 3.4 years ago by Anupriya Dalmia • 0 • written 5.6 years ago by mel22 ▴ 100

1

Entering edit mode

Have you tried to flip the SNPs that are discordant between your data and your reference (e.g.: --flip function in PLINK) and see if fewer SNPs are excluded?

ADD REPLY • link 5.6 years ago by alessandrotestori7 ▴ 420

0

Entering edit mode

thank you I will try this

ADD REPLY • link 5.6 years ago by mel22 ▴ 100

1

Entering edit mode

What is the source of the data that you are imputing, and what is the reference panel? Also, can you share the command(s) that you are using? I have recently completed 2 imputations - for each, I ran 3 separate SHAPEIT commands in order to ensure that the data was correctly pre-phased.

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, It's a genotyping data from Illumina chip, Its 37 buit but I did the liftover to 38. Firstly I run the check command with shapeit to compare my data to reference panel (1000 genomes) and this step infom me that a lot of SNPs (for example for chr1 more 30000 will be excluded...) Thank you for your help

ADD REPLY • link 5.6 years ago by mel22 ▴ 100

1

Entering edit mode

I see - thank you! From where did you obtain the 1000 Genomes data? - most data that is available is GRCh37.

ADD REPLY • link 5.6 years ago by Kevin Blighe 89k

0

Entering edit mode

This link : http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/

ADD REPLY • link 5.6 years ago by mel22 ▴ 100

score 11 · Answer 1 · 2019-12-15

Edit June 7, 2020:

The code below is for pre-phasing with SHAPEIT2. For phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs, see my answer here: A: ERROR: You must specify a valid interval for imputation using the -int argument,

So, the steps are usually:

pre-phasing into pre-existing haplotypes available from HERE ( C: Phasing with SHAPEIT )
phased imputation and generation of phaed VCFs ( A: ERROR: You must specify a valid interval for imputation using the -int argument, )

----------------

Thanks - good to know that there is now a GRCh38 version of that data! - I utilise GRCh37, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

If your Illumina microarray is a special design, then it may only target very specific regions and have minimal overlap.

Otherwise, as mentioned, the use of SHAPEIT involves a 3 step process:

checking overlap of your data with the reference panel (via -check)
removal of problematic variants (also via check)
pre-phase using filtered input data

1, first run a QC check (will throw error)

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --output-log Prephased/MyData_chr"${chr}"_alignments \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should generate files with extensions _alignments.snp.strand.exclude. Use these in the next step via --exclude-snp:

2, exclude problematic variants that were found

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should now run to completion and not return any error.

NB - this does not actually remove the variants from your data. It just excludes them when SHAPEIT is trying to determine the alignment between your data and the reference. If this command runs to completion without error, then you can proceed to the next step, #3

3, now perform pre-phasing

Here, we again instruct SHAPEIT to not include the problematic variants. Ultimately, these will therefore be lost from the dataset from this point.

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=12 --cpus-per-task=8 --partition=serial \
      shapeit \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/GSA_QCd_chr"${chr}"_alignments.snp.strand.exclude \
        -O Prephased/MyData_chr"${chr}"_1KGphased \
        -T 8 ;
  fi
done ;
rm temp.* ;

score 0 · Answer 2 · 2022-02-16

enter image description here Hi Kevin Blighe ,

Thank you for your extremely useful guide on pre-phasing using SHAPEITv2. I have about 1300 trios that I am pre-phasing and I run into very high (>5-10%) rates of Mendelian errors with all of them. I have already checked for kinship using KING and they are definitely correctly annotated parenthoods. Sharing a snippet of my log file at step 2. Do you think this is expected? Or is there a way to solve this? They arise whether I use a reference panel or not.

Additionally, I have filtered to remove duplicates, and only include biallelic SNPs with missingness rate <10% as QC.