Question: Phasing with SHAPEIT
0
gravatar for melania 2282
11 months ago by
melania 228290
melania 228290 wrote:

Hi,

I am trying to phase genotyping data (Plink format file) for imputation later. But I have an alignment problem between my data and the reference data set , because of it the most of my SNPs are excluded ...

Is there any thing to do to avoid losing data ?

snp plink impute shapeit • 1.4k views
ADD COMMENTlink modified 8 months ago by officialoxybreath0 • written 11 months ago by melania 228290
1

Have you tried to flip the SNPs that are discordant between your data and your reference (e.g.: --flip function in PLINK) and see if fewer SNPs are excluded?

ADD REPLYlink written 11 months ago by alessandrotestori7390

thank you I will try this

ADD REPLYlink written 11 months ago by melania 228290
1

What is the source of the data that you are imputing, and what is the reference panel? Also, can you share the command(s) that you are using? I have recently completed 2 imputations - for each, I ran 3 separate SHAPEIT commands in order to ensure that the data was correctly pre-phased.

ADD REPLYlink written 11 months ago by Kevin Blighe67k

Hi Kevin, It's a genotyping data from Illumina chip, Its 37 buit but I did the liftover to 38. Firstly I run the check command with shapeit to compare my data to reference panel (1000 genomes) and this step infom me that a lot of SNPs (for example for chr1 more 30000 will be excluded...) Thank you for your help

ADD REPLYlink modified 11 months ago • written 11 months ago by melania 228290
1

I see - thank you! From where did you obtain the 1000 Genomes data? - most data that is available is GRCh37.

ADD REPLYlink written 11 months ago by Kevin Blighe67k
4
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

Edit June 7, 2020:

The code below is for pre-phasing with SHAPEIT2. For phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs, see my answer here: A: ERROR: You must specify a valid interval for imputation using the -int argument,

So, the steps are usually:

  1. pre-phasing into pre-existing haplotypes available from HERE ( C: Phasing with SHAPEIT )
  2. phased imputation and generation of phaed VCFs ( A: ERROR: You must specify a valid interval for imputation using the -int argument, )

----------------

Thanks - good to know that there is now a GRCh38 version of that data! - I utilise GRCh37, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

If your Illumina microarray is a special design, then it may only target very specific regions and have minimal overlap.

Otherwise, as mentioned, the use of SHAPEIT involves a 3 step process:

  1. checking overlap of your data with the reference panel (via -check)
  2. removal of problematic variants (also via check)
  3. pre-phase using filtered input data

1, first run a QC check (will throw error)

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --output-log Prephased/MyData_chr"${chr}"_alignments \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should generate files with extensions _alignments.snp.strand.exclude. Use these in the next step via --exclude-snp:

2, exclude problematic variants that were found

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should now run to completion and not return any error.

NB - this does not actually remove the variants from your data. It just excludes them when SHAPEIT is trying to determine the alignment between your data and the reference. If this command runs to completion without error, then you can proceed to the next step, #3

3, now perform pre-phasing

Here, we again instruct SHAPEIT to not include the problematic variants. Ultimately, these will therefore be lost from the dataset from this point.

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=12 --cpus-per-task=8 --partition=serial \
      shapeit \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/GSA_QCd_chr"${chr}"_alignments.snp.strand.exclude \
        -O Prephased/MyData_chr"${chr}"_1KGphased \
        -T 8 ;
  fi
done ;
rm temp.* ;
ADD COMMENTlink modified 5 months ago • written 11 months ago by Kevin Blighe67k
1

Thank you Kevin your code and explanation

ADD REPLYlink written 11 months ago by melania 228290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1880 users visited in the last hour