Question: Phasing with SHAPEIT
0
gravatar for melania 2282
5 weeks ago by
melania 228250
melania 228250 wrote:

Hi, I am trying to phase genotyping data (Plink format file) for imputation later. But I have an alignement problem between my data and the refernce data set , because of it the most of my SNPs are excluded ...

Is there any thing to do to avoid losing data ?

snp • 152 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by melania 228250
1

Have you tried to flip the SNPs that are discordant between your data and your reference (e.g.: --flip function in PLINK) and see if fewer SNPs are excluded?

ADD REPLYlink written 5 weeks ago by alessandrotestori7390

thank you I will try this

ADD REPLYlink written 5 weeks ago by melania 228250
1

What is the source of the data that you are imputing, and what is the reference panel? Also, can you share the command(s) that you are using? I have recently completed 2 imputations - for each, I ran 3 separate SHAPEIT commands in order to ensure that the data was correctly pre-phased.

ADD REPLYlink written 5 weeks ago by Kevin Blighe53k

Hi Kevin, It's a genotyping data from Illumina chip, Its 37 buit but I did the liftover to 38. Firstly I run the check command with shapeit to compare my data to reference panel (1000 genomes) and this step infom me that a lot of SNPs (for example for chr1 more 30000 will be excluded...) Thank you for your help

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by melania 228250
1

I see - thank you! From where did you obtain the 1000 Genomes data? - most data that is available is GRCh37.

ADD REPLYlink written 5 weeks ago by Kevin Blighe53k
1
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe53k
Kevin Blighe53k wrote:

Thanks - good to know that there is now a GRCh38 version of that data! - I utilise GRCh37, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

If your Illumina microarray is a special design, then it may only target very specific regions and have minimal overlap.

Otherwise, as mentioned, the use of SHAPEIT involves a 3 step process:

  1. checking overlap of your data with the reference panel (via -check)
  2. removal of problematic variants (also via check)
  3. pre-phase using filtered input data

    echo "Pre-phasing Global Screening Array" ;

1, first run a QC check (will throw error)

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --output-log Prephased/MyData_chr"${chr}"_alignments \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should generate files with extensions _alignments.snp.strand.exclude. Use these in the next step via --exclude-snp:

2, exclude problematic variants that were found

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=8 --cpus-per-task=4 --partition=serial \
      shapeit \
        -check \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/MyData_chr"${chr}"_alignments.snp.strand.exclude \
        -T 8 ;
  fi
done ;
rm temp.* ;

This should now run to completion and not return any error.

NB - this does not actually remove the variants from your data. It just excludes them when SHAPEIT is trying to determine the alignment between your data and the reference. If this command runs to completion without error, then you can proceed to the next step, #3

3, now perform pre-phasing

Here, we again instruct SHAPEIT to not include the problematic variants. Ultimately, these will therefore be lost from the dataset from this point.

for chr in X {1..22}; do
  plink --bfile MyData --chr "${chr}" --make-bed --out temp

  if [ "${chr}" != "X" ]
  then
    srun --mem=12 --cpus-per-task=8 --partition=serial \
      shapeit \
        -B temp \
        -M library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        --input-ref library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz library/1000GP_Phase3/1000GP_Phase3.sample \
        --exclude-snp Prephased/GSA_QCd_chr"${chr}"_alignments.snp.strand.exclude \
        -O Prephased/MyData_chr"${chr}"_1KGphased \
        -T 8 ;
  fi
done ;
rm temp.* ;
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe53k
1

Thank you Kevin your code and explanation

ADD REPLYlink written 5 weeks ago by melania 228250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1214 users visited in the last hour