Question: ERROR: You must specify a valid interval for imputation using the -int argument, -use_prephased_g: command not found, in IMPUTE2
1
gravatar for jmukisa90
6 days ago by
jmukisa9010
jmukisa9010 wrote:

Hi there,

I am new to Bioinformatics and imputation. I would like to impute genotypes for my phased SNP data (Used adapted SHAPEIT2 scripts following this link, Phasing with SHAPEIT . I downloaded Impute2 using the commands below:

  wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
  tar -xvzf impute_v2.3.2_x86_64_static.tgz

and adapting a script for imputation based on the link: https://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook#Imputation

I would like to use the 1000G_Phase 3 reference data and the .haps files from the earlier phasing of the data for imputation in IMPUTE2.

when I run the adapted IMPUTE2 scripts :

with final commands in the script as

CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`
impute2      \
    -use_prephased_g \
    -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt\
    -sample_g library/file_chr"${chr}"_1KGphased.sample \
    -known_haps_g  library/file_chr"${chr}"_1KGphased.haps \
    -h  library/1000GP_Phase3/genetic_map_chr"${chr}".hap.gz \
    -Ne 20000 \
    -l library/1000GP_Phase3/genetic_map_chr"${chr}".legend.gz \
     -int $CHUNK_START $CHUNK_END \
    -buffer 250 \
    -o library/file_chr${CHR}_1KGphased.pos${CHUNK_START}-${CHUNK_END}.impute2\
    -allow_large_regions \
    -seed 367946

I get the error below:

======================
 IMPUTE version 2.3.2
======================

Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.

The seed for the random number generator is 2097578927.

Command-line input: impute2

ERROR: You must specify a valid interval for imputation using the -int argument.
line 48: -use_prephased_g: command not found

Questions:

  1. What would be the best way of setting the -int boundaries in this case given that I want to impute across whole chromosomes?
  2. Can the -int boundaries be applied to all the 22 autosomal chromosomes in this a single script?If yes, how?
  3. why are the impute2 options specified here not working? I have tried switching which option comes first in the impute2 command but I get similar errors of the new first option "command not found"?

Thank you all for your help.

impute2 software error • 103 views
ADD COMMENTlink modified 2 days ago by Kevin Blighe56k • written 6 days ago by jmukisa9010
1
gravatar for Kevin Blighe
2 days ago by
Kevin Blighe56k
Kevin Blighe56k wrote:

Hi,

I gave the answer in the other thread, regarding the pre-phasing of data using SHAPEIT2. I can see that you are now a different user (?) who is doing the next step, i.e., the imputation, using the pre-phased haplotypes?

Unless you have a stick of RAM that's the size of the Sun, you will indeed have to do the imputation in chunks. You also need to therefore know the lengths of your chromosomes. Basically, this can be achieved via shell scripting. Here is how I did it for interrval ('chunk') sizes of 5 megabase (5 million bases):

for chr in {1..22}; do
  case "${chr}" in
    1)
      max=249250621
    ;;
    2)
      max=243199373
    ;;
    3)
      max=198022430
    ;; 
    4)
      max=191154276
    ;;
    5)
      max=180915260
    ;;
    6)
      max=171115067
    ;;
    7)
      max=159138663
    ;;
    8)
      max=146364022
    ;;
    9)
      max=141213431
    ;;
    10)
      max=135534747
    ;;
    11)
      max=135006516
    ;;
    12)
      max=133851895
    ;;
    13)
      max=115169878
    ;;
    14)
      max=107349540
    ;;
    15)
      max=102531392
    ;;
    16)
      max=90354753
    ;;
    17)
      max=81195210
    ;;
    18)
      max=78077248
    ;;
    20)
      max=63025520
    ;;
    19)
      max=59128983
    ;;
    22)
      max=51304566
    ;;
    21)
      max=48129895
    ;;
  esac

  chunk=1 ;
  interval=5000000 ;
  start=0 ;
  end="${interval}" ;

  while [ $end -lt $max ] ;
  do
    srun --mem=32 --cpus-per-task=32 --partition=serial \
      impute \
        -phase \
        \
        -use_prephased_g \
        -known_haps_g Prephased/GSA_QCd_chr"${chr}"_1KGphased.haps \
        -strand_g GSA/GSA_strandinfo_chr"${chr}".list \
        \
        -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        \
        -h library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz \
        -l library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz \
        \
        -align_by_maf_g \
        -int $((start+1)) "${end}" \
        -Ne 20000 \
        -o Imputed_Phased/GSA_chr"${chr}"_chunk"${chunk}"_1KG ;

    start=$(($start+$interval)) ;
    end=$(($end+$interval)) ;
    chunk=$(($chunk+1)) ;

    echo "${chr}" "${start}" "${end}" "${chunk}" ;
  done ;
done ;

I got the chromosome lengths from the fai file that's produced from samtools faidx for the GRCh37 1000 Genomes FASTA reference genome. You can see the link for this genome in step 3, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Also note that I add the -phase parameter, which will perform a phased imputation. With your code, an un-phased imputation will be performed. Some of your other parameters differ from mine, so, please check those.

Once your imputation is complete, you can convert the resulting haps files to vcf via:

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

After that, you'll need BCFtools commands to piece your data back together, and more time and RAM.

Trust that this assists you.

Kevin

ADD COMMENTlink modified 2 days ago • written 2 days ago by Kevin Blighe56k

Thank you Kevin, I am trying to follow through with your reply. I am missing only the "GSA/GSA_strandinfo_chr"${chr}".list " .files. Is there a way it is generated from the .ped/.map files? John

ADD REPLYlink written 15 hours ago by jmukisa9010

I got that file from the array manufacturer (Illumina) - I don't think that it is necessary, particularly when your input is from NGS(?) Are you imputing NGS data?

ADD REPLYlink written 15 hours ago by Kevin Blighe56k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1055 users visited in the last hour