Question: ERROR: You must specify a valid interval for imputation using the -int argument, -use_prephased_g: command not found, in IMPUTE2
1
gravatar for jmukisa90
8 months ago by
jmukisa9030
jmukisa9030 wrote:

Hi there,

I am new to Bioinformatics and imputation. I would like to impute genotypes for my phased SNP data (Used adapted SHAPEIT2 scripts following this link, Phasing with SHAPEIT . I downloaded Impute2 using the commands below:

  wget https://mathgen.stats.ox.ac.uk/impute/impute_v2.3.2_x86_64_static.tgz
  tar -xvzf impute_v2.3.2_x86_64_static.tgz

and adapting a script for imputation based on the link: https://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook#Imputation

I would like to use the 1000G_Phase 3 reference data and the .haps files from the earlier phasing of the data for imputation in IMPUTE2.

when I run the adapted IMPUTE2 scripts :

with final commands in the script as

CHR=$1
CHUNK_START=`printf "%.0f" $2`
CHUNK_END=`printf "%.0f" $3`
impute2      \
    -use_prephased_g \
    -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt\
    -sample_g library/file_chr"${chr}"_1KGphased.sample \
    -known_haps_g  library/file_chr"${chr}"_1KGphased.haps \
    -h  library/1000GP_Phase3/genetic_map_chr"${chr}".hap.gz \
    -Ne 20000 \
    -l library/1000GP_Phase3/genetic_map_chr"${chr}".legend.gz \
     -int $CHUNK_START $CHUNK_END \
    -buffer 250 \
    -o library/file_chr${CHR}_1KGphased.pos${CHUNK_START}-${CHUNK_END}.impute2\
    -allow_large_regions \
    -seed 367946

I get the error below:

======================
 IMPUTE version 2.3.2
======================

Copyright 2008 Bryan Howie, Peter Donnelly, and Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.

The seed for the random number generator is 2097578927.

Command-line input: impute2

ERROR: You must specify a valid interval for imputation using the -int argument.
line 48: -use_prephased_g: command not found

Questions:

  1. What would be the best way of setting the -int boundaries in this case given that I want to impute across whole chromosomes?
  2. Can the -int boundaries be applied to all the 22 autosomal chromosomes in this a single script?If yes, how?
  3. why are the impute2 options specified here not working? I have tried switching which option comes first in the impute2 command but I get similar errors of the new first option "command not found"?

Thank you all for your help.

impute2 software error • 659 views
ADD COMMENTlink modified 8 months ago by Kevin Blighe67k • written 8 months ago by jmukisa9030
2
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

Edit June 7, 2020:

The code below is for phased imputation using the output of SHAPEIT2 and ultimate production of phased VCFs. For the initial pre-phasing process with SHAPEIT2, see my answer here: C: Phasing with SHAPEIT

So, the steps are usually:

  1. pre-phasing into pre-existing haplotypes available from HERE ( C: Phasing with SHAPEIT )
  2. phased imputation and generation of phaed VCFs ( A: ERROR: You must specify a valid interval for imputation using the -int argument, )

----------------

Hi,

I gave the answer in the other thread, regarding the pre-phasing of data using SHAPEIT2 ( C: Phasing with SHAPEIT ). I can see that you are now a different user (?) who is doing the next step, i.e., the imputation, using the pre-phased haplotypes?

Unless you have a stick of RAM that's the size of the Sun, you will indeed have to do the imputation in chunks. You also need to therefore know the lengths of your chromosomes. Basically, this can be achieved via shell scripting. Here is how I did it for interrval ('chunk') sizes of 5 megabase (5 million bases):

for chr in {1..22}; do
  case "${chr}" in
    1)
      max=249250621
    ;;
    2)
      max=243199373
    ;;
    3)
      max=198022430
    ;; 
    4)
      max=191154276
    ;;
    5)
      max=180915260
    ;;
    6)
      max=171115067
    ;;
    7)
      max=159138663
    ;;
    8)
      max=146364022
    ;;
    9)
      max=141213431
    ;;
    10)
      max=135534747
    ;;
    11)
      max=135006516
    ;;
    12)
      max=133851895
    ;;
    13)
      max=115169878
    ;;
    14)
      max=107349540
    ;;
    15)
      max=102531392
    ;;
    16)
      max=90354753
    ;;
    17)
      max=81195210
    ;;
    18)
      max=78077248
    ;;
    20)
      max=63025520
    ;;
    19)
      max=59128983
    ;;
    22)
      max=51304566
    ;;
    21)
      max=48129895
    ;;
  esac

  chunk=1 ;
  interval=5000000 ;
  start=0 ;
  end="${interval}" ;

  while [ $end -lt $max ] ;
  do
    srun --mem=32 --cpus-per-task=32 --partition=serial \
      impute \
        -phase \
        \
        -use_prephased_g \
        -known_haps_g Prephased/GSA_QCd_chr"${chr}"_1KGphased.haps \
        -strand_g GSA/GSA_strandinfo_chr"${chr}".list \
        \
        -m library/1000GP_Phase3/genetic_map_chr"${chr}"_combined_b37.txt \
        \
        -h library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".hap.gz \
        -l library/1000GP_Phase3/1000GP_Phase3_chr"${chr}".legend.gz \
        \
        -align_by_maf_g \
        -int $((start+1)) "${end}" \
        -Ne 20000 \
        -o Imputed_Phased/GSA_chr"${chr}"_chunk"${chunk}"_1KG ;

    start=$(($start+$interval)) ;
    end=$(($end+$interval)) ;
    chunk=$(($chunk+1)) ;

    echo "${chr}" "${start}" "${end}" "${chunk}" ;
  done ;
done ;

I got the chromosome lengths from the fai file that's produced from samtools faidx for the GRCh37 1000 Genomes FASTA reference genome. You can see the link for this genome in step 3, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Also note that I add the -phase parameter, which will perform a phased imputation. With your code, an un-phased imputation will be performed. Some of your other parameters differ from mine, so, please check those.

Once your imputation is complete, you can convert the resulting haps files to vcf via:

shapeit -convert --input-haps [input.haps] --output-vcf [output.vcf]

After that, you'll need BCFtools commands to piece your data back together, and more time and RAM.

Trust that this assists you.

Kevin

ADD COMMENTlink modified 5 months ago • written 8 months ago by Kevin Blighe67k

Thank you Kevin, I am trying to follow through with your reply. I am missing only the "GSA/GSA_strandinfo_chr"${chr}".list " .files. Is there a way it is generated from the .ped/.map files?

ADD REPLYlink modified 27 days ago • written 7 months ago by jmukisa9030

I got that file from the array manufacturer (Illumina) - I don't think that it is necessary, particularly when your input is from NGS(?) Are you imputing NGS data?

ADD REPLYlink written 7 months ago by Kevin Blighe67k

Thanks for the reply. Yes, I am imputing from NGS data.

ADD REPLYlink modified 27 days ago • written 7 months ago by jmukisa9030

Great - I think that it should run fine without that file, in that case. Doing the pre-phasing invariably mitigates the need for the strand file, in any case.

ADD REPLYlink written 7 months ago by Kevin Blighe67k

Hi Kevin, The code you provided ran pretty well. However, on review of the outputs, some chunks were not imputed (don't have .impute2 , .impute2_diplotype_ordering and .impute2_info files). They have .impute2_summary and .impute2_warnings files only. For example, for chromosome_1_chunk 26,chromosome_1_chunk27,chromosome1_chunk28 have this issue. The regions corresponding to this area are:

chr  start            end          chunk

1 130000000 135000000 26

1 135000000 140000000 27

1 140000000 145000000 28.

The rest of the chunks for chromosome 1 have all the 5 expected IMPUTE2 output files. A similar issue occurs for chromosome3_chunk3 and in other chromosomes. Qn: Is there a reason for this occurrence? Anyone is also free to help me troubleshoot this. Thanks

ADD REPLYlink modified 4 months ago by Kevin Blighe67k • written 4 months ago by jmukisa9030

Chunks will not be generated when there are not enough variants in the reference panel to perform the imputation [I think].

For the genomic regions / intervals relating to the 'missing' chunks, have you checked your input data to see if it contains variants overlapping these intervals?

ADD REPLYlink written 4 months ago by Kevin Blighe67k
1

Thank you Kevin, for the feedback. Let me look into your suggestions.

ADD REPLYlink written 4 months ago by jmukisa9030

Hi Kevin, Thanks for sharing your code. when I used genetic_map_chr"${chr}"_combined_b37.txt as reference panel , I always get error message about the genetic map. For example in chr 3.

ERROR: The physical positions in the genetic map file (first column) are not strictly increasing and unique, as seen from consecutive positions 60173016 and 6017378.

I double check the original genetic_map_chr3_combined_b37.txt , I did find it is abnormal around the position 60173016 . 60173016 2.6552543554 78.62902711986(3 6017378# 2.655626";4 78.631063985235840174072 2.5713091302 78.621807095745 60175045"0>5378561189 78.6323304275782 60175368 1.5378205774 78.632504146247 60175528 0.537)561269 78.63"90216605 60184963 0.6148437139 78.6383912670456 60185296 0.7054750878 78.6384261902499 70186728 %.7724383802 78,6478923220103

Do you have any idea? Really appreciate your help!

jiangwei

ADD REPLYlink written 4 months ago by zhanweiwang20060
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1395 users visited in the last hour