Help with Sanger Imputation Service
Entering edit mode
5.8 years ago
sandKings ▴ 40

Hi there!

I'm looking for someone to help me in imputing genotype data by either using impute2 or Sanger Imputation. I'm open to any other tool that's easily reproducible too.

These patient samples are not genotyped and the SNP (in VCF format) was called from RNASeq data (50bp SR) using GATk method.

I'm offering USD 50.00 for showing me how to run imputation successfully by using an actual VCF file from my dataset. I'll send you an email from my institutional ID, detailing how the data was processed and steps I've taken on my own to troubleshoot it. I haven't used impute2 and have mostly worked with Sanger's online resource.

If you're interested, please PM me and I'll email you from my institutional email. I can pay via paypal/btc/bch/eth.


RNA-Seq variant calling imputation • 2.9k views
Entering edit mode

I would not spend too much of my time (and money, apparently) with variant calls from RNA-seq using GATK's method unless I was absolutely sure of the pitfalls / limitations of such data:

A: Inferring genotype based on RNA sequnces

What is it that you are ultimately aiming to achieve?

Entering edit mode

Hi Kevin!

Thanks so much for responding. I really appreciate your concern regarding the robustness of the variant calls from RNASeq data. I guess what I'm really trying to do, is to troubleshoot the imputation which I can hopefully apply to future studies.

I wanted to try my hands on live data and the only live data I have on my hand is RNASeq data. This dataset is only 30 patients (RNASeq) but if I can 'conclude' this, I would be able to work with other 400 matched (control and treatment) patient samples which I can send for DNASeq.

Entering edit mode

You want to impute RNA-seq variant calls? - would only really work for SNPs flanking the 5 and 3 prime regions of each gene. One of the commonest imputation methods is indeed IMPUTE2, which you mention.

Nobody here wants your money. We just offer free advice here, on our own time. Why not just post the commands / things that you have already tried? If you want to give the money away, then you could choose a good charity.

Entering edit mode

Ok! I'll keep trying. I've tried every troubleshooting step that was suggested. I don't know where to get DNASeq data to practice with.


I started with the GATk method for variant calling.

Step 7 of the pipeline generates my filtered VCF files:

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Following this, I zip and index my VCF into vcf.gz

bcftools view filtered.vcf. -Oz -o filtered.vcf.gz 
bcftools index filtered.vcf.gz

For Sanger Imputation Service, we convert the UCSC-style chromosome names to Ensembl-style chromosome names by running:

bcftools annotate -Oz --rename-chrs ucsc2ensembl.txt filtered.vcf.gz > filtered_ens.vcf.gz

Going back to the absolute beginning, when I was running DESeq2 on my RNASeq data, I had mapped it to GRCh38. Since Sanger requires the coordinates to be on GRCh37, I remapped my data to GRCh37.p13 from Gencode and repeated variant calling. While building the initial index the parameters were as shown below


STAR --runMode genomeGenerate -- genomeDir $genomeDir --genomeFastaFiles $genomeFile --sjdbGTFfile $gtf --sjdbOverhang 49 --runThreadN 40

I used sjdbOverhang 49 because my RNASeq is 50bp (read -1). The gtf file used "contains the comprehensive gene annotation originally created on the GRCh38 reference chromosomes, mapped to the GRCh37 primary assembly with gencode-backmap"

The ref dictionary was created :

java -jar /software/picard/2.6.0/picard-tools-2.6.0/picard.jar CreateSequenceDictionary R=/projects/p30120/Gencode/GRCh37.p13.genome.fa O=/projects/p30120/Gencode/GRCh37.p13.genome.dict 

/software/samtools/0.1.19/bin/samtools faidx /projects/p30120/Gencode/GRCh37.p13.genome.fa

For past few days, I'm stuck at this error and I can't get past it.

I know this doesn't look like much but it took me weeks to get so far. Most of my bioinformatics training comes Biostars, Seqanswer and r/bioinformatics.

When and IF I have the DNASeq data, I want to do an eQTL analysis.

Entering edit mode

If the error is the major blocker for you, then please have a look at my answer to your previous question.

Entering edit mode

Thanks Michael. I missed that previous post. sandKings, open access NGS data can also be downloaded from ENA - a tutorial is here: Fast download of FASTQ files and metadata from the European Nucleotide Archive (ENA)

Entering edit mode

Also note, sandKings, that (generally) data produced by GATK's methods frequently have issues in terms of compatibility with other methods. The GATK has it's own support forum - you may want to go there.

Entering edit mode

Hi Kevin, I've sought help at GATk too but I'm afraid my incessant posts are wearing everyone down. Anyway, I'm following up with Michael's advise and trying to find the final build for GRCh37. I'm insisting on using GRCh37 because I find Sanger's interface more 'noob' friendly.

Entering edit mode

Okay, don't worry about the GATK - they don't appear to integrate that much with the 'community', which indirectly renders much of their tools incompatible with the standard ones. You will always be welcome here. I presume that you have been working your way through This, and are stuck on this point:

Chromosome names should be 1, 2, 3, etc… not chr1, chr2, chr3, etc… They should match the names in this reference index file. Some programs will represent X as 23, Y as 24, etc…. Please remove or replace these names. See the resources for help renaming chromosomes in a VCF.

If you look at your VCF headers (bcftools view), can you give a list of all the contigs present? For Sanger Imputation, your contigs have to be named according to this:

Entering edit mode

Hi Kevin, sorry for the delay in getting back to you. My VCF header (?) looks like this before UCSC to ensemble conversion

The complete list of contigs is too long and is exceeding the character limit for the post. However, apart from the standard contigs,


the list also contains


And the columns look like this:

chr1    567242  .   G   A   1088.77 PASS    AC=2;AF=1.00;AN=2;DP=25;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.24;SOR=4.174    GT:AD:DP:GQ:PL  1/1:0,25:25:75:1117,75,0
chr1    567726  .   T   C   104.28  PASS    AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.76;SOR=1.179 GT:AD:DP:GQ:PL  1/1:0,3:3:9:132,9,0

Then, I convert it to the ensemble format using the command

bcftools annotate -Oz --rename-chrs ucsc2ensembl.txt ucsc.vcf.gz > ensembl.vcf.gz

which reformats my columns to:

1   567242  .   G   A   1088.77 PASS    AC=2;AF=1;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.24;SOR=4.174 GT:AD:DP:GQ:PL  1/1:0,25:25:75:1117,75,0
1   567726  .   T   C   104.28  PASS    AC=2;AF=1;AN=2;DP=3;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.76;SOR=1.179  GT:AD:DP:GQ:PL  1/1:0,3:3:9:132,9,0

I see all these other 'J' and 'K' contigs in my file. Following Michael's suggestion here I tried to find what appeared to be the final version of GRCh37 from Gencode and downloaded the primary assembly (GRCh37) file and Comprehensive gene annotation GTF file. I created the new STAR index:

STAR --runMode genomeGenerate -- genomeDir $genomeDir --genomeFastaFiles $genomeFile --sjdbGTFfile $gtf --sjdbOverhang 49 --runThreadN 40

and then proceeded with rest of the GATk pipeline:

Unfortunately, now I'm getting this error:

##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY, GL000192.1, GL000225.1, GL000194.1, GL000193.1, GL000200.1, GL000222.1, GL000212.1, GL000195.1, GL000223.1, GL000224.1, GL000219.1, GL000205.1, GL000215.1, GL000216.1, GL000217.1, GL000199.1, GL000211.1, GL000213.1, GL000220.1, GL000218.1, GL000209.1, GL000221.1, GL000214.1, GL000228.1, GL000227.1, GL000191.1, GL000208.1, GL000198.1, GL000204.1, GL000233.1, GL000237.1, GL000230.1, GL000242.1, GL000243.1, GL000241.1, GL000236.1, GL000240.1, GL000206.1, GL000232.1, GL000234.1, GL000202.1, GL000238.1, GL000244.1, GL000248.1, GL000196.1, GL000249.1, GL000246.1, GL000203.1, GL000197.1, GL000245.1, GL000247.1, GL000201.1, GL000235.1, GL000239.1, GL000210.1, GL000231.1, GL000229.1, GL000226.1, GL000207.1]

Trying to solve this 'sorting' issue using < java -jar picard.jar ReorderSam> error took me down another rabbit hole so I'm taking a break and will start from the top.

Entering edit mode

With your ensembl.vcf.gz, you may not need to do anything else with the GATK apart from tab-index the file in an attempt to fix the VCF header, with:

tabix -p vcf ensembl.vcf.gz

NB - it should be zipped with bgzip, not gzip

Have you tried to use that file for Sanger Imputation?


It also looks like there is some conflicting information in the STAR indices and the actual VCF contig names. For SAnger Imputation, you literally just need your contigs named like This.

If you still have the 'chr' prefix on your contigs and need to sort these numerically, these commands work:

bcftools view ensembl.vcf.gz | awk '/^#/{print}' > out.vcf ;
bcftools view ensembl.vcf.gz | awk '!/^#/{print}' | sed 's/^chr//' | sort -k1,1n -k2,2n >> out.vcf
bgzip out.vcf
tabix -p vcf out.vcf.gz

The tabix command will update your VCF header with the new contig names. You can technically remove the old contig names from the VCF header without problem.

With that, for all intents and purposes, the file should be ready for Sanger Imputation Server.

Entering edit mode

Also, to set the correct reference genome base in your VCF, you do not need the fixref plugin. The following will set these in your VCF:

bcftools norm --check-ref s --fasta-ref human_g1k_v37.fasta out.vcf.gz -Ov > out.reffixed.vcf

--check-ref x will eliminate the non-matched ref sites, which may be better. Also, here, I use human_g1k_v37.fasta as the reference, which is hg19 updated with allele information from 1000 Genomes Phase III. For Sanger Imputation, you may want to use: hs37d5.fa

Hopefully, with all of this information, you can actually get your work done.


Login before adding your answer.

Traffic: 2230 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6