Question: WGS data for pipeline development
gravatar for win
16 months ago by
win810 wrote:

Hi all, i am looking for paired end Illumina FASTA files from whole genome NGS run. I am developing some pipelines and was using a BAM file from 1000 genomes. After I run the GATK Best Practices pipeline I am generating a VCF with no data at all. Basically, i am converting BAM to FASTQ, then aligning to hg38 and then other steps as recommended. I am wondering what is so incorrect that no variants are detected?

Before I post questions about the pipeline itself, I wanted to be sure to use an input file which has been recommended by someone in this group.

Any help here is much appreciated and if there was an accompanying VCF that would be awesome.

##Align with BWA-mem.
#sudo ./algorithms/bwa/bwa mem -M references/hg38gatkbundle/Homo_sapiens_assembly38.fasta data/HG100/HG100.1.exome.fastq data/HG100/HG100.2.exome.fastq > data/HG100/HG100.aligned.reads.sam

##Sort the SAM file and it will also convert to BAM
#sudo java -jar algorithms/picard/picard.jar SortSam INPUT=data/HG100.aligned.reads.sam OUTPUT=data/HG100.sorted.reads.bam SORT_ORDER=coordinate

##Reheader the BAM. This is required for GATK.
#sudo java -Xmx8g -jar algorithms/picard/picard.jar AddOrReplaceReadGroups I=data/HG100/HG100.sorted.reads.bam O=data/HG100/HG100.sorted.reheader.reads.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20

##Build BAM index
#sudo java -Xmx8g -jar algorithms/picard/picard.jar BuildBamIndex INPUT=data/HG100/HG100.sorted.reheader.reads.bam

##Perform local realignment
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T RealignerTargetCreator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -known references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -L references/hg38gatkbundle/wgs_calling_regions.hg38.interval_list -I data/HG100/HG100.sorted.reheader.reads.bam -o data/HG100/HG100.realignment.targets.list

##Perform local realignment of reads around indels
#sudo java -jar -Xmx8g algorithms/gatk3/gatk3.8.jar -T IndelRealigner -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG100.sorted.reheader.reads.bam -known references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -known references/hg38gatkbundle/dbsnp_138.hg38.vcf -known references/hg38gatkbundle/Homo_sapiens_assembly38.known_indels.vcf -targetIntervals data/HG100.realignment.targets.list -o data/HG100/HG100.local.realigned.bam

##Perform Base Recalibration
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T BaseRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG15/HG15.local.realigned.bam -knownSites references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG15/

##Run PrintReads
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T PrintReads -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG15/HG15.local.realigned.bam -o data/HG15/HG15.output.bam --read_filter MappingQualityZero

##Run Unified Genotyper
#Use the following command if you also need to generate a gVCF file.
#sudo java -Xmx8g -jar algorithms/gatk3/gatk3.8.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --output_mode EMIT_ALL_SITES -I data/HG100.output.bam --dbsnp references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG100.output.raw.snps.indels.g.vcf

 ##Use the following command to generate a VCF file
    #sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --output_mode EMIT_VARIANTS_ONLY -I data/HG15/HG15.output.bam --dbsnp references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG15/HG15.output.raw.snps.indels.vcf

#### May have to use this one
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -A BaseQualityRankSumTest -A Coverage -A QualByDepth -A FisherStrand -A ReadPosRankSumTest --output_mode EMIT_VARIANTS_ONLY -I data/HG15/HG15.output.bam --dbsnp references/hg38gatkbundle/Homo_sapiens_assembly38.dbsnp138.vcf -o data/HG15/HG15.output.raw.combined.vcf

## Run Variant Recalibrator for SNP
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T VariantRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG15/HG15.output.raw.snps.indels.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 references/hg38gatkbundle/hapmap_3.3.hg38.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 references/hg38gatkbundle/1000G_omni2.5.hg38.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 references/hg38gatkbundle/1000G_phase1.snps.high_confidence.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/hg38gatkbundle/dbsnp_138.hg38.vcf -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile data/HG15/HG15.recalibrate.SNP.recal -tranchesFile data/HG15/HG15.recalibrate.SNP.tranches -rscriptFile data/HG15/HG15.recalibrate.SNP.plots.R

## Apply Variant Recalibration for SNPs 
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T ApplyRecalibration -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG15/HG15.output.raw.snps.indels.vcf -mode SNP --ts_filter_level 99.0 -recalFile data/HG15/HG15.recalibrate.SNP.recal -tranchesFile data/HG15/HG15.recalibrate.SNP.tranches -o data/HG15/HG15.recalibrated.SNP.raw.indels.vcf

## Run Variant Recalibrator for INDEL
## ?? Need to check the QD problem here, for now we have eliminated the QD annotation flag.
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T VariantRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG100.recalibrated.SNP.raw.indels.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/hg38gatkbundle/dbsnp_138.hg38.vcf -an DP -an FS -an SOR -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile data/HG100.recalibrate_INDEL.recal -tranchesFile data/HG100.recalibrate_INDEL.tranches -rscriptFile data/HG100.recalibrate_INDEL_plots.R

## Apply Variant Recalibrator
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T ApplyRecalibration -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG100.recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile data/HG100.recalibrate_INDEL.recal -tranchesFile data/HG100.recalibrate_INDEL.tranches -o data/HG100.recalibrated_variants.vcf
ngs wgs • 848 views
ADD COMMENTlink modified 16 months ago by Sean Davis25k • written 16 months ago by win810

Not what you are looking for, but you shouldn't use sudo to perform tasks like this. Sudo should only be used when you absolutely have to, for "sysadmin tasks".

ADD REPLYlink written 16 months ago by WouterDeCoster39k

It is highly recommended to use HaplotypeCaller instead of UnifiedGenotyper


ADD REPLYlink written 16 months ago by Medhat8.3k

I am not quite sure what you were running there, but could it be that the parameters of the pipeline are for use with multiple samples, not just one?

ADD REPLYlink written 16 months ago by Michael Dondrup46k

Please be more specific what you need. Tumor-only or matched normal or normal only? There are a couple of standard datasets that include validated mutations. Better start with these, e.g. NA12878. Still, please provide some details on your pipeline. Best would be to provide the exact commands you used and the alignment statistics. No results at all is suspicious. Every WGS sample, also healthy ones, contains thousands of germline mutations. There must be a bug somewhere in your pipeline.

ADD REPLYlink written 16 months ago by ATpoint18k

OK! So I am looking for germline mutation so no need for tumor vs. sample. I am posting the entire pipeline just now.

ADD REPLYlink written 16 months ago by win810

Not really answering your question but maybe useful tips... To add Read Group to your bam files, run bwa mem with the -R STR options. This way you can eliminate the AddOrReplaceReadGroup step. Also, consider piping the output of bwa mem to samtools sort so you get the sorted bam file without passing by SAM. Even better, you could pipe to bamsort in biobambam so you can also index and mark duplicates on the fly (which probably you need to do). (I can't comment on the rest as I'm not familiar with germline variant calling).

ADD REPLYlink written 16 months ago by dariober10k

Thanks, will implement some of the steps you have outlined.

ADD REPLYlink written 16 months ago by win810
gravatar for Sean Davis
16 months ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

While I am not diagnosing your problem directly, I have three general points that are very actionable.

  1. The approach to debugging a pipeline is to proceed through each step, one at a time, verifying that at each step you produce something like what you expect. This is especially important when the pipeline does not produce the expected results at the end.
  2. A successful pipeline will also include MANY quality control steps. These can help discover problems when they occur. The picard library has several useful tools for quality control (see the "Metrics" functions).
  3. As you learn about how your pipeline can fail, attempt to implement a test or a metric that can help you quickly diagnose your failure the next time it occurs. This can save you days of effort down the road.
ADD COMMENTlink written 16 months ago by Sean Davis25k

I put this comment here to be more relevant to the answer. Beside what was said by Sean Davis it is also nice if you follow one of the pipelines mentioned here in this review A review of bioinformatic pipeline frameworks, that will be more convenient.

ADD REPLYlink written 16 months ago by Medhat8.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1989 users visited in the last hour