Merge VCF files by chromosome and Across samples
0
0
Entering edit mode
8 weeks ago
Maverick ▴ 10

Hello,

I had a gatk pipeline setup successfully to call SNPs and Indels. however it was running whole samples and to reduce time I split them by chromosomes (after mapping, sorting and marking duplicates) and went all the way upto calling variants (GVCF mode). now when I try to concat the chromosomes of the same sample I get an error saying it's not sorted (no matter what tool I run vcftools, Picard) and it's particularly when trying to concat chr9 and chr10. it says that 1st position mentioned in chr 10 is before last position of chr9. I manually checked it and it's not the case. however I sorted them (using vcftools) and concatenated (using vcftools). now the concatenation is successful.

Next step was to combing the gvcf file across samples into one for joint calling. I ran combinegvcf and I got this error:

> 20:37:57.292 INFO  ProgressMeter -       chr9:137752186           
> 250.6             858634000        3426981.3 20:38:02.220 INFO  CombineGVCFs - Shutting down engine [October 13, 2024 at 8:38:02 PM
> GMT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done.
> Elapsed time: 250.69 minutes. Runtime.totalMemory()=1224736768
> java.lang.IllegalStateException: The elements of the input Iterators
> are not sorted according to the comparator
> htsjdk.variant.variantcontext.VariantContextComparator    at
> htsjdk.samtools.util.MergingIterator.next(MergingIterator.java:107)
>   at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
>   at
> java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
>   at
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>   at
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>   at
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
>   at
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
>   at
> org.broadinstitute.hellbender.engine.MultiVariantWalker.traverse(MultiVariantWalker.java:136)
>   at
> org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:165)
>   at
> org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
>   at
> org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
>   at
> org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
>   at
> org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
>   at
> org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
>   at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)  at
> org.broadinstitute.hellbender.Main.main(Main.java:306)\

Again this is only after chr9 in all the samples. Why do I have to sort the combined files again?

I have seen in the logs that the files are usually processed in the order 1,10,11-19,2,20,21-22,3-9,X,Y,M. could the issue be because I am trying to concatenated in the natural order as in 1-22,X,Y M?

Thank you for your time and guidance.

GATK PICARD MergeVcf vcftools • 887 views
ADD COMMENT
0
Entering edit mode

(no matter what tool I run vcftools, Picard)

show us the picard command tool

ADD REPLY
0
Entering edit mode

why calling CombineGVCFs with merged vcfs when you could combine+genotype each chromosome ?

ADD REPLY
0
Entering edit mode

Hi,

My assumption was combine and genotype had to be done for the whole vcf file and so I merged them first. So the joint calling isn't affected if the samples have only specific chromosomes and not all of them?

My pipeline follows this:

Phase 1:

(For each sample in a batch)

  1. Map to reference using BWA-MEM

  2. Mark Duplicates and Sort - SortSam

  3. Mark Duplicates and Sort - MarkDuplicates

  4. Mark Duplicates and Sort - samtools_index

  5. Split BAM by Chromosome & index (using samtools)

    For each chromosome in a sample,

  6. Base Quality Recalibration - BaseRecalibrator

  7. Base Quality Recalibration - ApplyBQSR

  8. Collect Alignment & Insert Size Metrics - CollectAlignmentSummaryMetrics

  9. Collect Alignment & Insert Size Metrics - CollectInsertSizeMetrics

  10. Variant Calling and Filtering - HaplotypeCaller (GVCF mode)


  1. CombineGVCFs per Sample (previously used combinegvcf , gather vcf and all failed with the error like First record in file chr10 is not after first record in previous file chr9)

    so I used vcf tools to sort

>  zcat "$vcf_file" | $vcftools/perl/vcf-sort -c -t "$tmp" | bgzip -c >
> "$sorted_vcf_file"

   and then vcftools to concat

 > $vcftools/perl/vcf-concat $gvcf_sorted_inputs >
 > "$results/${sample_name}_combined.g.vcf"

Phase 2:

  1. CombineGVCFs - combine all sample gvcfs in the batch into one:
  > $gatk CombineGVCFs -R ${ref} $VARIANT_ARGS -O
  > ${results_comb}/$batch_name"_GATK_combined.g.vcf"
  1. GenotypeGVCFs
  2. VariantRecalibrator (SNPs)
  3. ApplyVQSR (SNPs)
  4. VariantRecalibrator (INDELs)
  5. ApplyVQSR (INDELs)
  6. SelectVariants (extract SNPs and INDELS)
  7. VariantFiltration (SNPs)
  8. VariantFiltration (INDELs)
  9. MergeVcfs (SNPs and INDELs)
  10. bcftools (split vcf files per sample)

Regarding the error,

I believe the contig order is wrong in one of my samples. For example when I run gatk sort sam (same command for all my samples),

  $gatk SortVcf \
        -I "$file" \
        -O "$sorted_file" \
        --CREATE_INDEX true

I can see in the std output that its reading is in the order 1,10-19,2,20-22,X,Y as usual but for only one sample it's reading 1,2,3.. and so on. why would this happen?

The only difference I noticed between the samples was when generating fastq files from the original cram file I changed my command for this particular sample in question where this specific fastq file was generated having reads without the read direction indicator such as /1 or /2 . this should not be an issue at all since there is no question of order before mapping right?

ADD REPLY
1
Entering edit mode

and then vcftools to concat

vcftools is deprecated for this task. Use bcftools.

ADD REPLY
1
Entering edit mode

I would remove the sample with a probkem and re-rerun everything for this sample.

Furthemore, you should use a workflow manager like snakemake or nextflow.

ADD REPLY
0
Entering edit mode

Thank you for that. currently have a master script that submits jobs to LSF for each step. I will try to shift to a workflow manager. And yes, I am currently removing the sample and have run it to see if that is the case.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

My assumption was combine and genotype had to be done for the whole vcf file and so I merged them first.

no. imagine you're interested in calling only one gene. The other contigs are unrelated to your gene.

You'll need to merge everything if you use VQSR

ADD REPLY
1
Entering edit mode

Split BAM by Chromosome & index (using samtools)

this is useless, you can use option -L of gatk to call only one chromosome.

ADD REPLY
0
Entering edit mode

the sole reason to split the chromosomes is to save time. the steps are run in parallel for all chromosomes before combining. its saving about ~25 hrs. In reality, the og pipeline ran the entire bam file without intervals.

ADD REPLY
1
Entering edit mode

the sole reason to split the chromosomes is to save time.

you don't save time when creating a new bam when GATK can use intervals.

ADD REPLY
0
Entering edit mode

Yeah, I am using VQSR.The pipeline is expected to find rare germline variants across the whole genome. Currently not looking for anything specific. Will be adding annotations and rankers to filter possible pathogenic variants.

ADD REPLY
0
Entering edit mode

I tried with entirely different files. It was WES data of 3 samples this time. it ran successfully when I restricted the pipeline to just 2 chromosomes for testing(I used 21 and 22) . When I ran it again for all chromosomes I got the same error.

My adjusted pipeline follows this flow currently:

Phase 1:

(For each sample in a batch)

  1. Map to reference using BWA-MEM
  2. Mark Duplicates and Sort - SortSam
  3. Mark Duplicates and Sort - MarkDuplicates
  4. Mark Duplicates and Sort - samtools_index

Steps 2,3 and 4 are as shown below and is the only time I sort.

$gatk SortSam -I ${aligned_reads}/$sample_name"_GATK_aligned.bam" -O ${aligned_reads}/$sample_name"_GATK_sorted_reads.bam" -SO coordinate --TMP_DIR $TMP_DIR

$gatk MarkDuplicates -I ${aligned_reads}/$sample_name"_GATK_sorted_reads.bam" -O ${aligned_reads}/$sample_name"_GATK_sorted_dedup_reads.bam" -M ${aligned_reads}/$sample_name"_metrics.txt" --TMP_DIR $TMP_DIR

$samtools index ${aligned_reads}/$sample_name"_GATK_sorted_dedup_reads.bam"

Not splitting bam anymore; directly feeding chromosome name as interval for base recalibration and haplotype caller.

$gatk BaseRecalibrator \ -I "${aligned_reads}/${sample_name}_GATK_sorted_dedup_reads.bam" \ -R "${ref}" \ --known-sites "${DBSNP_VCF}" \ --known-sites "${MILLSIND_VCF}" \ -L "${chr_name}" \ -O "${data}/${sample_name}_${chr_name}_recal_data.table"

$gatk ApplyBQSR \ -I "${aligned_reads}/${sample_name}_GATK_sorted_dedup_reads.bam" \ -R "${ref}" \ --bqsr "${data}/${sample_name}_${chr_name}_recal_data.table" \ -L "${chr_name}" \ -O "${aligned_reads}/${sample_name}_${chr_name}_recalibrated.bam"

$gatk HaplotypeCaller \ -R "${ref}" \ -I "${aligned_reads}/${sample_name}_${chr_name}_recalibrated.bam" \ -O "${results}/${sample_name}_${chr_name}_raw_variants.g.vcf.gz" \ -ERC GVCF \ -L "${chr_name}" \ --native-pair-hmm-threads 10

Then I concatenate the chromosomes of each sample into respective files and index it using bcftools

(the order I concatenate is 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M)

$bcftools concat -O z -o "$output_vcf" "${gvcf_list[@]}"

$bcftools index -t "$output_vcf"

I don't have any errors till this point.

Then I run combine gvcf for combining across samples using:

$gatk CombineGVCFs \ -R "${ref}" \ "${variant_args[@]}" \ -O "${results_comb}/${batch_name}_GATK_combined.g.vcf.gz"

and I get the same error.

However the error was rectified when I ran sorting operation of the individual sample gvcf files after concatenation and before combinegvcf using :

$gatk SortVcf -I "$file" -O "$sorted_file" --CREATE_INDEX true

I can see that all samples are being read and sorted and the only difference is the contig order and nothing else.

So, is the whole error because of the order I am concatenating the chromosomes? should it have been (1,10,11,12,13,14,15,16,17,18,19,2,20,21,22,3,4,5,6,7,8,9,X,Y,M). Why is this a problem for combinegvcf?

Thank you so much for all the guidance once again!

ADD REPLY
0
Entering edit mode

$gatk CombineGVCFs \ -R "${ref}" \ "${variant_args[@]}" \ -O "${results_comb}/${batch_name}_GATK_combined.g.vcf.gz"

what's in \ "${variant_args[@]}" ?

Furthermore what is the output of

find /path/to -type f -name "*.g.vcf.gz" | while read F; do bcftools view --header-only | grep "##contig=" | md5sum - ; done | sort | uniq -c
ADD REPLY
0
Entering edit mode

I search for the chromosome-concatenated vcf files of each sample using a loop like:

for file in $(find "${results}" -type f -name "*combined.g.vcf.gz");do variant_args+=("--variant" "$file"); done

I pass this as input to the combine gvcf command.

As for the command you have mentioned, I assume you are wanting to check if the contigs are same across the samples I am combining using combine gvcf. yes, they are same.

Here is the output:

find . -type f -name "*combined.g.vcf.gz" | while read F; do bcftools view --header-only "$F" | grep "##contig=" | md5sum - ; done | sort | uniq -c
      3 9f2d6b8b12cb54747284b610c0b0bef7  -  

I tested 3 samples and all have the exact same header. The only error is the order I was concatenating the chromosomes.

I don't have the error anymore after changing the order to "1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 M X Y" . I believe this is because the reference genome's dictionary is in the lexicographical order (seems they are sorted in order by considering them as strings) and the downstream tools also expecting the contigs to be in the same order as the reference.

ADD REPLY

Login before adding your answer.

Traffic: 1395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6