PureCN docker workflow
0
1
Entering edit mode
3.2 years ago
biologist ▴ 20

Hello everyone, I am currently using the Docker image for PureCN and this is my workflow. Could anyone advise me on whether this workflow is right? The reason I ask this is because the results generated using this workflow are significantly different from the results I get from using Sequenza for 10 paired tumor-germline samples:

#Step 1 - Generating interval files

Rscript $PURECN/IntervalFile.R --in-file $BAIT_COORDINATES \
    --fasta $REF_GENOME_b37 --out-file $OUT_REF/baits_hg19_intervals.txt \
    --off-target --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed  \
    --mappability $MAPPABILITY_FILE_B37

#Step 2 - Run mutect on 10 germline samples, create genomics db and use CreateSomaticPanelOfNormals

> $PURECN_G/G_sample_map

        count=1
        while IFS='' read -r line
        do
        if [ ! -z $GERMLINE_PFX ] && [ ! -f $PURECN_G/${GERMLINE_PFX}.vcf.gz ] && [ $count -le 10 ]; then
                count=$((count+1))
                echo $GERMLINE_PFX
                #For each germline bam files
                echo "Running Mutect2 on germline bams.................."

                #Run mutect on 10 germline samples
                gatk Mutect2 \
                -R $REF_GENOME_b37 \
                -I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
                --max-mnp-distance 0 \
                --genotype-germline-sites true \
                --genotype-pon-sites true \
                --interval-padding 75 \
                --L $BEDFILE_0B \
                -O $PURECN_G/${GERMLINE_PFX}.vcf.gz

                #Create a map file of all germline vcfs
                echo -e "$GERMLINE_PFX\t$PURECN_G/${GERMLINE_PFX}.vcf.gz" >> $PURECN_G/G_sample_map
        fi

       done<$LIST_OF_SAMPLES

        #Create a Genomics DB from the normal Mutect2 calls 
        gatk GenomicsDBImport \
        -R $REF_GENOME_b37 \
        -L $BEDFILE_0B \
        --genomicsdb-workspace-path $PURECN_G/pon_db \
        --sample-name-map $PURECN_G/G_sample_map

        #Combine the normal calls using CreateSomaticPanelOfNormals
        gatk CreateSomaticPanelOfNormals \
        -R $REF_GENOME_b37 \
        --germline-resource $GERMLINE_RESOURCE \
        -V gendb://$PURECN_G/pon_db \
        -O $PURECN_G/pon.vcf.gz

#Step 3 - Run Mutect on unmatched tumor samples

while IFS='' read -r line
        do
                if [ ! -z $TUMOR_PFX ]  && [ ! -f $PURECN_T/${TUMOR_PFX}.vcf.gz ]; then
                        #Run Mutect2 for each tumor sample
                        gatk Mutect2 \
                        -R $REF_GENOME_b37 \
                        -I $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
                        -pon $PURECN_G/pon.vcf.gz \
                        --germline-resource $GERMLINE_RESOURCE\
                        --genotype-germline-sites true \
                        --genotype-pon-sites true \
                        --interval-padding 75 \
                        -L $BEDFILE_0B \
                        -O $PURECN_T/${TUMOR_PFX}.vcf.gz

                fi

       done<$LIST_OF_SAMPLES

#Step 4 - calculate gc normalized coverages for germlines

count=1
            while IFS='' read -r line
            do
            TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
            GERMLINE_PFX=$(echo $line| awk -F'|' '{print $2}')
            if [ ! -z $GERMLINE_PFX ] && [ $count -le 10 ] && [ ! -f $PURECN_G/${GERMLINE_PFX}_coverage_loess.txt.gz ] ; then
                    count=$((count+1))
                    echo $GERMLINE_PFX

                    Rscript $PURECN/Coverage.R \
                    --out-dir $PURECN_G \
                    --bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${GERMLINE_PFX}*.bam \
                    --intervals $OUT_REF/baits_hg19_intervals.txt
            fi

           done<$LIST_OF_SAMPLES

#Step 5 - calculate gc normalized coverages for tumors

while IFS='' read -r line
        do
        TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')
        if [ ! -z $TUMOR_PFX ] && [ ! -f $PURECN_T/${TUMOR_PFX}_coverage_loess.txt.gz ]; then
                echo $TUMOR_PFX
                Rscript $PURECN/Coverage.R \
                --out-dir $PURECN_T \
                --bam $SAVEROOT/$TUMOR_PFX/recal/bqsr_${TUMOR_PFX}*.bam \
                --intervals $OUT_REF/baits_hg19_intervals.txt
        fi

        done<$LIST_OF_SAMPLES

#Step 6 - create normal db

ls -a $PURECN_G/*_loess.txt.gz | cat > $PURECN_G/normal_coverages.list
        ulimit -n 5000
        #For a Mutect2/GATK4 normal panel GenomicsDB (beta)
        Rscript $PURECN/NormalDB.R --out-dir $PURECN_G \
        --coverage-files $PURECN_G/normal_coverages.list \
        --normal-panel $PURECN_G/pon_db \
        --genome hg19

#Step 7 - calculate tumor cellularity and ploidy

while IFS='' read -r line
        do
                TUMOR_PFX=$(echo $line| awk -F'|' '{print $1}')

        if [ ! -z $TUMOR_PFX ] && ! grep $TUMOR_PFX $PURECN_RESULTS/purecn_unpaired_tumor_cellularity_ploidy.txt; then
                echo $TUMOR_PFX
                #For each tumor bam files
                # Production pipeline run (dont need stats file if Mutect2 was run)
                Rscript $PURECN/PureCN.R --out $PURECN_T \
                --tumor $PURECN_T/bqsr_${TUMOR_PFX}_T_coverage_loess.txt.gz \
                --sampleid $TUMOR_PFX \
                --vcf $PURECN_T/${TUMOR_PFX}.vcf.gz \
                --fun-segmentation PSCBS \
                --normaldb $PURECN_G/normalDB_hg19.rds \
                --mapping-bias-file $PURECN_G/mapping_bias_hg19.rds \
                --intervals $OUT_REF/baits_hg19_intervals.txt \
                --genome hg19 \
                --model betabin \
                --force --post-optimize --seed 123

My assay is targeted capture with an off target read of 30% and but if I specify --off-target in Step 1, in Step 6, I see an error:

INFO [2022-07-12 03:14:00] Removing 15793 intervals with low coverage in normalDB.
FATAL [2022-07-12 03:14:00] No intervals left after coverage checks.
FATAL [2022-07-12 03:14:00]
FATAL [2022-07-12 03:14:00] This is most likely a user error due to invalid input data or
FATAL [2022-07-12 03:14:00] parameters (PureCN 2.2.0).
Error: No intervals left after coverage checks.
This is most likely a user error due to invalid input data or
parameters (PureCN 2.2.0).
Execution halted

If I exclude --off-target in Step 1, the pipeline runs but produces different cellularity and ploidy estimates from that produced by Sequenza.

Hoping to get some idea on whether I am following the right steps for my pipeline, thank you!

purecn • 912 views
ADD COMMENT

Login before adding your answer.

Traffic: 3288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6