Question

PHG -imputeTarget pathToVCF plugin not writing expected output files?

0

Entering edit mode

13 months ago

twrl8 • 0

Hello!

I am trying to do the last few steps of the Practical Haplotype Graph pipeline as described here . I am running PHG v1.2, have loaded the Haplotypes to the database, created the pangenome fasta and am now trying to impute the best paths for different samples using WGS reads in fastq format using the Plugin: -ImputePipelinePlugin -imputeTarget pathToVCF. When I start this as a test for one sample I get a vcf file as an output (against the reference I am guessing?), but I do not get any information about the calculated path. The keyfiles created by the plugin (_pathKeyFile and _withMappingIds) are empty apart from the header. These files, to my understanding, should contain haplotype counts?

If I go into the database and look at the "paths" table, I find it has entries for all the assemblies I used to load the haplotypes plus one for the reference (at least judging by the method_id). The tables haplotype_counts and read_mapping_paths are empty, but I suppose there should be information about the alignment, haplotype counts and path saved there, so there seems to be an issue.

The plugin seems to run till the end (and output a vcf), but I did get one ERROR during its run:

ERROR net.maizegenetics.pangenome.hapCalling.Minimap2Utils - input directory does not contain both of lineA_1.trim.gz and lineA_2.trim.gz. The directory name should not be part of the filename in the keyfile.

How could it do any alignment and output a vcf if it can not find these files? I checked the path listed in the config file:

fastqDir=/PHG/inputDir/imputation/fastq/

And the two files are located at that location.

I would greatly appreciate help or direction into what to check!

Many thanks!

If helpful this is my keyfile:

cultivar        flowcell_lane   filename        filename2
lineA  lineA  lineA_1.trim.gz        lineA_2.trim.gz

And this is my config file:

# Imputation Pipeline parameters for fastq or SAM files

# Required Parameters!!!!!!!
#--- Database ---
host=localHost
user=xxx
password=xxx
DB=/PHG/phg_run2.db
DBtype=sqlite


#--- Used by liquibase to check DB version ---
liquibaseOutdir=/PHG/outputDir/

#--- Used for writing a pangenome reference fasta(not needed when inputType=vcf) ---
pangenomeHaplotypeMethod=assembly_by_anchorwave
pangenomeDir=/PHG/outputDir/pangenome
indexKmerLength=21
indexWindowSize=11
indexNumberBases=90G

#--- Used for mapping reads
inputType=fastq
readMethod=lineA_20230306_run2
keyFile=/PHG/20230306_wgsconfigs/lineA_20230306readMapping_key_file.txt
fastqDir=/PHG/inputDir/imputation/fastq/
samDir=/PHG/inputDir/imputation/sam/
lowMemMode=true
maxRefRangeErr=0.25
outputSecondaryStats=false
maxSecondary=20
fParameter=f15000,16000
minimapLocation=minimap2

#--- Used for path finding
pathHaplotypeMethod=assembly_by_anchorwave
pathMethod=lineA_20230306_run2
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=1
minTransitionProb=0.0005
numThreads=36
probCorrect=0.99
removeEqual=false
splitNodes=true
splitProb=0.99
usebf=true
minP=0.8

#   used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=classic

#--- Used to output a vcf file for pathMethod
#~~~ Optional Parameters ~~~
outVcfFile=/PHG/outputDir/align/lineA_20230306_run2_variants.vcf
localGVCFFolder=/PHG/inputDir/loadDB/gvcf
#pangenomeIndexName=**OPTIONAL**
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
debugDir=/PHG/debugDir/
#bfInfoFile=**OPTIONAL**

phg • 1.1k views

ADD COMMENT • link 13 months ago by twrl8 • 0

0

Entering edit mode

13 months ago

lcj34 ▴ 420

The db has UNIQUE constraints relating to the reads and corresponding method. If any of the read_mappings were populated from the previous run, you will need to either run the DeleteReadMappingPlugin to remove the older entries, or use a different method name to store the entries for this run.

ADD COMMENT • link 13 months ago by lcj34 ▴ 420

0

Entering edit mode

Thank you very much! I will keep it in mind for future runs and choose a different name just in case.

ADD REPLY • link 13 months ago by twrl8 • 0

score 1 · Accepted Answer · 2023-03-09

1

Entering edit mode

13 months ago

zrm22 ▴ 40

Hello,

That error is generally thrown when the PHG cannot find the files specified in the keyfile. The issue is that the PHG is requiring files with the following extensions:

.fastq.gz,.fq.gz,.fastq,.fq

Because you have .trim.gz, it is not realizing that you have fastq files there.

What is inside the VCF file? Did it create all the SNPs but there are no taxa? Does it just have the header? Do the taxa lines match your input Haplotype sample names?

ADD COMMENT • link 13 months ago by zrm22 ▴ 40

0

Entering edit mode

Hello,

Thank you so much for the quick response!!

I will try again renaming the fastq files. If I rerun the script with the same readMethod and pathMethod it should overwrite old entries in the database, correct?

I am not sure which taxa information there should be, but the vcf contains the header and seemingly SNPs across all chromosomes of the reference genome (at least judging by the chromosome names).

These are the first few lines of the vcf if you are interested:

##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=3,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AF,Number=3,Type=Integer,Description="Allele Frequency">
##INFO=<ID=ASM_Chr,Number=1,Type=String,Description="Assembly chromosome">
##INFO=<ID=ASM_End,Number=1,Type=Integer,Description="Assembly end position">
##INFO=<ID=ASM_Start,Number=1,Type=Integer,Description="Assembly start position">
##INFO=<ID=ASM_Strand,Number=1,Type=String,Description="Assembly strand">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1H_1    49588   .   A   G   .   .   .
1H_1    49595   .   A   C   .   .   .
1H_1    49668   .   C   T   .   .   .
1H_1    49681   .   G   A   .   .   .
1H_1    49741   .   C   G   .   .   .
1H_1    49760   .   G   A   .   .   .
1H_1    49779   .   C   T   .   .   .
1H_1    49822   .   G   A   .   .   .
1H_1    49831   .   A   C   .   .   .
1H_1    49884   .   C   T   .   .   .
1H_1    49952   .   A   G   .   .   .
1H_1    49974   .   C   T   .   .   .
1H_1    50003   .   C   A   .   .   .
1H_1    50019   .   G   A   .   .   .

ADD REPLY • link 13 months ago by twrl8 • 0

0

Entering edit mode

My guess is that there is nothing in the DB for the readMethod or pathMethod, so it is ok to use the same. If the pipeline had found the sample correctly it would have added a column for lineA.

ADD REPLY • link 13 months ago by zrm22 ▴ 40

0

Entering edit mode

Yes, I think. Looking at the "methods" table in the DB there is no entry named the same or similar to the read/pathMethod.

I apprechiate your advice and will try to rerun this step. Thank you!

ADD REPLY • link 13 months ago by twrl8 • 0

0

Entering edit mode

Sorry, perhaps also as a clarification for later regarding the VCF file.

The REF column will contain the allele of the reference genome, while the ALT contains all the alleles of the pangenome, i.e. those of my assemblies and not of my samples (fastq files based), correct?

1) Is there a way to tell the best calculated path from the VCF? My chromosomes are unfortunately called the same across assemblies, so I would guess the "ASM_Chr" field would not help? Or is the path only saved in the BLOB in the "paths" table inside the DB? If yes, how do I access the BLOB object to create something readable (I unfortunately do not have prior experience with sql)?

2) Is the alignment against the pangenome itself saved somewhere? Since I might want to continue work with it.

Sorry, if this is too far from the first topic, I will create a new post.

ADD REPLY • link 13 months ago by twrl8 • 0