Question: PureCN prepare environment error
0
gravatar for cg_ref_database
6 weeks ago by
cg_ref_database20 wrote:

I'm trying to prepare the environment and assay specific files for the R package PureCN following https://www.bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/Quick.html

The R version that I am using is R 4.0.1

The command from the PureCN vignette that I am using is:

$ Rscript $PURECN/IntervalFile.R --infile baits_hg19.bed \ 
    --fasta hg19.fa --outfile $OUT_REF/baits_hg19_intervals.txt \
    --offtarget --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig

I am using hg38, so I have followed the instructions to change 19 to 38 for the previous steps, so my command from is:

$ Rscript $PURECN/IntervalFile.R --infile baits_hg38.bed \ 
    --fasta hg38.fa --outfile $OUT_REF/baits_hg38_intervals.txt \
    --offtarget --genome hg38 \
    --export $OUT_REF/baits_optimized_hg38.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig

The error message that I am encountering and have not been able to figure out is:

Warning message:
package 'optparse' was built under R version 4.0.2
Warning message:
****no function found corresponding to methods exports from 'BSgenome' for: 'releaseName'**
Error in normalizePath(path.expand(path), winslash, mustWork) :
  path[1]="hg38.fa": The system cannot find the file specified**
Execution halted

Any help or suggestions would be much appreciated.

purecn R software error • 174 views
ADD COMMENTlink modified 5 weeks ago by markus.riester500 • written 6 weeks ago by cg_ref_database20

Hello cg_ref_database,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 6 weeks ago by lieven.sterck8.3k
1

Thank you for doing that! I'll be sure to do so for future posts.

ADD REPLYlink written 6 weeks ago by cg_ref_database20
0
gravatar for markus.riester
6 weeks ago by
markus.riester500 wrote:

Hi, you sure hg38.fa is the correct path to the FASTA file?

  path[1]="hg38.fa": The system cannot find the file specified
  
ADD COMMENTlink written 6 weeks ago by markus.riester500

Thank you Markus, that was the problem for that part of the command. A new problem while still at the IntervalFile.R stage is that it gives the following error: [E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?)

Warning message:
package 'optparse' was built under R version 4.0.2
Warning message:
no function found corresponding to methods exports from 'BSgenome' for: 'releaseName'
Warning message:

package 'GenomeInfoDb' was built under R version 4.0.2
INFO [2020-07-02 19:55:07] Loading PureCN 1.19.2...
Warning messages:
1: package 'VariantAnnotation' was built under R version 4.0.2
2: package 'Rsamtools' was built under R version 4.0.2
3: package 'XVector' was built under R version 4.0.2
INFO [2020-07-02 19:55:07] Processing hglft_genome_probes_65b83_cdc6f0.bed.gz...
WARN [2020-07-02 19:55:08] Found 27928 overlapping intervals, starting at line 38.
WARN [2020-07-02 19:55:09] Target intervals were not sorted.
INFO [2020-07-02 19:55:11] Splitting 16807 large targets to an average width of 400.
INFO [2020-07-02 19:55:11] Tiling off-target regions to an average width of 200000.
INFO [2020-07-02 19:55:11] Removing following contigs from off-target regions: chr4_GL000008v2_random,chr14_GL000009v2_random,chr14_GL000225v1_random,chr15_KI270727v1_random,chr16_KI270728v1_random,chr17_KI270729v1_random,chrUn_KI270442v1,chrUn_KI270743v1,CMV
WARN [2020-07-02 19:55:12] No mappability scores provided.
WARN [2020-07-02 19:55:12] No reptiming scores provided.
INFO [2020-07-02 19:55:12] Calculating GC-content...
[E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?)
Error in value[[3L]](cond) :
   record 172301 (chr13:40366455-40558971) failed
  file: GRCh38.d1.vd1.fa
Calls: preprocessIntervals ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
In addition: Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1_random, chr4_GL000008v2_random, chr5_GL000208v1_random, chr9_KI270717v1_random, chr9_KI270718v1_random, chr9_KI270719v1_random, chr9_KI270720v1_random, chr11_KI270721v1_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr14_KI270722v1_random, chr14_GL000194v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_GL000205v2_random, chr17_KI270729v1_random, chr17_KI270730v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_ [... truncated]
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr4_GL000008v2_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_KI270729v1_random, chrUn_KI270442v1, chrUn_KI270743v1, CMV
  - in 'y': chr1_KI270766v1_alt, chr7_KI270803v1_alt, chr17_KI270909v1_alt, chr19_KI270938v1_alt, chr22_KI270879v1_alt
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
Execution halted
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by cg_ref_database20

The problem that I was encountering above was that the infile needed to be sorted, compressed and indexed:

bedtools sort -i [infile] > [infile.sorted]
bgzip -i [infile.sorted]
ADD REPLYlink written 6 weeks ago by cg_ref_database20

Glad you figured it out. Bioconductor should be able to work with uncompressed Fasta files if they are sorted (the official genome build Fasta files are). See the underlying ?scanFa function.

ADD REPLYlink written 6 weeks ago by markus.riester500

My apologies, I still need assistance with the prepare environment step. Any help would be very much appreciated.

The new error that I'm encountering now is:

INFO [2020-07-07 01:07:10] Loading GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw...

INFO [2020-07-07 01:08:04] Loading PureCN 1.18.0...
INFO [2020-07-07 01:08:04] Processing hglft_genome_probes_65b83_cdc6f0.sorted.bed.gz...
WARN [2020-07-07 01:08:05] Found 27928 overlapping intervals, starting at line 38.
WARN [2020-07-07 01:08:07] Target intervals were not sorted.
INFO [2020-07-07 01:08:17] Splitting 16807 large targets to an average width of 400.
INFO [2020-07-07 01:08:45] Tiling off-target regions to an average width of 200000.
INFO [2020-07-07 01:08:45] Removing following contigs from off-target regions: chr4_GL000008v2_random,chr14_GL000009v2_random,chr14_GL000225v1_random,chr15_KI270727v1_random,chr16_KI270728v1_random,chr17_KI270729v1_random,chrUn_KI270442v1,chrUn_KI270743v1
WARN [2020-07-07 01:09:46] 298 intervals without mappability score (298 on-target).
INFO [2020-07-07 01:09:46] Removing 5109 intervals with low mappability score (<0.50).
WARN [2020-07-07 01:09:46] No reptiming scores provided.
INFO [2020-07-07 01:09:46] Calculating GC-content...
Error in value[[3L]](cond) :
   record 172300 (chr14:16023139-16053427) contains invalid DNA letters
  file: GRCh38.d1.vd1.compress.bgzip.fa.gz
Calls: preprocessIntervals ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
In addition: Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1_random, chr4_GL000008v2_random, chr5_GL000208v1_random, chr9_KI270717v1_random, chr9_KI270718v1_random, chr9_KI270719v1_random, chr9_KI270720v1_random, chr11_KI270721v1_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr14_KI270722v1_random, chr14_GL000194v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_GL000205v2_random, chr17_KI270729v1_random, chr17_KI270730v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_ [... truncated]
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': CMV
  - in 'y': chr11_KI270721v1_random, chr14_GL000194v1_random, chr14_KI270722v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr17_GL000205v2_random, chr17_KI270730v1_random, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_KI270736v1_random, chr22_KI270737v1_random, chr22_KI270738v1_random, chr22_KI270739v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1_random, chr5_GL000208v1_random, chr9_KI270717v1_random, chr9_KI270718v1_random, chr9_KI270719v1_random, chr9_KI270720v1_random, chrEBV, chrM, chrUn_GL000195v1, chrUn_GL00021 [... truncated]
3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr4_GL000008v2_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_KI270729v1_random, chrUn_KI270442v1, chrUn_KI270743v1, CMV, chr11_KI270721v1_random, chr14_GL000194v1_random, chr14_KI270722v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr17_GL000205v2_random, chr17_KI270730v1_random, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_KI270736v1_random, chr22_KI270737v1_random, chr22_KI270738v1_random, chr22_KI270739v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1 [... truncated]
Execution halted
ADD REPLYlink written 5 weeks ago by cg_ref_database20

Okay, the problem was with the GRCh38.d1.vd1.fa that I obtained from the cancer.gov website (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/#step-1-converting-bams-to-fastqs-with-biobambam-biobambam2-2054) and that I used to process my tumor-only samples with. When I used GATK's fasta file, bundle/hg38/Homo_sapiens_assembly38.fasta, I was finally able to create the baits' txt and bed files.

My new fear is that I'll have to re-process all of the VCF files for all of the tumor-only samples using GATK's Homo_sapiens_assembly38.fasta file. I'll find out when I get to the step in the PureCN manual. Wish me luck, and thank you everyone (Markus) for any and all feedback that you provided as that helped me to finally complete this step.

To be continued...

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cg_ref_database20

Hm, this shouldn't be that difficult.

Did you run samtools faidx GRCh38.d1.vd1.fa? This should probably warn you when something is wrong with the Fasta.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by markus.riester500

Thank you for the samtools command to check if there was something wrong with the fasta. The fasta was fine. The problem appears to be with the liftover (37 -> 38) of the probe.bed file. A comparison between the liftover bed-file and the probe.bed in hg38 coordinates (I was unaware that this file was available, hence the liftover) revealed that there were five intervals that were not present in the liftover bed-file that were present in the bed-file that was already in hg38 coordinates.

ADD REPLYlink written 5 weeks ago by cg_ref_database20

Did you end up converting the input files for --mappability and --reptiming flags when you were trying to run with hg38? It seems like the input files you used were for hg19 configuration.

ADD REPLYlink written 5 days ago by newbio17120
1

See section 2 of https://bioconductor.org/packages/release/bioc/vignettes/PureCN/inst/doc/Quick.html for links to precomputed map-ability files. Rep timing should be available from UCSC genome browser. It’s not critical and safe to run without.

ADD REPLYlink written 5 days ago by markus.riester500
0
gravatar for markus.riester
5 weeks ago by
markus.riester500 wrote:

Thanks. If you think this is something that should have been handled with a better error message, feel free to open a github issue with a minimal example to reproduce.

ADD COMMENTlink written 5 weeks ago by markus.riester500
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 809 users visited in the last hour