I'm hitting a wall and need some help with this issue.
Overview: I have NGS data for a human tumor cell line (Onco_comb_rd.bam) for which I need to identify CNVs compared to a wild type human genome sequence. I understand I need a tumor/normal comparison for samtools mpileup | varscan copynumber (I am perfectly willing to use bcftools cnv if someone can tell me what is supposed to go after the -c and -s options)
I have mapped, sorted, and rmdup'd the tumor data. I do not have a "normal" set of data, so I generated a synthetic data set using one of the bbmap related tools. For the synthetic set, I tried to match the # of tumor reads as well as the approximate insert size. Mapped etc (hg38_synth.rd.bam)
It took a while to get mpileup to function without a memory error, but I eventually got it to work. I have tried this whoel process without piping the data, and with (latest command below), but keep running into the same (or close) error:
[mpileup] 2 samples in 2 input files
Min coverage: 10
Min avg qual: 15
P-value thresh: 0.01
<mpileup> Set max per-file depth to 4000
Reading input from STDIN
Reading mpileup input...
Error: Invalid format or not enough samples in mpileup: ##fileformat=VCFv4.2
At any rate, this is the latest version of the command:
samtools mpileup -Avuf /oasis/projects/nsf/csd414/pdehoff/Reference/GRCh38_latest_genomic.fna hg38_synth.rd.bam Onco_comb_rd.bam | java -Xmx4G -jar /oasis/projects/nsf/csd414/pdehoff/programs/VarScan.v2.3.9.jar copynumber --mpileup Onco_hg38
I've been pulling my hair out over this for a while and I need to understand what I'm doing wrong here. If any of you folks have some suggestions, please let me know. Thank you.