Question: Only one cluster in multiple tumor phylogeny samples
1
gravatar for ceruleanivy
9 months ago by
ceruleanivy20
ceruleanivy20 wrote:

I would like to create phylogenetic trees for 50 patients with breast cancer based purely on SNP VAFs as I don't have copy number variations. Although I've been having several issues with patients having less than 10 mutations, there are many examples with ostensible VAF differences where sciClone keeps returning just one cluster. Here's an example https://www.sendspace.com/filegroup/Wn8S9o8HzjkcUaLzTpR%2BIQ

Do you have any ideas on what is the default/lowest minimum depth of the software ? I've also read in another paper (PhyloSub) that choosing another platform might increase potential number of clusters you get, so I am interested in anything that someone has to suggest that works on the above data.

sequencing snp sciclone next-gen R • 484 views
ADD COMMENTlink modified 9 months ago by Chris Miller17k • written 9 months ago by ceruleanivy20
2
gravatar for Chris Miller
9 months ago by
Chris Miller17k
Washington University in St. Louis, MO
Chris Miller17k wrote:

First of all, your variant files need to be reformatted. VAFs should be expressed as percentages (0-100), not fractions (0-1). That's not helping things

There are several other oddities about your variant files:

1) There's no way to have a decimal number of reference-supporting reads

17  41209079  1.791 195 0.1

2) Why do you have exactly 1000 reads for all of these sites? Your variant lists need to be merged, such that you get readcounts for each variant in each bam, regardless of whether it was called or not. You may have (for example) a cluster that's present at 2-10% in one sample and at 35% in the other, but wouldn't have been identified in the former, since callers have limited low-VAF sensitivity.

17  41276047  1000  0 0
17  41197799  1000  0 0
17  41209140  1000  0 0

3) Take a look at this image from a recent paper of ours (http://www.cell.com/cell-systems/fulltext/S2405-4712(15)00113-1). The downsampling here gives you a pretty good idea of how depth affects the ability to distinguish individual clusters

Figure 5

ADD COMMENTlink written 9 months ago by Chris Miller17k

Thanks for the response, for some reason my obsolete software for mutation calling gives me wrong outputs. Fixed that and the program finally gave me 3 clusters. Apart from the computational time, there are some patients with six samples (bilateral primary with metastatic sites) where I cannot figure how to assign clusters across all the 5-6 files and plots are not available for more than 3 VAF entries.

ADD REPLYlink written 8 months ago by ceruleanivy20

Your variant lists need to be merged, such that you get readcounts for each variant in each bam, regardless of whether it was called or not.

So what you are saying is that if I have 800 variants in the primary tumor, and 1000 in the relapse tumor, I have to merge them so that I end up with 1000 variants in both primary and tumor? Then I need to include the number of reads for the samples where the variants are not present, I suppose it is not enough to say Variant allelic frequency = 0 at that position.

ADD REPLYlink written 3 months ago by chloe.steen120

Exactly. There may be variants present at 3% that were not called in one samples, but since you're able to see them clearly at 40% the other tumor, you can infer that they were present.

ADD REPLYlink written 3 months ago by Chris Miller17k

what tools are you using for getting readcounts for each variant in each bam? I saw bam-readcount https://github.com/genome/bam-readcount

ADD REPLYlink written 3 months ago by tangming20051.8k
3

I used samtools mpileup and provided the list of positions based on the merged vcf files for the two samples.

samtools mpileup -f myreferencefastafile.fasta -s merged_variants_file.txt path/to/bam/file.bam -o output.vcf -v -t DP4

DP4 is a tag to access the number of reads in the output file. It gives the number of reference reads in the forward strand, number of reference reads in the reverse strand, number of alternative reads in forward strand, and number of alternative reads in reverse strand.

I used these numbers in the postprocessing to generate the new vcf files for my individivual samples.

Edit: DP4 is a tag in an older version of samtools (1.1), but if you read the documentation you should figure out how to extract the number of reads.

ADD REPLYlink modified 3 months ago • written 3 months ago by chloe.steen120

Thanks, this is very helpful

ADD REPLYlink written 3 months ago by tangming20051.8k

for the mpileup output

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
1       10234   .       C       T,<*>                PL:DP:SP:ADF:ADR:AD     0,2,171,69,183,207:27:10:11,4,0:12,0,0:23,4,0

what's does the <*> mean? and the PL has 6 combinations.

Thanks.

ADD REPLYlink written 11 weeks ago by tangming20051.8k
1

We typically use bam-readcount, but any tool that gives you the desired info is fine.

ADD REPLYlink written 3 months ago by Chris Miller17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1738 users visited in the last hour