Question: Too many distinct mutations in matched normal and cancer cells
0
gravatar for prashant10991
10 days ago by
prashant109910 wrote:

Hi,

I am working on the exome sequencing data shared by https://www.nature.com/articles/sdata201610. To summarize the dataset, they sequenced exomes of cancer tissues and blood cells as matched normal of 7 different patients. I have acquired VCF files of 3 patients with the same type of cancers. Shared VCF files are already filtered. I tried to find the unique mutations to cancer and matched normal cells by taking different types of joins on VCF files.

I found there are many mutations unique to cancer cells and matched normal cells. I was expecting that matched normal cells will have very few unique mutations. Can you help me understand this behavior? Exact stats are shared below:

For patient 1/2/3:

Total common mutations (in both cancer tissue and matched normal blood cell): 75961/88110/82211

Total unique cancerous mutations (only in tissue): 15909/17694/17464

Total unique matched mutations (only in matched normal): 14825/13826/21555

These were the steps followed to compute common, unique cancerous and unique matched mutations using VCFs files of SNPs only.

  1. Only those SNP mutations were kept which satisfied the PASS criteria in filters. We have filtered both matched normal and cancer mutations.
  2. The mutations which are present in both matched normal and cancer mutations referred as common mutations above.
  3. The mutations which are exclusive to either matched normal or cancer are referred as unique matched and unique cancerous mutations, repectively.
cancer snp exome vcf • 133 views
ADD COMMENTlink modified 10 days ago • written 10 days ago by prashant109910
1

You should use a somatic variant caller rather than trying to overlap variants between different VCFs. Check this similar earlier discussion: Why do people not call normal and tumor variant separately for somatic mutation identification?

ADD REPLYlink written 10 days ago by igor7.3k

You should at least give links to the file sources and provide the command lines that were used for the filtering. Otherwise it is almost impossible to reproduce what you did. Brief Reminder On How To Ask A Good Question

ADD REPLYlink written 10 days ago by ATpoint13k

Data is not publically available. It can be downloaded by requesting here http://txcrb.org/data.html. This same link is also present in the article I mentioned above along with all the preprocessing. I am using the filtered VCF files shared by the authors.

However I see no issue in sharing a snippet containing the header and few lines from the file.

##fileformat=VCFv4.0
##fileDate=20151218
##reference=/hgsc_software/cancer-analysis/resources/references/human/hg19/hg19.fa
##INFO=<ID=P,Number=1,Type=Float,Description="Indel p value">
##INFO=<ID=ReqIncl,Number=0,Type=Flag,Description="Loci is in the list of sites required to be included">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RR,Number=1,Type=Integer,Description="Reference Read Depth">
##FORMAT=<ID=VR,Number=1,Type=Integer,Description="Major Variant Read Depth">
##FILTER=<ID=NonVar,Description="No variant at this site">
##FILTER=<ID=NoData,Description="No sequencing data at this site">
##FILTER=<ID=low_qual,Description="indel posterior probability is less than 0.0">
##FILTER=<ID=low_VariantReads,Description="Number of variant reads is less than 2">
##FILTER=<ID=low_VariantRatio,Description="Variant read ratio is less than 0.06">
##FILTER=<ID=single_strand,Description="All variant reads are in the same strand direction">
##FILTER=<ID=low_coverage,Description="Total coverage is less than 5">
##FILTER=<ID=read_end_ratio,Description="Ratio of variant reads within 5bp of read end is greater than 0.8">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TCRBOA1-T-WEX
1       17961   .       TG      T       0       PASS    P=0.0011        GT:VR:RR:DP:GQ  1/0:6:29:35:.
1       745370  .       TA      T       5       PASS    P=0.6892        GT:VR:RR:DP:GQ  1/0:44:221:265:.

Only processing I did on the data is to select mutations which passed the criteria put by the authors.

cat atlas-snp.vcf | grep PASS > atlas-snp_mod.vcf

For now I am only focusing on SNPs.

ADD REPLYlink modified 10 days ago by finswimmer10k • written 10 days ago by prashant109910
1

I did not look at the paper. Based on the keyword "matched" I was assuming that you had matched-normal VCF files, so produced with a somatic caller: cancer vs normal in one file therefore I was asking for the command you used to extract unique mutations, but apparently do not have these kind of files. Following igor recommendation you should not (and probably cannot) do proper identification of cancer (somatic) variants based on intersecting individual files. Download the raw data, e.g. using https://www.biostars.org/p/325010/'s, and then use a somatic variant caller such as Strelka2, GATK etc (plase use the search function on this) to distinguish tumor and germline variants.

ADD REPLYlink modified 9 days ago • written 9 days ago by ATpoint13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1404 users visited in the last hour