Question

Too many distinct mutations in matched normal and cancer cells

0

Entering edit mode

5.2 years ago

prashant10991 • 0

Hi,

I am working on the exome sequencing data shared by https://www.nature.com/articles/sdata201610. To summarize the dataset, they sequenced exomes of cancer tissues and blood cells as matched normal of 7 different patients. I have acquired VCF files of 3 patients with the same type of cancers. Shared VCF files are already filtered. I tried to find the unique mutations to cancer and matched normal cells by taking different types of joins on VCF files.

I found there are many mutations unique to cancer cells and matched normal cells. I was expecting that matched normal cells will have very few unique mutations. Can you help me understand this behavior? Exact stats are shared below:

For patient 1/2/3:

Total common mutations (in both cancer tissue and matched normal blood cell): 75961/88110/82211

Total unique cancerous mutations (only in tissue): 15909/17694/17464

Total unique matched mutations (only in matched normal): 14825/13826/21555

These were the steps followed to compute common, unique cancerous and unique matched mutations using VCFs files of SNPs only.

Only those SNP mutations were kept which satisfied the PASS criteria in filters. We have filtered both matched normal and cancer mutations.
The mutations which are present in both matched normal and cancer mutations referred as common mutations above.
The mutations which are exclusive to either matched normal or cancer are referred as unique matched and unique cancerous mutations, repectively.

vcf cancer SNP exome • 1.2k views

ADD COMMENT • link 5.2 years ago by prashant10991 • 0

1

Entering edit mode

You should use a somatic variant caller rather than trying to overlap variants between different VCFs. Check this similar earlier discussion: Why do people not call normal and tumor variant separately for somatic mutation identification?

ADD REPLY • link 5.2 years ago by igor 13k

0

Entering edit mode

You should at least give links to the file sources and provide the command lines that were used for the filtering. Otherwise it is almost impossible to reproduce what you did. Brief Reminder On How To Ask A Good Question

ADD REPLY • link 5.2 years ago by ATpoint 82k

0

Entering edit mode

Data is not publically available. It can be downloaded by requesting here http://txcrb.org/data.html. This same link is also present in the article I mentioned above along with all the preprocessing. I am using the filtered VCF files shared by the authors.

However I see no issue in sharing a snippet containing the header and few lines from the file.

##fileformat=VCFv4.0
##fileDate=20151218
##reference=/hgsc_software/cancer-analysis/resources/references/human/hg19/hg19.fa
##INFO=<ID=P,Number=1,Type=Float,Description="Indel p value">
##INFO=<ID=ReqIncl,Number=0,Type=Flag,Description="Loci is in the list of sites required to be included">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RR,Number=1,Type=Integer,Description="Reference Read Depth">
##FORMAT=<ID=VR,Number=1,Type=Integer,Description="Major Variant Read Depth">
##FILTER=<ID=NonVar,Description="No variant at this site">
##FILTER=<ID=NoData,Description="No sequencing data at this site">
##FILTER=<ID=low_qual,Description="indel posterior probability is less than 0.0">
##FILTER=<ID=low_VariantReads,Description="Number of variant reads is less than 2">
##FILTER=<ID=low_VariantRatio,Description="Variant read ratio is less than 0.06">
##FILTER=<ID=single_strand,Description="All variant reads are in the same strand direction">
##FILTER=<ID=low_coverage,Description="Total coverage is less than 5">
##FILTER=<ID=read_end_ratio,Description="Ratio of variant reads within 5bp of read end is greater than 0.8">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TCRBOA1-T-WEX
1       17961   .       TG      T       0       PASS    P=0.0011        GT:VR:RR:DP:GQ  1/0:6:29:35:.
1       745370  .       TA      T       5       PASS    P=0.6892        GT:VR:RR:DP:GQ  1/0:44:221:265:.

Only processing I did on the data is to select mutations which passed the criteria put by the authors.

cat atlas-snp.vcf | grep PASS > atlas-snp_mod.vcf

For now I am only focusing on SNPs.

ADD REPLY • link updated 5.2 years ago by finswimmer 16k • written 5.2 years ago by prashant10991 • 0

1

Entering edit mode

I did not look at the paper. Based on the keyword "matched" I was assuming that you had matched-normal VCF files, so produced with a somatic caller: cancer vs normal in one file therefore I was asking for the command you used to extract unique mutations, but apparently do not have these kind of files. Following igor recommendation you should not (and probably cannot) do proper identification of cancer (somatic) variants based on intersecting individual files. Download the raw data, e.g. using https://www.biostars.org/p/325010/'s, and then use a somatic variant caller such as Strelka2, GATK etc (plase use the search function on this) to distinguish tumor and germline variants.

ADD REPLY • link 5.2 years ago by ATpoint 82k