I am starting to play with open source variant callers such as - MuTect, SomaticSniper, Strelka, Varscan2. All those algorithms need always - normal.bam and tumor.bam. My question is, where can i get normal.bam?? I need to put one normal (non-cancer) patient in my sequencing run? Or can I download normal.bam from Internet and using it in my variant calling?
Thank you for sharing your experiences with variant calling in somatic mode.
This is very basic, still I'll try to put it in very simple way..
There are 2 kinds of mutations, Germline and Somatic. Most of the analysis concerned with cancer use Somatic mutations (You can search why too much importance for Somatic mutations). When a tumor sample is collected for sequencing, along with tumor tissue, we should also collect a normal tissue (blood sample) from the same patient. We use both of these samples for downstream analysis. When calling variants, most of the variant callers use this normal and tumor bam files, so that variants found common in both files are not considered as suspecting ones. Because the variant is present in normal tissue also, it doesn't have anything to do with disease.
I need to put one normal (non-cancer) patient in my sequencing run?
No, you should use a normal tissue (or blood) sample from same patient.
can I download normal.bam from Internet and using it in my variant calling?
No, you can not use a random normal sample for your analysis, because each patient/person is different.
There are some other ways one can handle a tumor data without having a matched normal. You can search this forum for such analysis tips.
For instance VarDict, MuTect, Platypus, Scalpel (InDel calling only), FreeBayes, and Pindel (Indel Only) can all be run in tumour only calling mode. In addition, some of these tools also allow for things like Panels of Normals, where you can put together a panel of local controls to use instead of a matched normal sample.
Thnak you Dan - this is very useful! Did you compare algorithms to call variants only from tumor tissue VS algorithms based on tumor and normal tissue?
I haven't. I am doing tumour only calling in my clinical pipeline because we won't be sequencing matched normal tissues. I'd be interested in such a comparison running on some controlled and well-studied tumour data but I haven't done so myself. I'm also working with amplicon sequencing data and not whole exome or whole genome sequencing.
Thank you for very nice explanation. So if I understand right - I have two options. 1: Prepare library always two samples corresponding one patient - form tumor and blood tissue. So my flowcell capacity is half reduce (because one patient - two tissues). 2: Use algorithm which call variants only from tumor tissue, like @Dan Gaston mentioned? I am right? Thank you.
I would collect a normal tissue as well when collecting the tumor tissue. Sometimes, people do not collect normal tissue and by the time they realize they need normal, the patient would be expired. But this is the typical case (which happened with me). In these cases, one can follow different alternatives as @Dan Gaston pointed out some. But, I strongly recommend to collect matched normal if its possible.
It's important to stress what sort of research you are doing. Or if this is clinical work and not research. In a clinical setting, no one doing routine sequencing for molecular diagnostics will be doing matched normal tissues because it doubles the cost of the diagnostic test. Generating something like a panel of normals might be as good as it gets. For research, again it depends on your setting. We are doing several studies with tumour tissue only because it is what we have to work with. And for doing tumour profiling it should be sufficient. If it is a research project and you can afford it, matched normal tissue (blood or surrounding normal tissue from the organ if that's all you have) is the best way to go.
Thank you so much Dan for sharing your experience. I am working in Diagnosis lab and absolutely agree with double cost for collecting and sequencing normal+cancer tissue. So could you recommend so of those mentioned algorithms for variant calling without cancer tissue - VarDict, MuTect, Platypus, Scalpel (InDel calling only), FreeBayes, and Pindel (Indel Only).. Is there any your "winner" ? Thank you so much!
I've found MuTect (I'm using the first MuTect and not the newer MuTect2) + Scalpel (MuTect only does SNVs, Scalpel only does Indels so they are complimentary) to probably be the most reliable overall. But I have found some edge cases. For instance, very low-frequency variants (0.5-1% range) are not always picked up by MuTect if the depth of coverage is also low. FreeBayes will usually find these. However, I find FreeBayes without a paired normal sample tends to call a lot of variants at this low frequency, many of which are likely false-positives when I've done an in-depth analysis. I see similar issues with Pindel. I use all of the variant callers currently and am finishing constructing various filters and decision trees on top of the results. For specific diagnostic/prognostic indicators, I'm more liberal in calling them. The odds of them being a false-positive are, in my experience, lower than just some random variant.
You can find most of the software I am using on my Github repository. It isn't exactly documented, and there is a private repository for the config files I use that are required but the various "ddb" components form the core of what I am doing. It plugs into the Toil workflow management system. ddb-scripts has various analysis scripts (somatic amplicon is my general purpose workflow for this stuff) and ddb-ngsflow has all of the wrappers for various actual tools. It may be useful. I'm working on tidying lots of this stuff up now for publication and a better release.
Feel free to contact me anytime. Always happy to help another diagnostic lab get set up.
Thank you Dan so much for perfect experience sharing. I will probably contact you after Iam doing some my personal research :-)
If you have a say in the experimental design, try to get 1 tumor + 1 normal per patient.
If you have only tumor available, call variants using only the tumor sample (check GATK, sammtools/bcftools, freebayes, ...) but annotate your variants with a healthy population database like GnomAD