From annovar documents, we know thta SIFT predicts whether an amino acid substitution affects protein [D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)]
now I got a lot of variants, and I went to select the candidate variants caused a cancer with sift. Assuming that we use annovar to get the sift score, from its document, we can get two output file:
"_filtered" file: variants written to this file if it can not be matched the database
"_dropped" file: variants written to this file if it can be matched the database
[kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped 1000g2012apr_eur 0.04 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C 1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays 1000g2012apr_eur 0.81 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4 1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease 1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
I'm not sure which of the following is correct:
M1: Only selecting the variants if sift score<0.05 from "_dropped" file and don't care about the "_filtered" file
M2: Selecting the variants if sift score<0.05 from "_dropped" file and combine it with the "_filtered" file, then using other methods for further filtering.