Skip to the last paragraph to go straight to the question. I am interested in looking at variants from a rna-seq sample of GM12878 and comparing to variants found by GIAB. The basic outline of my pipeline is the following
1) Trim reads with ngShoRT.
2) Align reads with STAR to the reference genome. Keep only uniquely mapped reads.
3) Call SNPs with samtools/bcftools
Comparing to GIAB, ~60% of the discovered variants match up with GIAB (is that a reasonable amount with just 1 sample?).
To understand why some of the variants found by samtools/bcftools don't match up to GIAB, I would like to focus only on SNPs within highly expressed genes. To calculate gene expression, I want to use TPM values from Salmon. I've learned that popular quantification tools (Salmon, RSEM, etc.) are based on alignments to the transcriptome. However, only 82% of the reads aligned by STAR overlap an exon. Related question: previously been addressed. So why should I trust a quantification tool that is based on the transcriptome when 18% of the reads don't align to an exon?