Hi!
I had a look at a predecessor PhD script for ASEReadCounter using WASP for unbiased mapping and I don't think it makes any sense.
Below is the worflow followed in his script:d
More specifically, in this step below, the script uses the called variants from the own data.
using this flag:
I thought the snps should be from an external source such as 1000 Genomes not the own called variants?
Am i missing something here or what?
Help pls:)
/Jonas
Thank you so much for you answer i.sudbery! Yes I have read it too and I don't think it's totally clear either, so I guess I might not be retard after all:) No he was not using DNA seq, only RNA seq. I now got this project in my knee:)
But if I use the variants called from my own data, doesn't that mean that the bias is allready introduced in the STAR alignReads step? So I'm running wasp with an allready biased reference? Does that sound reasonable? what do you think?
Its always unclear what to do for ASE/eQTL when you don't have matching DNAseq. Infact, you ideally want matched, phased, haplotypes!
I think using, say, 1000G SNPs is likely to be conservative, and therefore safe. I'm not sure if you should provide them as phased haplotypes or just as SNPs - if your samples match the known haplotypes, then my feeling is that this will be advantageous. However, it might cause issues where your samples don't match the common haplotypes.
My worry with this approach is that you will end up discarding reads unnecessarily: WASP takes reads that overlap a "known" SNP and generates all possible haplotypes other than the one seen in the read, and tests if they maps elsewhere. If they do, they are discarded. I think what might happen then is you might do this for lots of SNPs that your sample doesn't actually carry. Which might lead to discarding too many reads.
One the other hand, if you use RNA-seq variant calling, you will be limited to only those variants that in actaully in your sample. You will also find variants that are in your sample, but not in, say 1000G. The general worry with reference bias is that you under call variants, because reads with variants are less likely to map to reference. Thus, you are unlikely to generate false positive variants due to reference bias. You might under quantify variants, which might lead to false ASE, this is what WASP corrects for, but you are unlikely to call False positive variants due to reference bias.
However you may call false positives due to other problems with calling variants from RNAseq (such as RNA editing, or base modification).
One solution might be to take the intersection of something like 1000G and the RNAseq variants, and therefore use SNPs you are pretty confident are real, but only the ones you have some evidence are present in your sample. However, this has the disadvantage or being the least powerful/most conservative of all the options.