Dear all, I have a more conceptual question. I have used trinity supertranscript pipeline for calling SNPs between 2 individuals reared under 2 conditions (4 samples/4 libraries/4 vcf files (always compared each of the samples to the refence)). Reference was build with Supertranscript method (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543425/) which is applied on non-model organisms and although superTranscripts do not represent any true biological molecule, they provide a practical replacement for a reference genome. So, in my case the refersnce was build of the combined de novo assembly of my own data.
In the above paper it is stated
"Only heterozygous SNPs, which we defined as those with at least one read supporting the reference allele, were analysed. Reported homozygous SNPs were removed because they are likely to be false positives of the assembly or alignment. True homozygous SNPs should be assembled into the reference and are therefore not detectable. Note that this is a general limitation of using the same sample to create the reference and call variants and is not unique to the superTranscript method. However, homozygous variants could be detected for non-model organisms if multiple samples were available or if superTranscripts were constructed and called with respect to a control."
I suppose that heterozygous SNPs are those represented by GT:0/1 in the vcf files while homosygous are represented with 0/0. Excluding the homozygous SNPs which actually I am more interested on, because these are the ones that explain variation among my two individuals, the #SNPs is reduced to 1/10 for each file. And of course from those much more less are on shared genes and positions among my 4 vcf files. At the end I yielded only 10 SNPs that are shared between the 2 individuals. Also, if i got it correctly I get SNPs in loci where the one allele has the same polymorphism as the reference and the other allele has the alternative polymorphism which is different in individual 1 and 2. This sounds to me more like an allele specific expression, which is interesting however not what I am looking for. Keep in mind that I do not have a reference genome, only a reference transcriptome that does not come of my data (completely different treatments though).
Any suggestions about which pipeline may be adequate for this kind of data or how can I get out homozygous SNPs without high number of false positives would be really helpful.