I used DiscoSnp (specifically discoSnpRAD) with a reference genome and I have two questions:
Out of 700k SNPs, only 30% are in genomic regions (there is a scaffold number in the .vcf file). The rest 70% SNPs have chromosome number given as SNP_higher_path and SNP_lower_path. Does this mean that the variants of these SNPs did not map to the genome but were called de novo? Can I still retain them?
Does SNP_higher_path and SNP_lower_path represent two alleles from the same variant? If so, I need to remove one of them correct?
Thanks for the clarifications. I've been trying out some of the filtering options that you mentioned. But does it matter which allele I retain among the two options: higher or lower path? Is there any parameter to consider to select one or other?
Also, I see that mapped SNPs (with scaffold number) don't have higher and lower path flags. Does this mean one allele was already removed when the results were output?
HIgher or lower are meaningless. It's a (deterministic) random choice.
Once mapped, one knows which of the nucleotide is the reference, hence it becomes REF, when the other is ALT.
A particular case, however: when the predicted variant is, say, A/T and the mapped reference genome contains, say, G at the mapped position, the VCF still contains REF and ALT alleles (while meaningless), but the 'Genome' field contains the nucletide 'G'.