I am trying to do an analysis that requires sequence data, and not SNPs directly (Extended Bayesian Skyline Plot). The input required for the program are multiple fasta files, each with an alignment (for multiple individuals) for a particular 'region'. However, each locus/ region represented in a fasta file should have at least 3, or more SNPs to be informative.
The data I have is ddRAD seq data. SNPs were called after reference aligning the reads to the genome (scaffold level) using samtools and freebayes.
Is there any way to identify regions/reads that have multiple SNPs on them, so that they can be selected for all individuals and a fasta file (of 80 bp) can be written for each such region?
Ideally, is there a way that one could get information on how many SNPs each of the reads (the 80 bp reads that have been mapped to the genome), then classify them as reads with 0 or 1 or 2 or 3 or 4 etc SNPs, and finally select the subset of reads/ regions that have say 4 SNPs, and get consensus reads (2 alleles per individual per 'locus')?
If anyone has any suggestions, I'd be very grateful!