Through samtools, I merged all the .bam files that I wanted to be part of my reference genome.
samtools merge F1.bam *F1*_sorted.bam
I ordered and created the index of my bam file.
samtools sort -m 768M F1.bam -o F1_sorted.bam samtools index F1_sorted.bam
After that I created a fasta file from my bam.
samtools mpileup -uf genome.fa F1.bam | bcftools call -c | vcfutils.pl vcf2fq > F1.fastq seqtk seq -a F1.fastq > F1.fasta
And finally I extracted the desired sequences, which corresponds to 2 bases of each side of an detected SNP t in my study population. For that, I used SNPregionfile.txt as the coordinate map of the SNP list.
samtools faidx F1.fasta --region-file SNPregionfile.txt > F1_seq.txt cat F1_seq.txt |grep -v ">" > F1_seq_final.txt
Then I performed the same process with subgroups of the F1 population. My idea was to compare the references of the main group with the sub-groups, to see which alleles were fixed in each one of the subgroups, but variable in the entire population. The problem is that I realized that for the same allele, the reference genome considering all individuals was often homozygous and considering a subset of the population, it was heterozygous. Then I discovered that the heterozygous call algorithm is more complex than I imagined and it is clear that it is not enough to have some heterozygous individuals for that allele to be considered as such, but a considerable proportion of the counts.
So, my question is, how to create a reference genome fasta file using the pipeline presented here but considering any possibility of my data ending up as heterozygous in any of the individuals used to create the merged bam file.