My situation is as follows:
I have two groups of reads/Individuals that differ in terms of indels (one group has the indels, the other doesn't). I already Mapped them and generated bam files. So, now I am struggling to get any form of usable VCF for my purposes. I am a little bit used to FST analysis of PoolSeq reads and the comparison of pupulations in regard to SNPs and indels. In that case I usually went with straight forward mpileup and sync and then used R for the FST analysis and IGV for visualisation of the differences. But here I have several hundred individual's to look at.
What would you suggest? So far I came up with something like this:
#!/bin/bash (...) module load BCFtools/1.3.1 module load SAMtools/1.3.1 # List all BAM files in current directory bam_files=$(ls *.bam | tr '\n' ' ') # Create a space-separated string of bam files BAM_FILES=$(echo $bam_files | tr ' ' '\n' | paste -s -d ' ') # Run mpileup on all bam files and pipe the output to BCFtools samtools mpileup -g -B $BAM_FILES | bcftools view -Ou - > all_samples.bcf
Now this resulted in a bcf file which 1. seems to be suspiciously small and 2. I don't really know how to use it. I know I can assess the quality of the variant calls with a tool like bcftools stats to generate summary statistics for the BCF file, such as the number of variants, transition/transversion ratios, and other quality metrics. But is this useful here? In the end I just need a straight forward way to asses the indels of the one group of individuals of my samples. Is pooling here an option too ? Or am I maybe completely wrong ? Thanks