Hi all,
I am new to bioinformatics analysis. I have a sorted BAM file which is WES. In order to get chromosome 1 from the given BAM file in fasta format, I use the following set of commands
samtools view -b file_name.bam chr1>chr1.bam
samtools mpileup -E -uf ~/hg38.fa chr1.bam > chr1.mpileup
bcftools call -mv -Oz chr1.mpileup>chr1.vcf.gz
tabix -p vcf chr1.vcf.gz
samtools faidx ~/hg38.fa chr1|bcftools consensus chr1.vcf.gz -o chr1.fa
In the obtained chr1.fa file, it seems that the non-exome parts of the sequence are also there which doesn't make sense. Further, given the human genome is diploid, shouldn't I be getting 2 fasta files??
Thanks. I see. So if the bam file is WES, the vcf file obtained will only have variants from the exome regions, however for the consensus command, we need to put a filter to remove non-exome regions. Am I correct?
You never get 100% reads on target, you will always have some reads that align to non-exome loci, and your software will try to call variants there. You can filter your the bam to only have reads that cross exons, of you could filter your vcf to only have exomic variants. I don't know if you can filter in the consensus command.