Hi,
I have different samples (exomes) a and b. In each sample, there are 5-10 individuals.
1) I have used following code for variant calling:
>samtools mpileup -ugf ref.fa a*_sorted.bam > a.bcf # pileup of 5 individuals (5 bam)
>bcftools call -vmO v a.bcf > a.vcf
>vcfutils.pl varFilter -Q 10 -d 10 -D 200 a.vcf > a_filtered.vcf
Similarly, b_filtered.vcf
was also generated.
2) I have a list of 10 genes for which I am interested to find variants from these two datasets (a and b) and used bcftools
for annotation:
>bgzip genes_10sorted.bed
>tabix -p bed genes_10sorted.bed.gz
>bcftools annotate -a genes_10sorted.bed.gz -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') a_filtered.vcf.gz > a_filtered_ann10.vcf
3) Now I can see the gene names in the filtered and annotated vcf file a_filtered_ann10.vcf
but I can't figure out the sample names as they are indicated with ERS561518, ERS561535, ERS561560, ERS561566, ERS561638.
How can I retain the file names as sample names while making pileup and keep them throughout?
Any guidance in this regard would be appreciated.
Thanks!
I edited the title to make it more specific. I guess you should modify the read groups of your bam file.
Yes, there is a SM tag in @RG = ERS561518 of my first sample.
Should I edit manually or is there any automatic way?
Use Picard AddOrReplaceReadGroups.