Hello,
I'm working on variant calling to compare two strains and identify variants unique to strain 2. I performed RNA-seq on three replicates from each strain. I aligned all the samples to a reference genome using STAR, then used bedtools intersect
to extract the genes positions I was interested in.
With the 6 new BAM files (2 strains X 3 replicates), I performed the following steps:
samtools sort
,samtools index
,samtools rmdup
,samtools index
bcftools mpileup
,bcftools call
on the three replicates for each strain, which resulted in two VCF files (one for each strain).- I applied
tabix
to both VCF files and usedbcftools view
. - Finally, I filtered for variants that appear in the VCF for strain 2 but not in the VCF for strain 1, and applied filters for depth (DP) and quality (QUAL).
The issue I'm facing is that I’m seeing variants in strain 2 that don’t appear in strain 1. However, upon examining the data in IGV, it seems that these variants unique to strain 2 may be due to a lack of coverage (or very low coverage) in strain 1, not an actual difference between the two strains. In other words, I’m not sure if these variants in strain 2 are truly absent in strain 1, or if they just aren’t detectable in strain 1 due to poor or no coverage.
How can I determine which variants are truly unique to strain 2 and not just due to a lack of coverage in strain 1?
I’d greatly appreciate any help or suggestions.
Thank you!
You could quality filter your VCF to remove variants with low confidence (coverage, missing data etc) (https://speciationgenomics.github.io/filtering_vcfs/ ) .
Thanks for your response! I don’t quite understand this situation:
Position x strain 1: 25 reads with T instead of C
Position x strain 2: 0 reads
In this case, the VCF for strain 1 will show a high-quality SNV at that position, while strain 2 will show nothing. When comparing the two VCF files, it may seem like there's a mutation in strain 1, but without reads from strain 2, I can’t confirm if the C to T change is unique to strain 1.
Thank you!
You should be easily able to discern that there is no coverage in strain 2 at position x. You can use a program like
pandepth
ormosdepth
to get per base coverage.