Identifying true variants between two strains despite low coverage
0
0
Entering edit mode
8 weeks ago
bioinfo1994 ▴ 20

Hello,

I'm working on variant calling to compare two strains and identify variants unique to strain 2. I performed RNA-seq on three replicates from each strain. I aligned all the samples to a reference genome using STAR, then used bedtools intersect to extract the genes positions I was interested in.

With the 6 new BAM files (2 strains X 3 replicates), I performed the following steps:

  • samtools sort, samtools index, samtools rmdup, samtools index
  • bcftools mpileup, bcftools call on the three replicates for each strain, which resulted in two VCF files (one for each strain).
  • I applied tabix to both VCF files and used bcftools view.
  • Finally, I filtered for variants that appear in the VCF for strain 2 but not in the VCF for strain 1, and applied filters for depth (DP) and quality (QUAL).

The issue I'm facing is that I’m seeing variants in strain 2 that don’t appear in strain 1. However, upon examining the data in IGV, it seems that these variants unique to strain 2 may be due to a lack of coverage (or very low coverage) in strain 1, not an actual difference between the two strains. In other words, I’m not sure if these variants in strain 2 are truly absent in strain 1, or if they just aren’t detectable in strain 1 due to poor or no coverage.

How can I determine which variants are truly unique to strain 2 and not just due to a lack of coverage in strain 1?

I’d greatly appreciate any help or suggestions.
Thank you!

VCF bcftools mpileup variant-calling • 390 views
ADD COMMENT
0
Entering edit mode

You could quality filter your VCF to remove variants with low confidence (coverage, missing data etc) (https://speciationgenomics.github.io/filtering_vcfs/ ) .

ADD REPLY
0
Entering edit mode

Thanks for your response! I don’t quite understand this situation:

Position x strain 1: 25 reads with T instead of C

Position x strain 2: 0 reads

In this case, the VCF for strain 1 will show a high-quality SNV at that position, while strain 2 will show nothing. When comparing the two VCF files, it may seem like there's a mutation in strain 1, but without reads from strain 2, I can’t confirm if the C to T change is unique to strain 1.

Thank you!

ADD REPLY
0
Entering edit mode

You should be easily able to discern that there is no coverage in strain 2 at position x. You can use a program like pandepth or mosdepth to get per base coverage.

ADD REPLY

Login before adding your answer.

Traffic: 1933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6