Determine INDELs number (both classes separately) from reference and graph-based VCF files
0
0
Entering edit mode
6 months ago
Matteo Ungaro ▴ 100

Hi there,

this is more so of a hint/suggestion post than a real question since I could manage to find some related posts here on Biostars but appreciate a feedback on the procedure/results for the analysis.

In principle, I'm trying to compare the bwa-mem_GATK pipeline working on the linear reference against the HPRC draft pangenome graph (Giraffe-DV) for a number of samples which are not phased haplotypes in the graph space (three Papuans at first), using HG002 as baseline comparison.

That said both pipelines, bwa-mem_GATK and Giraffe-DV, generate viable VCFs although slightly different in structure. My first intent is to determine the set of INDELs for these four samples (HG002 + Papuan samples) starting from their VCF files, comparing both the reference-based approach to the graph-based one; ideally, the second should be more representative for genotyping samples.

Also, I'm trying to tease apart the two classes of small variants e.g. insertions from deletions in this analysis.

Currently, I experimented with two approaches for which I found suggestions on this forum, BEDOPS and bcftools. Basically, with the former there is a very simple set of commands:

vcf2bed --insertions < <sample_reference>.vcf | wc -l
vcf2bed --deletions < <sample_reference>.vcf | wc -l

vcf2bed --insertions < <sample_graph>.vcf | wc -l
vcf2bed --deletions < <sample_graph>.vcf | wc -l

whereas, the latter requires something like this:

bcftools view --types indels --include 'ILEN<0' <sample>.vcf | bcftools stats > <sample>_deletions.txt
bcftools view --types indels --include 'ILEN>0' <sample>.vcf | bcftools stats > <sample>_insertions.txt

grep "DP" <sample>_<INDEL_class>.txt | awk -F'\t' '{sum+=$6;} END{print sum;}'

One thing to be noted is that the Giraffe-DV VCF does not have a type filed nor a ilen information in it; hence, I'm not sure the approach that uses bcftools works correctly... What would be interesting is having some sort of visual interpretation of this analysis; however, I'm kind of concerned with the numbers I'm getting, they appear to be odd and possibly erroneous, see below.

| sample  | variant_type | approach  | BEDOPS  | bcftools |
| ------- | ------------ | --------- | ------- | -------- |
| HG002   |      INS     | reference | 475809  | 453944   |
| HG002   |      INS     |   graph   | 690414  |          |
| HG002   |      DEL     | reference | 501174  | 475132   |
| HG002   |      DEL     |   graph   | 852921  |          |
| 6103671 |      INS     | reference | 436799  | 417400   |
| 6103671 |      INS     |   graph   | 821559  |          |
| 6103671 |      DEL     | reference | 437471  | 417417   |
| 6103671 |      DEL     |   graph   | 1080141 |          |
| 6103673 |      INS     | reference | 448359  | 427776   |
| 6103673 |      INS     |   graph   | 1336050 |          |
| 6103673 |      DEL     | reference | 446712  | 425586   |
| 6103673 |      DEL     |   graph   | 1333631 |          |
| 6103675 |      INS     | reference | 453935  | 432375   |
| 6103675 |      INS     |   graph   | 812122  |          |
| 6103675 |      DEL     | reference | 448073  | 426747   |
| 6103675 |      DEL     |   graph   | 1061532 |          |

Is there anything I'm doing wrong or could be done differently? Maybe using any type of filter on specific FORMAT filed in the VCF? Let me know, thanks!

P. S. happy to share portion of the files whether that helps

INDEL vcf • 311 views
ADD COMMENT

Login before adding your answer.

Traffic: 1790 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6