Question

Count and filter sites with degenarate bases in VCF files

0

Entering edit mode

16 months ago

ja569116 • 0

Hi,

I genotyped samples from methylation reads/bisulfite sequencing. I was surprised that many of the alternative alleles were degenerate bases: R or Y.

V00001.vcf

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
NW_022882922.1  28895   .       C       T       0       PASS    NS=1:DP=52      GT:GQ:DP        0/1:0:52
NW_022882922.1  36586   .       C       T,Y     0       PASS    NS=1:DP=23:GU=T/C       GT:GQ:DP        1/2:0:23
NW_022882922.1  36640   .       G       A       0       PASS    NS=1:DP=40      GT:GQ:DP        1/1:0:40
NW_022882922.1  39071   .       A       G       0       PASS    NS=1:DP=43      GT:GQ:DP        1/1:0:43

V0021

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001
NW_022882922.1  25160   .       G       Y       0       PASS    NS=1:DP=34:GU=T/C       GT:GQ:DP        0/1:0:34
NW_022882922.1  25676   .       T       C       0       PASS    NS=1:DP=41      GT:GQ:DP        0/1:0:41
NW_022882922.1  28342   .       G       A,R     0       PASS    NS=1:DP=35:GU=A/G       GT:GQ:DP        1/2:0:35
NW_022882922.1  29887   .       C       A       0       PASS    NS=1:DP=48      GT:GQ:DP        0/1:0:48

One sample had way more degenerate bases:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001
NW_022882922.1  8082    .   G   A   0   PASS    NS=1:DP=6   GT:GQ:DP    0/1:0:6
NW_022882922.1  11106   .   T   G   0   PASS    NS=1:DP=19  GT:GQ:DP    0/1:0:19
NW_022882922.1  17828   .   C   G   0   PASS    NS=1:DP=27  GT:GQ:DP    0/1:0:27
NW_022882922.1  25160   .   G   Y   0   PASS    NS=1:DP=37:GU=T/C   GT:GQ:DP    0/1:0:37
NW_022882922.1  27396   .   G   A,R 0   PASS    NS=1:DP=33:GU=A/G   GT:GQ:DP    1/2:0:33
NW_022882922.1  28342   .   G   A,R 0   PASS    NS=1:DP=27:GU=A/G   GT:GQ:DP    1/2:0:27
NW_022882922.1  28895   .   C   T   0   PASS    NS=1:DP=32  GT:GQ:DP    0/1:0:32
NW_022882922.1  29887   .   C   A   0   PASS    NS=1:DP=35  GT:GQ:DP    0/1:0:35
NW_022882922.1  40905   .   T   C,Y 0   PASS    NS=1:DP=17:GU=T/C   GT:GQ:DP    1/2:0:17
NW_022882922.1  43671   .   A   C   0   PASS    NS=1:DP=11  GT:GQ:DP    0/1:0:11
NW_022882922.1  43859   .   A   T   0   PASS    NS=1:DP=18  GT:GQ:DP    0/1:0:18
NW_022882922.1  46336   .   G   A,R 0   PASS    NS=1:DP=26:GU=A/G   GT:GQ:DP    1/2:0:26

When I try to combine them with GATK, I got an error because of them.

I have preprocessed my samples in two different ways. My goals are:

Count and estimate the percentage of degenerate sites (with R/Y). I can count how many total sites there are with bcftools but I don't know how to count degenerate sites.
After knowing which preprocessing is better, I would like to filter those degenerate bases/sites to finally make my dataset.

Thanks;

VCF degenerate-bases bisulfite-sequencing • 806 views

ADD COMMENT • link updated 16 months ago by Jeremy Leipzig 23k • written 16 months ago by ja569116 • 0

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or use one of (a) the option highlighted in the image below/ (b) fenced code blocks for multi-line code. Fenced code blocks are useful in syntax highlighting. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
code_formatting