Question: Parse raw vcf in R from CollapsedVCF (VariantAnnotation library)
1
gravatar for ricardoguerreiro2121
5 months ago by
ricardoguerreiro212120 wrote:

Hello

Simple question ( I hope!)

I'm using this VariantAnnotation package from bioconductor to work with SNPs.

myparam = ScanVcfParam(               # this function is a fast positional filter
    samples="3441_pseudo_gatk", 
    which= GRanges(chr[14],          # This is a chromossome name
                   ranges=IRanges(   # these are custom filters
                       start= 10,
                       end=10000)))

    VCF <- readVcf(TabixFile("my_potato.vcf.gz"), "Potato", param=myparam)

VCF is a "Formal class CollapsedVCF". It's very complex and there are extensive instructions, but I cannot find the answer to the most simple thing of them all:

How do I parse the raw data of VCF in R?

I know that if I write it out with:

writeVcf(obj = VCF, filename =  "tmp.vcf")

I get exactly what I want in a file in my working directory. However, I would like parsing all lines in R as a vector to filter, as I already wrote a filter that gives index positions.

All of this because I don't like the filterVcf() function. It doesn't take advantage of the fact that VCF is already filtered by positions and is thus smaller than the original "my_potato.vcf.gz".

EDIT : Actually, I got a detail wrong. FilterVcf() can indeed use the positional filter myparam with param=myparam.

Thank you for the attention!
Ricardo

ADD COMMENTlink modified 5 months ago • written 5 months ago by ricardoguerreiro212120

What do you mean by "parsing all lines in R as a vector to filter, as I already wrote a filter that gives index positions."... You can pretty easily convert all the VCF information to data.frames but getting it back into VCF format is not as easy.

What exactly do you want to extract? and what is your filter?

ADD REPLYlink written 5 months ago by benformatics920

I have a filtering function that outputs desired indexes (based on, for example, the value of info(VCF)$MQ ) but I can't subscript VCF with indexes because it's not a dataframe or a vector. That is why I thought of parsing the raw lines as elements of a vector. I can just concatenate them to the header to create a new vcf file.

I would simply like to filter out certain lines of the vcf. For example, to filter out lines 1,5,7:

index <- c(1,5,7)

vector[-index]  # this would be a vector on which each element is a raw line of the vcf file `

A raw line should look like this:

ChrUn|ref0000006|ref0000012|ref0000001|ref0000010   24754   .   T   A94.77  10X_ALLELE_FRACTION_FILTER  AC=1;AF=0.5;AN=2;BaseQRankSum=2.155;ClippingRankSum=-1.643;DP=276;FS=2.9;MLEAC=1;MLEAF=0.5;MQ=46.59;MQRankSum=-0.851;QD=0.34;ReadPosRankSum=0.077;SOR=0.782;MUMAP_REF=33.0841;MUMAP_ALT=37.2581;AO=23;RO=259;MMD=2.51716;RESCUED=4;NOT_RESCUED=455;HAPLOCALLED=0    GT:AD:DP:GQ:PL:BX:PS    0/1:234,22:256:99:123,0,10080:GAAGTTCAGCTGCCGT-1_60;CAAGGCCGTCAAGAGC-1_74;ACGTCAAAGCACTCAT-1_74;AACGTTGAGGGTGTGT-1_65;GCTGCAGCAATGGACG-1_74;ACTACGAGTTGCCCGA-1_70;AAAGCGGCATGTGGGA-1_74;ACGTATGAGCCAGTAG-1_60;CCAAGATTCACGGCCA-1_74;GCGCAACCAATCGCAT-1_74_70;GGAACGATCTCCAGCT-1_60;ATAGCGTAGGTGATAT-1_74;CTCATTATCTTCGACC-1_74;CTCTGTGAGATTCACC-1_74;ATTCTACGTGTAACCT-1_70;TCAGCTCCACAGTGAG-1_74;CTTCATAAGC
ADD REPLYlink modified 5 months ago • written 5 months ago by ricardoguerreiro212120
1

Check out bcftools, especially bcftools view.

Much easier to use and faster than R. Just remember to be on top of alt-alleles, or to avoid confusion, split multi-allelic entries to biallelic entries using bcftools norm.

ADD REPLYlink modified 5 months ago • written 5 months ago by RamRS22k

The format that the package parses a VCF to is insanely confusing! Why even use R to work with huge VCF files?

ADD REPLYlink written 5 months ago by RamRS22k

I know, right?? I initially did a python script and it was much easier, besides having better performance. But I need to deliver an R filter to a client that does not know (or even have installed) python...

ADD REPLYlink written 5 months ago by ricardoguerreiro212120

The client wants the filter in R specifically? You can always system("bcftools view") :-D

ADD REPLYlink modified 5 months ago • written 5 months ago by RamRS22k

Thank you for the suggestion. I already had the answer I was looking for. The client wants an R script to run in windows, hehe.

ADD REPLYlink written 5 months ago by ricardoguerreiro212120
3
gravatar for d-cameron
5 months ago by
d-cameron2.0k
Australia
d-cameron2.0k wrote:

How do I parse the raw data of VCF in R?

You already know the answer to that: it's the readVcf() function. You use the VariantAnnotation library to fully parse the file, or you use a text file parser if you need the raw input lines for some reason (you very rarely do).

See here for an example of where I needed to do exactly that to work around a bug in the VariantAnnotation library.

I have a filtering function that outputs desired indexes (based on, for example, the value of info(VCF)$MQ ) but I can't subscript VCF with indexes because it's not a dataframe or a vector.

I would simply like to filter out certain lines of the vcf. For example, to filter out lines 1,5,7:

Sub-setting a VCF using VariantAnnotation works exactly like you describe. Just subset the VCF using the notation you would use for a vector: filteredVcf = vcf[-c(1, 5, 7)] will removes the first, firth, and seventh variant from the VCF.

You can also subset using a logical vector. For example: highqualvcf = vcf[VariantAnnotation::fixed(vcf)$QUAL > 20]

ADD COMMENTlink modified 5 months ago • written 5 months ago by d-cameron2.0k

Perfect answer, thank you!

I didn't realize VCF[-c(1, 5, 7)] would subset the vcf like if it was a vector. It only prints the metainformation but when doing info(filteredVcf) you can can see that the rows are gone.

Once again thanks and have a nice day/evening!

ADD REPLYlink written 5 months ago by ricardoguerreiro212120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1113 users visited in the last hour