Question: upSet R plot for variant calling
0
gravatar for evelyn
11 days ago by
evelyn20
evelyn20 wrote:

hi,

I have tried using upset plot for three vcf files from different pipelines. I extracted the variant column (SNPs) and used these csv files (with one column) for R import. I have used this code:

set1 <- read.csv("set1.vcf", sep="")
set2 <- read.csv("set2.vcf", sep="")
set3 <- read.csv("set3.vcf", sep="")

set1 <- as.vector(set1$V1)
set2 <- as.vector(set2$v1)
set3 <- as.vector(set3$V1)

read_sets = list(set1_reads = set1,
set2_reads = set2,
set3_reads = set3)

upset(fromList(read_sets),
       sets = c("set1_reads", "set2_reads", "set3_reads"),
       number.angles = 20, point.size = 2.5, line.size = 1.5,
       mainbar.y.label = "read intersection", sets.x.label = "read set size",
       text.scale = c(1.5, 1.5, 1.25, 1.25, 1.5, 1.5), mb.ratio = c(0.65, 0.35),
       group.by = "freq", keep.order = TRUE)

It gives an intersection plot but when the number of SNPs from upset plot are really low when I compared these with vcf-compare results using same vcf files. I am not sure why I am getting different numbers with upset plot.

snp R • 161 views
ADD COMMENTlink modified 11 days ago by zx87547.3k • written 11 days ago by evelyn20
1

If you don't mind shifting to python, I have some code which does what you are looking for, more or less, in this repository: https://github.com/wdecoster/surpyvor/

Let me know if you can't figure it out - sooner or later I'll implement it for SNVs as well.

ADD REPLYlink written 11 days ago by WouterDeCoster39k

I have never used python.I will give it a try and will let you know how it works.

ADD REPLYlink written 11 days ago by evelyn20

Can you post example lines from your VCF files? Especially lines that are found to overlap via upsetR and those that are not so we can see the difference. My instinct would be slightly different formatting if you are seeing some overlapping, but looking at some example lines would help.

ADD REPLYlink written 11 days ago by shawn.w.foley580

I am not sure how to extract the overlapping SNPs/vcf lines.

ADD REPLYlink written 11 days ago by evelyn20

You can use grep to do that. By the way, why are you reading VCF files using read.csv? Even if the defaults set for read.csv (as opposed to read.table which is slightly better here) are being overridden because you set the sep, you set the sep to "", which means the result data frame has one column per character.

ADD REPLYlink written 11 days ago by RamRS21k

If I use read.table to import, I get this error:

Error in start_col:end_col : argument of length 0
ADD REPLYlink written 11 days ago by evelyn20

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLYlink written 11 days ago by RamRS21k

Are your vcf file standard vcf format files ? Would be nice if you can share example "set1.vcf".

ADD REPLYlink written 11 days ago by zx87547.3k

I extracted Chrom REF ALT columns by removing the headers for R import like this (some lines from set1.vcf):

#CHROM  REF ALT
ch00    A   AATCATCATCATCATCATC
ch00    A   G
ch00    A   C
ch00    T   A
ch00    G   C
ch00    G   A
ch00    C   T
ch00    A   T
ch00    TTC T
ch00    C   A
ch00    C   T
ch00    T   A
ADD REPLYlink written 11 days ago by evelyn20

It is hard to guess without actual data, could you post outputs of:

vcf-compare -H set1.vcf set2.vcf set3.vcf

versus in R:

length(intersect(set1, set2))
length(intersect(set1, set3))
length(intersect(set2, set3))

length(setdiff(set1, set2))
length(setdiff(set1, set3))
length(setdifft(set2, set3))

etc...
ADD REPLYlink modified 11 days ago • written 11 days ago by zx87547.3k
3
gravatar for SMK
11 days ago by
SMK430
Gentopia, Belgium
SMK430 wrote:

Try this one (note that set1.vcf, set2.vcf, and set3.vcf are the original/real vcf without grep, and it's read.vcfR):

rm(list = ls())
library(UpSetR)
library(vcfR)

set1 <- read.vcfR("set1.vcf")
set2 <- read.vcfR("set2.vcf")
set3 <- read.vcfR("set3.vcf")

set1 <-
  as.vector(paste(set1@fix[, "CHROM"], set1@fix[, "POS"], sep = "_"))
set2 <-
  as.vector(paste(set2@fix[, "CHROM"], set2@fix[, "POS"], sep = "_"))
set3 <-
  as.vector(paste(set3@fix[, "CHROM"], set3@fix[, "POS"], sep = "_"))

read_sets <- list(set1_reads = set1,
                  set2_reads = set2,
                  set3_reads = set3)

upset(fromList(read_sets), order.by = "freq")

If you would like to compare genotypes (like -g, --cmp-genotypes Compare genotypes, not only positions in vcf-compare), adding "ALT" in addition to "CHROM" and "POS" should work.

ADD COMMENTlink modified 11 days ago • written 11 days ago by SMK430

It worked. I got same results as of vcf-compare. Thank you!

ADD REPLYlink written 11 days ago by evelyn20

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLYlink written 10 days ago by RamRS21k

I tried to use the same codes for six vcf files but it shows the plot for five files only.

ADD REPLYlink written 10 days ago by evelyn20

This is already a new question, please post a new question with a link to this one. In this post the question was based on 3 sets of files, and this solution worked for you, consider accepting this as answer - tick.

ADD REPLYlink written 10 days ago by zx87547.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 822 users visited in the last hour