Finding important predispositions using R
2
1
Entering edit mode
7.6 years ago

Hi,

I have a list of variants called from a individual genome and I'm trying to filter out the important predispositions from it. My approach was to download the variant_summary.txt.gz file from ClinVar website, in which most of the variants related to human health are being recorded, so that I can intersect my variants with it.

I loaded the variant_summary.txt into R and it says the Dataset has 154358 rows and 25 columns. But when I check with wc-l linux the number of records is 198661. I double checked the no of rows by visualizing the data in excel. It had 198,661. My questions are,

1. Why R does not load all the records of my file?
2. Given the fact that I'm still novice to bioinformatics do you think that my approach is feasible in finding predispositions if I fix the R issue?

Thanks you very much.

ClinVar R read.csv • 2.2k views
1
Entering edit mode
Most likely there is a problem with line 154358 in your file. It may contain a different number of columns or may not use the same delimiter as the other lines. Unfortunately read.delim does not give a warning in these cases, try reading with read.csv.
3
Entering edit mode
7.6 years ago

You can read in the file like so:

var.anno = read.delim("Downloads/variant_summary.txt", header=T)
dim(var.anno)
[1] 198660     25


That gave me the correct dimensions, first row contains the header.

Regarding your second question, I think it is reasonable to try to annotate detected variants and their association with phenotypes from available databases. Depending on what you are after, you might also want to consult other variant databases, e.g. dbSNP. dbSNP also links to ClinVar if there is an entry there and to OMIM.

There are also R packages for this purpose, some here:

1
Entering edit mode

You can also add the arguments comment.char = "" and quote = "" to read.table(). The input file contains both "#" and single-quote characters, which are causing the truncated read issue.

0
Entering edit mode

Thank you very much. I 'm using annovar to annotate my variants, but unfortunately it does not provide annotations with phenotype association databases. That's why I tried merging clinvar txt, with my variant list using location coordinates. Could you please mention if you know any other alternatives.

Anyway my questions are solved. Thanks again :)

1
Entering edit mode

Some more tools:

• Ensembl VEP
• SNPedia, links to a lot of other databases for each snp, also has a bulk api
0
Entering edit mode

Thanks :)

1
Entering edit mode
7.6 years ago

Dear nilakshafreezon,

a) How are you reading the CSV (as it .txt format)? Are you using the correct delimiter? like is you text is separated by a comma or space or tab or a semicolon?

IMP: R is mostly like command line version of excel (for newbies) so, it is almost like what you do on excel for opening a file do the same as command line.

The Above image shows how you import a text in Excel, the same way you need to import in R. eg.

> tree <- read.csv(file="trees91.csv",header=TRUE,sep=",")


Here we defined the delim as "sep". [http://www.cyclismo.org/tutorial/R/input.html#reading-a-csv-file]

Note: This might be of some help:

b) For bioinformatics, one have to have equally efficient in both biology and (computer) languages. But it doesn't matter, you can learn the other in no time, if you the basics of one.

Regards,
Devashish Das

0
Entering edit mode

Dear Devashish,

1. Of course, I used the correct 'sep' value.In this case it's "\t" as it's a tab delimited file. Furthermore it's not a problem of loading the files into R. But it depicts only a portion of the records.
2. I was asking about from more experienced users about their approaches in finding important predispositions. Whether their approaches are similar or what do they have in addition .

And thank you for your kind consideration.