Question: Acquiring and combining hg38 SNP and gene data
0
gravatar for spiral01
23 months ago by
spiral0180
spiral0180 wrote:

I am trying to put together a dataset of hg38 human genome SNP data (CDS only) that contains information on:

  • SNP chromosomal location
  • Allele frequencies
  • Allele change
  • Codon change
  • Peptide change
  • Variant consequence
  • Gene within which SNP is located

Initially I tried to use BiomaRt's getBM to get variation data, which included everything I wanted except allele frequencies (you can get minor allele frequencies but only very few of the entries actually have any data for this).

I eventually gave up on BiomaRt for variation data as it keeps crashing R (I know my code is correct because it works on the smaller chromosomes - 22 for instance - but fails on the larger ones). From reading other threads I can see that the variation data on BiomaRt can be fiddly to acquire due to the size of the datasets.

So my new approach is to acquire data via the UCSC table browser which gives me all the information above except the gene within which the SNP is located. I have managed to get gene information from BiomaRt using:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attributes = listAttributes(ensembl)
#Get all chromosome names
chroms <- getBM(attributes = 'chromosome_name',
                mart = ensembl)

for(i in 1:length(chroms[,])){
  gene.table <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 
                                     'ensembl_transcript_id', 'transcript_start', 'transcript_end', 
                                     'exon_chrom_start', 'exon_chrom_end', 'phase', 'end_phase', 
                                     'genomic_coding_start', 'genomic_coding_end', 'cds_start', 'cds_end'), 
                      filters = c('chromosome_name'),
                      values = chroms[i,1], 
                      mart = ensembl)

  gene.table <- data.frame(t(sapply(gene.table,c)))
  gene.table <- t(gene.table)
  gene.table <- gene.table[complete.cases(gene.table),]

  write.table(gene.table, file=paste("chrom.",chroms[i,1],".gene.txt", sep=''), sep='\t', quote=FALSE)
}

From here I can use the SNP chromosomal location and the gene_coding_start and gene_coding_end coordinates to identify to which gene each SNP belongs.

The reason for my post is that this seems like a rather unwieldy approach to acquiring what I assume is a commonly used dataset. What I would like to know is whether there are more efficient ways of getting what I need, and if not is my own methodology sound.

As an aside, I did intend to use VEP but have had real issues attempting to install it or use it via the virtual machine.

snp gene • 984 views
ADD COMMENTlink modified 23 months ago by genecats.ucsc560 • written 23 months ago by spiral0180
2
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

I would 'just' get the dbsnp vcf file for hg38 and annotate it with snpeff or vep.

ADD COMMENTlink written 23 months ago by Pierre Lindenbaum119k

HI Pierre, thanks for your reply. I have tried installing snpeff but have had no luck. I get the error: Unsupported major.minor version 51.0. I understand that I need to upgrade my Java version, but this is proving troublesome too. Using Java -version I get:

java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)

However, using /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -version I get:

java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

I know I need 1.7 or higher, but am unsure why it won't work.

ADD REPLYlink written 23 months ago by spiral0180

what is exactly your snpeff command please, including the java cmd + What is your OS ?

ADD REPLYlink written 23 months ago by Pierre Lindenbaum119k

I am running OS X El Capitan. I downloaded snpEff here: http://snpeff.sourceforge.net/download.html and had no trouble instilling it. However, when I run the command:

java -jar snpEff.jar databases

I get the following errors:

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/snpeff/SnpEff : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

To check my java version I use:

java -version

Which gives me:

java version "1.6.0_65" Java(TM) SE Runtime Environment (build 1.6.0_65-b14-468-11M4833) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-468, mixed mode)

However, when I use the command:

 /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -version

I get:

java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

So my guess is that although I have Java 1.8 installed it is not configured?

ADD REPLYlink written 23 months ago by spiral0180

and what is the output of

 /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java  -jar snpEff.jar databases

???

ADD REPLYlink written 23 months ago by Pierre Lindenbaum119k

Ah that provides a list of databases! So to run it I must call Java from the correct place? Many thanks. Is there no way to set Java 1.8 as the default? Again, many thanks for your help.

ADD REPLYlink written 23 months ago by spiral0180

Do you remember installing Java from Java.com? If not do that.

ADD REPLYlink written 23 months ago by genomax65k

Actually I am still receiving errors. Whilst the above command worked, when I use:

/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -jar snpEff.jar download database.dbsnp.GRCh38

I get the following error:

java.lang.RuntimeException: Property: 'database.dbsnp.GRCh38.genome' not found
    at org.snpeff.interval.Genome.<init>(Genome.java:106)
    at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:680)
    at org.snpeff.snpEffect.Config.readConfig(Config.java:648)
    at org.snpeff.snpEffect.Config.init(Config.java:479)
    at org.snpeff.snpEffect.Config.<init>(Config.java:117)
    at org.snpeff.SnpEff.loadConfig(SnpEff.java:451)
    at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80)

The database is the one listed within the snpEff.config file (as advised in the documentation). at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72) at org.snpeff.SnpEff.run(SnpEff.java:1182) at org.snpeff.SnpEff.main(SnpEff.java:162)

ADD REPLYlink written 23 months ago by spiral0180
1

Looks like you are not able to download the database automatically (behind a firewall?). You may need to download manually as described in the documentation.

ADD REPLYlink written 23 months ago by genomax65k
1
gravatar for genecats.ucsc
23 months ago by
genecats.ucsc560
genecats.ucsc560 wrote:

You should be able to use the UCSC Table Browser to get this information. Select the snp147CodingDbSnp table as your primary table, and then choose "selected fields from primary and related tables" as your output option, where you can select fields from the snp147 table.

Then I think you want the following selections from snp147CodingDbSnp: chrom, chromStart, chromEnd, name, transcript, alleles, codons, peptides

and the following fields from snp147: refNCBI/UCSC, observed, func, alleleFreqs

this leads to output like the following:

# position rsId NM_transcript codons peptides ref observed function frequency
chr9    133255667   133255668   rs782400398 NM_020469   TGA,CGA,    *,R,    A   A/G stop-loss   0.999974,0.000026,
chr9    133255668   133255669   rs782672373 NM_020469   CCG,CCA,    P,P,    C   C/T coding-synon    
chr9    133255669   133255670   rs56392308  NM_020469   CCG,CGT,    P,R,    G   -/C frameshift  0.106304,0.893696,

If you have any more questions about the UCSC Genome Browser, feel free to send your question to our public mailing list at genome@soe.ucsc.edu so our whole team can see your question. If your question involves private data, you can send it to genome-www@soe.ucsc.edu instead.

Thanks, ChrisL from the UCSC Genome Browser

ADD COMMENTlink written 23 months ago by genecats.ucsc560

Hi Chris,

Many thanks for your response and apologies for my own slow reply. I have been using UCSC data such as you've shown above, but I also need to identify in which gene each SNP is located. As far as I can see this is not possible from the table browser and would involve cross referencing co-ordinates? I have given this a go and due to overlapping transcripts have ended up finding the same SNP in multiple genes?

ADD REPLYlink written 22 months ago by spiral0180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 901 users visited in the last hour