ensembl - selecting SNPs of specific genes
3
0
Entering edit mode
7 months ago
Eliza ▴ 30

Hi , I was wondering if there is any code or software inside ENSEMBL that can help me to get all SNPs that are annotated to a specific gene ( from the 1000 genome project) for example : getting a VCF file of all SNPs for the gene :LINC01435.

Thank you for help

SNP VCF ensembl gene genome • 1.1k views
ADD COMMENT
3
Entering edit mode
7 months ago
bk11 ★ 3.0k

You also can use tabix to pull all the SNPs for LINC01435 (chr10:109517591-109871360) for 1000 genome vcf file as follows-

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 10:109517591-109871360 |cut -f1-5 |awk '!/##/' |head
#CHROM  POS ID  REF ALT
10  109517629   .   G   A
10  109517659   .   C   G
10  109517694   .   C   T
10  109517945   .   A   G
10  109517971   rs59481676  T   C
10  109518131   rs74587027  G   A
10  109518278   rs77125356  A   T
10  109518322   .   C   T
10  109518361   rs2900779   G   C
ADD COMMENT
0
Entering edit mode

bk11 Thank you and i was wondering if there is a list of genes I'm interested in , is also possible to extract them using tabix like you did?

ADD REPLY
1
Entering edit mode

Absolutely, you can do it pretty easily. All you need is to provide the loci of genes (CHROM:START-END). I am demonstrating using a simple for loop here-

#Prepare a file containing loci list
cat loci.list
10:109517591-109871360
11:119517591-109871360
12:109517591-109871360

#Write a for loop in extract_snps.sh
#!/bin/bash

input=`cat loci.list`

for i in  ${input}; do
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ${i} |cut -f1-5 |awk '!/##/' >${i}_snps.list
done

#Run the loop
chmod a+x extract_snps.sh
./extract_snps.sh
ADD REPLY
0
Entering edit mode

bk11 thank you vey much , I was also wondering when I download the VCF file for all SNPs in a specific gene and read it in R ( using the readVCF package) a lot of SNPs ( some which i need) are filtered out for example :

vcf_HLA_DRB5_final=as.data.frame(vcf_HLA_DRB5_2$vcf)

View(vcf_HLA_DRB5_final) vcf_sorcs=file.path("SORCS3.vcf") vcf_sorcs_2=readVCF( vcf_sorcs , min_maf = 0.02 , max_maf = NA )

[#.......] Reading VCF file.. Rows: 1135 Columns: 638 Column specification Delimiter: "\t" chr (636): ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, HG00098, HG00100, HG00106, HG00112, HG00114, HG0... dbl (2): #CHROM, POS Use spec() to retrieve the full column specification for this data. Specify the column types or set show_col_types = FALSE to quiet this message. [##......] Chromosome: 6 Mbp: 32.51738 Region Size: 12.905 kb Num of individuals: 629 [##......] Before filtering Num of variants: 1135 Num of individuals: 629 [###.....] After filtering Num of variants: 400 Num of individuals: 629

and i get the warnings:

In FUN(newX[, i], ...) : NAs introduced by coercion

how can i prevent it from happening ? and why is it occurring ?

ADD REPLY
0
Entering edit mode

I am not sure what are trying to do here. But, I do not have any problem in reading the vcf file generated for a specific loci using tabix.

##Here I am reading one of the vcf file generated above

library(VariantAnnotation)
vcf <- readVcf("10:109517591-109871360_snps.vcf")

vcf
class: CollapsedVCF 
dim: 3616 629 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 6 columns: AF, DP, CB, EUR_R2, AFR_R2, ASN_R2
info(header(vcf)):
          Number Type    Description                                                                                     
   AF     .      Float   Allele Frequency, for each ALT allele, in the same order as listed                              
   DP     1      Integer Total Depth                                                                                     
   CB     .      String  List of centres that called, UM (University of Michigan), BI (Broad Institute), BC (Boston Co...
   EUR_R2 1      Float   R2 From Beagle based on European Samples                                                        
   AFR_R2 1      Float   R2 From Beagle based on AFRICAN Samples                                                         
   ASN_R2 1      Float   R2 From Beagle based on Asian Samples                                                           
geno(vcf):
  List of length 7: GT, AD, DP, GL, GQ, GD, OG
geno(header(vcf)):
      Number Type    Description                                                                                         
   GT 1      String  Genotype                                                                                            
   AD .      Integer Allelic depths for the ref and alt alleles in the order listed                                      
   DP 1      Integer Read Depth (only filtered reads used for calling)                                                   
   GL 3      Float   Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is no...
   GQ 1      Float   Genotype Quality                                                                                    
   GD 1      Float   Genotype dosage.  Expected count of non-ref alleles [0,2]                                           
   OG 1      String  Original Genotype input to Beagle  
ADD REPLY
0
Entering edit mode

bk11 so what i wanted to do is to read the VCF file in R and then convert it to data frame for easier usage : for example :

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 18:52340197-53535899 > DCC.vcf

library("vcfR") vcf_dcc=file.path("DCC.vcf") vcf_dcc_2=readVCF( vcf_dcc , min_maf = 0.02 , max_maf = NA ) vcf_dcc_final=as.data.frame(vcf_dcc_2$vcf)

but after reading it I see that vcfR filtered many of the variants out :

Rows: 9315 Columns: 638 Column specification Delimiter: "\t" chr (636): ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, HG00098, HG00100, HG00106, HG00112, HG00114, HG0... dbl (2): #CHROM, POS

Use spec() to retrieve the full column specification for this data. Specify the column types or set show_col_types = FALSE to quiet this message. [##......] Chromosome: 18 Mbp: 52.34026 Region Size: 1195.465 kb Num of individuals: 629 [##......] Before filtering Num of variants: 9315 Num of individuals: 629 [###.....] After filtering Num of variants: 400 Num of individuals: 629

so originally there where 9315 but after filtering only 400 and i was wondering why it is filtering it out and how to prevent it

ADD REPLY
1
Entering edit mode

It is because you are filtering while reading vcf using criterin min_maf = 0.02. Without filter you will see the number as you saw before.

    vcf_file <- "DCC.vcf"
    vcf <- read.vcfR(vcf_file, verbose=F)
    vcf
    ***** Object of Class vcfR *****
    629 samples
    1 CHROMs
    9,315 variants
    Object size: 168 Mb
    0 percent missing data
    *****        *****         *****
ADD REPLY
0
Entering edit mode

bk11 HI , I was wondering from where did you get the position of the gene , in Gwas it is :LINC01435 10:107694973-108197849 but here in the code you wrote :chr10:109517591-109871360. I'm simulating data with the sim1000G package and it seems that there is a error of mismatch between chromosomes in genetic map and vcf , the package uses GRCh37 coordinates and Im suspecting maybe Im using not the correct database

ADD REPLY
1
Entering edit mode
7 months ago
dthorbur ★ 2.5k

You can download the dbSNP database used on Ensembl in VCF format directly through the NCBI ftp page. This is where the FAQ on ensembl lead you.

If you just want a single gene, you can download a spreadsheet with the information using ensembl's BioMart tool and filter to only include your gene(s) of interest. You could then manipulate the data into a VCF format if you really needed to.

ADD COMMENT
0
Entering edit mode

dthorbur thank you , can the SNPs be from the 1000 genome project?

ADD REPLY
0
Entering edit mode

Those are part of dbSNP on NCBI. Docs here. Alternatively, you can download 1000 genome variant data on GRCh38 here.

ADD REPLY
1
Entering edit mode
7 months ago
Ben Moore ★ 2.4k

Hi Eliza,

You could use the Data Slicer tool or BioMart. To use BioMart, you'll need to select the Human Variation dataset, then use 1000 Genomes 3 - All as a 'Variant set name' Filter and the gene ID as a second filter in the same query)

ADD COMMENT

Login before adding your answer.

Traffic: 1506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6