Question

SNPs in 1000 genome project

0

Entering edit mode

6.0 years ago

Javad ▴ 150

Hello every body,

I want to have SNPs of 1000 genome project only for exons. The database of 1000 genome project includes SNPs for the whole genome. Is there an easy way to filter and download data only for exons. I don't want to spend time on writing scripts to filter data. I would be grateful if some body can help.

Best, Ehsan

SNP • 2.7k views

ADD COMMENT • link updated 6.0 years ago by Alex Reynolds 35k • written 6.0 years ago by Javad ▴ 150

1

Entering edit mode

The answer is probably bedtools

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

1

Entering edit mode

What is your next step, why do you want subset of 1000 genome data? At the moment "I want this, and don't want to code" seems to me an unclear request.

ADD REPLY • link 6.0 years ago by zx8754 11k

1

Entering edit mode

Refer to dbSNP. In dbSNP, kgvalidated and kgprod tags denote the variants are from 1000 genomes project. Then filter by syn, nsf, nsm, nsn , u3 and u5 tags. These tags are for coding variants with calculated variant effect. For filtering you can use bcftools.

otherway is to intersect dbsnp vcf with exon coordinates.

ADD REPLY • link 6.0 years ago by cpad0112 21k

1

Entering edit mode

Javad : Don't forget to follow up on this thread.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer as long as it works.
Upvote|Bookmark|Accept

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Yeah, But I didn't want to write scripts. I thought maybe this data is already stored somewhere. Thank you anyway.

ADD REPLY • link 6.0 years ago by Javad ▴ 150

1

Entering edit mode

Filtering by tags is one line code if one knows how to use bcftools.

ADD REPLY • link 6.0 years ago by cpad0112 21k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

It's just a single command but okay.

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I don't think it would be just a single command. because the coordinates of exons are not included in the vcf file. Am I missing some thing? Could you please give me some hints to go through it? Thanks

ADD REPLY • link 6.0 years ago by Javad ▴ 150

1

Entering edit mode

You would need a bed file of the targets of interest, essentially the exome. You can get those from UCSC.

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

0

Entering edit mode

no vcf file will have exon coordinates, in general. VCF fill have coordinates for variants only. When you filter for variants in coding and UT regions, this automatically covers exonic regions, mostly.

ADD REPLY • link 6.0 years ago by cpad0112 21k

score 5 · Accepted Answer · 2018-04-17

Hello Javad,

without a little work you will not reach your goal. The most difficult part here is, to clarify who said what is an exon? UCSC, NCBI, Ensembl,...? And do you want just coding exons or all?

Let's assume you like all exons defined by NCBI.

Go to UCSC Table browser
Choose hg19 in the assembly field, "Genes and Gene Predictions" in group and "NCBI RefSeq" in track.
Choose Bed as the output format, and give the output file a name e.g. exons.bed
Click on get output

In the next dialog choose "Exons".

Now you have a file called exons.bed which contain the coordinates.

What we have to do now, is to sort this file by position and remove the "chr" from the chromome names. You can do it like this:

cut -c4- exons.bed|sort -k1,1V -k2,2g -k3,3g > exons_sorted.bed

This file we can use to query the 1000 Genomes file directly on the ftp server using tabix:

tabix -R exons_sorted.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > exon_variants.vcf

If this is to slow you have to download the compressed vcf file and the tabix index file to your pc and adopt the tabix command.

fin swimmer

score 4 · Accepted Answer · 2018-04-17

Ehsan, following from Wouter's suggestion, first download the 1000 Genomes VCF data and then filter these using bedtools intersect using a GTF file of all coding exons.

For example (for GRCh37 / hg19):

download 1000 Genomes by following the initial steps to my tutorial: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format
download a GENCODE GTF for all coding / non-coding exons (https://www.gencodegenes.org/releases/grch37_mapped_releases.html)
use bedtools intersect -a 1000Genomes.vcf -b GENCODE.GTF

score 3 · Accepted Answer · 2018-04-17

Here is a command-line solution that does not involve scripting or navigating through web pages, via Gencode and BEDOPS convert2bed and bedmap.

Build a BED file of exons:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip -c - \
    | awk '$3 == "exon"' - \
    | convert2bed -i gff - \
    > exons.bed

Given a BED file of exons and a VCF file of variants from 1000Genomes (e.g., 1kG.variants.vcf), you can map SNPs to exons via BEDOPS bedmap:

$ bedmap --echo --echo-map-id --delim '\t' exons.bed <(vcf2bed --sort-tmpdir=${PWD} < 1kG.variants.vcf) > answer.bed

Each line of the file answer.bed will contain the exon record, along with the IDs of all SNPs that overlap the exon's genomic interval.

If you want the positions of those variants, replace --echo-map-id with --echo-map.