Question: SNPs in 1000 genome project
0
gravatar for Javad
4 months ago by
Javad20
Javad20 wrote:

Hello every body,

I want to have SNPs of 1000 genome project only for exons. The database of 1000 genome project includes SNPs for the whole genome. Is there an easy way to filter and download data only for exons. I don't want to spend time on writing scripts to filter data. I would be grateful if some body can help.

Best, Ehsan

snp • 459 views
ADD COMMENTlink modified 4 months ago by Alex Reynolds25k • written 4 months ago by Javad20
1

The answer is probably bedtools

ADD REPLYlink written 4 months ago by WouterDeCoster31k
1

What is your next step, why do you want subset of 1000 genome data? At the moment "I want this, and don't want to code" seems to me an unclear request.

ADD REPLYlink written 4 months ago by zx87544.8k
1

Refer to dbSNP. In dbSNP, kgvalidated and kgprod tags denote the variants are from 1000 genomes project. Then filter by syn, nsf, nsm, nsn , u3 and u5 tags. These tags are for coding variants with calculated variant effect. For filtering you can use bcftools.

otherway is to intersect dbsnp vcf with exon coordinates.

ADD REPLYlink written 4 months ago by cpad01128.3k
1

Javad : Don't forget to follow up on this thread.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer as long as it works.
Upvote|Bookmark|Accept

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax54k

Yeah, But I didn't want to write scripts. I thought maybe this data is already stored somewhere. Thank you anyway.

ADD REPLYlink written 4 months ago by Javad20
1

Filtering by tags is one line code if one knows how to use bcftools.

ADD REPLYlink written 4 months ago by cpad01128.3k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 4 months ago by genomax54k

It's just a single command but okay.

ADD REPLYlink written 4 months ago by WouterDeCoster31k

I don't think it would be just a single command. because the coordinates of exons are not included in the vcf file. Am I missing some thing? Could you please give me some hints to go through it? Thanks

ADD REPLYlink written 4 months ago by Javad20
1

You would need a bed file of the targets of interest, essentially the exome. You can get those from UCSC.

ADD REPLYlink written 4 months ago by WouterDeCoster31k

no vcf file will have exon coordinates, in general. VCF fill have coordinates for variants only. When you filter for variants in coding and UT regions, this automatically covers exonic regions, mostly.

ADD REPLYlink written 4 months ago by cpad01128.3k
5
gravatar for finswimmer
4 months ago by
finswimmer4.5k
Germany
finswimmer4.5k wrote:

Hello Javad,

without a little work you will not reach your goal. The most difficult part here is, to clarify who said what is an exon? UCSC, NCBI, Ensembl,...? And do you want just coding exons or all?

Let's assume you like all exons defined by NCBI.

  1. Go to UCSC Table browser
  2. Choose hg19 in the assembly field, "Genes and Gene Predictions" in group and "NCBI RefSeq" in track.
  3. Choose Bed as the output format, and give the output file a name e.g. exons.bed
  4. Click on get output

In the next dialog choose "Exons".

Now you have a file called exons.bed which contain the coordinates.

What we have to do now, is to sort this file by position and remove the "chr" from the chromome names. You can do it like this:

cut -c4- exons.bed|sort -k1,1V -k2,2g -k3,3g > exons_sorted.bed

This file we can use to query the 1000 Genomes file directly on the ftp server using tabix:

tabix -R exons_sorted.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > exon_variants.vcf

If this is to slow you have to download the compressed vcf file and the tabix index file to your pc and adopt the tabix command.

fin swimmer

ADD COMMENTlink modified 4 months ago • written 4 months ago by finswimmer4.5k
3
gravatar for Kevin Blighe
4 months ago by
Kevin Blighe26k
USA / Europe / Brazil
Kevin Blighe26k wrote:

Ehsan, following from Wouter's suggestion, first download the 1000 Genomes VCF data and then filter these using bedtools intersect using a GTF file of all coding exons.

For example (for GRCh37 / hg19):

  1. download 1000 Genomes by following the initial steps to my tutorial: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format
  2. download a GENCODE GTF for all coding / non-coding exons (https://www.gencodegenes.org/releases/grch37_mapped_releases.html)
  3. use bedtools intersect -a 1000Genomes.vcf -b GENCODE.GTF
ADD COMMENTlink modified 4 months ago • written 4 months ago by Kevin Blighe26k
2
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

Here is a command-line solution that does not involve scripting or navigating through web pages, via Gencode and BEDOPS convert2bed and bedmap.

Build a BED file of exons:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip -c - \
    | awk '$3 == "exon"' - \
    | convert2bed -i gff - \
    > exons.bed

Given a BED file of exons and a VCF file of variants from 1000Genomes (e.g., 1kG.variants.vcf), you can map SNPs to exons via BEDOPS bedmap:

$ bedmap --echo --echo-map-id --delim '\t' exons.bed <(vcf2bed --sort-tmpdir=${PWD} < 1kG.variants.vcf) > answer.bed

Each line of the file answer.bed will contain the exon record, along with the IDs of all SNPs that overlap the exon's genomic interval.

If you want the positions of those variants, replace --echo-map-id with --echo-map.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour