Question: SNPs in 1000 genome project
0
gravatar for Javad
5 weeks ago by
Javad0
Javad0 wrote:

Hello every body,

I want to have SNPs of 1000 genome project only for exons. The database of 1000 genome project includes SNPs for the whole genome. Is there an easy way to filter and download data only for exons. I don't want to spend time on writing scripts to filter data. I would be grateful if some body can help.

Best, Ehsan

snp • 275 views
ADD COMMENTlink modified 5 weeks ago by Alex Reynolds24k • written 5 weeks ago by Javad0
1

The answer is probably bedtools

ADD REPLYlink written 5 weeks ago by WouterDeCoster28k
1

What is your next step, why do you want subset of 1000 genome data? At the moment "I want this, and don't want to code" seems to me an unclear request.

ADD REPLYlink written 5 weeks ago by zx87544.3k
1

Refer to dbSNP. In dbSNP, kgvalidated and kgprod tags denote the variants are from 1000 genomes project. Then filter by syn, nsf, nsm, nsn , u3 and u5 tags. These tags are for coding variants with calculated variant effect. For filtering you can use bcftools.

otherway is to intersect dbsnp vcf with exon coordinates.

ADD REPLYlink written 5 weeks ago by cpad01125.3k
1

Javad : Don't forget to follow up on this thread.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer as long as it works.
Upvote|Bookmark|Accept

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax48k

Yeah, But I didn't want to write scripts. I thought maybe this data is already stored somewhere. Thank you anyway.

ADD REPLYlink written 5 weeks ago by Javad0
1

Filtering by tags is one line code if one knows how to use bcftools.

ADD REPLYlink written 5 weeks ago by cpad01125.3k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 5 weeks ago by genomax48k

It's just a single command but okay.

ADD REPLYlink written 5 weeks ago by WouterDeCoster28k

I don't think it would be just a single command. because the coordinates of exons are not included in the vcf file. Am I missing some thing? Could you please give me some hints to go through it? Thanks

ADD REPLYlink written 5 weeks ago by Javad0
1

You would need a bed file of the targets of interest, essentially the exome. You can get those from UCSC.

ADD REPLYlink written 5 weeks ago by WouterDeCoster28k

no vcf file will have exon coordinates, in general. VCF fill have coordinates for variants only. When you filter for variants in coding and UT regions, this automatically covers exonic regions, mostly.

ADD REPLYlink written 5 weeks ago by cpad01125.3k
4
gravatar for finswimmer
5 weeks ago by
finswimmer2.1k
Germany
finswimmer2.1k wrote:

Hello Javad,

without a little work you will not reach your goal. The most difficult part here is, to clarify who said what is an exon? UCSC, NCBI, Ensembl,...? And do you want just coding exons or all?

Let's assume you like all exons defined by NCBI.

  1. Go to UCSC Table browser
  2. Choose hg19 in the assembly field, "Genes and Gene Predictions" in group and "NCBI RefSeq" in track.
  3. Choose Bed as the output format, and give the output file a name e.g. exons.bed
  4. Click on get output

In the next dialog choose "Exons".

Now you have a file called exons.bed which contain the coordinates.

What we have to do now, is to sort this file by position and remove the "chr" from the chromome names. You can do it like this:

cut -c4- exons.bed|sort -k1,1V -k2,2g -k3,3g > exons_sorted.bed

This file we can use to query the 1000 Genomes file directly on the ftp server using tabix:

tabix -R exons_sorted.bed ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > exon_variants.vcf

If this is to slow you have to download the compressed vcf file and the tabix index file to your pc and adopt the tabix command.

fin swimmer

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by finswimmer2.1k
3
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe19k
University College London Cancer Institute
Kevin Blighe19k wrote:

Ehsan, following from Wouter's suggestion, first download the 1000 Genomes VCF data and then filter these using bedtools intersect using a GTF file of all coding exons.

For example (for GRCh37 / hg19):

  1. download 1000 Genomes by following the initial steps to my tutorial: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format
  2. download a GENCODE GTF for all coding / non-coding exons (https://www.gencodegenes.org/releases/grch37_mapped_releases.html)
  3. use bedtools intersect -a 1000Genomes.vcf -b GENCODE.GTF
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe19k
2
gravatar for Alex Reynolds
5 weeks ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Here is a command-line solution that does not involve scripting or navigating through web pages, via Gencode and BEDOPS convert2bed and bedmap.

Build a BED file of exons:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip -c - \
    | awk '$3 == "exon"' - \
    | convert2bed -i gff - \
    > exons.bed

Given a BED file of exons and a VCF file of variants from 1000Genomes (e.g., 1kG.variants.vcf), you can map SNPs to exons via BEDOPS bedmap:

$ bedmap --echo --echo-map-id --delim '\t' exons.bed <(vcf2bed --sort-tmpdir=${PWD} < 1kG.variants.vcf) > answer.bed

Each line of the file answer.bed will contain the exon record, along with the IDs of all SNPs that overlap the exon's genomic interval.

If you want the positions of those variants, replace --echo-map-id with --echo-map.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Alex Reynolds24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 662 users visited in the last hour