Question: Extract vcf from 1000 genomes based on population and region
0
gravatar for ammarsabir15
2.6 years ago by
ammarsabir1560
ammarsabir1560 wrote:

I want to see the mutation in SCN5A (which is on chromosome 3:38548066-38649628) in Punjabi population and for this purpose I have downloaded the vcf file for chromosome 3 from here.

The file is very large and contain variants (on chromosome 3) for all the populations in 1000 genomes project and I do not know how to filter it based on population and region that's why I am not able to get variants for the region of interest.

How this vcf file can be filtered based on region and population ?

sequencing snp alignment • 2.3k views
ADD COMMENTlink modified 2.6 years ago by Emily_Ensembl19k • written 2.6 years ago by ammarsabir1560
1

There are a lot of tools to subset a VCF file by region/sample - this is a routine requirement. Did you try searching the forum? GATK has a couple of tools, and so does vcftools/tabix/plinkseq.

You will need to generate the list of samples that form the population of your choice - that is, the samples being a part of a population is more of a logical grouping than something an algorithm can figure out.

ADD REPLYlink written 2.6 years ago by RamRS24k

But there is still a problem. I want to compare these blast results part1 of blast results and part2 of blast results i.e the G shown in yellow in part2 with the variants file for punjabi population from 1000 genomes project. The variant file is here. I want to check that whether at the same position this mutation exists in punjabi population or not. But I am not getting the right way to do it because the blast file has different numbers for position of bases than the vcf file. So what can be the way to compare these?

ADD REPLYlink written 2.6 years ago by ammarsabir1560

Do not add answers unless you're answering the question. This deserves to be a comment reply on an appropriate post. I'm moving it to a comment on the top level post now.

ADD REPLYlink written 2.6 years ago by RamRS24k

I tried to add it as a comment but it gave error that's why I used this option.

ADD REPLYlink written 2.6 years ago by ammarsabir1560

What was the exact error you faced?

ADD REPLYlink written 2.6 years ago by RamRS24k
1
gravatar for Emily_Ensembl
2.6 years ago by
Emily_Ensembl19k
EMBL-EBI
Emily_Ensembl19k wrote:

The Data Slicer (originally from 1000 Genomes, now passed to Ensembl) does just that.

ADD COMMENTlink written 2.6 years ago by Emily_Ensembl19k

@Emily thanks for the tool its great. As I said I want to compare mutations from above given blast results with one here.. But in above blast results I am confused about the right position of mutation on the subject (the mutation which is highlighted). I think it is on position 38666125 of chromosome # 3 , is it right ?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ammarsabir1560
0
gravatar for aham
2.6 years ago by
aham40
aham40 wrote:

You can do this with vcftools and bcftools:
First extract the variants in Punjabi population using 'vcftools':

vcf-subset -e -c list_of_Punjabi_persons.txt 1000_genomes_chr3.vcf > PJL_chr3.vcf

Then, extract the variants in specified region using bcftools:

bcftools view -R target_region.bed -o PJL_SCN5A.vcf PJL_chr3.vcf.gz
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by aham40

I was successful in getting list of variants for punjabi population i.e PJL_chr3.vcf. The command used for this purpose was

zcat genotypes.vcf.gz | vcf-subset -e -c list1.txt > PJL_chr3.vcf

But when I tried to extract desired region i.e chr3:38540899-38649799 from thenvcf file then I get a vcf file containing only headers for the vcf file. The command used for this was

bcftools view -r chr3:38540899-38649799 -o PJL_SCN5A.vcf PJL_chr3.vcf.gz

Is there any mistake in this command?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ammarsabir1560

You may want to double check the chromosome naming scheme - it may be 3:start-end and not chr3:start-end

ADD REPLYlink written 2.6 years ago by RamRS24k

Your solution worked in this way.

   bcftools view -r 3:38548066-38649628  -o PJL_SCN5A.vcf PJL_chr3.vcf.gz
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by ammarsabir1560

I suggest you are using the wrong coordinates of SCN5A gene. If your 'PJL_chr3.vcf.gz' file is on hg19 build, which I am sure so, then the correct coordinates of this gene are: '3:38589548-38691164' (from gencode v.19). Please try using these coordinates. Also check the chromosome naming scheme in input file, as suggested by @Ram.

ADD REPLYlink written 2.6 years ago by aham40

mshakeel yes you are right, the vcf is on hg19 build as it states in header file of vcf

##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

Should I only use '3:38589548-38691164' or a list of coordinates for SCN5A? Also for the same gene there are listed many coordinates so which to use? For example the first one is '3:38674526-38687267' and the second one is which you have told.

ADD REPLYlink modified 2.6 years ago by RamRS24k • written 2.6 years ago by ammarsabir1560

Please see here to find the coordinates of this gene according to hg19/hs37d5 assembly. The correct coordinates of the gene '3:38589548-38691164'.

ADD REPLYlink written 2.6 years ago by aham40

@aham your solution worked thanks for help.

ADD REPLYlink written 2.6 years ago by ammarsabir1560

in case you need to perform this 2-step process, it'd be much faster first to filter by region and then to select the samples. anyway, you can merge both commands in one:

bcftools view -r 3:38589548-38691164 -S list_of_Punjabi_persons.txt -o PJL_SCN5A.vcf 1000_genomes_chr3.vcf
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Jorge Amigo11k

following your suggestions ,use "vcf-subset -e -c ceu_sample.txt ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > ceu_chr7.vcf" I can't get the vcf files for CEU, ceu_sample.txt including all the sample name of cue,could you give me some suggestions

ADD REPLYlink written 2.3 years ago by gongrui09180

Why not use the Data Slicer posted by Emily above in this post ??

ADD REPLYlink written 2.3 years ago by ammarsabir1560
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1895 users visited in the last hour