Question: Hot to retrieval a SNP from a very big vcf file by position with tabix
0
gravatar for vw
16 days ago by
vw20
vw20 wrote:

Hi all,

I want to retrieval SNPs in a very large vcf file.

I stored the chromosome and position of my interested SNPs in a temp file as below:

chr1    2487663 rs2227312       C       A       100.0   PASS    DP=1825;ASP=true;CAF=0.4008,0.5992;COMMON=1;G5=true;G5A=true;GENEINFO=LOC115110:115110|TNFRSF14:8764;GNO=true;HD=true;KGPhase1=true;KGPhase3=true;R5=true;RS=2227312;RSPOS=24

I used to retrieval records from the source vcf with tabix by inputting a range:

tabix source.vcf.gz chr1: 222-245

But this time, since it a snp, I can only input a begin site:

tabix source.vcf.gz chr1: 2,487,663

or

tabix source.vcf.gz chr1:2,487,663-2,487,663

But it doesn't work. Furthermore, not all of SNPs have dbSNP ID, so, I cannot retrieval them by ID.

Could you give me some suggestions?

Thank you!

snp vcftools tabix vcf • 125 views
ADD COMMENTlink modified 16 days ago • written 16 days ago by vw20
2
gravatar for Pierre Lindenbaum
16 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

source.vcf MUST be compressed with bgzip and indexed with tabix or bcftools

bgzip source.vcf 
tabix -p vcf  source.vcf.gz
tabix source.vcf.gz "chr1:2487663-2487663"

(no space , no comma in "chr1:2487663-2487663" )

so, I cannot retrieval them by ID.

gatk select variants: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

--keepIDs / -IDs

List of variant IDs to select

If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line.

ADD COMMENTlink modified 16 days ago • written 16 days ago by Pierre Lindenbaum121k

Thanks for your response. I have compressed and indexed the vcf file. A few first lines of the source.vcf.gz appear as below:

gzip -cd source.vcf.gz | head -60 | tail -4

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10019   rs775809821     TA      T       .       .       RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP

I try to retrieval above SNP from the source.cvf.gz:

tabix source.vcf.gz "chr1:10019-10019"

But it doesn't return anything. Did I make any mistake?

ADD REPLYlink modified 16 days ago • written 16 days ago by vw20

Did I make any mistake?

yes it's "1:10019-10019" not "chr1:10019-10019"

ADD REPLYlink written 16 days ago by Pierre Lindenbaum121k

I figure out the issue:

In the source vcf file, the content in the chromosome column is a plant number without prefix "chr". So, my command line should be:

tabix source.vcf.gz "1:10019-10019"

It works well now. Thank you so much!

ADD REPLYlink written 16 days ago by vw20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1954 users visited in the last hour