Question: Extract samples from a bcf files based on genotype?
0
gravatar for ste.lu
10 months ago by
ste.lu60
ste.lu60 wrote:

Hi All,

I am looking for a way (possibly using bcftools) to extract samples based on the variant, but I haven't found it yet.

For instance:

CHROM        POS            REF           ALT         SAMPLE.A              SAMPLE.B               SAMPLE.C
chr1         10             A             TAA         0/0                   0/1                    1/1
chr1         10             C             GGA         0/0                   0/0                    0/1

I want to have samples that are Het or Hom for the SNP 1-10-A-TAA. so the Ideal output should be something similar to:

chr1         10             A             TAA         SAMPLE.B               SAMPLE.C

The file I am working with are huge so I would need the best computational approach.

So far I came out with this:

bcftools query -r chr1:1-15 -i "GT=="AA" & GT=="AR"' -f '%CHR %POS %REF %ALT [/t%SAMPLE=%GT]\n' file_name.bcf

Any suggestions which can improve the speed?

Thanks to whoever spend some time to help me.

sequencing snp next-gen genome • 360 views
ADD COMMENTlink modified 10 months ago by chrchang5237.3k • written 10 months ago by ste.lu60
1
gravatar for chrchang523
10 months ago by
chrchang5237.3k
United States
chrchang5237.3k wrote:

What's the speed bottleneck here? Are you performing thousands of queries of this type? Or are you performing a smaller number of very large queries?

If you're performing lots of queries, but the number of variants and samples is not that large, bcftools query should be fine, as long as your BCF is indexed.

If you're performing a smaller number of very large queries, and you only have biallelic variants, plink 1.9 is likely to outperform bcftools. You can convert your data to plink-format with

plink --bcf <full BCF filename> --const-fid 0 --out converted ;

your sample query can then be executed with

plink --bfile converted --keep-allele-order --chr 1 --from-bp 1 --to-bp 15 --recode rlist --out result .

See https://www.cog-genomics.org/plink/1.9/formats#rlist for a description of the format of this command's main output file (result.rlist). Positional information for the variants can be looked up in converted.bim, or if you use a plink 1.9 build from 30 Nov 2019 or later, result.map (--recode rlist forgot to generate the .map file in older plink 1.9 builds).

ADD COMMENTlink modified 10 months ago • written 10 months ago by chrchang5237.3k
0
gravatar for Pierre Lindenbaum
10 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

 java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+ " "+V.getStart()+ " "+V.getReference().getDisplayString()+" "+V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+" "+V.getGenotypes().stream().filter(G->!(G.isNoCall() || G.isHomRef())).map(G->G.getSampleName()).collect(Collectors.joining("\t"))));' src/test/resources/rotavirus_rf.vcf.gz


RF01 970 A C S5
RF02 251 A T S2 S3
RF02 578 G A S4
RF02 877 T A S1
RF02 1726 T G S2    S3
RF02 1962 TACA TA S1
RF03 1221 C G S2    S3
RF03 1242 C A S4
RF03 1688 T G S5
RF03 1708 G T S5
(...)
ADD COMMENTlink written 10 months ago by Pierre Lindenbaum130k

Hi Pierre,

Thanks a lot for your quick answer. Unfortunately, is note that easy to try it, I am working on HPC and they are quite strict about installing new packages. I'll try to git clone it.

Thanks again.

ADD REPLYlink written 10 months ago by ste.lu60
1

compile on your side and 'scp' the tool on your cluser...

ADD REPLYlink written 10 months ago by Pierre Lindenbaum130k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1914 users visited in the last hour