Extract samples from a bcf files based on genotype?
Entering edit mode
4.4 years ago
ste.lu ▴ 80

Hi All,

I am looking for a way (possibly using bcftools) to extract samples based on the variant, but I haven't found it yet.

For instance:

CHROM        POS            REF           ALT         SAMPLE.A              SAMPLE.B               SAMPLE.C
chr1         10             A             TAA         0/0                   0/1                    1/1
chr1         10             C             GGA         0/0                   0/0                    0/1

I want to have samples that are Het or Hom for the SNP 1-10-A-TAA. so the Ideal output should be something similar to:

chr1         10             A             TAA         SAMPLE.B               SAMPLE.C

The file I am working with are huge so I would need the best computational approach.

So far I came out with this:

bcftools query -r chr1:1-15 -i "GT=="AA" & GT=="AR"' -f '%CHR %POS %REF %ALT [/t%SAMPLE=%GT]\n' file_name.bcf

Any suggestions which can improve the speed?

Thanks to whoever spend some time to help me.

sequencing next-gen genome snp • 1.9k views
Entering edit mode
4.4 years ago

What's the speed bottleneck here? Are you performing thousands of queries of this type? Or are you performing a smaller number of very large queries?

If you're performing lots of queries, but the number of variants and samples is not that large, bcftools query should be fine, as long as your BCF is indexed.

If you're performing a smaller number of very large queries, and you only have biallelic variants, plink 1.9 is likely to outperform bcftools. You can convert your data to plink-format with

plink --bcf <full BCF filename> --const-fid 0 --out converted ;

your sample query can then be executed with

plink --bfile converted --keep-allele-order --chr 1 --from-bp 1 --to-bp 15 --recode rlist --out result .

See https://www.cog-genomics.org/plink/1.9/formats#rlist for a description of the format of this command's main output file (result.rlist). Positional information for the variants can be looked up in converted.bim, or if you use a plink 1.9 build from 30 Nov 2019 or later, result.map (--recode rlist forgot to generate the .map file in older plink 1.9 builds).

Entering edit mode
4.4 years ago

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

 java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+ " "+V.getStart()+ " "+V.getReference().getDisplayString()+" "+V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+" "+V.getGenotypes().stream().filter(G->!(G.isNoCall() || G.isHomRef())).map(G->G.getSampleName()).collect(Collectors.joining("\t"))));' src/test/resources/rotavirus_rf.vcf.gz

RF01 970 A C S5
RF02 251 A T S2 S3
RF02 578 G A S4
RF02 877 T A S1
RF02 1726 T G S2    S3
RF02 1962 TACA TA S1
RF03 1221 C G S2    S3
RF03 1242 C A S4
RF03 1688 T G S5
RF03 1708 G T S5
Entering edit mode

Hi Pierre,

Thanks a lot for your quick answer. Unfortunately, is note that easy to try it, I am working on HPC and they are quite strict about installing new packages. I'll try to git clone it.

Thanks again.

Entering edit mode

compile on your side and 'scp' the tool on your cluser...


Login before adding your answer.

Traffic: 1105 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6