Question

Using bcftools to find unique alt homozygous sites

1

Entering edit mode

6 months ago

Axzd ▴ 70

Hello,

I have a vcf with 20 samples. I want to find for each sample the sites that are 1/1, only in that sample (so other samples must have genotypes 0/1 or 0/0). I know I can use filters such as GT="aa"'

However, how do I say GT="aa" for sample 1 and 'is not "aa"' in all other samples? Is there an easy way without manually specifying the GT of each sample, which would end up in a very long command? I know you can specify the exact genotype you expect per sample, but I could end up with a vcf with 50 samples, which might be tedious. I could write a Python script that would print the commands.

How would you approach this? There must be a trivial command, but I don't find it.

Thanks a lot

EDIT: of course, it will be easier for you with a sample of the data. Please note I have excluded sites with unknown genotypes but I might not filter them out. It's still an open question. I manually edited this snippet to include a line with a "1/1"

bcftools view -e 'GT = "./."' all_bcftools2_merged.vcf|head -n 50
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.18+htslib-1.18
##bcftoolsCommand=mpileup -Ou -I --fasta-ref vaga.fa ../ancestor.sorted.bam
##reference=file://vaga.fa
##contig=<ID=Chrom_3,length=20354777>
##contig=<ID=Chrom_1,length=18146847>
##contig=<ID=Chrom_5,length=16930519>
##contig=<ID=Chrom_2,length=16274841>
##contig=<ID=Chrom_4,length=15224634>
##contig=<ID=Chrom_6,length=13893210>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric, http://samtools.github.io/bcftools/rd-SegBias.pdf">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.18+htslib-1.18
##bcftools_callCommand=call -mv -Ob -o ancestor_bcftools.bcf; Date=Wed Sep 13 10:01:56 2023
##bcftools_viewVersion=1.18+htslib-1.18
##bcftools_viewCommand=view ancestor_bcftools.bcf; Date=Thu Sep 14 18:21:29 2023
##bcftools_viewCommand=view -Ou -o ancestor_bcftools2.bcf; Date=Thu Sep 14 18:21:29 2023
##bcftools_mergeVersion=1.18+htslib-1.18
##bcftools_mergeCommand=merge --threads 24 ancestor_bcftools2.bcf D2C150g_bcftools2.bcf D2C1_bcftools2.bcf D2C350g_bcftools2.bcf D2C3_bcftools2.bcf D3A350g_bcftools2.bcf D3A3_bcftools2.bcf D4B450g_bcftools2.bcf D4B4_bcftools2.bcf D5C150g_bcftools2.bcf D5C1_bcftools2.bcf D5C350g_bcftools2.bcf D5C3_bcftools2.bcf H2B450g_bcftools2.bcf H2B4_bcftools2.bcf H3A450g_bcftools2.bcf H3A4_bcftools2.bcf H3C450g_bcftools2.bcf H3C4_bcftools2.bcf H5A250g_bcftools2.bcf H5A2_bcftools2.bcf H5A450g_bcftools2.bcf H5A4_bcftools2.bcf H5C250g_bcftools2.bcf H5C2_bcftools2.bcf; Date=Mon Sep 25 14:45:31 2023
##bcftools_viewCommand=view -e 'GT = "./."' all_bcftools2_merged.vcf; Date=Thu Oct 12 16:23:22 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ancestor    D2C150g D2C1    D2C350g D2C3    D3A350g D3A3    D4B450g D4B4    D5C150g D5C1    D5C350g D5C3    H2B450g H2B4    H3A450g H3A4    H3C450g H3C4    H5A250g H5A2    H5A450g H5A4    H5C250g H5C2
Chrom_3 7253    .   T   A   221.3   .   VDB=0.535543;SGB=-0.693147;RPBZ=-1.50431;MQBZ=1.34343;MQSBZ=-1.02125;BQBZ=0.323777;SCBZ=0.975289;MQ0F=0.00398406;MQ=43;DP=4623;DP4=1443,1221,494,349;AN=50;AC=25    GT:PL   0/1:136,0,255   0/1:97,0,255    0/1:175,0,255   0/1:240,0,255   0/1:255,0,255   0/1:125,0,255   0/1:255,0,255   0/1:208,0,255   0/1:230,0,255   0/1:255,0,255   0/1:255,0,255   0/1:218,0,255   0/1:255,0,255   0/1:177,0,255   0/1:255,0,202   0/1:255,0,255   0/1:241,0,255   0/1:255,0,255   0/1:249,0,255   0/1:136,0,255   0/1:220,0,255   0/1:219,0,255   0/1:255,0,255   0/1:166,0,255   0/1:172,0,255
Chrom_3 7906    .   G   A   222.406 .   VDB=1.00557e-07;SGB=-0.693147;RPBZ=-1.81995;MQBZ=-4.11869;MQSBZ=0.0750985;BQBZ=-2.55896;SCBZ=4.94681;MQ0F=0;MQ=58;DP=2581;DP4=522,513,514,371;AN=50;AC=25   GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:67,0,255    0/1:93,0,255    0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:243,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7911    .   C   T   222.411 .   VDB=2.66194e-07;SGB=-0.693147;RPBZ=-0.981365;MQBZ=-4.00625;MQSBZ=0.0765345;BQBZ=-3.22205;SCBZ=5.27961;MQ0F=0;MQ=58;DP=2537;DP4=501,513,511,380;AN=50;AC=25  GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:45,0,255    0/1:40,0,255    0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:243,0,239   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7912    .   G   A   222.405 .   VDB=5.09897e-07;SGB=-0.693147;RPBZ=-0.98308;MQBZ=-4.06079;MQSBZ=0.0765345;BQBZ=-3.32505;SCBZ=4.87485;MQ0F=0;MQ=58;DP=2539;DP4=503,515,512,378;AN=50;AC=25   GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:59,0,255    0/1:90,0,255    0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:245,0,236   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7949    .   G   C   222.406 .   VDB=0.201825;SGB=-0.693147;RPBZ=1.81576;MQBZ=-4.12627;MQSBZ=-0.328279;BQBZ=0.966014;SCBZ=2.96348;MQ0F=0.00518135;MQ=57;DP=2899;DP4=507,536,593,488;AN=50;AC=25  GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:206,0,255   0/1:197,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,253   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7972    .   A   C   222.406 .   VDB=0.0949276;SGB=-0.693147;RPBZ=-2.4423;MQBZ=-0.00764329;MQSBZ=0.425233;BQBZ=-2.16904;SCBZ=3.54574;MQ0F=0.0060241;MQ=59;DP=2334;DP4=510,545,346,326;AN=50;AC=25    GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:76,0,255    0/1:118,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:230,0,245   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7974    .   A   T   222.409 .   VDB=0.075728;SGB=-0.693147;RPBZ=-2.13329;MQBZ=-0.527608;MQSBZ=-0.123898;BQBZ=-1.0772;SCBZ=3.61746;MQ0F=0;MQ=59;DP=2331;DP4=511,542,346,323;AN=50;AC=25  GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:76,0,255    0/1:120,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:223,0,245   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7980    .   C   A   222.404 .   VDB=0.135845;SGB=-0.693147;RPBZ=-1.43568;MQBZ=-0.50627;MQSBZ=-0.15972;BQBZ=-0.961533;SCBZ=3.49738;MQ0F=0;MQ=59;DP=2335;DP4=509,541,356,318;AN=50;AC=25  GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:86,0,255    0/1:100,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:235,0,228   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 7983    .   C   T   222.403 .   VDB=0.106311;SGB=-0.693147;RPBZ=-1.30284;MQBZ=-0.492116;MQSBZ=-0.161765;BQBZ=0.152925;SCBZ=3.94198;MQ0F=0;MQ=59;DP=2341;DP4=511,541,357,322;AN=50;AC=25 GT:PL   0/1:255,0,255   0/1:255,0,255   1/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:86,0,255    0/1:111,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:233,0,236   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 8004    .   T   G   222.406 .   VDB=0.557767;SGB=-0.693147;RPBZ=0.23356;MQBZ=-1.95341;MQSBZ=-1.52365;BQBZ=-0.360925;SCBZ=4.40809;MQ0F=0;MQ=59;DP=2386;DP4=522,519,356,344;AN=50;AC=25   GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:98,0,255    0/1:108,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,204   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255
Chrom_3 8010    .   T   C   222.406 .   VDB=0.621464;SGB=-0.693147;RPBZ=0.764096;MQBZ=-0.452836;MQSBZ=-1.52365;BQBZ=-0.468479;SCBZ=4.32313;MQ0F=0;MQ=59;DP=2399;DP4=529,513,360,349;AN=50;AC=25 GT:PL   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:98,0,255    0/1:118,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,204   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255   0/1:255,0,255

EDIT2: edited for spelling and grammar, sorry not a native speaker I hope my wording makes sense.

bcftools • 2.5k views

ADD COMMENT • link updated 6 months ago by Pierre Lindenbaum 161k • written 6 months ago by Axzd ▴ 70

1

Entering edit mode

Would you please head -n 20 to see how is your data?

ADD REPLY • link 6 months ago by gonzlezb ▴ 10

0

Entering edit mode

Of course, sorry, I am going to edit the original post

ADD REPLY • link 6 months ago by Axzd ▴ 70

score 4 · Answer 1 · 2023-10-13

4

Entering edit mode

6 months ago

LChart 3.9k

The trivial solution is the GATK Jexl Expressions

gatk SelectVariants \
    -R reference.fasta \
    -V variants.vcf \
    -select 'vc.getGenotype("TARGET_SAMPLE").isHomVar() && vc.GetHomVarCount() == 1'

one homozygous variant, and it's the target sample.

ADD COMMENT • link 6 months ago by LChart 3.9k

0

Entering edit mode

This is indeed trivial. Well, thanks a lot.

ADD REPLY • link 6 months ago by Axzd ▴ 70

0

Entering edit mode

JEXL of course! And neat trick with coupling the count with the isHomVar(). I wonder how GATK's JEXL expression is so much simpler than Pierre's jvarkit expression. Or maybe it's just a difference in the logic as in the exact same expression will work with vcffilterjdk as well?

ADD REPLY • link 6 months ago by Ram 43k

2

Entering edit mode

jexl is scripted java. But I you cannot do things like loops with jexl. I didn't known that function https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html#getHomVarCount-- but I can use it in jvarkit:

 java -jar jvarkit.jar vcffilterjdk -e 'return variant.getGenotype("S1").isHomVar() && variant.GetHomVarCount()==1;' in.vcf

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I assume these are all just granting access to the htsjdk VariantContext https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html#method.summary

ADD REPLY • link 6 months ago by LChart 3.9k

0

Entering edit mode

yes, JEXL is a scripted layer over java using https://commons.apache.org/proper/commons-jexl/

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

Ram · Answer 2 · 2023-10-12

3

Entering edit mode

6 months ago

Pierre Lindenbaum 161k

A one liner using jvarkit vcffilterjdk: https://jvarkit.readthedocs.io/en/latest/VcfFilterJdk/

 java -jar jvarkit.jar vcffilterjdk -e 'return variant.getGenotype("S1").isHomVar() && variant.getGenotypes().stream().filter(G->!G.getSampleName().equals("S1")).noneMatch(G->G.isHomVar());' in.vcf

ADD COMMENT • link updated 6 months ago by Ram 43k • written 6 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you Pierre. I always love your one-liner solutions. But I am never able to write them correctly.

ADD REPLY • link 6 months ago by Axzd ▴ 70

0

Entering edit mode

Is it supposed to be variant.getGenotype("S1") or variant.getGenotypes("S1")? I'm assuming variant is of type VariantContext and that class doesn't seem to have a getGenotype() method.

ADD REPLY • link 6 months ago by Ram 43k

1

Entering edit mode

 variant.getGenotype("S1")

get genotype for sample S1 output is a https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/Genotype.html . Method is : https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html#getGenotype-java.lang.String-

 variant.getGenotypes()

returns all the genotypes for this variant. output is https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/GenotypesContext.html , a List/an array of https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/Genotype.html

ADD REPLY • link 6 months ago by Pierre Lindenbaum 161k

score 1 · Answer 3 · 2023-10-12

I think that this is a tricky question that probably does not have simple solutions

while it probably is possible to come up with some filtering with bcftools I would recommend writing a Python script using cyvcf2

https://github.com/brentp/cyvcf2

the code would be fairly simple if you know even a minimal Python,

iterate over VCF and sample and print the line the sample genotype is 1/1 and no other is such.

I feel that any other solution would probably be even more complicated,

I asked the chatGPT for a solution and claims the following works with cyvcf,

looking at the code, I don't think it is quite right as the 1/1 genotype is not always at index [3] in my opinion, you should find the index by looking up the genotype in the list (if it exists)

but the solution is very close and just needs a little troubleshooting and tweaking

import cyvcf2
import os

input_vcf = "path_to_your_file.vcf"
output_dir = "path_to_output_directory"

# Ensure output directory exists
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

vcf = cyvcf2.VCF(input_vcf)

samples = vcf.samples

for sample in samples:
    print(f"Processing {sample}...")

    out_name = os.path.join(output_dir, f"{sample}_unique_1_1.vcf")
    writer = cyvcf2.Writer(out_name, vcf)

    for variant in vcf:
        if variant.gt_types[vcf.samples.index(sample)] == 3:  # 3 denotes 1/1 genotype
            unique_flag = True
            for i, s in enumerate(samples):
                if s != sample and variant.gt_types[i] == 3:
                    unique_flag = False
                    break
            if unique_flag:
                writer.write_record(variant)

    writer.close()

print("Finished processing all samples.")

I also asked the ChatGPT about using bcftools and gave me this, looks a little too complicated to my liking, I would use the previous solution after fixing it up

#!/bin/bash

input_vcf="path_to_your_file.vcf"
output_dir="path_to_output_directory"

for sample in $(bcftools query -l $input_vcf); do
    echo "Processing $sample..."

    # First extract sites where the sample is 1/1
    bcftools view -s $sample -i 'GT="1/1"' $input_vcf > ${output_dir}/${sample}_temp.vcf

    # Now exclude sites where any other sample is 1/1
    for exclude_sample in $(bcftools query -l $input_vcf); do
        if [ "$exclude_sample" != "$sample" ]; then
            bcftools view -S ^<(echo $exclude_sample) -i 'GT!="1/1"' ${output_dir}/${sample}_temp.vcf > ${output_dir}/${sample}_temp2.vcf
            mv ${output_dir}/${sample}_temp2.vcf ${output_dir}/${sample}_temp.vcf
        fi
    done

    # Rename the final output for the sample
    mv ${output_dir}/${sample}_temp.vcf ${output_dir}/${sample}_unique_1_1.vcf
done

echo "Finished processing all samples."

score 1 · Answer 4 · 2023-10-12

1

Entering edit mode

6 months ago

Ram 43k

I'll outline an approach that does not answer your question specifically but gets you to a point where you can use a data processing utility (R/pandas) to write custom queries of any sort.

First, create a matrix of samples and genotypes:

bcftools query -Hf '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%GT]\n' vcf_file > genotype_matrix.tsv

The header on the file will look a little weird but you can tweak it a little then read it into any data-table processing software. You could then "flatten" the table (pivot to a longer form) so you get something in this format:

CHROM  POS  ID  REF  ALT  sample_name  GT

where each sample gets its own row. It now becomes easy to filter based on a sample_name - GT combination (say (sample_name == "ancestor" and GT == "1/1") || (sample_name!="ancestor" && (GT == "0/1" || GT == "0/0"))) and loop through each sample_name value to get this.

It is a programming heavy approach but the intermediate file makes for easier lifting and enables a broader range of queries. Pierre's vcffilterjdk tool enables this sort of querying at a much more abstract level.

ADD COMMENT • link 6 months ago by Ram 43k

0

Entering edit mode

I think I can get bcftools to print the coordinates and the genotypes, then I load in a pandas dataframe and use apply and a lambda function. As you said it's programming heavy. But I am intellectually unable to think at the abstract level displayed by Pierre. I have used a lot of his one liners and I am always mesmerized.

ADD REPLY • link 6 months ago by Axzd ▴ 70

1

Entering edit mode

You can't get to that level of thinking in a single step. One spends considerable time at each level before they start thinking at the next higher level. Once you solve variations of a problem a whole bunch of times, you end up writing a way to solve that entire family of problems out of sheer annoyance. You'll get there, just keep thinking "there has to be a better way"

ADD REPLY • link 6 months ago by Ram 43k

0

Entering edit mode

Thank you, appreciate it.

ADD REPLY • link 6 months ago by Axzd ▴ 70

Ram · Answer 5 · 2023-10-12

0

Entering edit mode

6 months ago

gonzlezb ▴ 10

I don't know a faster way to do it, but I guess this will do the job:

#!/bin/bash

input_file="your_file.vcf"

Do the analysis for each sample
awk '$10 ~ "1/1"' "$input_file" > ancestor.txt
awk '$11 ~ "1/1"' "$input_file" > D2C150g.txt

.....

awk '$NF ~ "1/1"' "$input_file" > last_sample

Then you save this file as your_script.sh and then you just type in the terminal

$ bash your_script.sh your_data.vcf

and you will have a separate text file for each of your samples with all the sites where there is 1/1 genotype.You can re use this script to any of your vcfs.

Now if you want to keep the same file and to exclude all the lines that do not have any 1/1 genotype, you can just do grep -E '1/1' your_file > filtered. Let me know if this can help you or not.

ADD COMMENT • link updated 6 months ago by Ram 43k • written 6 months ago by gonzlezb ▴ 10

1

Entering edit mode

This is a bad approach even if it will "do the job". You're writing VCF lines to output files with zero context AND having awk read through the file multiple times plus outlining an approach that will work only on uncompressed data.

ADD REPLY • link 6 months ago by Ram 43k

0

Entering edit mode

You can re use this script to any of your vcfs.

As long as the exact same set of samples is found at the exact same locations, sure. This statement is potentially misleading to new users. Even an added step like a loop with an index variable that adds 9 to the awk column and indexes the output to bcftools query -l to use as the output file name would have made it compatible with other VCFs but the code right now is specific to this file alone.

ADD REPLY • link 6 months ago by Ram 43k

0

Entering edit mode

OP says "I want to find for each sample the sites that are 1/1, only in that sample". Your post says "if you want to keep the same file and to exclude all the lines that do not have any 1/1 genotype, you can just do grep -E '1/1' your_file > filtered". That does not address OP's requirements.

ADD REPLY • link 6 months ago by Ram 43k

0

Entering edit mode

I'm cleaning up the comment section here. I went overboard and crossed boundaries with some criticism and I apologize to gonzlezb for that. I still think this approach is heavily flawed and would not follow or recommend it.

ADD REPLY • link 6 months ago by Ram 43k