extracting only one sex from gnomad genome v.3.1 file
0
0
Entering edit mode
2.8 years ago
storm1907 ▴ 30

Hi,

I try to use public gnomad genomes for downstream tests, but the problem is that I only need male controls from these files. The file looks like this (also header, but not shown due to large size)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69511   rs2691305       A       G       .       PASS    AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;AC_non_v2_XX=28734;AN_non_v2_XX=33774;AF_non_v2_XX=0.850773;nhomalt_non_v2_XX=13253;AC_non_cancer_fin_XX=1080;AN_non_cancer_fin_XX=1090;AF_non_cancer_fin_XX=0.990826;nhomalt_non_cancer_fin_XX=537;AC_non_neuro_nfe=32992;AN_non_neuro_nfe=34106;AF_non_neuro_nfe=0.967337;nhomalt_non_neuro_nfe=16243;AC_non_neuro_afr_XY=5275;AN_non_neuro_afr_XY=8862;AF_non_neuro_afr_XY=0.595238;nhomalt_non_neuro_afr_XY=1908;AC_non_neuro_nfe_XY=13529;AN_non_neuro_nfe_XY=13954;AF_non_neuro_nfe_XY=0.969543;nhomalt_non_neuro_nfe_XY=6668;AC_controls_and_biobanks_eas_XY=1209;AN_controls_and_biobanks_eas_XY=1210;AF_controls_and_biobanks_eas_XY=0.999174;nhomalt_controls_and_biobanks_eas_XY=604;AC_non_neuro_sas_XX=623;AN_non_neuro_sas_XX=646;AF_non_neuro_sas_XX=0.964396;nhomalt_non_neuro_sas_XX=304;AC_non_v2=53211;AN_non_v2=62346;AF_non_v2=0.853479;nhomalt_non_v2=24615;AC_non_topmed_nfe_XX=4123;AN_non_topmed_nfe_XX=4274;AF_non_topmed_nfe_XX=0.96467;nhomalt_non_topmed_nfe_XX=2022;AC_non_v2_mid=132;AN_non_v2_mid=146;AF_non_v2_mid=0.90411;nhomalt_non_v2_mid=63;AC_non_topmed_sas=2561;AN_non_topmed_sas=2620;AF_non_topmed_sas=0.977481;nhomalt_non_topmed_sas=1262;AC_non_cancer_eas_XX=1882;AN_non_cancer_eas_XX=1882;AC_nfe=34637;AN_nfe=35806;AF_nfe=0.967352;nhomalt_nfe=17050;AC_popmax=4423;AN_popmax=4424;AF_popmax=0.999774;nhomalt_popmax=2211;faf95_sas=0.945756;faf99_sas=0.93297;faf95_eas=0.975176;faf99_eas=0.965135;faf95_amr=0.876957;faf99_amr=0.869526;faf95_afr=0.586993;faf99_afr=0.583779;faf95=0.840765;faf99=0.838605;faf95_nfe=0.958818;faf99_nfe=0.955301;age_hist_het_bin_freq=123|146|149|217|287|242|200|187|124|72;age_hist_het_n_smaller=460;age_hist_het_n_larger=24;age_hist_hom_bin_freq=569|655|765|1405|2028|1746|1691|1518|1005|654;age_hist_hom_n_smaller=1206;age_hist_hom_n_larger=215;FS=0;MQ=41.6849;MQRankSum=-3.98;QD=26.0414;ReadPosRankSum=0.51;VarDP=1761397;QUALapprox=45869299;AS_FS=0;AS_MQ=41.6847;AS_MQRankSum=-3.972;AS_pab_max=1;AS_QD=26.0413;AS_ReadPosRankSum=0.51;AS_SOR=0.865485;InbreedingCoeff=0.521183;AS_SB_TABLE=99037,96244|856526,709549;AS_VQSLOD=0.4266;AS_culprit=AS_MQRankSum;NEGATIVE_TRAIN_SITE;allele_type=snv;n_alt_alleles=2;variant_type=multi-snv;segdup;gq_hist_alt_bin_freq=0|0|0|0|39|89|5016|5173|2494|4247|3343|1353|2059|1764|858|1286|1197|599|785|7691;gq_hist_all_bin_freq=0|0|0|0|2833|660|5232|5237|2506|4249|3343|1354|2060|1764|859|1287|1197|599|785|7691;dp_hist_alt_bin_freq=0|0|12706|9449|5594|4053|2707|1532|828|486|262|166|89|45|36|12|6|7|2|3;dp_hist_alt_n_larger=10;dp_hist_all_bin_freq=0|0|13838|11344|6042|4209|2735|1535|828|487|262|166|89|45|36|12|6|7|2|3;dp_hist_all_n_smaller=0;dp_hist_all_n_larger=10;ab_hist_alt_bin_freq=0|0|0|0|193|367|513|688|876|676|772|504|439|263|124|56|25|8|0|0;cadd_raw_score=0.112916;cadd_phred=2.209;revel_score=0.053;splice_ai_max_ds=0.02;splice_ai_consequence=donor_gain;primate_ai_score=0.632586;vep=G|missense_variant|MODERATE|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding|1/1||ENST00000335137.4:c.421A>G|ENSP00000334393.3:p.Thr141Ala|457|421|141|T/A|Aca/Gca|1||1|SNV||HGNC|HGNC:14825|YES||P1|CCDS30547.1|ENSP00000334393|||||tolerated(0.820)|benign(0.000)|Gene3D:1&Pfam:PF13853&PROSITE_profiles:PS50262&Superfamily:SSF81321&Transmembrane_helices:TMhelix&CDD:cd15226|||||||||,G|missense_variant|MODERATE|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding|3/3||ENST00000641515.2:c.484A>G|ENSP00000493376.2:p.Thr162Ala|544|484|162|T/A|Aca/Gca|1||1|SNV||HGNC|HGNC:14825|||||ENSP00000493376|||||tolerated(0.850)|benign(0.000)|Transmembrane_helices:TMhelix&CDD:cd15226&PANTHER:PTHR26451&PANTHER:PTHR26451&Pfam:PF13853&PROSITE_profiles:PS50262&Gene3D:1&Superfamily:SSF81321|||||||||,G|missense_variant|MODERATE|OR4F5|79501|Transcript|NM_001005484.1|protein_coding|1/1||NM_001005484.1:c.421A>G|NP_001005484.1:p.Thr141Ala|421|421|141|T/A|Aca/Gca|1||1|SNV||EntrezGene|HGNC:14825|YES||||NP_001005484.1|||||tolerated(0.820)|benign(0.000)||||||||||

I have to remove all parts of file that contains _XX in the end ( need to do test against men control, so all women data should be removed from control file)

I tried bcftools query bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO/AC_nfe_XY\t%INFO/AN_nfe_XY\n' $file > $outfile

but got Error: no such tag defined in the VCF header: INFO/_XY

Also bcftools query removes file header, and to save all men control numbers, I have to write script row for three meters.

Is there some easier and smarter way to remove female controls from this file without damaging file structure?

I also tried to modify this MIT sample code to take only controls related with men, but without success. https://github.com/mhguo1/TRAPD/blob/master/code/count_controls.py

Thank you!

gnomad • 1.1k views
ADD COMMENT
0
Entering edit mode

works on my machine (1.12):

$ wget -q -O - "https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr1.vcf.bgz" | bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO/AC_nfe_XY\t%INFO/AN_nfe_XY\n'  | head
chr1    10031   .   T   C   .   AC0;AS_VQSR 0   10548
chr1    10037   .   T   C   .   AS_VQSR 0   14172
chr1    10043   .   T   C   .   AS_VQSR 0   15992
chr1    10055   .   T   C   .   AS_VQSR 0   18072
chr1    10057   rs1570391741    A   C   .   AS_VQSR 0   21806
chr1    10061   .   T   C   .   AC0 0   20726
chr1    10061   .   T   TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC.   AC0 0   20726
chr1    10064   .   C   CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAA .   AC0 0   28238
chr1    10067   rs1489251879    T   TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC  .   PASS    0   22630
chr1    10108   .   C   CA  .   AS_VQSR 0   2088
ADD REPLY
0
Entering edit mode

Thank you!

Also the problem is that bcftools removes names of controls, but actually I need to save them, so there is no additional tabs and new lines, but only "XX=" has to be removed

ADD REPLY
0
Entering edit mode

bcftools removes names of controls,

what is 'controls' ? the VCF header ? why don't you call bcftools view --header-only before a second call to bcftools query ?

ADD REPLY
0
Entering edit mode

ok, thank you! The command should remove all characters between ; symbols,( _XX )

So that from

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69511   rs2691305       A       G       .       PASS    AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;AC_non_v2_XX=28734;AN_non_v2_XX=33774;AF_non_v2_XX=0.850773;nhomalt_non_v2_XX=13253;

stays only

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    69511   rs2691305       A       G       .       PASS    AC=70482;AN=83312;AF=0.846001;popmax=eas;faf95_popmax=0.975176;

I also try to implement sed commands, but no success

s/[^ ]*_XX[^ ]*//ig

| sed -e 's/[^ ]*_XX[^ ]*//ig'

sed '/^_XX/s//'


 sed 's/*XX//' 
ADD REPLY
0
Entering edit mode

it's not clear how the command above is related to the bcftools query....

you want bcftools annotate --rename-annots file.txt -x '^INFO/....

ADD REPLY
0
Entering edit mode

Maybe bcftools are not necessary in this case

there are a lot of female controls in gnomad file, so maybe bcftools with defining INFo field is not the smartest approach

ADD REPLY

Login before adding your answer.

Traffic: 2929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6