Question

How to remove samples with no genotypes from vcf

0

Entering edit mode

3.1 years ago

moxu ▴ 510

My vcf file has many samples with no genotypes, i.e. all genotypes are missing ('./.'). How to remove such samples?

Thanks.

VCF • 4.6k views

ADD COMMENT • link updated 21 months ago by Vanish007 ▴ 40 • written 3.1 years ago by moxu ▴ 510

1

Entering edit mode

3.1 years ago

pbpanigrahi ▴ 420

First separate header and data

grep -vP  '^#' input.vcf > data.vcf
grep -P  '^#' input.vcf > header.vcf

Use R to remove . Open R terminal. Move to the direcotry in which data.vcf is there and run below

df = read.table("data.vcf", sep="\t")

# till format column
df1 = df[,1:9]
# sample columns
df2 = df[,10:ncol(df)]
# remove samples with ./.
df2 = df2[,-which(apply(df2,2,function(x){sum(x=="./.")==length(x)})), drop = F]
# rejoin with data till format
df = cbind(df1,df2)
# export
write.table(df, file = "temp2.vcf", col.names = F, row.names = F, quote = F,sep="\t", eol = "\n")

Rejoin

cat header.vcf > final.vcf
cat temp2.vcf >> final.vcf

Above solution will work if

Multiple sample present in vcf file
At least one sample have non ./. entry

ADD COMMENT • link 3.1 years ago by pbpanigrahi ▴ 420

0

Entering edit mode

3.1 years ago

Pierre Lindenbaum 163k

not tested, a one-liner, use bcftools stats to count the number of called genotypes per samples, use those samples to select the called samples with bcftools view.

bcftools view --samples `bcftools stats -s '-' in.vcf.gz | awk '$1=="PSC" && (int($4)+int($5)+int($6) > 0) {print $3}' | paste -s -d',' ` in.vcf.gz

ADD COMMENT • link 3.1 years ago by Pierre Lindenbaum 163k

0

Entering edit mode

Hi Pierre, I just tested this myself and received the error:

bcftools/1.16/bin/bcftools: Argument list too long

ADD REPLY • link 21 months ago by Vanish007 ▴ 40

0

Entering edit mode

3.1 years ago

sbstevenlee ▴ 480

If you are a Python user, you may want to check out the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'Steven': ['0/1', '1/1'],
...     'Rachel': ['./.', './.']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data) # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Rachel
0  chr1  100  .   G   A    .      .    .     GT    0/1    ./.
1  chr2  101  .   T   C    .      .    .     GT    1/1    ./.
>>> f = lambda r: r[9:].apply(pyvcf.gt_miss)
>>> s = vf.df.apply(f, axis=1).all()
>>> l = s[s == False].index.to_list()
>>> filtered_vf = vf.subset(l)
>>> #filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1

ADD COMMENT • link 3.1 years ago by sbstevenlee ▴ 480

0

Entering edit mode

3.1 years ago

sbstevenlee ▴ 480

FYI, your question motivated me to write the pyvcf.VcfFrame.cfilter_empty method for the v0.11.0 version.

ADD COMMENT • link 3.1 years ago by sbstevenlee ▴ 480

score 1 · Accepted Answer · 2021-06-06

1

Entering edit mode

3.1 years ago

4galaxy77 2.8k

This will give you the samples with all missing genotypes (not tested).

plink2 --vcf in.vcf --missing -out missing
awk ' $3 == 0 ' missing.smiss | tail -n +2 > missing_samples.txt
bcftools view -S ^missing_samples.txt in.vcf > out.vcf

ADD COMMENT • link 3.1 years ago by 4galaxy77 2.8k

0

Entering edit mode

Hi, I just gave this a try, but missing_samples.txt has nothing in it.

I ran

plink2 --vcf Annot_Filtered.vcf.gz --missing -out missing 
awk ' $3 == 0 ' missing.smiss | tail -n +2 > missing_samples.txt
bcftools view -S ^missing_samples.txt in.vcf > out.vcf

Looking at missing.smiss:

head missining.smiss
#IID    MISSING_CT      OBS_CT  F_MISS
S001    301     16195   0.018586
S002    356     16195   0.0219821
S003    286     16195   0.0176598
S004    290     16195   0.0179068
S005    654     16195   0.0403828
S006    473     16195   0.0292065
S007    661     16195   0.0408151
S008    820     16195   0.0506329
S009    417     16195   0.0257487

Which looks like it's not showing any values missing. However when I take a look at my vcf, there are definitely samples with missing genotypes (./.): enter image description here

ADD REPLY • link 21 months ago by Vanish007 ▴ 40