How to remove samples with no genotypes from vcf
5
0
Entering edit mode
5 months ago
moxu ▴ 500

My vcf file has many samples with no genotypes, i.e. all genotypes are missing ('./.'). How to remove such samples?

Thanks.

VCF • 698 views
ADD COMMENT
1
Entering edit mode
5 months ago
4galaxy77 ★ 1.2k

This will give you the samples with all missing genotypes (not tested).

plink2 --vcf in.vcf --missing -out missing
awk ' $3 == 0 ' missing.smiss | tail -n +2 > missing_samples.txt
bcftools view -S ^missing_samples.txt in.vcf > out.vcf
ADD COMMENT
1
Entering edit mode
5 months ago
pbpanigrahi ▴ 410

First separate header and data

grep -vP  '^#' input.vcf > data.vcf
grep -P  '^#' input.vcf > header.vcf

Use R to remove . Open R terminal. Move to the direcotry in which data.vcf is there and run below

df = read.table("data.vcf", sep="\t")

# till format column
df1 = df[,1:9]
# sample columns
df2 = df[,10:ncol(df)]
# remove samples with ./.
df2 = df2[,-which(apply(df2,2,function(x){sum(x=="./.")==length(x)})), drop = F]
# rejoin with data till format
df = cbind(df1,df2)
# export
write.table(df, file = "temp2.vcf", col.names = F, row.names = F, quote = F,sep="\t", eol = "\n")

Rejoin

cat header.vcf > final.vcf
cat temp2.vcf >> final.vcf

Above solution will work if

  • Multiple sample present in vcf file
  • At least one sample have non ./. entry
ADD COMMENT
0
Entering edit mode
5 months ago

not tested, a one-liner, use bcftools stats to count the number of called genotypes per samples, use those samples to select the called samples with bcftools view.

bcftools view --samples `bcftools stats -s '-' in.vcf.gz | awk '$1=="PSC" && (int($4)+int($5)+int($6) > 0) {print $3}' | paste -s -d',' ` in.vcf.gz
ADD COMMENT
0
Entering edit mode
5 months ago
sbstevenlee ▴ 250

If you are a Python user, you may want to check out the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'Steven': ['0/1', '1/1'],
...     'Rachel': ['./.', './.']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data) # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven Rachel
0  chr1  100  .   G   A    .      .    .     GT    0/1    ./.
1  chr2  101  .   T   C    .      .    .     GT    1/1    ./.
>>> f = lambda r: r[9:].apply(pyvcf.gt_miss)
>>> s = vf.df.apply(f, axis=1).all()
>>> l = s[s == False].index.to_list()
>>> filtered_vf = vf.subset(l)
>>> #filtered_vf.to_file('filtered.vcf')
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Steven
0  chr1  100  .   G   A    .      .    .     GT    0/1
1  chr2  101  .   T   C    .      .    .     GT    1/1
ADD COMMENT
0
Entering edit mode
5 months ago
sbstevenlee ▴ 250

FYI, your question motivated me to write the pyvcf.VcfFrame.cfilter_empty method for the v0.11.0 version.

ADD COMMENT

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6