Filter out related individuals from a vcf file using Plink or Hail
1
0
Entering edit mode
6.4 years ago
ATCG ▴ 380

I have a data set for 11,000 individuals and would like to filter out related individuals and also based on a phenotype.

I have a vcf file format and prefer to work with Plink, Hail or vcftools?

Hail Plink VCF • 3.9k views
ADD COMMENT
0
Entering edit mode

Kevin thank you very much, however; what I want to do is to: 1.Remove related individuals 2.Remove individuals with a specific phenotype for which I have a txt file. If I understand correctly your code changes vcf to plink files only right? How can I filter out individuals as in 1,2.? Also, I don't have families I would be grateful four your advice.

ADD REPLY
0
Entering edit mode

Hello, it would have been better to have replied as a comment to my answer / Mejor hubiera sido escrito su comentario a mi respuesta encima.

You mentioned Plink in your question, so I presumed that you already knew how to use Plink. If you get your dataset in Plink format (with my code), you can then identify related individuals by following the procedures here: http://zzz.bwh.harvard.edu/plink/strat.shtml

For removing (or keeping) certain individuals, take a look here: https://www.cog-genomics.org/plink/1.9/filter


If you want to avoid Plink, just use BCFtools to remove individuals (if you already know who you want to keep/remove):

/Programs/bcftools-1.3.1/bcftools view -Ov --samples-file SamplesToKeep.list MyData.vcf.gz > MyData.Filtered.vcf

SamplesToKeep.list just contains a single column of IDs that you want to keep

ADD REPLY
1
Entering edit mode
6.4 years ago

You have 11,000 individuals in a VCF file? - that's impressive. VCFTools does have a relatedness function that it calculates pairwise for all samples. It generally seems to work fine, as I tried it on data that contained monozygotic twins once; however, in the context of false-negatives and false-positives that plague NGS, not even biologicaly monozygotic twins look the same after their DNA has passed through a 'next gen' sequencer and analysis pipeline.

I would favour getting the data into PLINK Format, which provides analyses across all sorts of things including relatedness. To do that, first gzip and tab-index your VCF and then normalise it:

#1st pipe, splits multi-allelic calls into separate variant calls
#2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
/Programs/bcftools-1.3.1/bcftools norm -m-any My.gz.vcf | /Programs/bcftools-1.3.1/bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > My.Norm.bcf ;

This will keep the VCF ID field as it is (assuming it's rs ID?). If you want to set it to something unique, use:

/Programs/bcftools-1.3.1/bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT'

To read this into PLINK:

/Programs/plink1.90/plink --noweb --bcf My.Norm.bcf --keep-allele-order --indiv-sort file IDsort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyPlinkData ;

It's critical when reading VCF/BCF into PLINK that you fix the order of the samples, because in virtually all cases PLINK will not order the samples as they appear in the VCF/BCF. You do this by specifying --indiv-sort file IDsort.list. IDsort.list contains a listing of FID and IID. It's easier to keep FID as 0 here:

0 16422

0 16427

0 16451

0 17412

0 17422

0 17413

If you have families in your data and want to do family-specific analysis, you'll have to reheader your VCF/BCF to have sample IDs as FID_IID, change --const-fid above to --id-delim _, and also modify the IDsort file:

Family1 16422

Family1 16427

Family2 16451

Family2 17412

Family3 17422

Family3 17413

Based on your ID sort file, you should also create a custom FAM file in which the samples are ordered the same. When you read data from VCF/BCF into PLINK, it won't know he gender or case control status.

In all analyses, you then just include your custom FAM as follows:

#Perform family TDT comparisons
/Programs/plink1.90/plink --noweb --bfile MyPlinkData --tdt --model mperm=500 --ci 0.95 --fam Custom.fam ;

Other people will hopefully have other suggestions.

ADD COMMENT

Login before adding your answer.

Traffic: 1919 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6