Question

Using PLINK to filter VCF files

2

Entering edit mode

7.8 years ago

Diego.Morales ▴ 30

Hello,

I am trying to filter a VCF file based on the R2 value that each SNP has. I would like to only keep the entries which have a R2 >= 0.3.

Are these the right commands/procedure?

plink --vcf dataset.dose.vcf ----recode --out dataset.dose #(vcf to map/ped)

plink --file dataset.dose  --indep-pairwise  1345835 5 0.3 --out dataset.dose #(to get a list of SNPs R2 >= 0.3)

plink --file dataset.dose --extract plink.prune.in --make-bed --out pruneddata #(to perform the extraction of the SNPs and convert map/ped files to bed/bim/fam)

Diego

GWAS PLINK vcf • 8.7k views

ADD COMMENT • link updated 2.8 years ago by Kevin Blighe 87k • written 7.8 years ago by Diego.Morales ▴ 30

score 4 · Answer 1 · 2019-03-21

4

Entering edit mode

5.1 years ago

Kevin Blighe 87k

Just be careful because PLINK assumes that you want to exclude the variants with r-squared >0.3, i.e., the variants that are in linkage disequilibrium. As your intention is to include these variants, you should be extracting variants from the plink.prune.out file.

So, 2 possible ways to get what you want:

--extract plink.prune.out
--exclude plink.prune.in

From the manual:

Variant pruning

--indep <window size="">['kb'] <step size="" (variant="" ct)>="" <vif="" threshold="">

--indep-pairwise <window size="">['kb'] <step size="" (variant="" ct)>="" <r^2="" threshold="">

--indep-pairphase <window size="">['kb'] <step size="" (variant="" ct)>="" <r^2="" threshold="">

These commands produce a pruned subset of markers that are in approximate linkage equilibrium with each other, writing the IDs to plink.prune.in (and the IDs of all excluded variants to plink.prune.out).

[source: https://www.cog-genomics.org/plink/1.9/ld]

Also, your window size of 1345835 seems very random.

Kevin

ADD COMMENT • link 5.1 years ago by Kevin Blighe 87k

1

Entering edit mode

Kevin Blighe any suggestions for running --indep-pairwise with unmapped SNPs? These are genotype by sequencing data and there is not a reference genome available for this species. Wondering what the best approach would be in this scenario in terms of parameter selection.

ADD REPLY • link 2.9 years ago by Gary Longo • 0

0

Entering edit mode

Can't really do that because, by its very nature, it needs positional information - it is assumed that variants are inherited throughout generations through 'haploblocks' [chunks of DNA], that give rise to the phenomena of linkage.

In which format is your input data?

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

That makes sense. The input data are in VCF format. They are RADseq data that have already been thinned to a single SNP from each marker. The concern is that some of these RADseq loci may be co-located on haploblocks but there is no real way to know other than maybe aligning to a closely related genome.

ADD REPLY • link 2.9 years ago by Gary Longo • 0

0

Entering edit mode

Sounds like it could become a bit of a 'rabbit hole' analysis. Not sure what data you have, exactly, but perhaps a synteny analysis may help if you just have sequences of your species of interest and no positional information. By synteny, you could infer positional information of your sequences in other defined species:

For example, take a look: https://rest.ensembl.org/documentation/info/genomic_alignment_region [main Ensembl REST page: https://rest.ensembl.org/]

I did this previously for mapping between Gallus gallus and Homo sapiens, with moderate success.

Should you not do genome assembly and then go from there to obtain positional info? - again, not sure which data it is that you have.

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

There is a closely related genome to our taxa of interest so I can give that a go.

Basically, we have around 10k SNPs derived from RADseq data (thinned to a single SNP per marker) and we want to LD thin these loci before running in STRUCTURE, where correlated markers can bias results. However it seems that a lot of population genetic studies using RADseq don't take this step (some of my previous work included). I was hoping there was a simple r^2 test to thin these unmapped SNPs but it seems like this is an inappropriate approach to the solution.

Thanks for the insight Kevin Blighe , much appreciated.

ADD REPLY • link 2.8 years ago by Gary Longo • 0

0

Entering edit mode

I see, molecular ecology? Questions on this topic are not common here, as you can infer. The synteny approach may be novel in your field, which does equally make that somewhat interesting. Other than that, there should not, technically, be anything stopping you from generating r^2 for each pairwise combination of markers from the VCF. Quite a few programs calculate r^2 for LD pruning, as you probably know, and I imagine that you'll find one that doesn't technically require positional info, or at least one with which you can instruct it to just calculate LD across all of your SNPs by setting a large window size. Im not sure how CHROM and BP are encoded in the VCF from RADseq.

ADD REPLY • link 2.8 years ago by Kevin Blighe 87k