plink produces does pruning according to log file but prune.in is full of dots
5
6
Entering edit mode
5.4 years ago

Hi,

I'm running a big SNP database (160GB vcf file) I pruned with plink;

plink --vcf SNPs_clean1.vcf --allow-extra-chr --indep-pairwise 50 10 0.1 --out SNP_50_10_01

It starts running producing the temporary files and the output files prune.in and prune.out). In the log file it shows having filtered the variants, 49595998 of 57089576 variants removed according to log file attached below. When I count the number of lines of the prune.in file, it does match the number of variants removed.

The problem is that when I open the file it's just a bunch of dots (one dot per line). The same happens for the prune.out file. I've looked for similar errors in this page but I've only seen blank prune.in and prune.out files so I hope this isn't a repeated question.

Thanks in advance for your help, Ana

Log file:

PLINK v1.90b5.3 64-bit (21 Feb 2018)

Options in effect:

  --allow-extra-chr
  --indep-pairwise 50 10 0.1
  --out SNP_50_10_01
  --vcf SNPs_camsud_clean1.vcf

Hostname: XXXXXX
Working directory: /home/aagapito/POTW/mapeo_vicugna/cardiff
Start time: Tue Nov 13 08:23:34 2018

Random number seed: 1542108214
257764 MB RAM detected; reserving 128882 MB for main workspace.
--vcf: SNP_50_10_01-temporary.bed + SNP_50_10_01-temporary.bim +
SNP_50_10_01-temporary.fam written.

57089576 variants loaded from .bim file.
56 people (0 males, 0 females, 56 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SNP_50_10_01.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 56 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.976739.
57089576 variants and 56 people pass filters and QC.
Note: No phenotypes present.
Pruned 532 variants from chromosome 27, leaving 67.
Pruned 438334 variants from chromosome 28, leaving 68336.
Pruned 3957 variants from chromosome 29, leaving 364.
Pruned 281964 variants from chromosome 30, leaving 41240.
Pruned 247860 variants from chromosome 31, leaving 38534.
Pruned 201756 variants from chromosome 32, leaving 29178.
(the list goes on...)
**Pruning complete.  49595998 of 57089576 variants removed.**
Marker lists written to SNP_50_10_01.prune.in and SNP_50_10_01.prune.out .
End time: Tue Nov 13 08:32:21 2018
SNP software error genome plink • 9.6k views
ADD COMMENT
6
Entering edit mode
5.4 years ago

This is because the .prune.in and .prune.out files contain variant IDs, but your VCF has all variant IDs set to ‘.’.

plink 2.0’s --set-all-var-ids flag provides one way to assign IDs.

ADD COMMENT
5
Entering edit mode
5.1 years ago
jena ▴ 290

I wrote a script to solve exactly this, you can find it here as a gist. In the script I also explain what's going on.

ADD COMMENT
0
Entering edit mode

I was thinking about this problem few days ago and realized it could probably be solved with a simple awk script. After a little bit of playing around and short testing, I found this oneliner to do the trick:

awk 'BEGIN{OFS="\t"} !/#/ {sub(/\./, $1"_"$2, $3)}1' input.vcf > annotated_input.vcf

However I tested it only briefly, so use it with caution. If you find some issue with this oneliner, please let me know, either here or by commenting under the gist on github.

The oneliner works like this:

  • BEGIN{OFS="\t"} = output will be tab-delimited
  • !/#/ = skip comment lines containing # (i.e. print them unmodified)
  • {sub(/\./, $1"_"$2, $3)} = substitute . in third column ($3, i.e. the ID column) with combination of $1 and $2 (i.e. CHROM and POS), separated by "_"
  • 1 = This condition (always true) prints all lines, modified or not.

FYI: Both the original script and this oneliner assume that your variants have unique positions, i.e. you don't have the same CHROM & POS combo on more than one row (this could be violated e.g. after decomposing complex variants). If your VCF does have this issue, the scripts would need to be updated (one way would be to merge not just CHROM & POS columns, but to include also e.g. ALT column, so the ID would be CHROM_POS_ALT). Again, if you need that, let me know and I can show how to do that (if it's not clear from the code itself).

ADD REPLY
2
Entering edit mode
5.4 years ago
YocelynGG ▴ 70

I solved the same problem on this way:

1. First, I obtained the list of variants in linkage desequilibrium (r2 coefficient)

plink --noweb --vcf file.vcf --no-sex --maf 0.05 --recode --allow-extra-chr --r2 --ld-window-kb 1 --ld-window 1000 --ld-window-r2 0 --out file_vcf_ld

With the output.ld, I filtered the last column (r2 coefficient >= 0.7).

awk '{if($NF>=0.7) print$0}' file_vcf_ld.ld > filter_snp_ld.txt

Create a list with those positions (based on column 1, 2, 4 and 5)

2. Remove those positions from the original file.vcf

grep -Fwvf filter_snp_ld.txt file.vcf > file_filter_ld.vcf

(It may take a long time)

3. With the new file.vcf perform a second filter based on Minor Allele Frequency (--maf)

plink --noweb --vcf file_filter_ld.vcf --allow-extra-chr --no-sex --maf 0.05 --out new_file_filtered_ld_maf --make-bed

I hope this solution could help you

ADD COMMENT
0
Entering edit mode
2.2 years ago
jena ▴ 290

I just noticed in the bcftools docs a parameter that I haven't noticed before: bcftools annotate --set-id. They include the following example:

bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf

This will set IDs for all variants that have missing ID (+). You could also reset all IDs by ommiting the +.

See more here: https://samtools.github.io/bcftools/bcftools.html#annotate

ADD COMMENT
0
Entering edit mode
12 months ago
Afrooz • 0

For plink2 you can use --set-missing-var-ids @:#\$r:\$a this will set the variant id to Chr:positionRef:Alt

plink2 --vcf yourVCF --double-id --allow-extra-chr --indep-pairwise 50 10 0.1 --set-missing-var-ids @:#\$r:\$a --new-id-max-allele-len 75 --out output

ADD COMMENT

Login before adding your answer.

Traffic: 1795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6