Merging VCF files into one
3
0
Entering edit mode
4.4 years ago

Hi everyone! I have 3 vcf files that I have to merge into one single vcf and then convert to binary bed format using plink, because I need to run a paternity test. I have all my files in a directory in HPC, in which all the softwares that I would need (like GATK or Plink and then King for relatedness inference) are already installed.

Let's say that I have 3 vcf files: proband.vcf.gz, mother.vcf.gz, father.vcf.gz. First of all, should I gunzip them? Now, I don't have anything else in my directory other than these 3 files. What are the steps to merge them into one and then convert them? Do I need other files that I have to download to do this?

This is the only command that I'm sure of, after merging the vcfs.

plink --vcf file.vcf --biallelic-only strict --double-id --vcf-filter --make-bed --out outfile

Thank you in advance.

genome sequencing • 5.7k views
ADD COMMENT
0
Entering edit mode

You may use vcftools function vcf-merge. It should work with zipped files.

ADD REPLY
0
Entering edit mode

VCFtools is more or less deprecated. People should move to BCFtools.

ADD REPLY
0
Entering edit mode
vcf-merge EPBL-0004_0200356071.filtered.vcf.gz EPBL-0005_0200356081.filtered.vcf.gz EPBL-0006_0200356091.filtered.vcf.gz | gunzip -c > all.vcf

This is what I used. But I get an error that tells me 'tabix: commad not found' What is tabix? I didn't find anything in the vcftools manual.

ADD REPLY
1
Entering edit mode
4.4 years ago

Please look up bcftools merge

Prior to merging, you may want to split multi-allelic records via:

bcftools norm -m-any ...

Kevin

ADD COMMENT
0
Entering edit mode

What do the three dots stand for?

ADD REPLY
0
Entering edit mode

Your files

ADD REPLY
0
Entering edit mode

So I tried with vcf-merge and bcftools, but since I've never worked with vcfs I don't know what a multi-vcf should look like. Could you please help me?

This is the output using vcf-merge. Looks like some fields aren't in the right place. vcf-merge output

And this is the output with bcftools.

bcftools output Doesn't look bad but those ./.:.:., should I remove them?

Also, next step of my work is to run a relatedness inference on the vcf file using King. I have to convert it to binary using plink, like this:

plink --vcf all.vcf --biallelic-only strict --double-id --vcf-filter --make-bed --out 4_merged_family

And then I have to run King, to check if the father is the real father.

king -b 4_merged_family.bed --related --degree 2

Should I modify my vcf file in some way to do all of this? Or the vcf output after the merging should work fine?

ADD REPLY
0
Entering edit mode

It seems that you are viewing your VCFs in Excel or LibreOffice. Just to confirm: the VCF specification is tab-delimited, so, if you need to view a VCF in Excel, ensure that only tab is set as your delimiter.

You should not worry too much about calls like this: ./.. These are missing calls, i.e., there was no information in the data to call any variant for that particular individual at that particular position.

Just to clarify these points, first.

ADD REPLY
0
Entering edit mode

yes of course, I was just viewing them, also so I can post them easily (I'm viewing them only tab delimited). I'm working form terminal, but since there are many columns, the lines are overlapping. Yeah, I figured out they are missing calls. So can I leave them? They won't mess with the running of the other softwares?

So the output of bcftools is the right one, isn't it? And for the other procedures, are they correct? As I asked you above, should I modify my vcf file in some way to run them? Thank you for your patience.

ADD REPLY
0
Entering edit mode

They won't mess with the running of the other softwares?

I am not to know the answer to this. It depends on how each program / software is coded.

Also, leaving 'missing' genotypes in your data is a study design choice that has to be made by you.

############

The output of bcftools in your Excel worksheet is not correct - there are extra columns that should not be there. However, again, this could merely be a formatting issue with Excel. Take a look at the VCF in VIM and check the tab-spacing.

You could also let me know how you produced these VCFs(?)

ADD REPLY
0
Entering edit mode

Blood samples have been sequenced using Illumina sequencers and whole exome sequencing technology. Sequences have been aligned and variants called using GATK pipeline.

Do you have knowledge about the KING software for relatedness inference? I'm supposed to use it after the merging step and I should get an out put like this.

FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    HetConc HomIBS0 Kinship IBD1Seg IBD2Seg PropIBD InfType Error
Y001    NA18484 NA18486 18250   0.000   0.2500  0.2324  0.0002  0.3368  0.0006  0.2515  0.9750  0.0000  0.4875  PO      0
Y001    NA18484 NA18488 18249   0.000   0.2500  0.2332  0.0002  0.3379  0.0004  0.2522  1.0000  0.0000  0.5000  PO      0
Y001    NA18486 NA18488 18270   1.000   0.0000  0.2141  0.1053  0.3036  0.2460  0.0039  0.0000  0.0000  0.0000  UN      0

In which it tells me clearly the inferred relationship type: PO (parent of offspring (I guess)) UN (unrelated)

Instead it doesn't give anything and I get and error saying: Segmentation fault

ADD REPLY
0
Entering edit mode

Hey, segmentation faults are almost always (from my experience) caused by you not having enough RAM / memory. Where are you running the command (local computer?)?

ADD REPLY
0
Entering edit mode

Yeah i figure that out, and I don't know why I get this kind of error. I'm running it on the HPC (32 core) of the hospital.

ADD REPLY
0
Entering edit mode

I see. In that case, it could be some bad formatting in your input data? Some programs are designed poorly and will not issue useful error messages.

ADD REPLY
1
Entering edit mode

Thank you Kevin, I really appreciate your patience. It really was a format issue. I did the joint calling on the vcfs and created a unique file containing all of them. Let's see how it works.

ADD REPLY

Login before adding your answer.

Traffic: 3296 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6