Question: Merging VCF files into one
0
gravatar for sara.aglantarek
6 months ago by
sara.aglantarek10 wrote:

Hi everyone! I have 3 vcf files that I have to merge into one single vcf and then convert to binary bed format using plink, because I need to run a paternity test. I have all my files in a directory in HPC, in which all the softwares that I would need (like GATK or Plink and then King for relatedness inference) are already installed.

Let's say that I have 3 vcf files: proband.vcf.gz, mother.vcf.gz, father.vcf.gz. First of all, should I gunzip them? Now, I don't have anything else in my directory other than these 3 files. What are the steps to merge them into one and then convert them? Do I need other files that I have to download to do this?

This is the only command that I'm sure of, after merging the vcfs.

plink --vcf file.vcf --biallelic-only strict --double-id --vcf-filter --make-bed --out outfile

Thank you in advance.

sequencing genome • 387 views
ADD COMMENTlink modified 6 months ago by Kevin Blighe61k • written 6 months ago by sara.aglantarek10

You may use vcftools function vcf-merge. It should work with zipped files.

ADD REPLYlink written 6 months ago by zubenel90

VCFtools is more or less deprecated. People should move to BCFtools.

ADD REPLYlink written 6 months ago by Kevin Blighe61k
vcf-merge EPBL-0004_0200356071.filtered.vcf.gz EPBL-0005_0200356081.filtered.vcf.gz EPBL-0006_0200356091.filtered.vcf.gz | gunzip -c > all.vcf

This is what I used. But I get an error that tells me 'tabix: commad not found' What is tabix? I didn't find anything in the vcftools manual.

ADD REPLYlink modified 6 months ago • written 6 months ago by sara.aglantarek10
1
gravatar for Kevin Blighe
6 months ago by
Kevin Blighe61k
University College London
Kevin Blighe61k wrote:

Please look up bcftools merge

Prior to merging, you may want to split multi-allelic records via:

bcftools norm -m-any ...

Kevin

ADD COMMENTlink written 6 months ago by Kevin Blighe61k

What do the three dots stand for?

ADD REPLYlink written 6 months ago by sara.aglantarek10

Your files

ADD REPLYlink written 6 months ago by Kevin Blighe61k

So I tried with vcf-merge and bcftools, but since I've never worked with vcfs I don't know what a multi-vcf should look like. Could you please help me?

This is the output using vcf-merge. Looks like some fields aren't in the right place. vcf-merge output

And this is the output with bcftools.

bcftools output Doesn't look bad but those ./.:.:., should I remove them?

Also, next step of my work is to run a relatedness inference on the vcf file using King. I have to convert it to binary using plink, like this:

plink --vcf all.vcf --biallelic-only strict --double-id --vcf-filter --make-bed --out 4_merged_family

And then I have to run King, to check if the father is the real father.

king -b 4_merged_family.bed --related --degree 2

Should I modify my vcf file in some way to do all of this? Or the vcf output after the merging should work fine?

ADD REPLYlink written 6 months ago by sara.aglantarek10

It seems that you are viewing your VCFs in Excel or LibreOffice. Just to confirm: the VCF specification is tab-delimited, so, if you need to view a VCF in Excel, ensure that only tab is set as your delimiter.

You should not worry too much about calls like this: ./.. These are missing calls, i.e., there was no information in the data to call any variant for that particular individual at that particular position.

Just to clarify these points, first.

ADD REPLYlink written 6 months ago by Kevin Blighe61k

yes of course, I was just viewing them, also so I can post them easily (I'm viewing them only tab delimited). I'm working form terminal, but since there are many columns, the lines are overlapping. Yeah, I figured out they are missing calls. So can I leave them? They won't mess with the running of the other softwares?

So the output of bcftools is the right one, isn't it? And for the other procedures, are they correct? As I asked you above, should I modify my vcf file in some way to run them? Thank you for your patience.

ADD REPLYlink written 6 months ago by sara.aglantarek10

They won't mess with the running of the other softwares?

I am not to know the answer to this. It depends on how each program / software is coded.

Also, leaving 'missing' genotypes in your data is a study design choice that has to be made by you.

############

The output of bcftools in your Excel worksheet is not correct - there are extra columns that should not be there. However, again, this could merely be a formatting issue with Excel. Take a look at the VCF in VIM and check the tab-spacing.

You could also let me know how you produced these VCFs(?)

ADD REPLYlink modified 6 months ago • written 6 months ago by Kevin Blighe61k

Blood samples have been sequenced using Illumina sequencers and whole exome sequencing technology. Sequences have been aligned and variants called using GATK pipeline.

Do you have knowledge about the KING software for relatedness inference? I'm supposed to use it after the merging step and I should get an out put like this.

FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    HetConc HomIBS0 Kinship IBD1Seg IBD2Seg PropIBD InfType Error
Y001    NA18484 NA18486 18250   0.000   0.2500  0.2324  0.0002  0.3368  0.0006  0.2515  0.9750  0.0000  0.4875  PO      0
Y001    NA18484 NA18488 18249   0.000   0.2500  0.2332  0.0002  0.3379  0.0004  0.2522  1.0000  0.0000  0.5000  PO      0
Y001    NA18486 NA18488 18270   1.000   0.0000  0.2141  0.1053  0.3036  0.2460  0.0039  0.0000  0.0000  0.0000  UN      0

In which it tells me clearly the inferred relationship type: PO (parent of offspring (I guess)) UN (unrelated)

Instead it doesn't give anything and I get and error saying: Segmentation fault

ADD REPLYlink modified 5 months ago by Kevin Blighe61k • written 5 months ago by sara.aglantarek10

Hey, segmentation faults are almost always (from my experience) caused by you not having enough RAM / memory. Where are you running the command (local computer?)?

ADD REPLYlink written 5 months ago by Kevin Blighe61k

Yeah i figure that out, and I don't know why I get this kind of error. I'm running it on the HPC (32 core) of the hospital.

ADD REPLYlink written 5 months ago by sara.aglantarek10

I see. In that case, it could be some bad formatting in your input data? Some programs are designed poorly and will not issue useful error messages.

ADD REPLYlink written 5 months ago by Kevin Blighe61k
1

Thank you Kevin, I really appreciate your patience. It really was a format issue. I did the joint calling on the vcfs and created a unique file containing all of them. Let's see how it works.

ADD REPLYlink written 5 months ago by sara.aglantarek10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1207 users visited in the last hour