Question: Merge vcf files with multiple samples into one vcf with all variants
0
gravatar for bioinfo89
4 days ago by
bioinfo8940
bioinfo8940 wrote:

Hi All,

I am working on 1000g data. So I have 25 tab-delimited text files corresponding to each population. Each file has jointly genotyped data, so it contains genotypes from all the samples (~60-120) per population in the VCF.

Format of the File:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA19625 NA19703 NA19711 NA19818 NA19835 NA19904 NA19917 NA19922 NA19984 NA20127 NA20278 NA20287 NA20294 NA20299 NA20318 NA20322 NA20336 NA20341 NA20346 NA20356 NA20361 NA19700 NA19704 NA19712 NA19819 NA19900 NA19908 NA19914 NA19920 NA19923 NA19985 NA20281 NA20289 NA20296 NA20314 NA20320 NA20332 NA20339 NA20342 NA20351 NA20357 NA20362 NA19701 NA19707 NA19713 NA19834 NA19901 NA19909 NA19916 NA19921 NA19982 NA20126 NA20276 NA20282 NA20291 NA20298 NA20317 NA20321 NA20334 NA20340 NA20344 NA20355 NA20359 NA20412
chr 1234    .   TT  T   .   .   VRT=2   GT  0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./.

What I want to do is to create a single VCF file from all the 25 population VCFs which would list all the total unique sites combined including the shared sites among the total samples (of all populations).

Format I want:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA19625 NA19703 NA19711 NA19818 NA19835 HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103
chr 1234    .   TT  T   .   .   VRT=2   GT  0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0 ./. 0/0

Is there a way to do this?

Thank you!

snp • 100 views
ADD COMMENTlink modified 4 days ago by WouterDeCoster35k • written 4 days ago by bioinfo8940
1

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 4 days ago by genomax59k

Thanks @genomax, I will keep that in mind.

ADD REPLYlink written 4 days ago by bioinfo8940

Have you already looked into VCFtools? I believe their merging, comparing, and consensus options may be beneficial.

ADD REPLYlink written 4 days ago by Giovanni.madrigal1220

I tried it. But since the file I am using is not a standard VCF file, I am not getting the desired output. The vcf-validator throws lots of errors when I check the files I am using.

ADD REPLYlink written 4 days ago by bioinfo8940
1

More details on what commands you are using and errors would be helpful. I am aware of another post with the same issue Merge individual vcf files. There is also another tool kit called vcflib (https://github.com/vcflib/vcflib) if you would care to test your data there.

ADD REPLYlink written 4 days ago by Giovanni.madrigal1220

Yes sure I will test the vcflib tool kit thanks for the info.

I used the following commands for vcftools:

/home/tools/tabix-0.2.6/bgzip Validated.vcf

/home/tools/tabix-0.2.6/tabix -p vcf Validated.vcf.gz
/home/tools/vcftools_0.1.13/perl/vcf-compare Validated1.vcf.gz Validated2.vcf.gz

Error:

Broken VCF header, no column names?
 at /home/perl5/perlbrew/perls/perl-5.27.4/lib/5.27.4/Vcf.pm line 172, <__ANONIO__> line 34.
    Vcf::throw(Vcf4_1=HASH(0x269da80), "Broken VCF header, no column names?") called at /home/perl5/perlbrew/perls/perl-5.27.4/lib/5.27.4/Vcf.pm line 866
    VcfReader::_read_column_names(Vcf4_1=HASH(0x269da80)) called at /home/perl5/perlbrew/perls/perl-5.27.4/lib/5.27.4/Vcf.pm line 601
    VcfReader::parse_header(Vcf4_1=HASH(0x269da80)) called at /home/tools/vcftools_0.1.13/perl/vcf-compare line 198
    main::compare_vcfs(HASH(0x268a148)) called at /home/tools/vcftools_0.1.13/perl/vcf-compare line 19

Command to validate vcf:

/home/tools/vcftools_0.1.13/perl/vcf-validator Validated1.vcf.gz

Error:

'he column name contains leading/trailing spaces, removing: 'NA19088
The header tag 'contig' not present for CHROM=chrMT. (Not required but highly recommended.)
INFO field at chrMT:1738 .. INFO tag [VRT] not listed in the header
column NA18971 at chrMT:1738 .. FORMAT tag [GT] not listed in the header
], expected integersrMT:1738 .. Unable to parse the GT field [0/0
The header tag 'contig' not present for CHROM=chr1. (Not required but highly recommended.)
], expected integersr1:75198868 .. Unable to parse the GT field [0/0
], expected integersr1:111970424 .. Unable to parse the GT field [0/0
ADD REPLYlink modified 2 days ago by WouterDeCoster35k • written 2 days ago by bioinfo8940
1

Then just fix those errors ;-) you will make things lots easier if you follow the vcf specifications. Also bcftools could help you, but that tool is also quite strict about the vcf specfications.

ADD REPLYlink written 4 days ago by WouterDeCoster35k

Yes, I am trying my best. :)

ADD REPLYlink written 2 days ago by bioinfo8940

But since the file I am using is not a standard VCF file, I am not getting the desired output.

Could you please tell us, what makes your vcf a non standard vcf file?

fin swimmer

ADD REPLYlink written 4 days ago by finswimmer7.9k

By non-standard vcf I mean, it is a dbSNP submission VCF format which has additional information about the study and methods etc along with the reference assembly ID, INFO and FORMAT fields. Also, the INFO and FORMAT fields I had to remove since the tabidx step was not able to parse the information.

Command and Error:

/home//tools/tabix-0.2.6/tabix -p vcf Validated1.vcf.gz

[get_intv] the following line cannot be parsed and skipped: "##INFO=<ID=VRT,Number=1,Type=Integer,Description=""Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used),8 - MNV: multiple nucleotide variation with alleles of common length greater than 1,9 - Exception"">"
[ti_index_core] the indexes overlap or are out of bounds
ADD REPLYlink modified 2 days ago by WouterDeCoster35k • written 2 days ago by bioinfo8940

Could you please post the complete header of the original vcf file and the first few variants?

Thanks.

fin swimmer

ADD REPLYlink written 2 days ago by finswimmer7.9k

I shortened your title to make it readable.

ADD REPLYlink written 4 days ago by WouterDeCoster35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1268 users visited in the last hour