How Can I Merge A Large Amount Of Vcf Files?
7
9
Entering edit mode
8.8 years ago
Matt W ▴ 250

I have about one thousand vcf files, and this is a ridiculous amount to type by hand into vcf-merge. Is there a way to merge all of these files without typing each name?

I tried putting all of the files into a text file and using stdin, but vcf-merge doesn't support this, i.e.

vcf-merge < list.txt > merge.vcf

Maybe this problem stems from creating this many vcf files to begin with. When using vcf-tools, I need to select multiple locations (i.e.--from-bp 100 --to-bp 200) which means I need to run vcf-tools for each location. Is there a better way to do this?

Thanks in advance for any help!

vcf snp merge vcftools • 37k views
ADD COMMENT
13
Entering edit mode
8.8 years ago

If the problem is merely getting the list into vcf-merge, it's as easy as:

vcf-merge $(ls -1 *.vcf | perl -pe 's/\n/ /g') > merge.vcf

(assumes all your vcfs to be merged are in the same directory)

ADD COMMENT
1
Entering edit mode

This is definitely along the track of what I'm trying to achieve. However, I keep getting an error: http://pastebin.com/H5AQZEtp Any ideas?

ADD REPLY
5
Entering edit mode

A quick guess: the argument to ls is the number one (1), not the letter L (l).

ADD REPLY
0
Entering edit mode

That did the trick! Thanks for your help!

ADD REPLY
8
Entering edit mode
8.8 years ago

deprecated : do not use this

I wrote a tool named scanvcf : it concatenates the VCF, adds an extra column with the sample name.

$ head -n3 input.txt

#Sample VCF
Sample1 data/sample1.vcf.gz
Sample2 data/sample1.vcf.gz
Sample2 data/sample1.vcf.gz


$ cat input.txt |scanvcf 
#CHROM POS ID REF ALT QUAL FILTER . FORMAT Call SAMPLE
1 879317 rs7523549 C T 71 0 . GT:GQ:DP:FLT 0/1:34:8:0 Sample1
1 880238 rs3748592 A G 51 0 . GT:GQ:DP:FLT 1/1:51:8:0 Sample1
1 880390 rs3748593 C A 99 0 . GT:GQ:DP:FLT 1/0:99:30:0 Sample1
1 881627 rs2272757 G A 99 0 . GT:GQ:DP:FLT 1/0:59:20:0 Sample1
(...)
Y 13524507 . C T 99 0 . GT:GQ:DP:FLT 1/1:99:233:0 Sample20
Y 21154323 rs10465459 G A 99 0 . GT:GQ:DP:FLT 1/1:99:215:0 Sample20
Y 21154426 rs52812045 G A 99 0 . GT:GQ:DP:FLT 1/0:99:143:0 Sample20
Y 21154466 rs10465460 T A 99 0 . GT:GQ:DP:FLT 1/1:99:134:0 Sample20
Y 21154529 . G A 51 0 . GT:GQ:DP:FLT 1/1:51:8:0 Sample20

You can later filter the result using samplespersnp.

  $ cat list.tsv | scanvcf  |\
  sort -t'  ' -k1,1 -k2,2n -k4,4 -k5,5 -k11,11 |\
  samplespersnp --sample 11 | awk '($8=".")'
1 753269 rs61770172 C G 99 0 . GT:GQ:DP:FLT 1/1:99:116:0 Sample16 1
1 753405 rs61770173 C A 99 0 . GT:GQ:DP:FLT 1/1:63:31:0 Sample10 7
1 753405 rs61770173 C A 81 0 . GT:GQ:DP:FLT 1/1:51:19:0 Sample12 7
1 753405 rs61770173 C A 35 0 . GT:GQ:DP:FLT 1/0:35:66:0 Sample19 7
1 753405 rs61770173 C A 99 0 . GT:GQ:DP:FLT 1/1:99:35:0 Sample3 7
1 753405 rs61770173 C A 90 0 . GT:GQ:DP:FLT 1/1:90:21:0 Sample5 7
1 753405 rs61770173 C A 99 0 . GT:GQ:DP:FLT 1/1:99:36:0 Sample6 7
1 753405 rs61770173 C A 90 0 . GT:GQ:DP:FLT 1/1:90:21:0 Sample9 7
1 876499 rs4372192 A G 39 0 . GT:GQ:DP:FLT 1/1:39:4:0 Sample12 6
1 876499 rs4372192 A G 42 0 . GT:GQ:DP:FLT 1/1:42:5:0 Sample16 6
1 876499 rs4372192 A G 39 0 . GT:GQ:DP:FLT 1/1:39:4:0 Sample17 6
1 876499 rs4372192 A G 45 0 . GT:GQ:DP:FLT 1/1:45:6:0 Sample18 6
1 876499 rs4372192 A G 45 0 . GT:GQ:DP:FLT 1/1:45:6:0 Sample4 6
1 876499 rs4372192 A G 42 0 . GT:GQ:DP:FLT 1/1:42:5:0 Sample6 6
1 877831 rs6672356 T C 42 0 . GT:GQ:DP:FLT 1/1:42:5:0 Sample14 2
1 877831 rs6672356 T C 39 0 . GT:GQ:DP:FLT 1/1:39:4:0 Sample4 2
1 878601 . C T 98 0 . GT:GQ:DP:FLT 0/1:50:11:0 Sample14 1

This tool also takes an option -e for a query over the samples. e.g: "variation must contains Sample11, Sample12 BUT NOT Sample1 to Sample5:"

-e '(Sample11  && Sample12 && (!(Sample1 || Sample2 || Sample3 || Sample4 || Sample5)))'
ADD COMMENT
0
Entering edit mode

2.5 years later: my solution is deprecated. This is not what you're looking for.

ADD REPLY
0
Entering edit mode

what do you use now?

ADD REPLY
0
Entering edit mode
ADD REPLY
6
Entering edit mode
8.8 years ago
matted 7.5k

If I understand your task correctly, xargs would be a quick way to go (similar to the perl example above):

ls *vcf | xargs vcf-merge > merged.vcf

Or if it's a bit more complicated and you want to use the file list as input:

cat files.txt | xargs vcf-merge > merged.vcf
ADD COMMENT
0
Entering edit mode

Definitely what I am looking for, but each time I try there is some error. When I had plain vcf files it asked for the tabix index. So I used bgzip | tabix and tried this again doing ls *.vcf.gz and another error reading columns comes up: http://pastebin.com/aDpgkqjw Any ideas? Thanks for your help! EDIT: I also know vcf-merge works correctly because I have used it successfully on a small number of files.

ADD REPLY
0
Entering edit mode

Resolved. My blank output was being picked up by the *vcf and throwing an error. Thanks for the help!

ADD REPLY
4
Entering edit mode
7.9 years ago

Another option is: joinx

joinx vcf-merge [OPTIONS] file1.vcf file2.vcf [file3.vcf ...]
ADD COMMENT
2
Entering edit mode
8.8 years ago
Vivek ★ 2.5k

I think GATK CombineVariants walker would support a list of files as input.

ADD COMMENT
1
Entering edit mode
8.8 years ago

What about, time while read line; do vcf-merge file1.vcf $line >> merge.cvf; done < list.txt and your list.txt should have filenames except the first one.

Cheers

ADD COMMENT
1
Entering edit mode

Hello,

The script help to merge all vcf files, but how can you make it possible to add columns with sample name.?

ADD REPLY
0
Entering edit mode
2.8 years ago
ATpoint 49k

Years later, only using Unix tool, producing a coord-sorted and bgzipped VCF:

ls *.vcf | head -n 1 | xargs grep '#' > ${BASENAME}_header.txt
cat *.vcf | mawk '$1 ~ /^#/ {next} {print $0 | "sort -k1,1 -k2,2n --parallel=8"}' | \
  cat ${BASENAME}_header.txt /dev/stdin | bgzip -@ 8 > out.catted.vcf.gz
ADD COMMENT
0
Entering edit mode

how to add, one more column with for sample names (e.g base name of the file) Tried the following, it worked sed 's/$/\t'$input'/' $input

ADD REPLY

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6