Question: Efficient way of vcf files handling
0
gravatar for Earendil
4.3 years ago by
Earendil20
Sweden
Earendil20 wrote:

Hi all

I have a .tgz file in which there are folders, one for each individual.  Each folder contains vcf files, one for each chromosome. It's in this fashion: .tgz has ./individual1/chr1.vcf ./individual1/chr2.vcf etc, ./individual2/chr1.vcf ./individual2/chr2.vcf etc.

I want to create vcf files, one for each chromosome, containing individuals of a population using bcftools merge, so that later I can use the vcf files for an Fst estimation among populations using vcftools.

Having no previous experience for this kind of work, the only thing I can think is to extract the .tgz file, compress each vcf with bgzip and index it with tabix, run bcftools merge for each chromosome and population and finally run the Fst test on vcftools.

But is there a better way to do this? Extracting the .tgz file will chew up alot of storage space.

 

Thanks in advance.

vcftools bcftools vcf genome • 1.7k views
ADD COMMENTlink modified 4.3 years ago by Jonathan Dursi270 • written 4.3 years ago by Earendil20
3
gravatar for Jonathan Dursi
4.3 years ago by
Toronto
Jonathan Dursi270 wrote:

It's a not-widely-appreciated feature of tar that you can just untar the files you want; so that, for instance, you could do this by chromosome if disk space is the concern:

#!/bin/bash
mkdir -p merged
for chrom in X Y $( seq 1 22 )
do
    tar xzvf vcfs.tgz individual{1,2,3,4,5}/chr${chrom}.vcf
    for individual in individual*
    do
        cd $individual
        bgzip chr${chrom}.vcf
        tabix -p vcf chr${chrom}.vcf.gz
        cd ..
    done
    bcftools merge -m all --force-samples -O z individual*/chr${chrom}.vcf.gz  > merged/chr${chrom}.vcf.gz
    rm -f individual*/chr${chrom}.*
done

 

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Jonathan Dursi270

I don't see the "tar's under-appreciated feature" part. Really good use of a bash loop and the bash 3.0 {} syntax, but all bash commands can "just work on the files one wants".

ADD REPLYlink written 4.3 years ago by RamRS24k
1

It's been my experience lately that a number of people - including many who are quite proficient in the linux shell environment - don't realize you can just pull out the individual files you want out of a tar file, rather than untarring the whole thing. That may indeed be already familiar to you - I assumed everyone knew that myself until recently.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Jonathan Dursi270

Apologies - I see what you're highlighting now. I'd misread OP's question. It is indeed an underused feature, because we rarely have need for that today. In fact, it was only when I had to pick a script from a large project archive on my HPC that I discovered this nifty little feature :-)

ADD REPLYlink written 4.3 years ago by RamRS24k

Thank you very much for your answer! I am just a starter on the linux shell environment so I need to explore various ways to do things. I have been experimenting with scripts based on your answer and the storage problem is solved.

To try to optimize it a bit more, would there be a way to make the process faster? With brace and parameter expansion, the tar xzvf command runs anew each time for every different individual/chromosome combination, so it needs to sweep the whole tar file each time. This makes things slow; is there any possible way to avoid it? I could leave bcftools merge aside for the time being if that is a problem.

Thanks again

ADD REPLYlink written 4.3 years ago by Earendil20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2774 users visited in the last hour