Question: How to make bash script operate fast for a 13gb VCF.gz file with 900000 rows and 10000 fields?
0
gravatar for DK
8 days ago by
DK50
Canada
DK50 wrote:

Dear Friends,

I mostly script in shell/bash as am comfortable with it and also use a lot of linux commands and Entrez utilities for various big file analysis; however, when I stared working on VCF files, dealing with million rows and columns I find issues in operating these files with bash. For instance, below is an example script to count number of SNPs for samples(TCGA...barcode...) in a VCF file:

for i in {10..10000}
do
snp_count=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep -v "TCGA" | grep "0/1\|1/1" | wc -l)
sample=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep "TCGA")
echo -e "$sample\t$snp_count\n"
done

Output:
TCGA...barcode1     60123
TCGA...barcode2      45245
.
.

The script works but it has been running for 2 days now, and will take time. I would really appreciate if you could share/suggest on the best ways to deal with VCF files using bash/shell or if another programming language could make a difference? However, if possible, I would prefer bash scripting ways to tackle VCF files.

Thanks, DK

bash vcf • 349 views
ADD COMMENTlink modified 4 days ago by Pierre Lindenbaum110k • written 8 days ago by DK50
7

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

But I think you should write a script to avoid read the data file thousands times.

ADD REPLYlink modified 8 days ago • written 8 days ago by shenwei3564.0k
1

At the very least just leave the file decompressed...

ADD REPLYlink modified 7 days ago • written 8 days ago by pld4.7k
2

use the tools meant for vcf parsing (IMO).@OP

ADD REPLYlink written 8 days ago by cpad01127.6k
2

I don't understand why we see TCGA in your Output, because there is a grep -v "TCGA".

What is that awk "{print \$$i}" ?

ADD REPLYlink written 8 days ago by Pierre Lindenbaum110k

@OP and @PIerre! OP's code and output are incorrect (IMO) for following reasons.

For eg in OP loop (as pointed out in your post):

sample=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep -v "TCGA" | grep "0/1\|1/1" | wc -l)

There are several problems with this code (IMO). But last one (wc -l) counts and lists the number. So variable sample is a number not a string. grep is done only for unphased genotypes.

snp_count=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep -v "TCGA")

The end command grep -v "TCGA prints the columns (from awk field) with column names not matching with string TCGA. This no count and snp_count variable should output columns not strings.

As I understand, OP is negating bash variable inside awk with {print \$$i} There are other ways to call bash variable inside awk.

Line echo -e "$sample\t$snp_count\n" as per the code, should output number first (wc -l from sample variable) and columns without sample string (TCGA) ( grep -v from sample_number). But in OP, output is other way around and is also incorrect as it should be printing columns, instead of strings. There is some thing wrong with code and output.

If OP is clear about what OP wants with example data and expected output, that would be more helpful.

ADD REPLYlink modified 6 days ago • written 6 days ago by cpad01127.6k

Thanks! Made a mistake in putting the code here, I have corrected it.

ADD REPLYlink written 5 days ago by DK50
2

Decompress the file will help. Set your system's locale to ASCII with LC_ALL=C should speed things up too. You might also be running out of memory.

I agree with @Shenwei356 that you should write a custom VCF parser to avoid reading the files so many times. You're only interested in the reading the letter once; why open the envelope so many times?

ADD REPLYlink written 8 days ago by Eric Lim740
1

Hello DK,

I haven't such a big vcf file for testing here. So I don't know how my suggestions will perform.

First have a look at the output of bcftools stats if this maybe already fits your needs.

Another way I have tested with a small dataset here, is to extract the genotype for each sample, replace the genotype by 0 of it's hom REF or unknown otherwise by 1 to indicate that there is a variant. You will get a matrix where each column is for one sample and each row is for each variant.

1 1 1 
0 1 1 
1 1 1 
1 0 0 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
0 0 1 
0 1 1

On can then sum up each column to get the number of variants in each sample.

$ bcftools query -f '[%GT ]\n' input.vcf|sed -E  's/(0\/0|\.)/0/g;s/[0-9]\/[1-9]/1/g'|datamash -t\  sum 1-3 > SNPcounts.txt

Using gnu parallel, like already suggested by the other, will improve the speed, if you do this for example for each chromosome.

fin swimmer

ADD REPLYlink modified 7 days ago • written 7 days ago by finswimmer3.6k

DK : You now have multiple answers that look interesting. Once you have had a chance to test (if you can do all that would be great) please come back and validate the ones that work (and others that don't) by accepting the answers.

ADD REPLYlink written 4 days ago by genomax51k

I will certainly. Thanks!

ADD REPLYlink written 3 days ago by DK50
2
gravatar for chrchang523
8 days ago by
chrchang5233.6k
United States
chrchang5233.6k wrote:

This depends on implementation details of the programs you are calling with your bash script.

You should time ("profile") the components of your script on a smaller test dataset. Most likely, some steps will be very fast and you don't need to worry about them at all, and at least one step will be... not so fast. You can then look for faster replacements for the problematic components (cough rhymes with 'talk' ahem).

In this particular case, as another commenter has pointed out, the outer for loop is the biggest problem of all. Look for a way to avoid repeating every step 9991 times...

ADD COMMENTlink modified 8 days ago • written 8 days ago by chrchang5233.6k
2
gravatar for ATpoint
8 days ago by
ATpoint5.5k
Germany
ATpoint5.5k wrote:

I have no script ready for this but here are some inspirations:

You could index your file with Tabix. This allows fast retrieval of e.g. variants per chromosome. Write a script that subsets the vcf.gz per chromosome and pipe the output to your SNP-count script. To be efficient, I would use GNU parallel to run this for all chromosomes in parallel, given you have enough cores for this. Could be something like:

## Index:
tabix -p vcf big.vcf.gz
## Get entries per chromosome, including VCF header (-h) in parallel:
cat chromosomes.txt | parallel "tabix -h big.vcf.gz {} | ./your_script.sh"

The chromosomes.txt would have the 1-based range for each chromosome in the format:

chr1:start-end

chr2:start-end

<(...)

Also replace awk by mawk, which is considerably faster.

ADD COMMENTlink written 8 days ago by ATpoint5.5k
2
gravatar for finswimmer
4 days ago by
finswimmer3.6k
Germany
finswimmer3.6k wrote:

Hello again,

some more improvements on my previous code and an explanation to it. On my laptop with 4 cores and 16GB Ram the run time is < 7 minutes for the test dataset.

The Data

For testing I've download the vcf file for chromosome 1 with genotypes from the 1000 genomes project. This file is 1.1GB large and consists of around 6 500 000 variants in around 2 500 samples.

The columns look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 [...]
1       10177   rs367896724     A       AC      100     PASS    AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL GT      1|0     0|1
1       10235   rs540431307     T       TA      100     PASS    AC=6;AF=0.00119808;AN=5008;NS=2504;DP=78015;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0051;AA=|||unknown(NO_COVERAGE);VT=INDEL  GT      0|0     0|0

For the code it is important to see, that:

  • the sample data starts in column 10
  • the genotypes are given in phased format e.g. 0|0, 0|1
  • a NON-ALT genotype looks like this: 0|0

The Task

Create a list with the total count of variants for each sample in the vcf file. The output should look like this:

NA21141 133382
NA21142 131027
NA21143 130295
NA21144 131932

The Code

$ zcat input.vcf.gz|cut -f10-|grep -v "#"|parallel --pipe -N200000 -j4 'mawk -f count.awk'|mawk -v OFS="\t" -f sum.awk > snp_counts.txt

For readability reasons I've put the awk code in a file and haven't write it inline.

count.awk

{ for(i=1; i<=NF; i++) { if($i !~ /0|0/) snp[i]++ } }

END {for (id in snp) print id, snp[id]}

sum.awk

{snp[$1] += $2} END {for (id in snp) print id, snp[id]}

Let's go through the command line:

zcat input.vcf.gz: decompress the vcf file and print the data to stdout.

cut -f10-: As we know that the samples data start at column 10 we can discard the columns before.

grep -v "#": Furthermore we have to get rid of header lines.

parallel --pipe -N200000 -j4 'mawk -f count.awk': We split up the list of variants in chunks of 200 000 lines and run the counting of variants in each sample in 4 threads in parallel.

awk -v OFS="\t" -f sum.awk: We combine the results of the parallel counting of variants by sum up the values for each sample.

What are the awk script doing?

{ for(i=1; i<=NF; i++) { if($i !~ /0|0/) snp[i]++ } }

We iterate over each sample column and check whether it contains 0|0. If it doesn't we assume that the sample has a variant here. We increment the value in the list, where the index is the same as the column number.

END {for (id in snp) print id, snp[id]}

If we reach the end, we print out each column number and the number of variants we have found.

{snp[$1] += $2}

We sum up the count for each column number.

{for (id in snp) print id, snp[id]}

And print out again the column number with the number of variants.

One last step is needed. We have to replace the column number with the sample names. There are many ways to do so with grep, awk, .... I choose this way:

paste <(zgrep -m 1 "#CHROM" input.vcf.gz|cut -f10-|tr "\t" "\n") <(sort -k1,1g snp_counts.txt|cut -f2) > result.txt

More Speed?

More speed is possible by increasing the number of parallel processes with the -j option. More parallel processes need more memory. So it might be necessary to adopt the chunk size (-N) or one can run out of memory.

The more parallel processes we start the more important it is to get the data fast enough. So decompressing the vcf file might become a critical step here. One can try replacing zcat with pigz as it allows parallel decompressing.

Good luck :)

fin swimmer

ADD COMMENTlink modified 4 days ago • written 4 days ago by finswimmer3.6k

zcat input.vcf.gz|grep "#CHROM" -m 1= zgrep "#CHROM" -m 1 input.vcf.gz?

ADD REPLYlink modified 4 days ago • written 4 days ago by cpad01127.6k

There was a time when I knew about zgrep. But somehow it faded away ...

Post edited.

I should start closing my posts with "And cpad0112 will tell us how one can shorten the syntax" :)

ADD REPLYlink written 4 days ago by finswimmer3.6k

@ finswimmer you are one of the fast and efficient responders here. In general, I get/understand things only after one of you post a solution

ADD REPLYlink modified 3 days ago • written 4 days ago by cpad01127.6k

This is incredible! Thanks much Fin! I will implement this and all given suggestions and let you know :-)

ADD REPLYlink written 4 days ago by DK50
2
gravatar for Pierre Lindenbaum
4 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum110k wrote:

Here is oneliner using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ wget  -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz" |\
gunzip -c |\
java -jar dist/bioalcidaejdk.jar -e 'stream().flatMap(V->V.getGenotypes().stream()).filter(G->!(G.isHomRef()||G.isNoCall())).map(G->G.getSampleName()).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((k,v)->println(k+" "+v));' -F VCF
(...)
NA20852 441
HG02489 1233
HG02008 112
NA11919 905
HG02002 119
HG01398 133
HG03577 837
HG02009 1217
NA18549 819

detail:

  • stream(). get the stream of variants
  • flatMap(V->V.getGenotypes().stream()). convert it to a stream of genotypes
  • filter(G->!(G.isHomRef()||G.isNoCall())). remove the hom_ref and the nocall
  • map(G->G.getSampleName()). extract the sample name from the genotype
  • collect(Collectors.groupingBy(Function.identity(),Collectors.counting())). count the number of distinct names
  • forEach((k,v)->println(k+" "+v)); print it
ADD COMMENTlink written 4 days ago by Pierre Lindenbaum110k
1
gravatar for finswimmer
5 days ago by
finswimmer3.6k
Germany
finswimmer3.6k wrote:

I worked a bit more on this with the help of the 1000 Genomes vcf genotype file for chromosome 1. Thiss contains abou 2500 columns and 6 500 000 rows. This here runs about 25 minutes on my laptop.

zcat ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz|cut -f10-|grep -v "#"|awk '{for(i=1; i<=1250; i++) {if($i !~ "0|0") {j[i]++}}} END {for (k in j) print j[k]}'

fin swimmer

ADD COMMENTlink written 5 days ago by finswimmer3.6k
2

You should consider switching from awk to mawk for operations on large datasets. The increase in speed is amazing. Not sure if that holds true in this example because of the decompression at the beginning that might be the limiting factor, but in general, give mawk a try. You'll not regret it.

ADD REPLYlink written 5 days ago by ATpoint5.5k
1

Wooohaaaa! mawk! Running the above code with mawk melt down the runtime to 17 minutes. Amazing!

But mawk handles list or looping over lists in another way. The output is not sorted in the same way. We have to take care about this.

I've found a way to decrease the runtime to <7 minutes. I'll prepare the answer in some minutes.

fin swimmer

ADD REPLYlink written 4 days ago by finswimmer3.6k

Thank you all! Great advices! I will implement these and let you know the result. Thanks much again!

ADD REPLYlink written 5 days ago by DK50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1097 users visited in the last hour