Question: How to filter huge VCF files for datamining?
gravatar for jespinoz
3.3 years ago by
jespinoz20 wrote:

I have 88 human samples from a NextSeq run. I generated BCF files by first running HISAT2 and the preconfigured Ensembl hg38 database available on their documentation. After that, I converted the sam files to sorted bam files and then individual bcf files. All together, these 88 BCF files total up to 5.7 TB (5.8 TB below b/c I copied 2 to try to merge as a test) which is gigantic. I need to substantially reduce the file size and the SNPs analysis is only a secondary addition to one of our studies. I planned to look at cooccurrence of SNPs between the 88 samples. I understand SNP studies tend to focus on a rare SNPs but sparse matrices are not useful for the data mining algorithms I wish to use. My question is how I can filter these BCF files to combine them into a single data matrix that is manageable on a 4 GB laptop? What kind of filtering is typically done on SNPs to dramatically reduce the complexity while preserving a safe much information content as possible? I understand this seems contradictory but prototyping is my main goal with this analysis. How should they be filtered and what would be good cutoffs to use?

We have 88 samples whose reads together total to about 746 G in size. (symbollic links below which is why file size is smaller)

We used HISAT2 for the mapping using human assembly hg38. HISAT2 supplies preindexed files that we used located at

The assembly for the genome used for the indexing was retrieved from

$ du -sh *
5.8T    bcf_files
8.5G    grch38
4.7G    grch38.tar.gz
435K    reads
176G    sam_files
38G sorted_bam_files

My pipeline:

Create the sam file

hisat2 -q -p 2 --fast -x ./grch38/genome -1 {r1_path} -2 {r2_path} -S ./sam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam

Sam => Sorted-bam

samtools view -bS ./sam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam | samtools sort -@ 16 -o ./sorted_bam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted

Sorted-bam => BCF samtools mpileup -uf ./grch38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -C 50 --BCF -o ./bcf_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted.bcf ./sorted_bam_files/{lib}.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted

genotype bcf snps genome vcf • 1.7k views
ADD COMMENTlink modified 3.2 years ago • written 3.3 years ago by jespinoz20

You are probably producing so-called all-site VCFs. Such VCFs can be helpful to popgen analyses, but rarely used for other applications. Most of time we produce variant-only multiple-sample VCFs, which are by far smaller. What is your target application?

ADD REPLYlink written 3.3 years ago by lh332k

Pretty sure your VCF isn't the largest file - excluding the index files, it would be the smallest, in fact. With 88 samples, it might be just over a gig in size. BAM files, OTOH are around 60G in size for a 30X depth run.

VCF filtration is generally ad hoc and depends on the kind of analysis you wish to perform, but if you want to work on a 4 GB laptop (I'm assuming 4GB RAM, which is not good for most bioinformatics analyses), you're going to put up with lower speeds as you use software like plink to analyze your VCF file.

ADD REPLYlink written 3.3 years ago by RamRS28k

I am also surprised here, even 88 multiplied times the size of the human genome does not equal 4TB

ADD REPLYlink written 3.3 years ago by cmdcolin1.4k

These are human reads extract from metagenomic reads so the read depth is all over the place.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by jespinoz20

I've repeated this using and I'm creating individual bcf files. It's 71.5% done and the directory is 4.2 TB.

ADD REPLYlink written 3.3 years ago by jespinoz20

Are you creating so-called gVCF files (I think this is what @lh3 referred to above)? It just doesn't seem like a normal VCF can conceivably have that much data. Also, have you considered doing joint genotyping?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by cmdcolin1.4k

I'm running this command for each of the 88 samples: samtools mpileup -uf ./grch38/Homo_sapiens.GRCh38.dna.primary_assembly.fa -C 50 --BCF -o ./bcf_files/1054.2_RD1.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted.bcf ./sorted_bam_files/1054.2_RD1.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by jespinoz20

gVCFs are not super large either - I have a 250+ sample gVCF and it's less ~10G in size. gVCF does contain all sites, but it compresses contiguous variant-free sites into regions, unless you force it to emit all sites individually. See

ADD REPLYlink written 3.3 years ago by RamRS28k

Thanks for your patience. I've been traveling so haven't gotten a chance to respond to the comments. @lh3, I think you are right. I used (4.6 GB) instead of (3.9 GB) which I think is creating the large files and being overly comprehensive for my desired analysis. Do you know if the SNP db I used may have generated the "all-site VCF" you were referring to? My main goal is to look at groups of SNPs that correlate with a specific trait and I want to do this in Python using Pandas. @Ram, my work computer is 8GB RAM and my home Macbook Air is 4GB RAM which I may use from time to time. I have access to the Grid (which is how I'm generating these files) but I don't want to prototype new ideas using this giant dataset. @cmdcolin, I've updated my question to include my directory size.

ADD REPLYlink written 3.3 years ago by jespinoz20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1542 users visited in the last hour