Question

Further splitting single-chromosomes VCF files to decrease their size to annotate with VEP

0

Entering edit mode

16 months ago

manuel.moradielloscorpus ▴ 10

Hi all!

I'm trying to annotate various big TCGA variant calling files using VEP, running the process in parallel in a cluster providing 200GB and 10 cores to each annotation process, but apparently it isn't enough so I thought of reducing the file size to increase the speed of the jobs.

I already obtained those .vcf separated by chromosome and by their FILTER (either "PASS" or ".") Also I've been following the recommendations from VEP's manual to make it run faster but VEP still takes a lot of time (many days) and it doesn't finish the annotation process even for small chromosomes like 22 with the resources mentioned above...

Has anyone tried to split the .vcf files by number of lines/size to make them smaller without losing potential data or with any other alternative method?

I know that .vcf format is a bit finicky and simply removing the header, splitting with bash and adding back the header wouldn't be the best solution for this issue.

Thanks in advance~!

ensembl-vep annotation vcf vep variant • 1.6k views

ADD COMMENT • link 16 months ago by manuel.moradielloscorpus ▴ 10

score 2 · Accepted Answer · 2022-12-12

2

Entering edit mode

16 months ago

Pierre Lindenbaum 161k

split the genome into parts with bedtools makewindows

annotate each window / chromosome:

bcftools view --regions "${INTERVAL}" in.${CHROM}.vcf |  vep --vcf | bcftools view -O b -o out.${i}.bcf
bcftools index  out.${i}.bcf

and then concatenate everything with bcftools concat

ADD COMMENT • link 16 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre!

Thanks for the quick answer. I've been taking a look and playing around with bedtools makewindows but I believe it's not exactly what I need, maybe I missed out on explaining a bit more about my input.

What I need to split due to (possible) size limitations are .vcf files obtained from TCGA, not a human genome that can be divided in equal-sized windows as I would lose the position (POS) of each variant called and VEP would not return the desired result which is adding more columns with extra information for each variant.

I have each .vcf separated by chromosome and I can also further do it by their FILTER ("PASS" or "."), but it doesn't seem to be enough to work correctly.

    ##above we have ~200 lines more of the vcf header
    ##bcftools_pluginCommand=plugin missing2ref --output-type v --output PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.miss2ref.vcf PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.vcf.gz; Date=Wed May  5 01:16:55 2021
    ##bcftools_concatVersion=1.9+htslib-1.9
    ##bcftools_concatCommand=concat -o PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.noinfo.fix.split.miss2ref.vcf PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.miss2ref.vcf.gz /bio/scratch/sosug32/Data/TCGA/TCGA_Germline/TCGA-TG_Noinfo/PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.noinfo.fix.split.miss2ref.vcf.gz; Date=Wed May  5 03:43:16 2021
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TCGA-02-0003-10A-01D-1490-08    TCGA-02-0033-10A-01D-1490-08    TCGA-02-0047-10A-01D-1490-08
    X   60052   .   T   A   .   PASS    AC=2;AN=2   GT  0/0 0/0 0/0
    X   60147   .   A   G   92.9    .   AC=2;AN=2   GT  0/0 0/0 0/0
    X   60185   .   T   C   219.8   .   AC=35;AN=38 GT  0/0 0/0 0/0
    X   60215   .   A   C   340.77  .   AC=59;AN=64 GT  0/0 0/0 0/0
    X   60256   .   C   G   242.78  .   AC=9;AN=12  GT  0/0 0/0 0/0
    X   60336   .   G   A   601.77  .   AC=55;AN=60 GT  0/0 0/0 0/0
    X   60337   .   C   T   601.77  .   AC=53;AN=58 GT  0/0 0/0 0/0
    X   60408   .   A   C   175.77  .   AC=2;AN=4   GT  0/0 0/0 0/0
    X   60425   .   C   A   206.77  .   AC=8;AN=12  GT  0/0 0/0 0/0
    X   60454   .   A   G   356.78  .   AC=58;AN=62 GT  0/0 0/0 0/0
    X   60466   .   T   C   526.77  .   AC=68;AN=72 GT  0/0 0/0 0/0
    X   60579   .   G   A   470.77  .   AC=70;AN=74 GT  0/0 0/0 0/0
    X   60583   .   G   C   560.77  .   AC=71;AN=76 GT  0/0 0/0 0/0
    X   60593   .   A   T   590.77  .   AC=66;AN=72 GT  0/0 0/0 0/0
    X   60604   .   G   A   200.84  .   AC=3;AN=4   GT  0/0 0/0 0/0
    X   60692   .   T   C   343.77  .   AC=67;AN=74 GT  0/0 0/0 0/0
    X   60727   .   G   A   145.41  .   AC=3;AN=4   GT  0/0 0/0 0/0
    X   60814   .   G   C   196.84  .   AC=4;AN=6   GT  0/0 0/0 0/0

I'm looking for something akin to bash's $split -n l/8 -a 1 chr.vcf chr.split_preffix to divide each file into equal sized ones, but I've been advised to not remove the header and paste it after splitting with bash as that can tinker with the .vcf in a bad way, leading to some possible information loss.

I've tried to find a tool that would split the .vcf in a similar fashion to taking the previous example and obtaining various files:

    ##above we have ~200 lines more of the vcf header
    ##bcftools_pluginCommand=plugin missing2ref --output-type v --output PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.miss2ref.vcf PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.vcf.gz; Date=Wed May  5 01:16:55 2021
    ##bcftools_concatVersion=1.9+htslib-1.9
    ##bcftools_concatCommand=concat -o PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.noinfo.fix.split.miss2ref.vcf PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.miss2ref.vcf.gz /bio/scratch/sosug32/Data/TCGA/TCGA_Germline/TCGA-TG_Noinfo/PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.noinfo.fix.split.miss2ref.vcf.gz; Date=Wed May  5 03:43:16 2021
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TCGA-02-0003-10A-01D-1490-08    TCGA-02-0033-10A-01D-1490-08    TCGA-02-0047-10A-01D-1490-08
    X   60052   .   T   A   .   PASS    AC=2;AN=2   GT  0/0 0/0 0/0
    X   60147   .   A   G   92.9    .   AC=2;AN=2   GT  0/0 0/0 0/0
    X   60185   .   T   C   219.8   .   AC=35;AN=38 GT  0/0 0/0 0/0
    X   60215   .   A   C   340.77  .   AC=59;AN=64 GT  0/0 0/0 0/0
    X   60256   .   C   G   242.78  .   AC=9;AN=12  GT  0/0 0/0 0/0
    X   60336   .   G   A   601.77  .   AC=55;AN=60 GT  0/0 0/0 0/0
    X   60337   .   C   T   601.77  .   AC=53;AN=58 GT  0/0 0/0 0/0
    X   60408   .   A   C   175.77  .   AC=2;AN=4   GT  0/0 0/0 0/0
    X   60425   .   C   A   206.77  .   AC=8;AN=12  GT  0/0 0/0 0/0
    X   60454   .   A   G   356.78  .   AC=58;AN=62 GT  0/0 0/0 0/0


    ##above we have ~200 lines more of the vcf header
    ##bcftools_pluginCommand=plugin missing2ref --output-type v --output PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.miss2ref.vcf PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.vcf.gz; Date=Wed May  5 01:16:55 2021
    ##bcftools_concatVersion=1.9+htslib-1.9
    ##bcftools_concatCommand=concat -o PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.noinfo.fix.split.miss2ref.vcf PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.pass.fix.split.miss2ref.vcf.gz /bio/scratch/sosug32/Data/TCGA/TCGA_Germline/TCGA-TG_Noinfo/PCA.r1.TCGAbarcode.merge.tnSwapCorrected.10389.X.noinfo.fix.split.miss2ref.vcf.gz; Date=Wed May  5 03:43:16 2021
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TCGA-02-0003-10A-01D-1490-08    TCGA-02-0033-10A-01D-1490-08    TCGA-02-0047-10A-01D-1490-08
    X   60466   .   T   C   526.77  .   AC=68;AN=72 GT  0/0 0/0 0/0
    X   60579   .   G   A   470.77  .   AC=70;AN=74 GT  0/0 0/0 0/0
    X   60583   .   G   C   560.77  .   AC=71;AN=76 GT  0/0 0/0 0/0
    X   60593   .   A   T   590.77  .   AC=66;AN=72 GT  0/0 0/0 0/0
    X   60604   .   G   A   200.84  .   AC=3;AN=4   GT  0/0 0/0 0/0
    X   60692   .   T   C   343.77  .   AC=67;AN=74 GT  0/0 0/0 0/0
    X   60727   .   G   A   145.41  .   AC=3;AN=4   GT  0/0 0/0 0/0
    X   60814   .   G   C   196.84  .   AC=4;AN=6   GT  0/0 0/0 0/0

Maybe I'm using bedtools makewindows incorrectly but in my case I cannot make windows of equal size as each variant could be a SNP or an INDEL and it also doesn't identify possible windows to be created within my variants positions.

ADD REPLY • link 16 months ago by manuel.moradielloscorpus ▴ 10

3

Entering edit mode

I'm looking for something akin to bash's $split -n l/8 -a 1 chr.vcf chr.split_preffix to divide each file into equal sized ones

I recently wrote https://lindenb.github.io/jvarkit/VcfSplitNVariants.html

$ java -jar dist/vcfsplitnvariants.jar --vcf-count 10 src/test/resources/rotavirus_rf.vcf.gz -o SPLITVCF --manifest manifest.mf --index -R src/test/resources/rotavirus_rf.fa
[INFO][VcfSplitNVariants]open SPLITVCF.00001.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00002.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00003.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00004.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00005.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00006.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00007.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00008.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00009.vcf.gz
[INFO][VcfSplitNVariants]open SPLITVCF.00010.vcf.gz

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

1

Entering edit mode

Thanks Pierre, this has worked wonders!

After three weeks of differentes approaches VEP has finally been able to annotate my (now splitted) input files and I'm starting to get the results I wanted.

Thanks for all the help~~

ADD REPLY • link 16 months ago by manuel.moradielloscorpus ▴ 10

0

Entering edit mode

Please accept the answer so the question is marked solved on the website. To do that, click on the green check mark on the left side of the answer.

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yep, sorry! I was looking for a way to exactly highlight the VcfSplitNVariants comment that was the final solution for my issue

ADD REPLY • link 16 months ago by manuel.moradielloscorpus ▴ 10

0

Entering edit mode

as I would lose the position (POS) of each variant called and VEP would not return the desired result which is adding more columns with extra information for each variant.

i don't understand that sentence.

ADD REPLY • link 16 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry, I didn't explain myself properly here!

As my current working files weren't genome sequence but variant calling data, bedtools makewindows wasn't working properly.

Each line includes either a SNP or an INDEL so creating small windows from the only available intervals (more than one base pair change) would result in "duplicating" some lines instead of reducing the filesize to launch smaller jobs for VEP to handle.

What I meant with

would lose the position (POS) of each variant called

is that creating windows from a certain START - END position would alter the original information about the expanse of that given variant, the coordinates would be slightly different and the extra information obtained from VEP (more columns) would be duplicated in those cases.

ADD REPLY • link 16 months ago by manuel.moradielloscorpus ▴ 10