Question

Forum:VCF bulk handling

0

Entering edit mode

3 months ago

Javier • 0

Hello, I am currently working on my master's thesis in bioinformatics (2 weeks in). My tutor has asked me to work with VCF files from ICGC database. Our goal use this data to training a ML model for the accurate selection of inmunogenic neoantigens in PAAD.

In the first step we need to annotate the vcfs and filter them for bero the inmunogenic test phase.

Initially, my data consists of 4 VCF files for each patient. There are two types: SNV or INDEL and they come from either Mutect2 or Sanger sequencing. For example:

Example:

APGI-AU.DO32825.SA407790.wgs.20210623.gatk-mutect2.somatic.snv.vcf.gz

My plan was to use bcftools to concatenate the SNV and INDEL files for each patient from Mutect2 and Sanger sequencing, resulting in two versions per patient. Then, I would merge all patients into a single file. This would give me files like:

step 1 E.g., APGI-AU_DO32825_gatk-mutect2_indel_snv.vcf.gz
step 2 E.g., gatk-mutect2_indel_snv.vcf.gz

Next, I intend to filter the VCF files using RNA data to remove genes that show no expression. After filtering, I will annotate the files using two different annotators for comparison: VEP and SnpEff.

At the end of this process, I expect to have:

gatk-mutect2_ann_VEP.vcf.gz
sanger_ann_VEP.vcf.gz
gatk-mutect2_ann_SnpEff.vcf.gz
sanger_ann_SnpEff.vcf.gz

However, I am unsure about how to handle the combination of vcf files, especially considering their different column structures (NORMAL and TUMOUR in SNV, and SA407790 and SA407795 in INDEL). I am not certain how this will affect downstream analysis, or if I should merge them differently.

Additionally, I am aware of tools like Sarek that can streamline this process using both annotators simultaneously. However, we encountered issues with system overload and errors when attempting this approach on our computers.

So I just wanna some advice or hear your opinion in my "workflow" idea. Thank you.

Update:

Right now im still traying to combine the vcfs. Im facing 2 problems:

As i sed Half of the files were generated with Mutect2, and the last two columns contain different sample IDs for each patient. I need to change these IDs to 'NORMAL' and 'TUMOUR' for each file. I'm having trouble figuring out the command to accomplish this.

Also i would also like to be able to identify from which patient each mutation in the collective VCF file comes from. I read that I can achieve this by adding an INFO tag, but I'm struggling to understand how to implement this.

For the 1º problem im using the comand bcftools reheader -s new_samples.txt "$out_dir/$output_vcf" -o "$out_dir/$output_vcf". It makes the job, but later when i try to manipulate this files it gives me this error:

[E::bgzf_read_block] Invalid BGZF header at offset 36076

index: failed to create index for ...

The new_samples.txt file is only this:

NORMAL TUMOUR

And, when cheeking the modified file, its all right gzip: APGI-AU_DO32825_gatk-mutect2.vcf.gz: decompression OK, trailing garbage ignored

bcftools SnpEff WGS VEP VCF • 519 views

ADD COMMENT • link 3 months ago by Javier • 0

1

Entering edit mode

Why annotate using 2 tools? Comparison of annotations won't be as straightforward as you imagine it once you get into the details of various settings. Go with one (I recommend VEP with MANE) and stick to it.

ADD REPLY • link 3 months ago by Ram 44k

0

Entering edit mode

Well, as I mentioned, my tutor asked me to do that. I thought that a good way to compare them would be to use MAFTools to visualize the data and create some cool visuals for my master's thesis.

ADD REPLY • link 3 months ago by Javier • 0