Hello all,
I have a large number of samples that are aligned to a reference genome and variant called. The variants have been analysed with SnpEff. I want to create a fasta file with the alternative coding sequences so that I can visualise them in a phylogenetic tree. I am currently using GATK4 to do so. I have already split the VCF file into a VCF file per sample, and GATK does seem to be doing something, but when I merged all fasta sequences into one file, all sequences were 100% identical. There are variants in my VCF file, this I have already checked. The merging of the fasta files is also working as it should, so, in my workflow, the problem lies with GATK.
My GATK command is this:
for vcf in /path/to/vcfs/*; do
gatk --java-options "-Xmx64G" FastaAlternateReferenceMaker -R ref.fasta -V $vcf -O out_dir/ \
$(basename "$vcf" .vcf).fasta -L "linkage_group:start-end"
done
I get all of the seperate fasta files with the desired filenames and merge them later. I doubt that the problem lies there, as the header re-naming process works fine. The GATK log says this:
Running:
/gnu/store/00mf4q4xyy1bc5lxblggzylfzh7wyyla-icedtea-3.7.0/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx64G -jar /gnu/store/109pqziyqiqf2qdfw3x60s99sq4170s8-gatk4-4.1.5.0/share/java/user-classes/gatk.jar FastaAlternateReferenceMaker -R path/to/my/reference.fa -V /path/to/my/sample.vcf -O /path/to/my/output.fasta -L linkage_group:start-end
13:24:00.239 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gnu/store/109pqziyqiqf2qdfw3x60s99sq4170s8-gatk4-4.1.5.0/share/java/user-classes/gatk-package-4.1.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jun 10, 2020 1:24:00 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:24:00.514 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
13:24:00.515 INFO FastaAlternateReferenceMaker - The Genome Analysis Toolkit (GATK) v4.1.5.0
13:24:00.515 INFO FastaAlternateReferenceMaker - For support and documentation go to https://software.broadinstitute.org/gatk/
13:24:00.515 INFO FastaAlternateReferenceMaker - Executing as <user_name> on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
13:24:00.515 INFO FastaAlternateReferenceMaker - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b12
13:24:00.515 INFO FastaAlternateReferenceMaker - Start Date/Time: June 10, 2020 1:24:00 PM CEST
13:24:00.515 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
13:24:00.515 INFO FastaAlternateReferenceMaker - ------------------------------------------------------------
13:24:00.516 INFO FastaAlternateReferenceMaker - HTSJDK Version: 2.21.2
13:24:00.516 INFO FastaAlternateReferenceMaker - Picard Version: 2.21.9
13:24:00.516 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:24:00.516 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:24:00.516 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:24:00.516 INFO FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:24:00.516 INFO FastaAlternateReferenceMaker - Deflater: IntelDeflater
13:24:00.516 INFO FastaAlternateReferenceMaker - Inflater: IntelInflater
13:24:00.516 INFO FastaAlternateReferenceMaker - GCS max retries/reopens: 20
13:24:00.516 INFO FastaAlternateReferenceMaker - Requester pays: disabled
13:24:00.516 INFO FastaAlternateReferenceMaker - Initializing engine
13:24:00.843 INFO FeatureManager - Using codec VCFCodec to read file file:///path/to/my/sample.vcf
13:24:00.889 INFO IntervalArgumentCollection - Processing 2066 bp from intervals
13:24:00.894 INFO FastaAlternateReferenceMaker - Done initializing engine
13:24:00.902 INFO ProgressMeter - Starting traversal
13:24:00.902 INFO ProgressMeter - Current Locus Elapsed Minutes Bases Processed Bases/Minute
13:24:01.005 INFO ProgressMeter - linkage_group:start 0.0 2066 1215294.1
13:24:01.005 INFO ProgressMeter - Traversal complete. Processed 2066 total bases in 0.0 minutes.
13:24:01.030 INFO FastaAlternateReferenceMaker - Shutting down engine
[June 10, 2020 1:24:01 PM CEST] org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2211446784
I have changed the exact file locations, the rest is authentic. To me, as a beginner user of GATK, it looks like everything is going fine. But, again, none of the output CDS were different. I am running it on VCF files with only SNPs as I saw that indels can cause problems for this programme.
EDIT: Forgot to mention it but they are also all identical to the reference sequence. Have also exchanged the log for the right one; I had accidentally put the index log there, which was in the same file as I ran them both together.
Honestly, I'm lost on this one. Can anyone with more GATK experience maybe enlighten me on what I am doing wrong?
Many thanks in advance.