Alternative coding sequences with GATK
0
1
Entering edit mode
3.9 years ago
AlBaars ▴ 10

Hello all,

I have a large number of samples that are aligned to a reference genome and variant called. The variants have been analysed with SnpEff. I want to create a fasta file with the alternative coding sequences so that I can visualise them in a phylogenetic tree. I am currently using GATK4 to do so. I have already split the VCF file into a VCF file per sample, and GATK does seem to be doing something, but when I merged all fasta sequences into one file, all sequences were 100% identical. There are variants in my VCF file, this I have already checked. The merging of the fasta files is also working as it should, so, in my workflow, the problem lies with GATK.

My GATK command is this:

for vcf in /path/to/vcfs/*; do
   gatk --java-options "-Xmx64G" FastaAlternateReferenceMaker -R ref.fasta -V $vcf -O out_dir/ \
   $(basename "$vcf" .vcf).fasta -L "linkage_group:start-end"
done

I get all of the seperate fasta files with the desired filenames and merge them later. I doubt that the problem lies there, as the header re-naming process works fine. The GATK log says this:

Running:
    /gnu/store/00mf4q4xyy1bc5lxblggzylfzh7wyyla-icedtea-3.7.0/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx64G -jar /gnu/store/109pqziyqiqf2qdfw3x60s99sq4170s8-gatk4-4.1.5.0/share/java/user-classes/gatk.jar FastaAlternateReferenceMaker -R path/to/my/reference.fa -V /path/to/my/sample.vcf -O /path/to/my/output.fasta -L linkage_group:start-end
13:24:00.239 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gnu/store/109pqziyqiqf2qdfw3x60s99sq4170s8-gatk4-4.1.5.0/share/java/user-classes/gatk-package-4.1.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jun 10, 2020 1:24:00 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:24:00.514 INFO  FastaAlternateReferenceMaker - ------------------------------------------------------------
13:24:00.515 INFO  FastaAlternateReferenceMaker - The Genome Analysis Toolkit (GATK) v4.1.5.0
13:24:00.515 INFO  FastaAlternateReferenceMaker - For support and documentation go to https://software.broadinstitute.org/gatk/
13:24:00.515 INFO  FastaAlternateReferenceMaker - Executing as <user_name> on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
13:24:00.515 INFO  FastaAlternateReferenceMaker - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b12
13:24:00.515 INFO  FastaAlternateReferenceMaker - Start Date/Time: June 10, 2020 1:24:00 PM CEST
13:24:00.515 INFO  FastaAlternateReferenceMaker - ------------------------------------------------------------
13:24:00.515 INFO  FastaAlternateReferenceMaker - ------------------------------------------------------------
13:24:00.516 INFO  FastaAlternateReferenceMaker - HTSJDK Version: 2.21.2
13:24:00.516 INFO  FastaAlternateReferenceMaker - Picard Version: 2.21.9
13:24:00.516 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:24:00.516 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:24:00.516 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:24:00.516 INFO  FastaAlternateReferenceMaker - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:24:00.516 INFO  FastaAlternateReferenceMaker - Deflater: IntelDeflater
13:24:00.516 INFO  FastaAlternateReferenceMaker - Inflater: IntelInflater
13:24:00.516 INFO  FastaAlternateReferenceMaker - GCS max retries/reopens: 20
13:24:00.516 INFO  FastaAlternateReferenceMaker - Requester pays: disabled
13:24:00.516 INFO  FastaAlternateReferenceMaker - Initializing engine
13:24:00.843 INFO  FeatureManager - Using codec VCFCodec to read file file:///path/to/my/sample.vcf
13:24:00.889 INFO  IntervalArgumentCollection - Processing 2066 bp from intervals
13:24:00.894 INFO  FastaAlternateReferenceMaker - Done initializing engine
13:24:00.902 INFO  ProgressMeter - Starting traversal
13:24:00.902 INFO  ProgressMeter -        Current Locus  Elapsed Minutes       Bases Processed     Bases/Minute
13:24:01.005 INFO  ProgressMeter - linkage_group:start              0.0                  2066        1215294.1
13:24:01.005 INFO  ProgressMeter - Traversal complete. Processed 2066 total bases in 0.0 minutes.
13:24:01.030 INFO  FastaAlternateReferenceMaker - Shutting down engine
[June 10, 2020 1:24:01 PM CEST] org.broadinstitute.hellbender.tools.walkers.fasta.FastaAlternateReferenceMaker done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2211446784

I have changed the exact file locations, the rest is authentic. To me, as a beginner user of GATK, it looks like everything is going fine. But, again, none of the output CDS were different. I am running it on VCF files with only SNPs as I saw that indels can cause problems for this programme.

EDIT: Forgot to mention it but they are also all identical to the reference sequence. Have also exchanged the log for the right one; I had accidentally put the index log there, which was in the same file as I ran them both together.

Honestly, I'm lost on this one. Can anyone with more GATK experience maybe enlighten me on what I am doing wrong?

Many thanks in advance.

SNP gene GATK • 852 views
ADD COMMENT

Login before adding your answer.

Traffic: 1925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6