Problem with Calling Variants from RNA-Seq data
Entering edit mode
14 days ago
Esraa • 0

Hello, I have previously posted that i was trying to find a benchmark RNA-Seq dataset for variant calling of SNPs and Indels. Someone recommended using the GIAB RNA-Seq reads and comparing it with the truth set of the same sample.

I tried running the GATK Best practices pipeline for short variant discovery on the data and comparing my results with theirs using, but got very unexpected bad results of F1 Scores 0.04, which to me indicates that i seem to be doing something wrong.

I tried troubleshooting by re-checking my references, tool parameters, even running the variant calling step on the GIAB Bam provided with the reads, but still nothing. Keeping in mind that Google's Deepvariant has used the same approach during their model evaluation and had acquired F1 Scores for the gatk results around >0.7.

So I would be really grateful if anyone could please help me identify the problem with my run, as i seem to have run out of ideas to what could be the possible cause of those results.

vcf gatk benchmark giab rna-seq • 203 views
Entering edit mode

My references



Reads: (hg002_gm24385.mrna.R1.fastq.gz and hg002_gm24385.mrna.R2.fastq.gz)

GTF: gencode.v45.basic.annotation.gtf

My workflow

1. Index preparation

STAR --runThreadN 15 --runMode genomeGenerate --genomeDir ./HG002/Index --genomeFastaFiles ./HG002/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta --sjdbGTFfile ./HG002/gencode.v31.basic.annotation.gtf --sjdbOverhang 149 --genomeSAsparseD 2

2. Mapping

STAR --runThreadN 15 --readFilesIn hg002_gm24385.mrna.R1.fastq.gz hg002_gm24385.mrna.R2.fastq.gz --genomeDir Index/ --outSAMtype BAM SortedByCoordinate --outFileNamePrefix mapped --outSAMunmapped Within --outSAMattrRGline ID:HG002 LB:lib1 SM:ONE PL:ILLUMINA --outSAMmapqUnique 60 --twopassMode Basic --outSAMattributes XS --readFilesCommand gunzip -c

3. Mark Duplicates

gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" MarkDuplicates -I mappedAligned.sortedByCoord.out.bam -O marked_duplicates.bam -M marked_dup_metrics.txt

4. SplitNCigarReads

gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" SplitNCigarReads -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta -I ./HG002/marked_duplicates.bam -O marked_split.bam

5. HaplotypeCaller

gatk --java-options "-Xmx4g -XX:ParallelGCThreads=14" HaplotypeCaller -R GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta -I ./HG002/marked_split.bam -O gatk.vcf --native-pair-hmm-threads 14

Note: It is also worth mentioning that i have not done the BQSR step from the pipeline for personal study reasons, but i have searched and do not think it could be the thing causing this huge gap in results.

Entering edit mode

Have you accounted for A>I RNA editing?

Entering edit mode

Yes, i specifically ran the RNA-seq best practices, and no i have not taken RNA editing into account considering this is not mentioned in the actual pipeline. Also, Deepvariant's publication tried the GATK pipeline as it is without any improvements and still got good results.


Login before adding your answer.

Traffic: 2765 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6