Hello BioStars Community,
I've been working with VCF files generated by the nf-core RNAvar pipeline, which employs the GATK best practices for RNA-seq variant calling. Upon reviewing the genotype information in my VCF files, I encountered a variety of genotype formats beyond the common ones I'm familiar with ("./.", "1/1", "0/1", "1/2").
Here are the unique genotype formats I extracted from my data:
# > unique_values
# [1] "./." "1/1" "0/1" "1/2" "2/2" "0/2" "2/1" "0/3" "1/3" "2/3" "3/2" "3/1" "3/3" "3/4" "0/4" "4/3" "1/4" "2/4" "4/2" "4/4"
# [21] "5/2" "3/5" "5/5" "4/5" "5/4" "0/5" "4/1" "5/6" "5/1" "1/5" "6/7" "2/5" "0/6" "1/7" "5/3" "6/5" "4/8" "3/6" "4/6" "6/2"
# [41] "0/7" "4/7" "6/1" "6/6" "6/3" "0/8" "7/7" "2/7" "1/6" "7/2" "9/7" "2/6" "5/8" "7/5" "8/2" "6/4" "3/7" "0/9" "9/3" "7/4"
# [61] "7/8" "9/9" "8/9" "5/10" "7/1" "10/4" "0/10" "9/1" "10/10" "2/8" "9/2" "8/5" "7/3" "8/8" "5/7" "8/3" "10/3" "3/8" "9/5" "1/8"
# [81] "3/9" "1/10" "3/10" "2/11" "8/1" "8/10" "1/9" "0/11" "8/6" "2/9" "10/6" "4/11" "8/4" "7/6" "6/8" "12/13" "9/10" "11/10" "11/7" "4/9"
# [101] "7/9" "0/14" "6/9" "11/12" "5/9" "6/10" "10/11" "12/1" "12/12" "11/11" "2/10" "9/11" "0/13" "8/7" "5/11" "10/1" "0/12" "2/12" "9/8" "10/2"
# [121] "4/10" "10/5" "8/11" "10/7" "1/11" "1/13" "9/6" "0/15" "11/2" "15/1" "0/16" "7/10" "11/8" "3/11" "17/17" "12/2" "4/12" "8/16" "11/9" "13/13"
# [141] "0/18" "0/17" "11/3" "13/8" "6/11" "11/5" "10/9" "11/14" "9/14" "11/1" "13/12" "1/12" "13/14" "9/4"
The common genotypes ("./.", "1/1", "0/1", "1/2") make sense to me, representing missing data, homozygous reference, heterozygous, and homozygous variant calls, respectively. However, the presence of other formats such as "1/3", "2/4", or "5/5" is confusing, and I am unsure how to interpret these, especially considering the RNA-seq context and the GATK pipeline used.
I suspect these unusual genotypes might have originated from specific nuances in the RNA-seq variant calling process or might represent multi-allelic sites, but I'm not entirely sure. Before I proceed with filtering these genotypes or interpreting the variant calls, I'd appreciate any insights or recommendations on how to handle these genotypes.
Should I be concerned about the validity of these unusual genotypes, or are they expected outcomes given the RNA-seq variant calling methodology? Any advice on best practices for filtering or interpreting these would be greatly appreciated. Also this is human data so its a diploid. Thank you in advance for your help!
- Repeated measures design matrix controlling for covariates age and sex
- RNA-Seq:How to know RNA-seq data is Strand-Specific or not based on only fastq files available in public data repositories such as SRA?
- Differential expression analysis of few genes from rnaseq or microarray
- illumina Arrays Illumina HumanHT-12 V3.0 expression beadchip reading data
- Correlation test for multiple variables and adjusted p values
- rna seq single ended -Information on Rseqc report
- RNA seq stranded
- How to use TPM from RNA seq data analysis for differential gene expression analysis? which statistical methods are reuired to be performed.
- Illumina microarray dataset analysis