Understanding Uncommon Genotype Formats in VCF Files Generated by nf-core RNAvar Pipeline
0
0
Entering edit mode
6 weeks ago
ASid ▴ 40

Hello BioStars Community,

I've been working with VCF files generated by the nf-core RNAvar pipeline, which employs the GATK best practices for RNA-seq variant calling. Upon reviewing the genotype information in my VCF files, I encountered a variety of genotype formats beyond the common ones I'm familiar with ("./.", "1/1", "0/1", "1/2").

Here are the unique genotype formats I extracted from my data:

# > unique_values
# [1] "./."   "1/1"   "0/1"   "1/2"   "2/2"   "0/2"   "2/1"   "0/3"   "1/3"   "2/3"   "3/2"   "3/1"   "3/3"   "3/4"   "0/4"   "4/3"   "1/4"   "2/4"   "4/2"   "4/4"  
# [21] "5/2"   "3/5"   "5/5"   "4/5"   "5/4"   "0/5"   "4/1"   "5/6"   "5/1"   "1/5"   "6/7"   "2/5"   "0/6"   "1/7"   "5/3"   "6/5"   "4/8"   "3/6"   "4/6"   "6/2"  
# [41] "0/7"   "4/7"   "6/1"   "6/6"   "6/3"   "0/8"   "7/7"   "2/7"   "1/6"   "7/2"   "9/7"   "2/6"   "5/8"   "7/5"   "8/2"   "6/4"   "3/7"   "0/9"   "9/3"   "7/4"  
# [61] "7/8"   "9/9"   "8/9"   "5/10"  "7/1"   "10/4"  "0/10"  "9/1"   "10/10" "2/8"   "9/2"   "8/5"   "7/3"   "8/8"   "5/7"   "8/3"   "10/3"  "3/8"   "9/5"   "1/8"  
# [81] "3/9"   "1/10"  "3/10"  "2/11"  "8/1"   "8/10"  "1/9"   "0/11"  "8/6"   "2/9"   "10/6"  "4/11"  "8/4"   "7/6"   "6/8"   "12/13" "9/10"  "11/10" "11/7"  "4/9"  
# [101] "7/9"   "0/14"  "6/9"   "11/12" "5/9"   "6/10"  "10/11" "12/1"  "12/12" "11/11" "2/10"  "9/11"  "0/13"  "8/7"   "5/11"  "10/1"  "0/12"  "2/12"  "9/8"   "10/2" 
# [121] "4/10"  "10/5"  "8/11"  "10/7"  "1/11"  "1/13"  "9/6"   "0/15"  "11/2"  "15/1"  "0/16"  "7/10"  "11/8"  "3/11"  "17/17" "12/2"  "4/12"  "8/16"  "11/9"  "13/13"
# [141] "0/18"  "0/17"  "11/3"  "13/8"  "6/11"  "11/5"  "10/9"  "11/14" "9/14"  "11/1"  "13/12" "1/12"  "13/14" "9/4"

The common genotypes ("./.", "1/1", "0/1", "1/2") make sense to me, representing missing data, homozygous reference, heterozygous, and homozygous variant calls, respectively. However, the presence of other formats such as "1/3", "2/4", or "5/5" is confusing, and I am unsure how to interpret these, especially considering the RNA-seq context and the GATK pipeline used.

I suspect these unusual genotypes might have originated from specific nuances in the RNA-seq variant calling process or might represent multi-allelic sites, but I'm not entirely sure. Before I proceed with filtering these genotypes or interpreting the variant calls, I'd appreciate any insights or recommendations on how to handle these genotypes.

Should I be concerned about the validity of these unusual genotypes, or are they expected outcomes given the RNA-seq variant calling methodology? Any advice on best practices for filtering or interpreting these would be greatly appreciated. Also this is human data so its a diploid. Thank you in advance for your help!

genotype rnaseq • 172 views
ADD COMMENT

Login before adding your answer.

Traffic: 1392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6