Question

Understanding Uncommon Genotype Formats in VCF Files Generated by nf-core RNAvar Pipeline

0

Entering edit mode

6 weeks ago

ASid ▴ 40

Hello BioStars Community,

I've been working with VCF files generated by the nf-core RNAvar pipeline, which employs the GATK best practices for RNA-seq variant calling. Upon reviewing the genotype information in my VCF files, I encountered a variety of genotype formats beyond the common ones I'm familiar with ("./.", "1/1", "0/1", "1/2").

Here are the unique genotype formats I extracted from my data:

# > unique_values
# [1] "./."   "1/1"   "0/1"   "1/2"   "2/2"   "0/2"   "2/1"   "0/3"   "1/3"   "2/3"   "3/2"   "3/1"   "3/3"   "3/4"   "0/4"   "4/3"   "1/4"   "2/4"   "4/2"   "4/4"  
# [21] "5/2"   "3/5"   "5/5"   "4/5"   "5/4"   "0/5"   "4/1"   "5/6"   "5/1"   "1/5"   "6/7"   "2/5"   "0/6"   "1/7"   "5/3"   "6/5"   "4/8"   "3/6"   "4/6"   "6/2"  
# [41] "0/7"   "4/7"   "6/1"   "6/6"   "6/3"   "0/8"   "7/7"   "2/7"   "1/6"   "7/2"   "9/7"   "2/6"   "5/8"   "7/5"   "8/2"   "6/4"   "3/7"   "0/9"   "9/3"   "7/4"  
# [61] "7/8"   "9/9"   "8/9"   "5/10"  "7/1"   "10/4"  "0/10"  "9/1"   "10/10" "2/8"   "9/2"   "8/5"   "7/3"   "8/8"   "5/7"   "8/3"   "10/3"  "3/8"   "9/5"   "1/8"  
# [81] "3/9"   "1/10"  "3/10"  "2/11"  "8/1"   "8/10"  "1/9"   "0/11"  "8/6"   "2/9"   "10/6"  "4/11"  "8/4"   "7/6"   "6/8"   "12/13" "9/10"  "11/10" "11/7"  "4/9"  
# [101] "7/9"   "0/14"  "6/9"   "11/12" "5/9"   "6/10"  "10/11" "12/1"  "12/12" "11/11" "2/10"  "9/11"  "0/13"  "8/7"   "5/11"  "10/1"  "0/12"  "2/12"  "9/8"   "10/2" 
# [121] "4/10"  "10/5"  "8/11"  "10/7"  "1/11"  "1/13"  "9/6"   "0/15"  "11/2"  "15/1"  "0/16"  "7/10"  "11/8"  "3/11"  "17/17" "12/2"  "4/12"  "8/16"  "11/9"  "13/13"
# [141] "0/18"  "0/17"  "11/3"  "13/8"  "6/11"  "11/5"  "10/9"  "11/14" "9/14"  "11/1"  "13/12" "1/12"  "13/14" "9/4"

The common genotypes ("./.", "1/1", "0/1", "1/2") make sense to me, representing missing data, homozygous reference, heterozygous, and homozygous variant calls, respectively. However, the presence of other formats such as "1/3", "2/4", or "5/5" is confusing, and I am unsure how to interpret these, especially considering the RNA-seq context and the GATK pipeline used.

I suspect these unusual genotypes might have originated from specific nuances in the RNA-seq variant calling process or might represent multi-allelic sites, but I'm not entirely sure. Before I proceed with filtering these genotypes or interpreting the variant calls, I'd appreciate any insights or recommendations on how to handle these genotypes.

Should I be concerned about the validity of these unusual genotypes, or are they expected outcomes given the RNA-seq variant calling methodology? Any advice on best practices for filtering or interpreting these would be greatly appreciated. Also this is human data so its a diploid. Thank you in advance for your help!

genotype rnaseq • 172 views

ADD COMMENT • link updated 6 weeks ago by Pierre Lindenbaum 161k • written 6 weeks ago by ASid ▴ 40

0

Entering edit mode

Don't forget to follow up on your threads, that is bad etiquette. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work. If an answer was not really helpful or did not work, provide detailed feedback so others know not to use that answer.

Upvote|Bookmark|Accept

ADD REPLY • link 6 weeks ago by Pierre Lindenbaum 161k