I'm trying to improve the genomic annotation I have for a T-cell receptor gamma locus of a genome of intermediate annotation level, from bulk RNA-seq data.
To briefly explain the situation, T-cells are born in the bone marrow and their T-cell receptor gamma locus has the germline genome structure that looks like this: '5--V1---V2---V3---J1---J2---C--3'
The V, J, and C segments are protein coding and stand for: Variable, Joint, and Constant, and the dashes represent intervening non-coding DNA
When these T-cells reach the thymus (their next stop following the bone marrow) their T-cell receptor gamma locus gets rearranged where a randomly selected V segment (Vx below) gets joined with a randomly selected J segment (Jy below) and the intervening DNA is cut out therefore giving this DNA structure: '5--VxJy---C--3'
This is then transcribed as a transcript and the DNA intervening the Jy and C segments is spliced out like an intron. The choice of V and J is random and different for each T-cell.
The genome annotation I have for the T-cell receptor gamma locus looks like this: '5--C---C---C---C---C---C--3'
Where all these C segments are annotated as exons of the same gene. However I suspect that this is annotation is wrong because I can only find sequence similarity to a C protein domain at the most 3' C segment whereas I'm finding similarities to a V/J protein domains upstream to it - not exactly in the same coordinates as the upstream annotated C segments but close.
I have bulk RNA-seq data from spleen of the species, which includes among other cells, T-cells after they have been through the thymus and hence rearranged.
I thought of trying something simple as obtaining all the coordinates in which the RNA-seq reads are spliced between the 5' and 3' ends of this locus, which might help resolve this situation.
So my question is whether there is a way to obtain this information from tools such as samtools, bedtools, etc.
But perhaps I'm better off with just running a de-novo transcriptome assembler dedicated for reconstructing T-cll receptor loci, in which case any recommendations? Thanks!