Getting all the genomic coordinates where RNA-seq reads are split within a given genomic interval in a bam file
1
0
Entering edit mode
17 months ago
rubic ▴ 240

Hi,

I'm trying to improve the genomic annotation I have for a T-cell receptor gamma locus of a genome of intermediate annotation level, from bulk RNA-seq data.

To briefly explain the situation, T-cells are born in the bone marrow and their T-cell receptor gamma locus has the germline genome structure that looks like this: '5--V1---V2---V3---J1---J2---C--3'

The V, J, and C segments are protein coding and stand for: Variable, Joint, and Constant, and the dashes represent intervening non-coding DNA

When these T-cells reach the thymus (their next stop following the bone marrow) their T-cell receptor gamma locus gets rearranged where a randomly selected V segment (Vx below) gets joined with a randomly selected J segment (Jy below) and the intervening DNA is cut out therefore giving this DNA structure: '5--VxJy---C--3'

This is then transcribed as a transcript and the DNA intervening the Jy and C segments is spliced out like an intron. The choice of V and J is random and different for each T-cell.

The genome annotation I have for the T-cell receptor gamma locus looks like this: '5--C---C---C---C---C---C--3'

Where all these C segments are annotated as exons of the same gene. However I suspect that this is annotation is wrong because I can only find sequence similarity to a C protein domain at the most 3' C segment whereas I'm finding similarities to a V/J protein domains upstream to it - not exactly in the same coordinates as the upstream annotated C segments but close.

I have bulk RNA-seq data from spleen of the species, which includes among other cells, T-cells after they have been through the thymus and hence rearranged.

I thought of trying something simple as obtaining all the coordinates in which the RNA-seq reads are spliced between the 5' and 3' ends of this locus, which might help resolve this situation.

So my question is whether there is a way to obtain this information from tools such as samtools, bedtools, etc.

But perhaps I'm better off with just running a de-novo transcriptome assembler dedicated for reconstructing T-cll receptor loci, in which case any recommendations? Thanks!

RNA-Seq bam splicing cigar reads • 671 views
1
Entering edit mode
17 months ago

I thought of trying something simple as obtaining all the coordinates in which the RNA-seq reads are spliced between the 5' and 3' ends of this locus, which might help resolve this situation.

samtools view -b in.bam "chr1:2345-6789" | bedtools bamtobed -i stdin -split