Getting all the genomic coordinates where RNA-seq reads are split within a given genomic interval in a bam file
1
0
Entering edit mode
3.3 years ago
rubic ▴ 270

Hi,

I'm trying to improve the genomic annotation I have for a T-cell receptor gamma locus of a genome of intermediate annotation level, from bulk RNA-seq data.

To briefly explain the situation, T-cells are born in the bone marrow and their T-cell receptor gamma locus has the germline genome structure that looks like this: '5--V1---V2---V3---J1---J2---C--3'

The V, J, and C segments are protein coding and stand for: Variable, Joint, and Constant, and the dashes represent intervening non-coding DNA

When these T-cells reach the thymus (their next stop following the bone marrow) their T-cell receptor gamma locus gets rearranged where a randomly selected V segment (Vx below) gets joined with a randomly selected J segment (Jy below) and the intervening DNA is cut out therefore giving this DNA structure: '5--VxJy---C--3'

This is then transcribed as a transcript and the DNA intervening the Jy and C segments is spliced out like an intron. The choice of V and J is random and different for each T-cell.

The genome annotation I have for the T-cell receptor gamma locus looks like this: '5--C---C---C---C---C---C--3'

Where all these C segments are annotated as exons of the same gene. However I suspect that this is annotation is wrong because I can only find sequence similarity to a C protein domain at the most 3' C segment whereas I'm finding similarities to a V/J protein domains upstream to it - not exactly in the same coordinates as the upstream annotated C segments but close.

I have bulk RNA-seq data from spleen of the species, which includes among other cells, T-cells after they have been through the thymus and hence rearranged.

I thought of trying something simple as obtaining all the coordinates in which the RNA-seq reads are spliced between the 5' and 3' ends of this locus, which might help resolve this situation.

So my question is whether there is a way to obtain this information from tools such as samtools, bedtools, etc.

But perhaps I'm better off with just running a de-novo transcriptome assembler dedicated for reconstructing T-cll receptor loci, in which case any recommendations? Thanks!

RNA-Seq bam splicing cigar reads • 948 views
ADD COMMENT
1
Entering edit mode
3.3 years ago

I thought of trying something simple as obtaining all the coordinates in which the RNA-seq reads are spliced between the 5' and 3' ends of this locus, which might help resolve this situation.

samtools view -b in.bam "chr1:2345-6789" | bedtools bamtobed -i stdin -split
ADD COMMENT

Login before adding your answer.

Traffic: 2707 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6