Question: Detecting potentially unannotated genes from RNA-seq alignments?
gravatar for MACRODER
8 weeks ago by
MACRODER0 wrote:

Hi there. I was hoping someone can help me. I’m new to RNA-seq. I have my own RNA-seq data aligned to the reference genome as a bam file. Since this genome is poorly annotated, I would like to know if there is a way to automatically detect regions that are not present in the gtf annotation file but that have high coverage of reads in the bam file. I imagine this would allow to detect potentially unannotated genes. Then I would like to somehow mark these regions and save them in another file to inspect and perform BLAST. The genome does not have introns, so it should be straightforward but unfortunately I am not sure where to start. I hope I have explained myself well. Thank you in advance.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by MACRODER0

I think people usually do this through de novo assembly of the RNASeq. You assemble the reads into transcripts, then see if what you have matches annotated transcripts.

ADD REPLYlink written 8 weeks ago by swbarnes27.9k
gravatar for husensofteng
8 weeks ago by
husensofteng230 wrote:

I would recommend using stringTie in reference-guided mode. You can find examples here.

stringtie -l novelregions -G ref_genes.gtf -o assembled_transcripts.gtf input.bam

It basically, tries to assemble your reads into transcripts and annotates those that map to the known genes.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by husensofteng230

Thank you, your suggestion was very usefull. I used stringtie and it found many novel transcripts that are not annotated in the refference. I visually inspect in IGV the original bam file and the gtf produced by stringtie, and found that some of the novel transcripts have a really low coverage of RNA-Seq reads. So, I would like to re-do the analysis with some threshold in coverage. I know that this has to do with the -c parameter (from the manual: Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 1). Now my problem is, how is this coverage calculated? What number should I try? I'm not sure if this parameter goes from 1 to some fixed value... Any advice?

ADD REPLYlink written 8 weeks ago by MACRODER0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1175 users visited in the last hour