Question

Detecting potentially unannotated genes from RNA-seq alignments?

0

Entering edit mode

3.9 years ago

MACRODER ▴ 10

Hi there. I was hoping someone can help me. I’m new to RNA-seq. I have my own RNA-seq data aligned to the reference genome as a bam file. Since this genome is poorly annotated, I would like to know if there is a way to automatically detect regions that are not present in the gtf annotation file but that have high coverage of reads in the bam file. I imagine this would allow to detect potentially unannotated genes. Then I would like to somehow mark these regions and save them in another file to inspect and perform BLAST. The genome does not have introns, so it should be straightforward but unfortunately I am not sure where to start. I hope I have explained myself well. Thank you in advance.

RNA-Seq alignment annotation new genes • 977 views

ADD COMMENT • link 3.9 years ago by MACRODER ▴ 10

1

Entering edit mode

I think people usually do this through de novo assembly of the RNASeq. You assemble the reads into transcripts, then see if what you have matches annotated transcripts.

ADD REPLY • link 3.9 years ago by swbarnes2 14k

score 1 · Answer 1 · 2020-05-14

1

Entering edit mode

3.9 years ago

husensofteng ▴ 410

I would recommend using stringTie in reference-guided mode. You can find examples here.

stringtie -l novelregions -G ref_genes.gtf -o assembled_transcripts.gtf input.bam

It basically, tries to assemble your reads into transcripts and annotates those that map to the known genes.

ADD COMMENT • link 3.9 years ago by husensofteng ▴ 410

0

Entering edit mode

Thank you, your suggestion was very usefull. I used stringtie and it found many novel transcripts that are not annotated in the refference. I visually inspect in IGV the original bam file and the gtf produced by stringtie, and found that some of the novel transcripts have a really low coverage of RNA-Seq reads. So, I would like to re-do the analysis with some threshold in coverage. I know that this has to do with the -c parameter (from the manual: Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 1). Now my problem is, how is this coverage calculated? What number should I try? I'm not sure if this parameter goes from 1 to some fixed value... Any advice?

ADD REPLY • link 3.9 years ago by MACRODER ▴ 10