mapping RNAseq reads to a genbank isoforme that is not in Reference Annotation
0
0
Entering edit mode
6.6 years ago

Hi, I'm trying to quantify different isoforms of a gene using RNAseq data. I have some bam files and I've tried assembling transcripts by cufflinks -G option using reference annotation (downloaded from UCSC). but one of the isoformes is not annotated in reference annotation that I have (maybe because its new!) and I can quantify it. however I can see this isoform in human mRNA from genbank track in UCSC. so is there any other reference annotation that I can use that may contain this new isoform? or all available reference are the same?

i have to mention that I've tried de novo assembling using -g option and the simple cufflinks command but it couldnt find the whole isoform, instead it predicted the extra exon(presented in this isoform) solely as a distinct transcript. so what is the possible remedy to this problem.

RNA-Seq sequence cufflinks new isoform in rnaseq • 2.4k views
2
Entering edit mode

I guess your reference may be a GTF or GFF3 file so you can add in command line.

0
Entering edit mode

Yes my reference is a gtf file but I'm quite new to this area. How can I add an isoform to my reference annotation? Could you please suggest me a paper or discussion on how to do that.

1
Entering edit mode

First, you create another gtf file(isoform.gtf) of your new isoform. The contents are your isoform.GTF format is described at

The line MUST be tab-separated, and chromosome name ( column-1 ) MUST be "1" not "chr1". If you prepare above new gtf file,

cat isoform.gtf UCSC_annotation.gtf > new.gtf


in your command line.(Linux or OS X)

0
Entering edit mode

thank you very much, i added 3 overlapping isoforms into my reference annotations file. I ran cufflinks to quantify them. it seems that only one of the added isoforms have been taken to analysis by cufflinks..because for 2 isoforms RPKM is 0.0000 . could it be because they share exons and cufflinks somehow ignore some reads that mapped to those exon for 2 out of 3 similar isoforms?

0
Entering edit mode

That is possible, but I cannot say the exact thing about Cufflinks isoform assembly.

So I think you should confirm it by reading Cufflinks original paper. I'm sorry that I couldn't be of any help to you.

1
Entering edit mode

Hi,

Just playing devil's advocate here. If you load your RNA-seq BAM in IGV and go to your gene and then make a Sashimi plot (click the left hand side panel which shows the name of the BAM loaded and from the drop down menu choose 'Sashimi plot').

You should see something like this -

Basically the exons should have connecting bridges/loops which means reads split-aligned and hence supporting the transcript structure/ isoform. Do you see your novel/ extra exon with a loop connecting it to the preceding exon?

If not, then it might be the reason why Cufflinks fails to generate the desired isoform but instead generates a single exon transcript.

I might be wrong but its worth checking.

0
Entering edit mode

Thanks for the tip. I checked and there is no loop. In other words, there is no junction read that covers these two exons. I did not map the reads myself. I started my work on bam files. Is there any possibility that these junction reads are discarded in previous processing step? By the way this new exon is 10 kb up stream of the conventional first exon. Could this large distance affect cufflinks ability to generate the novel isoform?

1
Entering edit mode

There is a parameter regulating the max. intron length in splice-aware aligners. But like in TopHat by deafult its 500kb and in STAR also to nearly same tune. So 10kb is definitely in safe zone unless deafult parameters were altered. The other thing to check is whether the novel exon overlaps a repeat region. You can import RepeatMasker track in IGV or you could visualize your gene on the UCSC Genome Browser and turn on the necessary track. If it is so then an imbalanced coverage (more or lesser reads aligining on the novel exon of repeat nature) maybe leading to not creating the isoform.

It could also be that your novel region is transcribed but not as part of mRNA isoforms. Also take help of public RNA-seq data (BodyMap, Encode) to check if there are known cases lurking somewhere. Really hand-waiving here

0
Entering edit mode

well Amitm this new exon transcribes from a LTR element, and as you said the read coverage on this repeat elements is really higher than other exons. I think I have to try to find a way for refining reads alignments into this repeat element first.

0
Entering edit mode

Have you checked Ensembl annotation?

0
Entering edit mode

Yeah! But the isoform is not included. It seems that Ensembl is quite like Refseq.

1
Entering edit mode

ok, have you checked the origin of the Genbank track, is it from an experiment, or does it have a release number?

0
Entering edit mode

thanks Andrew, well it dose have an id number but it comes from an experiment because in information page some researcher are listed as authors however there is no link to a publication.

0
Entering edit mode

Then likely it's an isoform that hasn't been proven, or predicted yet outside of that experiment. If it's not predicted computationally, or experimentally by Havana, then I'd consider it to be novel to a degree. I guess if you could confirm the transcript through qPCR and it's predicted in the Genbank track, then you'd have a good amount of evidence to suggest it's a real transcript variant.