Question: Mapping a de novo built transcriptome to a reference transcriptome
1
gravatar for cyril-cros
4.1 years ago by
cyril-cros890
France
cyril-cros890 wrote:

Hi,

I am interested in building an updated annotation of the mouse olfactory receptor genes. They have a tissue specific expression obviously, with various 3' isoforms, so the current annotation is not very good.
Hand annotation using IGV to visualize my data is not an option with this gene family ;)

I have mapped RNA Seq paired ends reads from olfactory epithelia to the mm10 genome with STAR, and then I used a new tool called IsoSCM (https://github.com/shenkers/isoscm) to obtain a more precise annotation (as a GTF file). It is a de novo approach.

My problem is that I can't easily (it seems) merge this new annotation with the 'official' one (mm10.gtf), which has less precise 5' and 3' (as seen in IGV) but includes Ensembl Gene IDs which I would very much like to have. I am including examples below of both syntaxes.

I would like to know if there is a tool that allows to add the Ensembl Gene IDs to a de novo transcriptome according to the features' position in the genome. I think someone already had to answer this type of question....

Otherwise, I can code simple stuff such as sorting features by their position in the genome and finding reasonable rules for merging (a feature from the new genome which has the same start or end position in mm10.gtf could inherit the mm10.gtf gene ID). Determining the feature type (5', CDS, exon, 3') is a bit more complex but possible.

Here is the mm10.gtf annotation:

[cyril@synapse ~]$ head -20 mm10.gtf 
chr1    mm10_ensGene    stop_codon    134202951    134202953    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    CDS    134202954    134203590    0.000000    -    1    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    exon    134199223    134203590    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    CDS    134234015    134234355    0.000000    -    0    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    start_codon    134234353    134234355    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    exon    134234015    134234412    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    exon    134235228    134235431    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 

And the one I obtain with IsoSCM:

[cyril@synapse ~]$ head -20 isoSCM.gtf 
chr4_JH584293_random    sol    exon    10822    11047    0.0    +    .    locus_id "locus.00000000";type "5p_exon"
chr4_JH584293_random    sol    exon    11191    11251    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12077    12246    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12375    12489    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12643    12675    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12754    12920    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    13477    13640    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    14817    14989    0.0    +    .    locus_id "locus.00000000";type "3p_exon"
chr4_JH584293_random    sol    exon    15884    15933    0.0    +    .    locus_id "locus.00000001";type "5p_exon"

Thanks for your help!

 

ADD COMMENTlink modified 4.1 years ago by poisonAlien2.7k • written 4.1 years ago by cyril-cros890
2
gravatar for poisonAlien
4.1 years ago by
poisonAlien2.7k
Asgard
poisonAlien2.7k wrote:

Use cuffcompare to compare your de novo gtf with mm10 gtf. Then look for class code j , which I think stands for novel isoform. You can play around. 

ADD COMMENTlink written 4.1 years ago by poisonAlien2.7k

Thanks a lot for the idea. Cuffcompare runs quickly and successfully, even if the input GTF is not a cufflinks output. The class code is also a nice idea, I can filter all unusual isoforms.
I will run comparisons using IGV and update this thread when I am done.

ADD REPLYlink written 4.1 years ago by cyril-cros890

After a few trials: I have roughly what I want, but some 3'UTR are truncated by cufflinks.
It seems bedtools might offer a more hands-on solution, I will look into that.

ADD REPLYlink written 4.1 years ago by cyril-cros890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1395 users visited in the last hour