Question: What is wrong with my custom GTF ??
gravatar for Joel TM
3.9 years ago by
Joel TM50
Joel TM50 wrote:

Good day, I've been trying to make the tuxedo pipeline for RNAseq data analysis work with my custom GTF but it haven't succeeded yet. I have tried many things, read many many posts...still nothing. I want to see differential expression in lncRNA. Some of them are known, some are unknown. I've had success with HTSeq --> DESeq, EdgeR...but I want to corroborate my results with cuffdiff.

GTF with known lncRNA only looks like that, (it should work but doesn't):

chr1 unknown exon 12567300 12567451 0 + 0 gene_id "SNORA59A"; gene_name "SNORA59A"; transcript_id "NR_003025_1"; tss_id "TSS3103";
chr1 unknown exon 24171572 24172345 0 - 0 gene_id "FUCA1"; gene_name "FUCA1"; p_id "P16785"; transcript_id "NM_000147"; tss_id "TSS3487";
chr1 unknown stop_codon 24172205 24172207 0 - 0 gene_id "FUCA1"; gene_name "FUCA1"; p_id "P16785"; transcript_id "NM_000147"; tss_id "TSS3487";
chr1 unknown CDS 24172208 24172345 0 - 0 gene_id "FUCA1"; gene_name "FUCA1"; p_id "P16785"; transcript_id "NM_000147"; tss_id "TSS3487";

[..] etc...

But tophat returns me with:

format:          fastq
        quality scale:   phred33 (default)
[2015-07-07 08:59:30] Reading known junctions from GTF file
        Warning: TopHat did not find any junctions in GTF file
[2015-07-07 08:59:30] Preparing reads
         left reads: min. length=20, max. length=50, 28359825 kept reads (14 discarded)
        right reads: min. length=20, max. length=50, 28359834 kept reads (5 discarded)
[2015-07-07 09:05:45] Building transcriptome data files..
[2015-07-07 09:06:01] Building Bowtie index from lncRNA_toUse.fa
Error: Couldn't build bowtie index with err = 1


My first goal is to fix this. Then I wil have to try with my unknown lncRNA as well, which the only format I get is:

chr1 unknown exon 47562325 47644943 0 + 0 gene_id "CYP4A22-AS1"; gene_name "CYP4A22-AS1";

It is worth noting that with the "official" Genes.gtf, the pipeline works. I am puzzled, and thank you very much for any insight.

have a great day,




gtf custom gtf rnaseq • 3.1k views
ADD COMMENTlink modified 17 months ago by Biostar ♦♦ 20 • written 3.9 years ago by Joel TM50

Just as a total guess, does it have something to do with lncRNA_toUse.fa ? (as in, does it use the same labels as the gft file)

Might be the same problem as this:

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by John12k

Tophat creates a log file and stores the exact commands that it uses. Try to run the last one and you'll (presumably) get the actual error that bowtie2 produces.

ADD REPLYlink written 3.9 years ago by Devon Ryan90k
gravatar for Joel TM
3.9 years ago by
Joel TM50
Joel TM50 wrote:

I thank you both for the time you took reading my issue. As it so happens often with bioinformatics, I simply re-ran my original extracting command providing me with a virtually identical .gtf file, yet, it is now working. I was looking so deep into the issue that I didn't think of the very, very simple things.

I wish you the best !


ADD COMMENTlink written 3.9 years ago by Joel TM50

FWIW this is called "programming by coincidence" and, at least in principle, it is a bad idea to continue on without understanding what went wrong in the first place.

On the other hand as you say this is bioinformatics - so many things go haywire all the time - there is rarely time to investigate every issue

ADD REPLYlink written 3.9 years ago by Istvan Albert ♦♦ 80k

Well My issue is only partially resolved. As soon as I add data for the lncRNAs I am interested in but are absent from the reference gtf, it crashes even if that data is from the UCSC Genome Browser table.... I will keep updating this post but there's gotta be a way to make this work with the tuxedo pipeline...otherwise it's just not very flexible...

ADD REPLYlink written 3.9 years ago by Joel TM50

Again, look in the logs for the commands being run and run the last one there manually. You'll then likely get a more informative error message.

ADD REPLYlink written 3.9 years ago by Devon Ryan90k

I wasn't aware of this "programming by coincidence"  but now I realize that is a thing a saw a few times ago.

ADD REPLYlink written 3.9 years ago by JC7.8k

I would not go as far as saying I do bioinformatics in blind luck; having said that, I only started working full time in the field this year. I am still filling my toolbox, and the learning process is never ending but oh so rewarding. I come here because of the community. I hope that seeking help in the form of an open question on Biostars, when I have scoured the web and forums without finding my answer, isn't frowned upon. In the end, my problem was ridicule, but honest. And now I know. It might even serve someone else one day.

ADD REPLYlink written 3.9 years ago by Joel TM50
gravatar for Joel TM
3.9 years ago by
Joel TM50
Joel TM50 wrote:


I think all is well now. Here is what I did.

1.  I used " sed " to extract the referenced functional lncRNAs from the genes.gtf
2. I utilized lncRNAdb + UCSC Table Browser to extract the not so well documented functional lncRNAs I am interested in, making sure the output was in the GTF format.
3. I created independant .txt files for each of those lncRNA. I added for each of those the field    gene_name "xxxxx";  (time consuming)
4. I used "cat" to merge everything back together and voila.  Basically I did everything from the command line instead of adding a quick line here and there with my .gtf open.

I now have 0 warning, and no problem. I work exclusively in linux and never encountered this issue before (it was just a matter of time I am sure). May be it had something to do with encoding or end of line character...? I do not know.

Thanks to all again, it was a rare scenario where the .log didn't help me ...

have a great day,


ADD COMMENTlink written 3.9 years ago by Joel TM50

Hi Joel, 

I am wondering if it would be possible for you to run me through the steps in making a custom GTF file. I have RNASeq data for Mouse neurons with GFP expression (within a gene). I have GFP sequence and insertion sites where this was incorporated in the gene. I am trying to incorporate this sequence in the reference genome to quantify the expression of this GFP incorporated gene. I believe after I insert this sequence in the reference genome (either by inserting it in the gene of interest or copying the whole gene with inserted GFP as new chromosome), I will have to update the GTF file somehow. What do you think is a best approach to get it to work? Can you help me with this and share your experience?



ADD REPLYlink written 3.5 years ago by Mnoon0

[N.B., I'm obviously not Joel]

The simplest method would be to add the GFP sequence to the genomic fasta file and add a single line to the GTF file like:

GFP     ensembl exon    0       2000    .       +       .       gene_id "GFP"; transcript_id "GFP"; exon_number "1"; gene_name "GFP"; transcript_name "GFP";

You'll need to change the name to match whatever you use and also the size (I used to know the length of EGFP, but I can't recall it at the moment). If you would prefer to include the entire fusion gene sequence then you can also do that, but you then must hard mask the sequence for the original gene in the genome (otherwise, most of the reads you're interested will be multimappers and will then get ignored!). For that you can use "bedtools maskfasta". This method would probably produce the best results.

ADD REPLYlink written 3.5 years ago by Devon Ryan90k

Hi Devon, 

Thanks so much for your prompt response. I was waiting for more information from my collaborator before I tried this. As you suggested I hard-masked the original gene. Should I add entire fusion gene sequence as new chromosome? or at the end of same chromosome (chr6)? I have never worked on such projects before therefore I am confused. Since I have hard-masked the original gene, should I edit/delete the annotation in GTF for chr6? and edit it to be a new chromosome? lets say "gfp" or "24"?? 

I also have some questions about the reporter sequence. I found out from the paper that there are other elements flanking the reporter gene (ZsGreen) such as (Rosa26sor + CAG-prm + loxSTOPlox + ZsGreen + WPRE + etc) from paper " A robust and high-throughput Cre reporting and characterization system for the whole mouse brain
10.1038/nn.2467" and my original gene is Rosa26sor. I have ZsGreen and WPRE sequence only and from the paper I know that reporter gene was inserted between exon 1 and 2 but didn't mention insertion-nucleotide positions. After contacting the Author, she sent me Genbank file of whole vector again, I dont know the positions. What do you suggest I do about the exact positions for the reporter gene? I would love to email you the details and all these sequences. Let me email you the files. I really appreciate your inputs and support.

ADD REPLYlink written 3.5 years ago by Mnoon0

Hey Devon,

I have a similar problem as Mnoon. When I try including the entire fusion gene sequence, not only the reads that I'm interested in but a lot of other reads as well end up being multimappers. Could you explain what you mean by hard masking the sequence? Should I be removing the entries for the original gene in the gtf file? Thanks!

ADD REPLYlink written 2.2 years ago by Mukund0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1775 users visited in the last hour