Question: Stringtie GTF naming convention error
0
gravatar for emcc
6 months ago by
emcc10
emcc10 wrote:

I have looked through similar posts with the same warning:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.

The indexes were built using the same -G file so the naming conventions should be exactly the same. An ERCC control has been included in the dataset but the same error occurs when the control sequences are not included.

The reference.gtf looks how it should but I'm concerned perhaps the geneID column (9th)?

scaffold1       WormBase_imported       exon    7437    7876    .     
+       .       transcript_id "transcript:BN1106_s1B000532.mRNA-1"; gene_id "gene:BN1106_s1B000532"; gene_name "BN1106_s1B000532";

Has anyone else seen something similar?

Could there be a problem with the sort and convert step from sam to bam files? Should I be using -n option and sorting by read name? I'm using the command below which sort by leftmost coordinate by default as that's what the protocols paper used.

samtools sort -@ 8 -o sample.bam sample.sam

Thank you in advance for any help :-)

p.s. I don't think my script has any problem but here's a sample:

stringtie -p 8 -G genome/genome_ERCC92.gtf -o sample.gtf sample.bam
ADD COMMENTlink modified 6 months ago • written 6 months ago by emcc10
1

A better title for your post would be "Stringtie GTF naming convention error", which is succinct and conveys the gist of your question. Details are better suited for the actual body of the post.

ADD REPLYlink written 6 months ago by Ram17k

Changed. Thank you :)

ADD REPLYlink written 6 months ago by emcc10

An ERCC control has been included in the dataset and I'm currently rerunning the pipeline without these sequences to assess any affects.

How exactly the ERCC was included? Was it included into the reference genome and annotation prior to building the index?

ADD REPLYlink written 6 months ago by h.mon19k

I used the commands

cat genome.fa ERCC.fa >genome_ERCC.fa
cat genome.gtf ERCC.gtf >genome_ERCC.gtf

these outputs were used to build the index used for alignments. I have since run the pipeline with the basic genome files (not including ERCC seqs) and I get the same problem.

Could there be a problem with the sort and convert step from sam to bam files? Should I be using -n option and sorting by read name? I'm using the command below which sort by leftmost coordinate by default as that's what the protocols paper used.

samtools sort -@ 8 -o sample.bam sample.sam

Any advice appreciated.

ADD REPLYlink modified 6 months ago • written 6 months ago by emcc10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1984 users visited in the last hour