Stringtie GTF naming convention error
0
0
Entering edit mode
3.1 years ago
emcc ▴ 10

I have looked through similar posts with the same warning:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.

The indexes were built using the same -G file so the naming conventions should be exactly the same. An ERCC control has been included in the dataset but the same error occurs when the control sequences are not included.

The reference.gtf looks how it should but I'm concerned perhaps the geneID column (9th)?

scaffold1       WormBase_imported       exon    7437    7876    .
+       .       transcript_id "transcript:BN1106_s1B000532.mRNA-1"; gene_id "gene:BN1106_s1B000532"; gene_name "BN1106_s1B000532";


Has anyone else seen something similar?

Could there be a problem with the sort and convert step from sam to bam files? Should I be using -n option and sorting by read name? I'm using the command below which sort by leftmost coordinate by default as that's what the protocols paper used.

samtools sort -@ 8 -o sample.bam sample.sam


Thank you in advance for any help :-)

p.s. I don't think my script has any problem but here's a sample:

stringtie -p 8 -G genome/genome_ERCC92.gtf -o sample.gtf sample.bam

stringtie naming convention gtf • 1.2k views
1
Entering edit mode

A better title for your post would be "Stringtie GTF naming convention error", which is succinct and conveys the gist of your question. Details are better suited for the actual body of the post.

0
Entering edit mode

Changed. Thank you :)

0
Entering edit mode

An ERCC control has been included in the dataset and I'm currently rerunning the pipeline without these sequences to assess any affects.

How exactly the ERCC was included? Was it included into the reference genome and annotation prior to building the index?

0
Entering edit mode

I used the commands

cat genome.fa ERCC.fa >genome_ERCC.fa
cat genome.gtf ERCC.gtf >genome_ERCC.gtf


these outputs were used to build the index used for alignments. I have since run the pipeline with the basic genome files (not including ERCC seqs) and I get the same problem.

Could there be a problem with the sort and convert step from sam to bam files? Should I be using -n option and sorting by read name? I'm using the command below which sort by leftmost coordinate by default as that's what the protocols paper used.

samtools sort -@ 8 -o sample.bam sample.sam