Stringtie GTF naming convention error
0
0
Entering edit mode
3.1 years ago
emcc ▴ 10

I have looked through similar posts with the same warning:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.

The indexes were built using the same -G file so the naming conventions should be exactly the same. An ERCC control has been included in the dataset but the same error occurs when the control sequences are not included.

The reference.gtf looks how it should but I'm concerned perhaps the geneID column (9th)?

scaffold1       WormBase_imported       exon    7437    7876    .     
+       .       transcript_id "transcript:BN1106_s1B000532.mRNA-1"; gene_id "gene:BN1106_s1B000532"; gene_name "BN1106_s1B000532";

Has anyone else seen something similar?

Could there be a problem with the sort and convert step from sam to bam files? Should I be using -n option and sorting by read name? I'm using the command below which sort by leftmost coordinate by default as that's what the protocols paper used.

samtools sort -@ 8 -o sample.bam sample.sam

Thank you in advance for any help :-)

p.s. I don't think my script has any problem but here's a sample:

stringtie -p 8 -G genome/genome_ERCC92.gtf -o sample.gtf sample.bam
stringtie naming convention gtf • 1.2k views
ADD COMMENT
1
Entering edit mode

A better title for your post would be "Stringtie GTF naming convention error", which is succinct and conveys the gist of your question. Details are better suited for the actual body of the post.

ADD REPLY
0
Entering edit mode

Changed. Thank you :)

ADD REPLY
0
Entering edit mode

An ERCC control has been included in the dataset and I'm currently rerunning the pipeline without these sequences to assess any affects.

How exactly the ERCC was included? Was it included into the reference genome and annotation prior to building the index?

ADD REPLY
0
Entering edit mode

I used the commands

cat genome.fa ERCC.fa >genome_ERCC.fa
cat genome.gtf ERCC.gtf >genome_ERCC.gtf

these outputs were used to build the index used for alignments. I have since run the pipeline with the basic genome files (not including ERCC seqs) and I get the same problem.

Could there be a problem with the sort and convert step from sam to bam files? Should I be using -n option and sorting by read name? I'm using the command below which sort by leftmost coordinate by default as that's what the protocols paper used.

samtools sort -@ 8 -o sample.bam sample.sam

Any advice appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 2300 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6