Custom reference sequence length discrepancy with STAR/Salmon/gffread
1
0
Entering edit mode
10 weeks ago
ki • 0

I am working with a custom reference sequence called gene_ref which is combined with hg38. The original reference genome FASTA has a full length of 9709 bp for this seq, which I confirmed with samtools faidx. I created a splice table and then generated a custom GTF with a structured annotation (gene, transcript, 5′UTR, multiple exons, and 3′UTR). I have share gtf file too here:

Using gffread, I produced a transcriptome FASTA from the genome + GTF, and that file gives a transcript length of 9103 bp because it includes only annotated exons and UTRs, while the unannotated regions are excluded.

However, I got error from quantification part with salmon that *SAM header report the sequence length of gene_ref as only 8828 bp*, which does not match either the genome FASTA (9709 bp) or the transcript FASTA (9103 bp).

I verified length in .fna , .fai (9709), _transcript.fna (9103) but in the bam file it is 8828 .---which is sum of exon cordinates in total for this ref_gene

I believe this question has been asked several times in Biostar/GitHub and none of the solution is helping me here.

My pipeline is Align with star and quantification with salmon.______________

Screenshot of gtf file is here


Thanks Ki

RefSeq ENSEMBL fasta STAR GTF • 6.5k views
ADD COMMENT
0
Entering edit mode
17 days ago
Kevin Blighe ★ 90k

Hey Ki,

The issue stems from a mismatch in how STAR and gffread interpret your custom GTF for transcript length calculation. STAR sums the lengths of 'exon' features in the GTF for the transcript coordinates in TranscriptomeSAM, giving 8828 bp. gffread also concatenates 'exon' features for the transcriptome FASTA, but your 9103 bp suggests possible overlapping or mis-annotated exons/UTRs in the GTF, leading to different effective sums.

Check your GTF for overlapping exon coordinates—STAR may merge them implicitly during indexing, while gffread concatenates as-is, potentially duplicating overlaps.

To fix, validate and correct the GTF: ensure 'exon' features cover full transcript regions (including UTRs as part of exons or add separate non-overlapping exon lines for UTRs). Use this bash snippet to calculate sum of exon lengths:

awk '$3=="exon" {len += $5 - $4 + 1} END {print len}' your.gtf

Re-generate STAR index with the fixed GTF (--sjdbGTFfile), re-run gffread for transcriptome FASTA, and confirm lengths match in transcriptInfo.tab (from STAR genome dir) and FASTA.

Then, align with --quantMode TranscriptomeSAM, and quantify with Salmon using the matching FASTA index and Transcriptome BAM. This should resolve the header mismatch.

If overlaps persist, share the full GTF for a closer look.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6