I am working with a custom reference sequence called gene_ref which is combined with hg38. The original reference genome FASTA has a full length of 9709 bp for this seq, which I confirmed with samtools faidx. I created a splice table and then generated a custom GTF with a structured annotation (gene, transcript, 5′UTR, multiple exons, and 3′UTR). I have share gtf file too here:
Using gffread, I produced a transcriptome FASTA from the genome + GTF, and that file gives a transcript length of 9103 bp because it includes only annotated exons and UTRs, while the unannotated regions are excluded.
However, I got error from quantification part with salmon that *SAM header report the sequence length of gene_ref as only 8828 bp*
, which does not match either the genome FASTA (9709 bp) or the transcript FASTA (9103 bp).
I verified length in .fna , .fai (9709), _transcript.fna (9103) but in the bam file it is 8828 .---which is sum of exon cordinates in total for this ref_gene
I believe this question has been asked several times in Biostar/GitHub and none of the solution is helping me here.
My pipeline is Align with star and quantification with salmon.______________
Thanks Ki