Custom reference sequence length discrepancy with STAR/Salmon/gffread
0
0
Entering edit mode
1 day ago
ki • 0

I am working with a custom reference sequence called gene_ref which is combined with hg38. The original reference genome FASTA has a full length of 9709 bp for this seq, which I confirmed with samtools faidx. I created a splice table and then generated a custom GTF with a structured annotation (gene, transcript, 5′UTR, multiple exons, and 3′UTR). I have share gtf file too here:

Using gffread, I produced a transcriptome FASTA from the genome + GTF, and that file gives a transcript length of 9103 bp because it includes only annotated exons and UTRs, while the unannotated regions are excluded.

However, I got error from quantification part with salmon that *SAM header report the sequence length of gene_ref as only 8828 bp*, which does not match either the genome FASTA (9709 bp) or the transcript FASTA (9103 bp).

I verified length in .fna , .fai (9709), _transcript.fna (9103) but in the bam file it is 8828 .---which is sum of exon cordinates in total for this ref_gene

I believe this question has been asked several times in Biostar/GitHub and none of the solution is helping me here.

My pipeline is Align with star and quantification with salmon.______________

Screenshot of gtf file is here


Thanks Ki

RefSeq ENSEMBL fasta STAR GTF • 1.4k views
ADD COMMENT

Login before adding your answer.

Traffic: 3520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6