Salmon (error: SAM file and fasta file have different transcript length)
3
0
Entering edit mode
3.6 years ago

Hi everyone,
I am trying to count aligned reads, using Salmon in alignment-based-mode

Command: salmon quant -t transcript.fa -l A -a accepted_hits.bam' -o salmon_quant

I am running into following error: " SAM file says target ENST00000618686.1 has length 228, but the FASTA file contains a sequence of length [2238 or 2237] "

RNA-Seq salmon • 2.7k views
0
Entering edit mode
3.6 years ago
Rob 5.2k

Hi,

This means that the length of the target length in the fasta file provided to salmon was 2238, while the length in the as inferred by the aligner used to produce the SAM file was _much_ shorter (228). Since salmon builds and applies an error model of alignment to help determine the appropriate allocation for each fragment, these features must be of the same length, so that, for example, the aligner does not report that the read maps to a position that doesn't exist in the transcript file provided. Can you see what the length of this feature is in the input GTF you provided to gffread?

0
Entering edit mode

The portion of GTF file, containing, transcript_id: ENST00000618686.1, looks like this,

GL000009.2 ENSEMBL transcript 56140 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL exon 56140 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL CDS 58084 58308 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL start_codon 58306 58308 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL stop_codon 58081 58083 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL UTR 56140 58083 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL UTR 58309 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic";

0
Entering edit mode
3.6 years ago
jomo018 ▴ 630

It seems that your transcript.fa contains the correct, 2237, length of this transcript. So the BAM file was incorrectly aligned. If you have the original fastq files, it may be better to use them directly with salmon.

0
Entering edit mode

Thanks for reply, I aligned my fastq files, against coding-region(cds for Human from ensemble), using TopHat2. Command: tophat Homo_sapiens --library-type fr-secondstrand r1.fq r2.fq

What could be the possible reason that: BAM file was incorrectly aligned??

Can you please suggest : salmon is better or sailfish? and Does salmon deals with multi-mapping reads?

1
Entering edit mode

Wait; did you align against the transcriptome using TopHat2? I would highly recommend against doing that. In alignment-based mode, salmon requires alignments to the transcriptome. The best ways to obtain such alignments are either to (1) align to the genome using STAR and use it's built-in facilities to _project_ the alignments into transcriptome coordinates or (2) use an un-spliced aligner (e.g. Bowtie2) to align against the transcriptome fasta file directly (the same one you then feed to salmon to built its error model, so they can't disagree).

Can you please suggest : salmon is better or sailfish? and Does salmon deals with multi-mapping reads?

Salmon is preferred to sailfish (both deal with multimapping reads). Also, sailfish cannot accept external alignments, it performs its own mapping. This brings me to my last question / suggestion. Is there a specific reason you want to run salmon in alignment-based mode? This is, of course, a perfectly legitimate thing to do. However, you can also simply index the transcriptome fasta file with salmon and let it map the reads and quantify the abundances itself --- you don't need to use an external alignment tool.

0
Entering edit mode

Thanks for explaining, in such easy manner, I actually aligned against coding-sequences(CDS), as aligning with genome requires high computational memory.

This brings me to my last question / suggestion. Is there a specific reason you want to run salmon in alignment-based mode?

I want to count reads of my BAM-file in alignment-based-mode, as pseudo-alignment is not very much preferred. With pseudo-alignment I can miss , novel isoforms, as I want to go for Differential-gene-expression analysis, Do you think, novel isoforms are going to make any difference??

1
Entering edit mode

As you point out, whether or not you use traditional alignment pr the builtin mapping algorithm, you won't be finding novel isoforms. If you want to do that, you should perform assembly via a tool like Stringtie or Scallop prior to quantification.

Regardless of if you do that, if you want to use traditional alignments, one easy way would be to first extract the transcripts from the genome, and then produce alignments against this reference using e.g. Bowtie2 (being sure to allow multimapping). Then, you will be sure the reference lengths in the BAM are the same as those in the input fasta file. If you want to use the builtin mapping algorithm, i'll mention that, if you use a recent version of salmon (I always recommend the most recent release), the --validateMappings flag will actually use an alignment algorithm (with some algorithmic enhancements to account for the fact that you are aligning against a highly repetitive reference transcriptome) to simultaneously improve mapping sensitivity and to ensure that mappings yield good underlying alignments and are not spurious.

0
Entering edit mode

Thanks a lot, this explanation is very helpful!! Should I go for quasi-mapping-based mode or FMD-index-based mode, as I want to go for , Differential-gene-expression analysis.

1
Entering edit mode

I'd recommend going with the quasi-mapping, and including the --validateMappings option among your command line flags (including others that seem relevant for your data given the documentation).

0
Entering edit mode

Thanks Rob !!!!!!!!!!!!!! :)

0
Entering edit mode

Hi Rob, what does -g [ --geneMap ] do?? If this flag is passed, then tximport is no more required?? which is used to input quant.sf in DESeq2??

1
Entering edit mode

The effect of the --geneMap flag is to cause salmon to output, in addition to it's transcript level quantifications, aggregated gene-level quantifications. However, this flag exists only for backward compatibility and is deprecated in favor of tximport. Though it attempts to do the same thing, tximport has a number of distinct advantages (a big one being that tximport can estimate an average gene length by looking at the salmon output over all samples in an experiment, while salmon's approach is, necessarily, single-sample). As such, I recommend to aggregate transcript-level abundance estimates to the gene level using tximport.

0
Entering edit mode

Thanks !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0
Entering edit mode
29 days ago

Hello Rob, I'd be super thankful if you could help me understand what's going on with my data because I'm going nuts.

I have metagenomic and metatranscriptomic data extracted from the same samples from the sea. I want to do differential expression analysis on the metagenomes, from a mix of organisms, at different times. I thought I would align DNA and RNA, salmon quant, tximport and then DESeq2. I'm so stuck and I hope you can help me get unstuck!

So what I've done so far: I did QC and assembly with the metaG data followed by annotation with MetaEuk, so I have a DNA.fasta file and a DNA.gff file. I used gffread to convert DNA.gff to DNA.gtf file. The paired metaT data I just QC'ed and merged. I then used STAR to align the RNA.fastq to the DNA.fa using the DNA.gtf as reference, with these parameters (that I got from https://rdrr.io/bioc/BANDITS/f/README.md):

STAR --runMode genomeGenerate --runThreadN 6 --limitGenomeGenerateRAM 458444417974 --genomeSAindexNbases 13 --genomeDir path/genomeDir --genomeFastaFiles DNA.fa --sjdbGTFfile DNA.gtf

STAR --genomeDir path/genomeDir --runThreadN 6 --readFilesIn RNA.fastq --outFileNamePrefix path/STAR_Results/D1_R1_aln --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode TranscriptomeSAM

# Then (not sure what this does exactly) again used gffread to "build a reference transcriptome (fasta format) compatible with the DNA fasta and gtf files used for STAR":

gffread -w cDNA.fa -g DNA.fa DNA.gtf

# Then when I try to do salmon quant with the alignment:

salmon quant -t cDNA.fa -l A -a STAR_Results/D1_R1_alnAligned.sortedByCoord.out.bam -o salmon_quant -p 4 --dumpEq

# It gives me the different lengths error mentioned above:

SAM file says target k75_658346 has length 629, but the FASTA file contains a sequence of length [532 or 531]

I think Bowtie 2 won't let me use a gtf reference, right? I thought aligning with gtf was better. Btw I tried STAR without gtf and it also didn't work. As for the salmon mapping-based mode, you need to build a decoy-aware transcriptome index (https://salmon.readthedocs.io/en/latest/salmon.html) which I don't know how to work because it requires a genome and a transcriptome, I think it's meant for ONE organism genome/transcriptome? Would Stringtie or Scallop work for what I'm trying to do? I can't remember why anymore but when selecting what programmes to use, STAR and then salmon quant seemed like the best option.

As you can see I'm very lost and running out of ideas. Any explanation / more things to try would be highly appreciated!! E