Question: Salmon (error: SAM file and fasta file have different transcript length)
0
gravatar for shivyasoni1994
8 months ago by
shivyasoni19940 wrote:

Hi everyone,
I am trying to count aligned reads, using Salmon in alignment-based-mode

Command: salmon quant -t transcript.fa -l A -a accepted_hits.bam' -o salmon_quant

I am running into following error: " SAM file says target ENST00000618686.1 has length 228, but the FASTA file contains a sequence of length [2238 or 2237] "

I had generated transcript.fa from gffread.

Thanks in advance!!!!

rna-seq salmon • 732 views
ADD COMMENTlink modified 8 months ago by jomo018480 • written 8 months ago by shivyasoni19940
0
gravatar for Rob
8 months ago by
Rob3.3k
United States
Rob3.3k wrote:

Hi,

This means that the length of the target length in the fasta file provided to salmon was 2238, while the length in the as inferred by the aligner used to produce the SAM file was _much_ shorter (228). Since salmon builds and applies an error model of alignment to help determine the appropriate allocation for each fragment, these features must be of the same length, so that, for example, the aligner does not report that the read maps to a position that doesn't exist in the transcript file provided. Can you see what the length of this feature is in the input GTF you provided to gffread?

ADD COMMENTlink modified 8 months ago • written 8 months ago by Rob3.3k

Thanks for reply,

The portion of GTF file, containing, transcript_id: ENST00000618686.1, looks like this,

GL000009.2 ENSEMBL transcript 56140 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL exon 56140 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL CDS 58084 58308 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL start_codon 58306 58308 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL stop_codon 58081 58083 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL UTR 56140 58083 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL UTR 58309 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic";

ADD REPLYlink written 8 months ago by shivyasoni19940
0
gravatar for jomo018
8 months ago by
jomo018480
jomo018480 wrote:

It seems that your transcript.fa contains the correct, 2237, length of this transcript. So the BAM file was incorrectly aligned. If you have the original fastq files, it may be better to use them directly with salmon.

ADD COMMENTlink written 8 months ago by jomo018480

Thanks for reply, I aligned my fastq files, against coding-region(cds for Human from ensemble), using TopHat2. Command: tophat Homo_sapiens --library-type fr-secondstrand r1.fq r2.fq

What could be the possible reason that: BAM file was incorrectly aligned??

Can you please suggest : salmon is better or sailfish? and Does salmon deals with multi-mapping reads?

ADD REPLYlink written 8 months ago by shivyasoni19940
1

Wait; did you align against the transcriptome using TopHat2? I would highly recommend against doing that. In alignment-based mode, salmon requires alignments to the transcriptome. The best ways to obtain such alignments are either to (1) align to the genome using STAR and use it's built-in facilities to _project_ the alignments into transcriptome coordinates or (2) use an un-spliced aligner (e.g. Bowtie2) to align against the transcriptome fasta file directly (the same one you then feed to salmon to built its error model, so they can't disagree).

Can you please suggest : salmon is better or sailfish? and Does salmon deals with multi-mapping reads?

Salmon is preferred to sailfish (both deal with multimapping reads). Also, sailfish cannot accept external alignments, it performs its own mapping. This brings me to my last question / suggestion. Is there a specific reason you want to run salmon in alignment-based mode? This is, of course, a perfectly legitimate thing to do. However, you can also simply index the transcriptome fasta file with salmon and let it map the reads and quantify the abundances itself --- you don't need to use an external alignment tool.

ADD REPLYlink written 8 months ago by Rob3.3k

Thanks for explaining, in such easy manner, I actually aligned against coding-sequences(CDS), as aligning with genome requires high computational memory.

This brings me to my last question / suggestion. Is there a specific reason you want to run salmon in alignment-based mode?

I want to count reads of my BAM-file in alignment-based-mode, as pseudo-alignment is not very much preferred. With pseudo-alignment I can miss , novel isoforms, as I want to go for Differential-gene-expression analysis, Do you think, novel isoforms are going to make any difference??

ADD REPLYlink modified 7 months ago • written 8 months ago by shivyasoni19940
1

As you point out, whether or not you use traditional alignment pr the builtin mapping algorithm, you won't be finding novel isoforms. If you want to do that, you should perform assembly via a tool like Stringtie or Scallop prior to quantification.

Regardless of if you do that, if you want to use traditional alignments, one easy way would be to first extract the transcripts from the genome, and then produce alignments against this reference using e.g. Bowtie2 (being sure to allow multimapping). Then, you will be sure the reference lengths in the BAM are the same as those in the input fasta file. If you want to use the builtin mapping algorithm, i'll mention that, if you use a recent version of salmon (I always recommend the most recent release), the --validateMappings flag will actually use an alignment algorithm (with some algorithmic enhancements to account for the fact that you are aligning against a highly repetitive reference transcriptome) to simultaneously improve mapping sensitivity and to ensure that mappings yield good underlying alignments and are not spurious.

ADD REPLYlink written 8 months ago by Rob3.3k

Thanks a lot, this explanation is very helpful!! Should I go for quasi-mapping-based mode or FMD-index-based mode, as I want to go for , Differential-gene-expression analysis.

ADD REPLYlink written 8 months ago by shivyasoni19940
1

I'd recommend going with the quasi-mapping, and including the --validateMappings option among your command line flags (including others that seem relevant for your data given the documentation).

ADD REPLYlink written 8 months ago by Rob3.3k

Thanks Rob !!!!!!!!!!!!!! :)

ADD REPLYlink written 8 months ago by shivyasoni19940

Hi Rob, what does -g [ --geneMap ] do?? If this flag is passed, then tximport is no more required?? which is used to input quant.sf in DESeq2??

ADD REPLYlink written 8 months ago by shivyasoni19940
1

The effect of the --geneMap flag is to cause salmon to output, in addition to it's transcript level quantifications, aggregated gene-level quantifications. However, this flag exists only for backward compatibility and is deprecated in favor of tximport. Though it attempts to do the same thing, tximport has a number of distinct advantages (a big one being that tximport can estimate an average gene length by looking at the salmon output over all samples in an experiment, while salmon's approach is, necessarily, single-sample). As such, I recommend to aggregate transcript-level abundance estimates to the gene level using tximport.

ADD REPLYlink written 8 months ago by Rob3.3k

Thanks !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

ADD REPLYlink written 7 months ago by shivyasoni19940
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour