Salmon alignment based quantification
1
1
Entering edit mode
2.1 years ago

Hi, I aligned my sequences with STAR alignment and I need to quantify them with salmon with alignment based quantification mode. So I wonder

> ./bin/salmon quant -t transcripts.fa -l <LIBTYPE> -a aln.bam -o salmon_quant

In the command above, what should I write to transcripts.fa ? Should I write my human primary assembly file? Or my sample’s ngs file?

Salmon rna-seq quantification • 3.7k views
ADD COMMENT
1
Entering edit mode
2.1 years ago

transcripts.fa should contain the sequences of the transcripts you wish to quantitate, in fasta format. You should note however, that Salmon expects your reads to be aligned to the transcriptome (that is, aligned to the sequences in the transcripts.fa file, not the genome. If you look in your BAM file, the contigs should be the names of transcripts, not the names of chromosomes. See this note from the salmon documentation:

Note

Genomic vs. Transcriptomic alignments

Salmon expects that the alignment files provided are with respect to the transcripts given in the corresponding fasta file. That is, Salmon expects that the reads have been aligned directly to the transcriptome (like RSEM, eXpress, etc.) rather than to the genome (as does, e.g. Cufflinks). If you have reads that have already been aligned to the genome, there are currently 3 options for converting them for use with Salmon. First, you could convert the SAM/BAM file to a FAST{A/Q} file and then use the lightweight-alignment-based mode of Salmon described below. Second, given the converted FASTA{A/Q} file, you could re-align these converted reads directly to the transcripts with your favorite aligner and run Salmon in alignment-based mode as described above. Third, you could use a tool like sam-xlate to try and convert the genome-coordinate BAM files directly into transcript coordinates. This avoids the necessity of having to re-map the reads. However, we have very limited experience with this tool so far.

ADD COMMENT
0
Entering edit mode

Firstly thanks for your response. I am still having problems in understanding this. 😅 Now I have bam files from STAR and I have gtf and Genome sequence, primary assembly (GRCh38) file from gencode. Lastly I have my sample's fastq files. As I understood from your reply, my sample's transcript file (because I want to quantify them :D)

ADD REPLY
1
Entering edit mode

You're going to need to do some things:

  1. Convert your .bam back to fastq. You'll want to sort your bam by query and not by coord for this step. This will make a subset of your original fastqs to only include the reads that mapped to the genome.
  2. Convert your GTF file to a .fasta file. There's some examples around the web (gffread was incredibly simple). This will be your transcripts.fa file
  3. Run Salmon
ADD REPLY
0
Entering edit mode

Thank you, but I didnt understand why am I converting my bam file to fastq because salmon accepts the bam file. (I am doing alignment-based quant)

ADD REPLY
1
Entering edit mode

This is because you likely mapped your original fastq files to your genome reference and not your transcriptome reference. My understanding of the alignment-based quant in Salmon is that your .bam file must have been aligned to the same .fasta file you are giving it (in this case, your transcripts.fa).

ADD REPLY
0
Entering edit mode

Hi, I'm about to use Salmon in the alignment-based mode, and I have a couple of questions.

1.) I used minimap to create alignment.sam files. Then I used SamTools to convert the .sam files to .bam files. Do I need to sort the .bam files by reference coordinates prior to giving the .bam file to Salmon for the following command...

./bin/salmon quant -t transcripts.fa -l <LIBTYPE> -a aln.bam -o salmon_quant

Or can I just give Salmon my .bam files without sorting?

2.) Does Salmon do any normalizing? For alignment-based mode, is there capability to give Salmon all of my .bam files which contain 3 replicates of 6 samples, so 18 .bam files.

I will be doing differential expression analysis downstream using edgeR, StageR and DEXseq.

ADD REPLY
1
Entering edit mode

Usually, STAR is used to align your reads to the genome. But the alignment mode of Salmon doesn't take reads aligned to the genome, it takes reads aligned to the transcriptome. Consider the following situation:

             AGATAGCGATATAGCATTTCGGGACAGCTAACGATACGATACTCAGACTACGAGACTCATCTAGAGACTATACGACTACAGATA
     Genome: 1-------10----------20--------30--------40--------50--------60--------70--------80---    chr1
Transcripts:                     ||||||||||----------||||||||||   transcriptA
      Reads:                     -------- read1

If you just run STAR in the usual way, you index a fasta file of the genome (called something like genome.fa or hg38.fa) that starts something like:

>chr1
AGATAGCGATATAGCATTTCGGGACAGCTAACGATACGATACTCAGACTACGAGACTCATC
TAGAGACTATACGACTACAGATA

and will get out a BAM file that tells you that read1 maps to base 20 of chr1.

However, what salmon expects is for you to index a fasta file of the transcriptome sequence (called something like transcripts.fa or ensembl105.fa), that starts something like:

>transcript1
GGGACAGCTTACTCAGACTAC

align against that, and get a BAM file that says read1 maps to base 1 of transcriptA. This is inline with what tools like RSEM expect.

So, if you want to use salmon in alignment mode, you have two options, both involve repeating the alignment:

  • Generate the transcripts.fa by using the GTF file and the genome.fa and gffread, index this with STAR --genomeGenerate, and align your reads against this index.
  • STAR has an special mode where you can align to the genome, but output BAMs in transcript coordinates. You activate this mode by passing --quantMode TranscriptomeSAM to STAR. You will still need transcriptome.fa in order to run salmon.

Alternatively, if you don't have access to the original fastq files, you can turn you BAM alignment back into fastq using samtools with the right options, and then use that as input to salmon in fastq mode.

ADD REPLY
0
Entering edit mode

I aligned with --quantMode TranscriptomeSAM again. So with toTranscriptome.out.bam files and transcript.fa files. I tried to run salmon but I am getting error below:

[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000463753.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309002.1|COL8A1-205|COL8A1|2964|processed_transcript| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000483969.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309004.1|COL8A1-207|COL8A1|832|processed_transcript| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000474648.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309005.1|COL8A1-206|COL8A1|841|processed_transcript| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000452013.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309003.2|COL8A1-203|COL8A1|659|protein_coding| appears in the reference but did not appear in the BAM^Z
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000261037.7|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309001.1|COL8A1-201|COL8A1|5705|protein_coding| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000652472.1|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000504360.2|COL8A1-209|COL8A1|5515|protein_coding| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000273342.8|ENSG00000144810.16|OTTHUMG00000148669.5|-|COL8A1-202|COL8A1|3029|protein_coding| appears in the reference but did not appear in the BAM

(Which is keeps going in the .err file)

ADD REPLY
0
Entering edit mode

This is a warning, not an error. It just says that you've got genes that get no reads. Its only a problem if every transcript is in the list and your output tables as zeros for everything. In that case it probably means that STAR is using different names for things than salmon is.

ADD REPLY
0
Entering edit mode

I was able to remove the warnings (which eventually ended up in an error) by creating transcriptome file from the genome and it's annotations (gtf):

gffread -w output.transcriptome.fa -g genome.fa genome.gtf
ADD REPLY
0
Entering edit mode

I am trying to create transcriptome file via gffread, but I see that I can only pass in gff annotation file not gtf.

seeing above "gffread -w output.transcriptome.fa -g genome.fa genome.gtf" I tried the following "gffread -w GCF_000146045.2_R64_genomic_transcriptome_gtf.fa -g GCF_000146045.2_R64_genomic.fna GCF_000146045.2_R64_genomic.gtf"

but it resulted with following error "Error: no valid ID found for GFF record"

I tried to see if there is a specific flag need to be used for gtf , but noticed the only way to create transcriptome.fa file using gffread is via gff file and genome.fa. Is this correct?

ADD REPLY

Login before adding your answer.

Traffic: 2406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6