Running salmon in alignment mode?
0
0
Entering edit mode
5 months ago
pearl2070 • 0

This is my first time working with RNA data. So far, I've run the dataset through trimmomatic, sortmeRNA, megahit and BWA. I'm trying to run Salmon using the SAM file output from BWA and the .fa resulting from megahit as the transcriptome.

I run this line:

./salmon-1.8.0_linux_x86_64/bin/salmon quant -p 12 -t Sample1_megahit.contigs.fa -l A -a Sample1_megahit.annotation_bwa.sam -o Sample1_salmon


I get this error at the end of a stream of lines that all say a variation of "this transcript not found in reference" and I'm not sure what reference it's referring to: Please provide a reference FASTA file that includes all targets present in the BAM header.

Should I be passing the unassembled transcriptome from before megahit or something? The megahit file filtered to have only transcripts that had successful BWA alignments? I'm not sure how to do that. The data was originally paired end, if that is relevant.

1
Entering edit mode

Which fasta did you use for BWA alignment?

0
Entering edit mode

For BWA, I ran:
bwa index -a bwtsw microbial_all_cds.fasta

Followed by:
bwa mem -t 32 microbial_all_cds.fasta Sample1_megahit.contigs.fa > Sample1_megahit.annotation_bwa.sam

0
Entering edit mode

Salmon is telling you that the names of the contigs in Sample1_megahit.contigs.fa are not the same as the names of the contifg in Sample1_megahit.annotation_bwa.sam

0
Entering edit mode

What could be causing this? Could it be because there are many contigs in Sample1_megahit.contigs.fa that didn't get annotated, and therefore aren't present in Sample1_megahit.annotation_bwa.sam?

0
Entering edit mode

In your comment to @Shred you say that you aligned to microbial_all_cds.fasta. If that's the case, then you must pass microbial_all_cds.fasta to salmon.

0
Entering edit mode

Where? If I run

./salmon-1.8.0_linux_x86_64/bin/salmon quant -p 12 -t microbial_all_cds.fasta -l A -a Sample1_megahit.annotation_bwa.sam -o Sample1_salmon


Then I encounter errors that read " Transcript appears twice in the transcript FASTA file" and "Transcript appears in the reference but did not appear in the BAM."

1
Entering edit mode

This means that you have multiple entires in your FASTA file that have the same name, which isn't allowed.

0
Entering edit mode

How do I resolve this? Is there some way to remove duplicates? I don't think I had a specific step to dereplicate sequences in my pipeline, actually. Could that have caused this problem?

0
Entering edit mode

Are you sure megahit is appropriate to use with RNAseq data? It appears to be a genome assembler.

0
Entering edit mode

I have seen it used in a few metatranscriptomics studies, but if it's likely to be the cause of this issue, I can try a different assembler.