Question

Salmon Index

0

Entering edit mode

10 months ago

ExtentHonest56 ▴ 10

Hello,

I had a question about indexing with Salmon. I saw on the Salmon github pipeline that you can use the cDNA sequence with no alterations to create the index.

curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gzenter
salmon index -t athal.fa.gz -i athal_index

But on Salmons documentation they say there are two ways to create indices:

-The first is to compute a set of decoy sequences by mapping the annotated transcripts you wish to index against a hard-masked version of the organism’s genome. This can be done with e.g. MashMap2, and we provide some simple scripts to greatly simplify this whole process. Specifically, you can use the generateDecoyTranscriptome.sh script, whose instructions you can find in this README -The second is to use the entire genome of the organism as the decoy sequence. This can be done by concatenating the genome to the end of the transcriptome you want to index and populating the decoys.txt file with the chromosome names. Detailed instructions on how to prepare this type of decoy sequence is available here. This scheme provides a more comprehensive set of decoys, but, obviously, requires considerably more memory to build the index.

So, can I just use the cDNA file from Ensembl as mentioned above, or do I have to create indices how they mention in the documentation?

Thank you!

transcriptomics Bulk RNA-sequencing • 1.1k views

ADD COMMENT • link updated 10 months ago by GenoMax 141k • written 10 months ago by ExtentHonest56 ▴ 10

score 2 · Accepted Answer · 2023-05-30

2

Entering edit mode

10 months ago

GenoMax 141k

can I just use the cDNA file from Ensembl as mentioned above

Yes you can. That said the recommendation from salmon devs is to use the genome decoy when possible. You can simply append the A. thaliana genome to the file above and create a new set of indexes. Decoy containing indexes will require more RAM when running salmon.

ADD COMMENT • link 10 months ago by GenoMax 141k

0

Entering edit mode

Thank you! I've made the decoy before, but I'm trying to redo it since I believe the first time I completed this incorrectly. The page Salmon references for creating suggests using GenCode, though they don't have what I am looking for. I usually use Ensembl. I am using Bos Taurus, so I decided to try NCBI RefSeq. For some reason the README file wont load after downloading. I believe I would use the file "GCF_002263795.2_ARS-UCD1.3_genomic.fna.gz" and "GCF_002263795.2_ARS-UCD1.3_rna.fna.gz" for this right? Or would I use "GCF_002263795.2_ARS-UCD1.3_rna_from_genomic.fna.gz"? Apologies for the simple question, I am not used to their annotation.

ADD REPLY • link 10 months ago by ExtentHonest56 ▴ 10

1

Entering edit mode

GENCODE is in reference to human data. GENCODE only deals with human, mouse data.

You can use the Arabidopsis genome from Ensembl.

Not sure why you are using an old release of Ensembl. Current release is 56 (but if you want to use release 28 then get the genome file from the same release). Following files are links for current release (as of this writing).

cDNA: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
genome: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

Thank you, yes I am using the newer version of Ensembl. Above with Arabidopsis is the example Salmon gave, I am using the Bos Taurus files. I have read to not use toplevel, but instead combine all chromosome primary files into one, but is toplevel okay for this then?

ADD REPLY • link 10 months ago by ExtentHonest56 ▴ 10

1

Entering edit mode

I am using the Bos Taurus files

Which organism are you actually working on? You have mentioned three so far, arabidopsis, human and now cow.

As for using top level file, it is equivalent to primary file when following condition is met:

If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

I am working on cow, Bos Taurus. The Arabidopsis code was an example code that was on Salmon's github and an example for indexing they gave was using files from Gencode which is why I was hoping to use a different reference bank. Thank you, this was very helpful!

ADD REPLY • link 10 months ago by ExtentHonest56 ▴ 10