Understanding salmon codes
1
1
Entering edit mode
3.2 years ago
Morris_Chair ▴ 280

Hello everyone, I want to count reads from RNAseq using salmon tool, probably because I'm not a bioinformatician (I'm trying hard to be..) but reading the guideline of salmon I'm not able to make it working. I was taught first to create an index which i did it by downloading the transcriptome and generate an index by this command line.

salmon index -t gencode.v29.transcripts.fa.gz -i gencode_v29_idx


Ideally I shoud quantify with gencode_v29_idx as salmon index but in the gencode_v29_idx folder there are banches of files which is not clear to me which file I shuould add to where.. On the manual it says to run this command line

salmon quant -i transcripts_index -l <LIBTYPE> -1 reads1.fq -2 reads2.fq -o˓→transcripts_quant


plus studying on the web I endeded up with this command line, I used the file indexing.log but I don't know if this is correct

(salmon) [@ws7910 RNAseq]$salmon quant -i /gencode_v29_idx/indexing.log -l --libType A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -o salm_treated1  It needs to be adjusted but I don't know what Thank you for help RNA-Seq • 3.7k views ADD COMMENT 0 Entering edit mode Not sure but I don't think that salmon can index gzipped files. Might be anyway best to first gunzip the file and only then index it. Edited ADD REPLY 2 Entering edit mode Actually, salmon can accept gzipped FASTA files for reference index creation as well as gzipped FASTA/Q files for quantification. I believe that Morris was able to build the index, but is running into problems while quantifying. ADD REPLY 0 Entering edit mode OK, point taken. indeed it does. ADD REPLY 3 Entering edit mode 3.2 years ago Rob 5.2k Hi Morris, You should add the index directory as the parameter to -i, so, try: (salmon) [@ws7910 RNAseq]$ salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -o salm_treated1


I've also removed a spurious / before gencode_v29_idx and also -l and --libType are redundant, so I fixed that. Let me know if this works for you.

0
Entering edit mode

Hi Rob, thank you, it worked :) What surprises me is the value of the mapping rate which I guess is very low.

[2019-03-11 23:07:14.345] [jointLog] [warning] Only 125906 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.
[2019-03-11 23:07:14.345] [jointLog] [info] **Mapping rate = 18.5059%**


grazie

1
Entering edit mode

That can have plenty of reason. Insufficient read length, gDNA, rRNA or bacterial/whatever contaminations come to mind. What is the read length? Try blasting some of the unassigned reads (check the documentation on how to get them, it is well-explained in there).

0
Entering edit mode

Hi ATpoint, The read length is =101

(salmon) [@ws7910 my_fastq]$cat Treated_1_m1.fastq | head @HWI-ST571:93:C00B8ACXX:1:1301:8908:31103/1 CTGGACTCCACACTCTCCTGGGTTTCACCTTTGTAGCAGGATCCCTGCAGACCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCA  The flag to write out the name of unassigned reads is this --writeUnmappedNames but now there is another problem ... When I run the salmon code like salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -- writeUnmappedNames -o salmon_unmappeed  I have this **activate does not accept more than one argument:** ['salmon', 'quant', '-i', 'gencode_v29_idx', '-l', 'A', '-1', '/Treated_1_m1.fastq', '-2',.......  I couldn't find this specific problematic online to see how to fix it, do you guys know how to solve it? thank you ADD REPLY 1 Entering edit mode Hi Morris, It looks like you have a big space between the -- and the flag name that should follow immediately after. That is, your command line should be: salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --writeUnmappedNames -o salmon_unmappeed  not salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -- writeUnmappedNames -o salmon_unmappeed  The spaces matter in how a program parses arguments (not just salmon, but all programs). The -- sequence by itself is a special separator and indicates the end of a series of arguments for the program. ADD REPLY 1 Entering edit mode Hi Rob, I found the problem, basically today I modified the environmental variable with alias salmon='source activate salmon' in order to access to salmon just typing in the terminal salmon, but there was a conflict with conda ..,this problem solved :) ADD REPLY 0 Entering edit mode Hi ATpoint, The sequences unmapped that I found where human (those one that I checked), .. I will try to change parameters and see if I can improve the percentage of mapped reads. (salmon) [@ws7910 my_fastq]$ grep -A 3 "C00B8ACXX:1:2205:7234:25095/1" Treated_1_m1.fastq
@HWI-ST571:93:C00B8ACXX:1:2205:7234:25095/1
CTATCCACAGAGACACACCGGACTGAGGGGGGCCACCTGCCTCCTTCCACACGGGGAAGACCAGGGGGTGGCAACAGATGGGGCCCTGCCTCTGCCCTCAG
+
CCCFFFDFHHGGHJJJJIIIJIIJJJJIIIJBDD?BBBDBDDDDD@ACDD>@BBD@BDDD?CCDDDDD8;@DCDB?ACDDCBBDDDDDDDDDDDDDBCBBD
(salmon) [p.panelli@ws7910 my_fastq]grep -A 3 "C00B8ACXX:1:2201:7717:123882/1" Treated_1_m1.fastq @HWI-ST571:93:C00B8ACXX:1:2201:7717:123882/1 AGATGCCTCATCCTGGGTTCACCCCAGGGACCATCGGGGTCTGCCTTTCCCAGCATTCTCTCTCCCTTGAGGATTCTCCATTCCATCATTTTGGTGGCGTT +CCCFFFFFHHHHHJJJJIJJJJIJJJJJJJJJJIIJIJJ@GHJJJJJJIJJHJJJIHHHHHHHFFFFFCEEEEDDCDDDDDEEFEEDDDDDEC=AB?ADBB  Thank you for your thought ADD REPLY 1 Entering edit mode Hi Morris, Can I ask how you verified that these are "human"? I took the sequences you gave above, and searched against the gencode transcriptome using Bowtie2 as the aligner. Both of these reads came back as completely unmapped. I did get a hit with blast (https://www.ncbi.nlm.nih.gov/nucleotide/NG_009253.1?report=genbank&log=nuclalign&blast_rank=1&RID=8G9GH598015), but I wonder if this is related to the presence or absence of this gene in the transcriptome against which you are aligning (I realize that I suggested gencode, and that's because I usually find it to have among the best annotations, so not sure why this would be missing).

Edit: With the help of some folks on Biostars chat, it seems that these reads are intronic, which explains why they are not mapping to the spliced transcriptome.

0
Entering edit mode

Hello Rob, I did a nucleotide blast with this this tool

selecting any species and I obtained this result (for the first one )

Human DNA sequence from clone RP11-169K16and I got this

The extract is RNA so it's weird to be intronic.

Thank you

1
Entering edit mode

If you take the sequence above (CTATCCACAGAGACACACCGGACTGAGGGGGGCCACCTGCCTCCTTCCACACGGGGAAGACCAGGGGGTGGCAACAGATGGGGCCCTGCCTCTGCCCTCAG)and blat it at UCSC you can see that the only hit is intronic. Black bar with name test is the sequence above.

Perhaps this molecule was captured before splicing?

0
Entering edit mode

Yes, it might be that it was captured before splicing, I don't think that the vast majority of the unmapped reads are intronic also because when I used featureCounts I got 75% of aligned reads.

Thank you genomax

0
Entering edit mode

That's a very surprising difference if the annotations are really the same. Would you be able to share the reads from one of your samples? We'd be interested to take a look.

0
Entering edit mode

sure, I can share the files let me know how

0
Entering edit mode

How large are they? You could use a file-sharing website, or if it is easier, you should be able to upload the data to this gdrive link.

0
Entering edit mode

Hi Rob, you should be able to download the files from here. There is a bam file and two fastq files derived from this Bam file

cheers