Question: Understanding salmon codes
1
gravatar for Morris_Chair
11 days ago by
Morris_Chair60
ISBReMIT
Morris_Chair60 wrote:

Hello everyone, I want to count reads from RNAseq using salmon tool, probably because I'm not a bioinformatician (I'm trying hard to be..) but reading the guideline of salmon I'm not able to make it working. I was taught first to create an index which i did it by downloading the transcriptome and generate an index by this command line.

salmon index -t gencode.v29.transcripts.fa.gz -i gencode_v29_idx

Ideally I shoud quantify with gencode_v29_idx as salmon index but in the gencode_v29_idx folder there are banches of files which is not clear to me which file I shuould add to where.. On the manual it says to run this command line

salmon quant -i transcripts_index -l <LIBTYPE> -1 reads1.fq -2 reads2.fq -o˓→transcripts_quant

plus studying on the web I endeded up with this command line, I used the file indexing.log but I don't know if this is correct

(salmon) [@ws7910 RNAseq]$ salmon quant -i /gencode_v29_idx/indexing.log -l --libType A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -o salm_treated1

It needs to be adjusted but I don't know what

Thank you for help

rna-seq • 266 views
ADD COMMENTlink modified 9 days ago • written 11 days ago by Morris_Chair60

Not sure but I don't think that salmon can index gzipped files. Might be anyway best to first gunzip the file and only then index it.

Edited

ADD REPLYlink modified 11 days ago • written 11 days ago by lieven.sterck4.2k
1

Actually, salmon can accept gzipped FASTA files for reference index creation as well as gzipped FASTA/Q files for quantification. I believe that Morris was able to build the index, but is running into problems while quantifying.

ADD REPLYlink written 11 days ago by Rob3.2k

OK, point taken. indeed it does.

ADD REPLYlink written 11 days ago by lieven.sterck4.2k
2
gravatar for Rob
11 days ago by
Rob3.2k
United States
Rob3.2k wrote:

Hi Morris,

You should add the index directory as the parameter to -i, so, try:

(salmon) [@ws7910 RNAseq]$ salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -o salm_treated1

I've also removed a spurious / before gencode_v29_idx and also -l and --libType are redundant, so I fixed that. Let me know if this works for you.

ADD COMMENTlink written 11 days ago by Rob3.2k

Hi Rob, thank you, it worked :) What surprises me is the value of the mapping rate which I guess is very low.

[2019-03-11 23:07:14.345] [jointLog] [warning] Only 125906 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.
[2019-03-11 23:07:14.345] [jointLog] [info] **Mapping rate = 18.5059%**

grazie

ADD REPLYlink modified 11 days ago • written 11 days ago by Morris_Chair60
1

That can have plenty of reason. Insufficient read length, gDNA, rRNA or bacterial/whatever contaminations come to mind. What is the read length? Try blasting some of the unassigned reads (check the documentation on how to get them, it is well-explained in there).

ADD REPLYlink written 11 days ago by ATpoint14k

Hi ATpoint, The read length is =101

(salmon) [@ws7910 my_fastq]$ cat Treated_1_m1.fastq | head
@HWI-ST571:93:C00B8ACXX:1:1301:8908:31103/1
CTGGACTCCACACTCTCCTGGGTTTCACCTTTGTAGCAGGATCCCTGCAGACCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCA

The flag to write out the name of unassigned reads is this --writeUnmappedNames

but now there is another problem ... When I run the salmon code like

salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --       writeUnmappedNames -o salmon_unmappeed

I have this

**activate does not accept more than one argument:**
['salmon', 'quant', '-i', 'gencode_v29_idx', '-l', 'A', '-1', '/Treated_1_m1.fastq', '-2',.......

I couldn't find this specific problematic online to see how to fix it, do you guys know how to solve it?

thank you

ADD REPLYlink modified 10 days ago • written 10 days ago by Morris_Chair60
1

Hi Morris,

It looks like you have a big space between the -- and the flag name that should follow immediately after. That is, your command line should be:

salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --writeUnmappedNames -o salmon_unmappeed

not

salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --       writeUnmappedNames -o salmon_unmappeed

The spaces matter in how a program parses arguments (not just salmon, but all programs). The -- sequence by itself is a special separator and indicates the end of a series of arguments for the program.

ADD REPLYlink written 10 days ago by Rob3.2k
1

Hi Rob, I found the problem, basically today I modified the environmental variable with alias salmon='source activate salmon' in order to access to salmon just typing in the terminal salmon, but there was a conflict with conda ..,this problem solved :)

ADD REPLYlink written 10 days ago by Morris_Chair60

Hi ATpoint, The sequences unmapped that I found where human (those one that I checked), .. I will try to change parameters and see if I can improve the percentage of mapped reads.

(salmon) [@ws7910 my_fastq]$ grep -A 3 "C00B8ACXX:1:2205:7234:25095/1" Treated_1_m1.fastq
@HWI-ST571:93:C00B8ACXX:1:2205:7234:25095/1
 CTATCCACAGAGACACACCGGACTGAGGGGGGCCACCTGCCTCCTTCCACACGGGGAAGACCAGGGGGTGGCAACAGATGGGGCCCTGCCTCTGCCCTCAG
 +
CCCFFFDFHHGGHJJJJIIIJIIJJJJIIIJBDD?BBBDBDDDDD@ACDD>@BBD@BDDD?CCDDDDD8;@DCDB?ACDDCBBDDDDDDDDDDDDDBCBBD
(salmon) [p.panelli@ws7910 my_fastq]$ grep -A 3 "C00B8ACXX:1:2201:7717:123882/1" Treated_1_m1.fastq
@HWI-ST571:93:C00B8ACXX:1:2201:7717:123882/1
AGATGCCTCATCCTGGGTTCACCCCAGGGACCATCGGGGTCTGCCTTTCCCAGCATTCTCTCTCCCTTGAGGATTCTCCATTCCATCATTTTGGTGGCGTT +CCCFFFFFHHHHHJJJJIJJJJIJJJJJJJJJJIIJIJJ@GHJJJJJJIJJHJJJIHHHHHHHFFFFFCEEEEDDCDDDDDEEFEEDDDDDEC=AB?ADBB

Thank you for your thought

ADD REPLYlink modified 10 days ago • written 10 days ago by Morris_Chair60
1

Hi Morris,

Can I ask how you verified that these are "human"? I took the sequences you gave above, and searched against the gencode transcriptome using Bowtie2 as the aligner. Both of these reads came back as completely unmapped. I did get a hit with blast (https://www.ncbi.nlm.nih.gov/nucleotide/NG_009253.1?report=genbank&log$=nuclalign&blast_rank=1&RID=8G9GH598015), but I wonder if this is related to the presence or absence of this gene in the transcriptome against which you are aligning (I realize that I suggested gencode, and that's because I usually find it to have among the best annotations, so not sure why this would be missing).

Edit: With the help of some folks on Biostars chat, it seems that these reads are intronic, which explains why they are not mapping to the spliced transcriptome.

ADD REPLYlink modified 10 days ago • written 10 days ago by Rob3.2k

Hello Rob, I did a nucleotide blast with this this tool

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome,

selecting any species and I obtained this result (for the first one )

Human DNA sequence from clone RP11-169K16and I got this

The extract is RNA so it's weird to be intronic.

Thank you

ADD REPLYlink written 9 days ago by Morris_Chair60
1

If you take the sequence above (CTATCCACAGAGACACACCGGACTGAGGGGGGCCACCTGCCTCCTTCCACACGGGGAAGACCAGGGGGTGGCAACAGATGGGGCCCTGCCTCTGCCCTCAG)and blat it at UCSC you can see that the only hit is intronic. Black bar with name test is the sequence above.

blat

Perhaps this molecule was captured before splicing?

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax64k

Yes, it might be that it was captured before splicing, I don't think that the vast majority of the unmapped reads are intronic also because when I used featureCounts I got 75% of aligned reads.

Thank you genomax

ADD REPLYlink written 9 days ago by Morris_Chair60

That's a very surprising difference if the annotations are really the same. Would you be able to share the reads from one of your samples? We'd be interested to take a look.

ADD REPLYlink written 9 days ago by Rob3.2k

sure, I can share the files let me know how

ADD REPLYlink written 9 days ago by Morris_Chair60

How large are they? You could use a file-sharing website, or if it is easier, you should be able to upload the data to this gdrive link.

ADD REPLYlink written 9 days ago by Rob3.2k

https://drive.google.com/file/d/1gAItT7JfdmbDo2AiTuNy_asXHRwlP7yv/view?usp=sharing https://drive.google.com/file/d/1fx0S7rdNs3qbAtMFBDL4zRg04F5EEWRx/view?usp=sharing https://drive.google.com/file/d/13yry1LqU8Q6Ob1NzsFvlpuTNCUPEDVPU/view?usp=sharing

Hi Rob, you should be able to download the files from here. There is a bam file and two fastq files derived from this Bam file

cheers

ADD REPLYlink modified 8 days ago • written 8 days ago by Morris_Chair60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 900 users visited in the last hour