It's common that metadata/information related to RNA-seq data are absent or not really clear. Often it needs substantial work and launching several tools to deduce those information.
Currently several approaches exist:
for the most known infer_experiment.py but you need an annotation ( I'm most of time working on species that do not have any annotation available so I always need to launch draft annotation to have few genes usable)
Launch tophat or hisat twice ( using fr-firststrand first and then fr-secondstrand) and then compare the results as explained here. The results from that are not always clear.
Map your reads (or a subsample) and then look at the results (how Read1 and Read2 are aligned) within a genome browser as explained here.
Use Salmon (but relative result because does not use an annotation)
...
Tired by receiving RNA-seq data without information of the library type used I mature the idea to develop a single tool to automate this task and provide me all the information needed based on any type of input data used as input:
- With or without fasta file (it will do an transcriptome assembly in no fasta file provided to map the reads against)
- With or without an annotation (Do an annotation using BUSCO if no gff/gtf provided)
- ...
Here is the result: GUESSmyLT
I hope it could help many of us to resolve this recurrent problem.
You are welcome to try it and provide feedback to improve it.
An example of result:
Results of paired library inferring of reads 4_r1.sub.100000 on ref 4:
Library type Reads Percent Vizualization according to firststrand
undecided 1 0.0% 3' -------??------- 5'
5' -------??------- 3'
ff_second 2 0.0% 3' ----------==2==> 5'
5' ==1==>---------- 3'
fr_first 4019 47.2% 3' ----------<==1== 5'
5' ==2==>---------- 3'
ff_first 5 0.1% 3' ----------<==1== 5'
5' <==2==---------- 3'
rf_second 19 0.2% 3' ----------==2==> 5'
5' <==1==---------- 3'
rf_first 21 0.2% 3' ----------==1==> 5'
5' <==2==---------- 3'
fr_second 4454 52.3% 3' ----------<==2== 5'
5' ==1==>---------- 3'
Roughly 50/50 split between the strands of the same library orientation should be interpreted as unstranded.
interesting tool, the descriptions on the possible library types are also quite handy.
some feedback on usage,
the example invocations are needlessly lengthy, you should not need to list the files as
home/.../read1.fastq
just call the filesread1.fq
andread2.fq
why bother with the absolute pathsthe use cases should be labeled by the information that is available to the end user:
requiring snakemake to run your tool seems to add unneeded complexity.
in general it seems there seem to be too many dependencies. It feels like the task at hand (determine the library type) ought to be much simpler than having to first assemble a transcript. Not sure what the right answer is here, but this might be an interesting research problem on its own. How to detect the library type without assembling transcripts?
What I am basically saying is that transcript assembly is a different and much bigger/complicated task than library type detection.
Thank you for your feedback. It's true that snakelike seems unnecessary. I will think about it and see if I keep it or not.