I am getting close to getting the velocyto pipeline working. However, I am running into some issues I think having to do with sparse data. I am not quite sure exactly what the solution would be, and would appreciate any tips/guidance on whether things are looking appropriate...
I am following along with this tutorial: https://github.com/BUStools/getting_started/blob/master/velocity_tutorial.ipynb
After this step:
vlm.score_cv_vs_mean(2000, plot=True, max_expr_avg=50, winsorize=True, winsor_perc=(1,99.8), svr_gamma=0.01, min_expr_cells=50) vlm.filter_genes(by_cv_vs_mean=True)
I get the following graph, which doesn't look as smooth as in the example. What might this mean?
Then, after these steps (Note that I had to to use a quite high apparently 60 as the min_perc_U value, otherwise I would get complaints like "min_perc_U=0.5 corresponds to total Unspliced of 1 molecule of less. Please choose higher value or filter our these cell" ):
vlm.score_detection_levels(min_expr_counts=0, min_cells_express=0, min_expr_counts_U=25, min_cells_express_U=20) vlm.score_cluster_expression(min_avg_U=0.007, min_avg_S=0.06) vlm.filter_genes(by_detection_levels=True) vlm.normalize_by_total(plot=True, min_perc_U=60)
I get the following graph:
The step that is ultimately giving me trouble is after that:
vlm.adjust_totS_totU(normalize_total=True, fit_with_low_U=False, svr_C=1, svr_gamma=1e-04)
I get the message:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I am guessing this means I have some genes that are not highly enough expressed at either unspliced or spliced levels. Does this just mean I have to do some more filtering? Or is something about my dataset perhaps not optimal for conducting velocity analysis?