I have a miRNA data set which looks somewhat not normal to me and I wanted to get an opinion of the community.
The data set consist of 90 miRNA samples from cancer tissues. After trimming adapters and filtering by size (18-25 bp) by cutadapt there is very high variation of percentage of filtered reads across the samples: lowest 3.5% and highest 50%, with median 19% (see details down or complete descriptive file here File with statistics). Lowest absolute number of reads after trimming and filtering 343.321, and highest 34.482.005.
I want to do a differential expression analysis of different groups of the samples (across tissues). Are there any potential issues which can arise due to high variability in number of reads across different samples? If so, what can be done about them?
sample # Total number Total number of trimmed Trimmed and filtered Reads mapped to Reads mapped to
of reads and filtered reads reads(%) miRNAs miRNAs (%)
3 20264513 1011838 5.0 855474 84.5
6 22279183 1517941 6.8 1331150 87.7
11 41346575 12452439 30.1 10484237 84.2
12 13421631 825000 6.1 660365 80.0
25 17442351 5609984 32.2 4629579 82.5
29 22323963 3018897 13.5 2756814 91.3
32 22097225 1050964 4.8 887537 84.4
34 32368666 9933623 30.7 6039261 60.8
55 24319059 5289647 21.8 4383139 82.9
57 28842256 3291841 11.4 2850177 86.6
60 15407714 1253426 8.1 1103150 88.0
61 21409705 9814410 45.8 8642218 88.1
62 28707347 12635163 44.0 10764864 85.2
65 21955057 7394967 33.7 6109353 82.6
66 26624176 11839221 44.5 10535026 89.0
68 27987570 7319352 26.2 6290405 85.9
69 9638790 2136508 22.2 1859750 87.0
82 30422344 3207930 10.5 2867819 89.4
83 30297304 2402661 7.9 2137548 89.0
85 41137933 11554100 28.1 9066972 78.5
87 27224594 8826819 32.4 7536977 85.4
88 30989860 14273861 46.1 12847943 90.0
91 50343291 3862888 7.7 3257365 84.3
92 21730109 1894990 8.7 1552786 81.9
93 36191161 1431614 4.0 901351 63.0
94 51992345 1805736 3.5 1556177 86.2