I have Illumina RNAseq data from individuals of 2 groups (2 different phenotypes), after preprocessing reads and running de novo assembly with Trinity, I have group 1st with assemblies sized from 45-55 Mb (megabyte, fasta format) and another group (group 2nd) with assemblies sized from 2-9 Mb. This difference could be due to high level of duplication rate (checked with FASTQC "deduplicate" module) in 2nd group raw read data.
To make sure that these assemblies could be feasible for further analysis (e.g differential expression, SNP discovery) or in an unfortunate case, we have to do it all over again (from library preparation steps), I want to check how large the portion of transcripts (with arbitrary similarity) that were shared between individual transcriptomes of two groups is. Which tool or method could help me do that? Any idea on the usefulness of these data (i.e to which extent we can exploit from this bad data) is also welcomed.
Thank you in advance for your suggestion !