I have a de novo transcriptome assembly of a polyploid tree species assembled with K=31 and min length 200 bp. The assembly contains almost 400K genes, and after a reduction with CD-HIT-EST (cut-off=0.97), I have around 350K genes left. Mapping ca. 1/4 of total reads back to the assembly showed the majority of the reads align > 1 times. Do you think would it pose a problem if I aim to work at a gene level? I can try cd-hit-est with cut-off=0.95. Or is it better to use Lace to stitch different isoforms together and take it from there?
Thank you very much in advance for your suggestions and comments!
$ bowtie2 --local --no-unal -x cdhit_e97_Trinity_Famer_K31 -p 24 -q -1 cat_70x_R1.fq.gz -2 cat_70x_R2.fq.gz | samtools view -b | samtools sort -o 70x_bowtie2.bam 78850917 reads; of these: 78850917 (100.00%) were paired; of these: 2584872 (3.28%) aligned concordantly 0 times 11430984 (14.50%) aligned concordantly exactly 1 time 64835061 (82.22%) aligned concordantly >1 times ---- 2584872 pairs aligned concordantly 0 times; of these: 201798 (7.81%) aligned discordantly 1 time ---- 2383074 pairs aligned 0 times concordantly or discordantly; of these: 4766148 mates make up the pairs; of these: 627598 (13.17%) aligned 0 times 410463 (8.61%) aligned exactly 1 time 3728087 (78.22%) aligned >1 times 99.60% overall alignment rate [bam_sort_core] merging from 80 files and 1 in-memory blocks...