Okay, so I aligned my reads to the genome and used the UCSC knownGene gtf to get counts via -t trasncripts. So i got counts,some of these counts are duplcated completely, mean ing they are completely the same and the counts are the same i was reading that this may be due to isoforms mapping to the same area(same start and stop as well as everything else, even duplicated coutns). I also have augmented my counts using biomart and i get gene biotype. Now i know i can graph these biotypes and get a general view of what is in my sample, the issue is is that biomart only queues up the "reviewd(swissprot)" and my "unreveiewed(trembl)" go without augmented info. I guess im just stuck on where to go from here, sorry for the various questions but theres just very limited info on small rna seq and the protocol. This is where im currently at:
1.Decide whether to collapse duplicated counts?
The gene biotypes i got are :
artifact IG_C_gene 19 16 IG_C_pseudogene IG_D_gene 11 46 IG_J_gene IG_J_pseudogene 4 6 IG_pseudogene IG_V_gene 1 207 IG_V_pseudogene lncRNA 289 60210 miRNA misc_RNA 1926 2402 Mt_rRNA Mt_tRNA 2 22 non_stop_decay nonsense_mediated_decay 10 3297 processed_pseudogene processed_transcript 10139 26 protein_coding protein_coding_CDS_not_defined 49902 28795 protein_coding_LoF pseudogene 105 20 retained_intron ribozyme 37115 8 rRNA rRNA_pseudogene 71 514 scaRNA snoRNA 51 1009 snRNA sRNA 2071 6 TEC TR_D_gene 1162 5 TR_J_gene TR_J_pseudogene 21 4 TR_V_gene TR_V_pseudogene 136 46
transcribed_processed_pseudogene transcribed_unitary_pseudogene
1216 188
transcribed_unprocessed_pseudogene translated_processed_pseudogene
1764 2 unitary_pseudogene unprocessed_pseudogene 87 2675 vault_RNA 4
The above only couns everything except trembl genes, im wondering if graphing these would be an acceptable output, and whether the trembl annotations should be added since they are unreviewed and low confidence.
Im genuinely just lost on how to move forward yet again.