Identifying 5p and 3p in miRNA isoform expression data from TCGA for feature selection
1
2
Entering edit mode
3.0 years ago
jiaqiwu ▴ 20

I've downloaded some miRNA expression data from TCGA (for CHOL) and the isoform quantification files look like this:

miRNA_ID    isoform_coords  read_count  reads_per_million_miRNA_mapped  cross-mapped    miRNA_region
hsa-let-7a-1    hg38:chr9:94175939-94175962:+   1   0.706072    N   precursor
hsa-let-7a-1    hg38:chr9:94175942-94175962:+   1   0.706072    N   precursor
hsa-let-7a-1    hg38:chr9:94175961-94175984:+   2   1.412144    N   mature,MIMAT0000062
hsa-let-7a-1    hg38:chr9:94175962-94175981:+   45  31.773244   N   mature,MIMAT0000062


However, in other projects and papers, I always see selected features labeled as hsa-let-7a-1-3p or hsa-let-7a-5p, etc. Where is the 3p/5p coming from? Does it correspond with the +/- strand?

Additionally, how do I pool this data between different samples so I can run differential expression analysis between data from CHOL samples and other cancer types (i.e., BRCA). My end goal is to perform feature selection methods and then use the selected features to predict cancer types, but I am unsure how to process this data.

tcga microrna mirna isoform data processing • 1.4k views
0
Entering edit mode

Did you find a way to do this? I want to figure out the 3p/5p forms from the isoform quantification files too, but don't know how or where to begin!

2
Entering edit mode
8 months ago

There is how i solve that:

The 3p and 5p strands corresponds to these MIMAT IDs miRNA_region column, then you'll have to sum counts with this same ID for each sample and then you'll have the raw counts that you need for differential expression analysis and so on.

First, delete rows of ''precursor'' and ''stemloop'' rows because we only want mature strands counts:

mature = your_data[-grep("precursor|stemloop", your_data$miRNA_region),]  I also remove ''mature,'' before MIMAT IDs because after all that we'll have to convert MIMAT IDs to miRNAs mature strand names using miRBaseConverter: mature$miRNA_region = gsub(pattern = c(as.character("mature,")), replacement = "", x = mature\$miRNA_region)


Now you'll sum counts of same miRNA ID and patient Barcode.

x = aggregate(read_count ~ miRNA_region + barcode, data=mature, sum)


Then you get something like this:

 miRNA_region                      barcode    read_count
MIMAT0000062 TCGA-05-4244-01A-01T-1108-13      14492
MIMAT0000063 TCGA-05-4244-01A-01T-1108-13       8767
MIMAT0000064 TCGA-05-4244-01A-01T-1108-13        610
MIMAT0000065 TCGA-05-4244-01A-01T-1108-13        750
MIMAT0000066 TCGA-05-4244-01A-01T-1108-13        804
MIMAT0000067 TCGA-05-4244-01A-01T-1108-13       4748


Unfortunately, i don't know yet how to separete counts by its different samples barcodes

edit1: I asked for help on this issue here