I would like to extract all data for human (and later mouse) from TargetScan but I can't quite wrap my head around the different files offered by TargetScan and relationships between them: http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_61
I would like to get the (from a biological point of view) basic result of target prediction: Associations of miRNAs (hsa-miR-100-3p) with genes from the same organism (Gene ID 123445). Ideally, I get a list of all binding sites containing miRNA/gene/position on UTR. More specifically, this is what I am looking for:
All targets for all human miRNAs (i.e. all targets that have at least one conserved binding site), a file looking like this:
| miRNA name | gene id | aggregate context score |
| hsa-miR-1 | 12345 | 0.89 |
Here you only have one line per miRNA-gene combination.
All binding sites for all human miRNAs, a file looking like:
| miRNA name | gene id | position on UTR | context score |
| hsa-miR-1 | 12345 | 128 |0.89 |
Here you have multiple lines for each miRNA-gene pair, one for each binding site.
As far as I can see there are multiple ways to achieve this:
Use the 'Predicted Conserved Targets Info' file. It contains the miRNA family name, the Gene ID and a species ID. Here I can extract everything for e.g. human and then map the miRNA family name to the actual miRNA name. The file contains one binding site per line, so I can use it to get all binding sites and then break it down to single miRNA-gene pairs. But how would I get the context score?
Use the 'Conserved site context+ scores' file. It contains the full miRNA name, Gene ID , context+ score and position. Again, we have one binding site per line. I can parse this, extract all lines with human miRNAs and get what I want. But am I missing something by not using the other files? How does TargetScan come from the 'Predicted Conserved Targets Info' to 'Conserved site context+ scores'?
Use the 'Summary Counts' file. It contains miRNA family, Gene ID, number of conserved sites on gene, aggregate score. After mapping from miRNA family to full miRNA name this would give me the first of my desired results, i.e. a file with miRNA-gene associations.
I read all TargetScan publications, but I still don't know where to start here and if I might miss something if I use only one specific file. What is the best practice here?