We are looking for feedback on a new Ensembl tool being developed to help researchers download the reference files they need in the right format directly from Ensembl.
We understand there's slightly different formatting needed by different tools, or even sometimes you need identifiers remapped to make datasets match. An example of that would be EMBL chromosome names (1, 2, 3...) and UCSC chromosome names (chr1, chr2, chr3...). For some analyses N padding in a chromosome, for others it might cause issues.
So we're creating a tool that can help give you the datasets you need, in the format you need, so you can spend less time preparing the reference sets and get down to running your analysis. For example, NCBI has a number of premade datasets, with different combinations of regions, and with prepared indexes for common tools:
The first step in this project is we want to hear from you on what filtering and transformations you do to our datasets to make them useful for your analysis. Or what changes to our datasets would make them easier for you to run your analysis faster. Everything from identifier types, to extra attributes needed, what combinations of regions in a reference set (patches, haplotypes, scaffolds, etc) to masking and filtering of regions.
Once we have a list of how our users use our data, and what programs they're trying to use it with, we can start this initiative to make our datasets and tools better adapted to your needs. We also hope we'll be able to follow up with anyone replying in case we need some clarification to better understand your needs.
Thank you to everyone, we're committed to making our reference data better fit your analysis needs.