Forum: How do you process your reference data? (in genomic, metabolomic, transcriptomic, proteomic?)
0
gravatar for Chloe Riou
2.6 years ago by
Chloe Riou40
Chloe Riou40 wrote:

Hi everyone,

I need your help and advice (###question### at the end of the paragraph for the really busy person). I am currently working on BioMAJ which is a workflow engine dedicated to data synchronization and processing. Its purpose is to manage and supervise databank updates, it also could do the pre or post processes of each banks (GenBank, EMBL, DDBJ, PBD...). This management saves a considerable amount of time, you do not have to think when to check databank, and what is the local version of your databank. BioMAJ has the ambition to adapt to new uses of databases, and to do this I will need your advice as bioinformatician/bioinformaticist or biologist or person interested by the question :

###How to you process your reference databank ? or reference genome ? (for example blast, indexation, extraction of gene sets, request in RDF/SPARQL ... ) Which data types do you work with?###

Thank you for reading this post. I would be grateful for each of your contribution!

https://github.com/genouest/biomaj/wiki

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Chloe Riou40
6

How do you season your data?

Spicey all the way!

On a more serious note: What exactly did you want to ask there? How does one pre-process data?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax69k

I was waiting for this one! And really?

I would like to know what operations do you usually use on your reference data? For example if your reference data is a genome, do you extract a set of gene? A chromosome? Do you index it? Do you blast it? Another example is: if you download all GenBank what will you do with the data? Will you extract a part of it with RDF/SPARQL? Is that clearer?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Chloe Riou40

Ideally you would not want to mess with the reference (in terms of changing things). When available from an authoritative source (e.g. NCBI, Ensembl, UCSC) I would get the source data (sequence, indexes) as is. If there are derived things needed (e.g. BBMap indexes) then I build them using the reference files mentioned before. This is pretty much a do it once and not repeat until absolutely needed thing.

There are tools to extract information about a gene/chromosome that can be run on the fly (e.g. bedtools, eutils) so it does not make sense to precompute those things.

If you start internally processing the reference data then you get on a slippery slope. You would need to keep doing this over every time a new version of the reference data comes out (e.g. with GenBank every night).

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax69k

I hear a James Bond dialogue here:

  • Villain: How would you like your data, Sir?
  • Bond: I like my data how I like my cars. Fast, wild and European.
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Istvan Albert ♦♦ 80k

Et tu Istvan? Then fall NCBI.

On a serious note, do you prefer EMBL over NCBI? Any reason?

ADD REPLYlink written 2.6 years ago by RamRS22k
1

Yes he has been on record saying so .. for annotations at least :)

ADD REPLYlink written 2.6 years ago by genomax69k
1

hey it is Bond that said it not me :-)

But yes as genomax2 stated, I think human gene annotation are better at Ensembl. And a lot of data is much easier to obtain from their FTP sites.

ADD REPLYlink written 2.6 years ago by Istvan Albert ♦♦ 80k
1

Hi- Just a suggestion... From the github wiki you link I read:

BioMAJ (BIOlogie Mise A Jour) is a workflow engine dedicated to data synchronization and processing. The Software automates the update cycle and the supervision of the locally mirrored databank repository.

I think there is a little too much jargon and technicality here, especially if you are trying to attract an audience not very familiar. Maybe a more gentle explanation would help...?

ADD REPLYlink written 2.6 years ago by dariober10k

Thanks for your remark, BioMAJ is a software useful to update and manage your databanks on your computer. For example if you want to download every new version of the human genome you could do that with BioMAJ, it will check if there is a newest version of the genome and automatically download it. Maybe I could remove the link because I just want informations about the processes used on databank by users, and not really talk about BioMAJ.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Chloe Riou40
1

Umm -- that's not a good thing man. I don't want my reference genome update in the middle of an analysis :-/ Gosh, on some projects i'm involved in we're using software and data that is over 5 years old. Deliberately.

ADD REPLYlink written 2.6 years ago by John12k
3

More precisely with BioMAJ, you will have the choice of what version you want to use and you could deliberately keep the old one. (it will "publish" the new version only if you want to)

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Chloe Riou40
1

Ah, ok thats pretty convenient :)

ADD REPLYlink written 2.6 years ago by John12k

How to you process your reference databank ?

I'm not sure if I understood this correctly, but a reference genome is indexed for alignment tools such as bwa/tophat2/hisat2/STAR..., so that's a 'useful' processing step.

ADD REPLYlink written 2.6 years ago by WouterDeCoster40k

Yes this is one of the possible answers. Is that what you do with reference data? Do you do other type of process? On which domain do you work? (in genomic, metabolomic, transcriptomic, proteomic? Etc. )

ADD REPLYlink written 2.6 years ago by Chloe Riou40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour