Reference genomes repository
1
0
Entering edit mode
9.1 years ago
Ana ▴ 20

Hello everyone,

I'm pretty new in this area so I apologize in the first place if the following question is silly or something xD Well, I'm trying to build an informatic tool with a feature in which I retrieve reference genomes to the user. The objective is that the user only provides the reads and I use some API to find the reference genomes to then do a job for him/her.

Given this, I'm a little confused about how do I clearly identify a reference genome from a set of reads in the online repositories available. For example, I went to https://www.encodeproject.org/ and I used their API but then when I looked better I guess those are just data and not reference genomes. I also find this: http://www.ensembl.org/info/data/ftp/index.html but then again I find "weird" only one sample for each species and also I'm not sure about what FASTA file should I really present to user and then download.

I would like some suggestions of some websites that I can use. Also, some explanations and clarifications on the topic would be really appreciated!

Thanks in advance.

reference-genome repository • 2.0k views
ADD COMMENT
0
Entering edit mode

Well, that truly is a question. If you intend to identify the ref. genome based on a set of reads, then you need to align those reads to all available ref. genomes. If you have a lot of reads then this is quite a job, especially if you are doing a nucleotide (nc) alignment. Usually if reads are long, they are converted into AA sequences and aligned onto nr database (this is how some groups that I know of, identify species in various ecological and metagenomic studies but this isn't straightforward as just aligning then to ref. genomes, they have these massive pipelines they use. I've been involved in building some of them and this is a science for itself. Moreover, the alignment procedure depends on the experiment used to produce reads ). Also, If you are going to do some nc read mapping then you need a genome sequence (DNA). These usually are divided according to chromosomes into smaller files as you saw (links you provided). A technical reason being :easier to download and manage, so in the end once downloaded they can be concatenated if you for some reason need the entire genomic sequence. As far as suggestions go, I think it is ok to stick to one repository in the course of an experiment and ensembl or ucsc are excellent choices!! Ok, I just realized your post is really loaded, so I' going to stop here. Hope I helped a bit.

cheers
mxs

ADD REPLY
0
Entering edit mode

Thanks for the answers. I probably didn't explained well before xD Anyway, i realized that ensembl and ucsc have reference genomes that I can download. But one thing that I can't understand is: when it says "biosamples" or "annotations" or "assembly" and then shows some files, what are they exactly? Where can I find more information? (Like I said, I just need reference genomes). And another thing, how can I use their APIs to retrieve that data in particular?

Thanks :)

ADD REPLY
0
Entering edit mode
9.1 years ago
deanna.church ★ 1.1k

There is an assembly database at NCBI: www.ncbi.nlm.nih.gov/assembly

This stores multiple assemblies per organism- though not all 'reference' quality.

ADD COMMENT
0
Entering edit mode

Thanks for the answers. I probably didn't explained well before xD Anyway, i realized that ensembl and ucsc have reference genomes that I can download. But one thing that I can't understand is: when it says "biosamples" or "annotations" or "assembly" and then shows some files, what are they exactly? Where can I find more information? (Like I said, I just need reference genomes). And another thing, how can I use their APIs to retrieve that data in particular?

Thanks :)

ADD REPLY

Login before adding your answer.

Traffic: 2125 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6