Question: Beginning Bioinformatics Student In Need Of Advice And Clarification.
gravatar for Caitlin
7.1 years ago by
Caitlin90 wrote:

Hi all.

I am a beginning bioinformatics student enrolled in the one bioinformatics course my community college offers. The pace of the course is relaxed and includes an extremely fundamental series of discussions regarding perl (a language I am very experienced with) in the form of very simple programming assignments which even a programming neophyte could probably complete within 15-20 minutes, basic searching and retrieval from GenBank, multiple sequence alignment with various software tools, e.g., clustal omega, muscle, and an introduction to BLAST.

Since I am very interested in the field of bioinformatics, I felt compelled to ask for clarification regarding several fundamental topics that are, unfortunately, not addressed in the course syllabus. Apologies if my questions are overly simplistic:

1). If I were to download a complete human genome sequence, in what format would it be in? Fasta? Would it be a monolithic Fasta file or 23 files (one per chromosome) in Fasta format?

2.) I'm interested in using either perl or Java to examine various genes. Would locating specific genes be feasible?

3.) I have tried in vain to locate public data which consists of a "normal" gene and one from an individual afflicted with cancer, Example: Healthy BRCA1 and a copy of a BRCA1 gene with mutations that lead to the development of a neoplasm. I would like to compare them and identify the location of the mutations, etc. GenBank does not seem to store "mutated" sequence info. Rather, I have only been able to locate BRCA1 and BRCA2 sequence data for various organisms with no indication that the Homo sapien was or was not afflicted with a form of cancer.

If anyone could provide some helpful feedback, I would be very appreciative. Having such a strong interest in the field and no mentor to consult is, as you may imagine, frustrating.

Thanks all.


perl bioinformatics java cancer • 3.8k views
ADD COMMENTlink modified 7.1 years ago by Chris Miller21k • written 7.1 years ago by Caitlin90

"very simple programming assignments which even a programming neophyte could probably complete within 15-20 minutes" I like this statement ;) and I wished that was true for everyone taking such courses, but I have been seeing people posting course assignments of this difficulty (EDIT: not meaning to say your specific course/class) here, trying to get immediate solutions out of biostar.

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Michael Dondrup47k

Thanks Micheal!


ADD REPLYlink written 7.1 years ago by Caitlin90
gravatar for Emily_Ensembl
7.1 years ago by
Emily_Ensembl21k wrote:

Hi Caitlin

  1. We keep all the Ensembl FASTA files here If you go into our DNA files for human (here, you'll see that we have not just 25 files (1-22 + X + Y +mitochondria), but each of those unmasked (dna.chromosome), soft-repeat masked (dna_sm.chromosome), hard repeat-masked (dna_rm.chromosome), a complete genome file and loads of patches and haplotypes (see for more info in patches and haplotypes).

  2. Have a play with the Ensembl Perl API. Here's the tutorial, the documentation and the installation instructions We also have a REST API you could have a try with in Java or any other language you want to try

  3. We do have the option in Ensembl to search by a disease state. From the Ensembl homepage ( you can search for a disease, for example breast cancer. You can then get a list of all variants in the genome associated with breast cancer (;end=906;idx=Variation;q=breast%20cancer;species=Homo_sapiens), click through to find the associated allele and the genes affected, plus a bunch of other stuff.

You may have guessed - I work for Ensembl, but just because I'm biased, it doesn't mean our database/website isn't awesome.


ADD COMMENTlink written 7.1 years ago by Emily_Ensembl21k

Hi Emily.

Thanks for the help. I have no doubt the info you provided will certainly prove beneficial. I have heard REST but I don't know anything about it (currently). Thankfully, my course project isn't due until early December of this year so I should have ample time to familiarize myself with the API and the Ensembl resources you provided links to!


ADD REPLYlink written 7.1 years ago by Caitlin90

Hi Caitlin

Our REST service lets you programme in another language but still access our API. We do this using simple URLs which generate data in a easy readable (by a computer) format. For example, try this URL:;content-type=application/json

You can see that it gives you a bunch of data in text format (it does this by accessing the Perl API). You can write code in any language you like to first generate that URL then read that data string, extract the bits of data you're interested in and display them in the form you like.

It's still in beta and there are a limited number of endpoints (unlike the Perl API which will allow you to extract every bit of data in our database), but it's still pretty cool.


ADD REPLYlink written 7.1 years ago by Emily_Ensembl21k
gravatar for Pierre Lindenbaum
7.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

1) both

2) yes for whatever-your-language-is. Search for 'biomart' or 'ucsc mysql'

3) go from pubmed and find the related sequences: e.g: but I'm afraid there is no way to say if the patient was affected or not.

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Pierre Lindenbaum130k

You can also use the COSMIC database ( which catalogs somatic variants in cancer. This may be of use in your cancer related question. For familial cancer's you will find at least some of the possible germline mutations in OMIM.

ADD REPLYlink written 7.1 years ago by DG7.1k

Merci beaucoup pour l'aide Dr. Lindenbaum!

ADD REPLYlink written 7.1 years ago by Caitlin90

Just a quick comment, you can also get human genes here: By default, if you click "get output", the default setting is human genes. Or if you want a fun perl exercise, you can parse the genes out of this file ... look up gtf format to get a handle on the format.

ADD REPLYlink written 7.1 years ago by KCC4.0k
gravatar for Michael Dondrup
7.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Hi, welcome to BioStar.

With respect to 3) the mutations in BRCA genes were as far as I remember part of a partially invalid patent of Myriad, see The Myriad ruling - What do gene patents now mean to bioinformatics? I haven't checked but the sequence of the variants should be in the patent application. You could apply detection of the variants described in their genetic test as long as you do not generate cDNA (covered by the patent) but search in genomic DNA sequences. Myriads opponent in that case had developed and offered such genetic test, so the data for the causal variants should exist and searching the patent archives might reveal them.

ADD COMMENTlink written 7.1 years ago by Michael Dondrup47k

Thanks Micheal.

I didn't know there was a patent issue, but I will certainly check that link out.

ADD REPLYlink written 7.1 years ago by Caitlin90
gravatar for Chris Miller
7.1 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

You can find all sorts of mutations from cancer on theTCGA Data Portal. They won't come as fastas, but will provide the exact coordinates of the base change(s). If you needed, for some reason, to introduce them into your sequence, it would be easy enough to do with a little script (and maybe a good exercise for someone brand-new to bioinformatics). Best of luck.

ADD COMMENTlink written 7.1 years ago by Chris Miller21k

Thanks Chris. I will definitely check that site out.

ADD REPLYlink written 7.1 years ago by Caitlin90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1832 users visited in the last hour