Question

Beginning Bioinformatics Student In Need Of Advice And Clarification.

6

Entering edit mode

10.7 years ago

Caitlin ▴ 100

Hi all.

I am a beginning bioinformatics student enrolled in the one bioinformatics course my community college offers. The pace of the course is relaxed and includes an extremely fundamental series of discussions regarding perl (a language I am very experienced with) in the form of very simple programming assignments which even a programming neophyte could probably complete within 15-20 minutes, basic searching and retrieval from GenBank, multiple sequence alignment with various software tools, e.g., clustal omega, muscle, and an introduction to BLAST.

Since I am very interested in the field of bioinformatics, I felt compelled to ask for clarification regarding several fundamental topics that are, unfortunately, not addressed in the course syllabus. Apologies if my questions are overly simplistic:

1). If I were to download a complete human genome sequence, in what format would it be in? Fasta? Would it be a monolithic Fasta file or 23 files (one per chromosome) in Fasta format?

2.) I'm interested in using either perl or Java to examine various genes. Would locating specific genes be feasible?

3.) I have tried in vain to locate public data which consists of a "normal" gene and one from an individual afflicted with cancer, Example: Healthy BRCA1 and a copy of a BRCA1 gene with mutations that lead to the development of a neoplasm. I would like to compare them and identify the location of the mutations, etc. GenBank does not seem to store "mutated" sequence info. Rather, I have only been able to locate BRCA1 and BRCA2 sequence data for various organisms with no indication that the Homo sapien was or was not afflicted with a form of cancer.

If anyone could provide some helpful feedback, I would be very appreciative. Having such a strong interest in the field and no mentor to consult is, as you may imagine, frustrating.

Thanks all.

~Caitlin

perl java cancer • 4.9k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 10.7 years ago by Caitlin ▴ 100

1

Entering edit mode

"very simple programming assignments which even a programming neophyte could probably complete within 15-20 minutes" I like this statement ;) and I wished that was true for everyone taking such courses, but I have been seeing people posting course assignments of this difficulty (EDIT: not meaning to say your specific course/class) here, trying to get immediate solutions out of biostar.

ADD REPLY • link 10.7 years ago by Michael 54k

0

Entering edit mode

Thanks Micheal!

;)

ADD REPLY • link 10.7 years ago by Caitlin ▴ 100

score 7 · Answer 1 · 2013-08-09

Hi Caitlin

We keep all the Ensembl FASTA files here ftp://ftp.ensembl.org/pub/current_fasta. If you go into our DNA files for human (here ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/), you'll see that we have not just 25 files (1-22 + X + Y +mitochondria), but each of those unmasked (dna.chromosome), soft-repeat masked (dna_sm.chromosome), hard repeat-masked (dna_rm.chromosome), a complete genome file and loads of patches and haplotypes (see http://www.ensembl.org/Help/Faq?id=291 for more info in patches and haplotypes).
Have a play with the Ensembl Perl API. Here's the tutorial http://www.ensembl.org/info/docs/api/core/core_tutorial.html, the documentation http://www.ensembl.org/info/docs/Doxygen/index.html and the installation instructions http://www.ensembl.org/info/docs/api/api_installation.html. We also have a REST API you could have a try with in Java or any other language you want to try http://beta.rest.ensembl.org/documentation.
We do have the option in Ensembl to search by a disease state. From the Ensembl homepage (http://www.ensembl.org/index.html) you can search for a disease, for example breast cancer. You can then get a list of all variants in the genome associated with breast cancer (http://www.ensembl.org/Homo_sapiens/Search/Details?db=core;end=906;idx=Variation;q=breast%20cancer;species=Homo_sapiens), click through to find the associated allele and the genes affected, plus a bunch of other stuff.

You may have guessed - I work for Ensembl, but just because I'm biased, it doesn't mean our database/website isn't awesome.

Emily

score 4 · Answer 2 · 2013-08-09

4

Entering edit mode

10.7 years ago

Pierre Lindenbaum 161k

1) both

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

2) yes for whatever-your-language-is. Search biostars.org for 'biomart' or 'ucsc mysql'

3) go from pubmed and find the related sequences: e.g: http://www.ncbi.nlm.nih.gov/nuccore?LinkName=pubmed_nuccore&from_uid=8533757 but I'm afraid there is no way to say if the patient was affected or not.

ADD COMMENT • link 10.7 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

You can also use the COSMIC database (http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/) which catalogs somatic variants in cancer. This may be of use in your cancer related question. For familial cancer's you will find at least some of the possible germline mutations in OMIM.

ADD REPLY • link 10.7 years ago by DG 7.3k

0

Entering edit mode

Merci beaucoup pour l'aide Dr. Lindenbaum!

ADD REPLY • link 10.7 years ago by Caitlin ▴ 100

1

Entering edit mode

Just a quick comment, you can also get human genes here: http://genome.ucsc.edu/cgi-bin/hgTables By default, if you click "get output", the default setting is human genes. Or if you want a fun perl exercise, you can parse the genes out of this file ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/ ... look up gtf format to get a handle on the format.

ADD REPLY • link 10.7 years ago by KCC ★ 4.1k

score 2 · Answer 3 · 2013-08-09

Hi, welcome to BioStar.

With respect to 3) the mutations in BRCA genes were as far as I remember part of a partially invalid patent of Myriad, see The Myriad ruling - What do gene patents now mean to bioinformatics? I haven't checked but the sequence of the variants should be in the patent application. You could apply detection of the variants described in their genetic test as long as you do not generate cDNA (covered by the patent) but search in genomic DNA sequences. Myriads opponent in that case had developed and offered such genetic test, so the data for the causal variants should exist and searching the patent archives might reveal them.

score 2 · Answer 4 · 2013-08-09

2

Entering edit mode

10.7 years ago

Chris Miller 22k

You can find all sorts of mutations from cancer on theTCGA Data Portal. They won't come as fastas, but will provide the exact coordinates of the base change(s). If you needed, for some reason, to introduce them into your sequence, it would be easy enough to do with a little script (and maybe a good exercise for someone brand-new to bioinformatics). Best of luck.