Question: Mapping GRCh38 to ENSEMBL / UCSC ,Gene, transcript , cDNA and Protein IDs and Sequence
0
gravatar for inambioinfo
21 months ago by
inambioinfo70
inambioinfo70 wrote:

Hello Folks,

I am interested in mapping latest Genome GRCh38 to other standard databases such as ENSEMBL, UCSC and RefSeq.

As of now, i have GRCh38 in hand. I could like to know which all the files i need to have with me from the above database to map back to the complete genome with corresponding genes ID & Sequence, Coding ID & Sequence and Proteins ID & Sequence.

Eventually

1.Mapping : Chromosome Co-ordinates ---> Gene ID, Gene Sequence and Gene Name

2.Mapping : Gene ID, Gene Sequence and Gene Name ---> CDS ID & Seq / Exon ID & Seq / Start & End of CDS

3.Mapping : CDS ID & Seq / Exon ID & Seq / Start & End of CDS ---> Protein ID & Seq

I also want to use dbsnp and COSMIC for identification of variations in Protein Seq / Exon Seq / Gene Seq / Chromosome Co-ordinates.

I have already check information from ENSEMBL and got to know that it can be possible to work on Biomart if am into R or Bioconductor. But i prefer to do the same manually and program it locally to get the mapping data mention above.

Is there any level of information like GTF file where i can draw the whole mapping information. I will be grateful if there is any possibility of interlink or co-relation among the 3 Database (ENSEMBL,UCSC,NCBI) which will help me to map gene cds and protein in any of the 3 DB.

More detail suggestion will be appreciated and Thanks in advance for your response.

ensembl snp rna-seq mappping genome • 1.6k views
ADD COMMENTlink modified 21 months ago by Ben_Ensembl720 • written 21 months ago by inambioinfo70
1
gravatar for Ben_Ensembl
21 months ago by
Ben_Ensembl720
EMBL-EBI
Ben_Ensembl720 wrote:

Hello,

As I work within the Ensembl team, I'll answer our question from an Ensembl point of view. You may want to get advice form others regarding how you can link the information together between the different resources.

Although we don't have web-based tool available for you to do this sort analysis, many of the queries you wish to perform 'manually' can be done using our REST API rest.ensembl.org).

The particular endpoints that will be relevant for your 3 mappings are in the 'Mappings' and 'Overlap' sections:

e.g: Genomic co-ordinates to gene ID: http://rest.ensembl.org/documentation/info/overlap_region

You may want to explore each of the GET or POST endpoints to see which ones suit the query you wish to perform.

Finally, you can download the GTF file (as well as many other files containing dumps from our databases for all species available in Ensembl) from our FTP site: http://www.ensembl.org/info/data/ftp/index.html

Best wishes

Ben Ensembl Helpdesk

ADD COMMENTlink written 21 months ago by Ben_Ensembl720
1
gravatar for Ben_Ensembl
21 months ago by
Ben_Ensembl720
EMBL-EBI
Ben_Ensembl720 wrote:

Hello,

Yes, these files look right for the analysis you wish to perform. With regards to the GTF, the README should clear this up for you: ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens//README

.gtf: this is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions

.chr.gtf: contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included)

For the list of variants, I would suggest using the Variant Effect Predictor: http://www.ensembl.org/info/docs/tools/vep/index.html

The VEP is an online tool that will allow you to retrieve the genomic co-ordinates of each of the variants, along with mappings to genes as well as cDNA and protein sequences.

Best wishes

Ben

ADD COMMENTlink written 21 months ago by Ben_Ensembl720
0
gravatar for inambioinfo
21 months ago by
inambioinfo70
inambioinfo70 wrote:

Hi Ben,

Thanks for your information.

Am curious about few things and could like to make things clear.

Do i need to use the GTF file for mapping from ENSEMBL Database:

http://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz

or

http://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz

If i use the following files then for mapping as i mention earlier

DNA : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/ ....................... (All Chromosomes.fa)

cDNA : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

CDS : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz

Peptide : http://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz

Variation : http://ftp.ensembl.org/pub/release-87/variation/vcf/homo_sapiens/Homo_sapiens.vcf.gz

Cinical Variation : http://ftp.ensembl.org/pub/release-87/variation/vcf/homo_sapiens/Homo_sapiens_clinically_associated.vcf.gz

If i want to map from Genome coordinates to till end protein/peptide in fasta sequence, Do you think i can use these files to start my work.

Also i wish to know how i can co-relate ID from Varant/dbsnp to Genome or cDNA or CDS or Protein with exact position.

Thanks once again Ben.

ADD COMMENTlink written 21 months ago by inambioinfo70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1142 users visited in the last hour