Question: How do I get the gene annotation for the latest version (GRCh38)?
0
gravatar for line1438
3.5 years ago by
line143820
line143820 wrote:

I have the gene annotation for all chromosome and its SNPs for the version GRCh37.68, just like


34554 36081 ENSG00000237613 FAM138A

69091 70008 ENSG00000186092 OR4F5

367640 368634 ENSG00000235249 OR4F29

621059 622053 ENSG00000185097 OR4F16

721320 722513 ENSG00000197049 AL669831.1

860260 879955 ENSG00000187634 SAMD11

879584 894670 ENSG00000188976 NOC2L

895967 901095 ENSG00000187961 KLHL17


it is a file recode all SNPs for a chromosome and the information that the SNP belong which gene.

I want to get the gene annotation for 1 to 23 chromosomes for the latest version (GRCh38),

the format of gene annotation just like above-mentioned,

what should I do?

Thanks a lot!

gwas gene • 2.2k views
ADD COMMENTlink modified 3.5 years ago by EagleEye6.6k • written 3.5 years ago by line143820

can be obtained from ensemble biomart

ADD REPLYlink written 3.5 years ago by Prasad1.6k
6
gravatar for EagleEye
3.5 years ago by
EagleEye6.6k
Sweden
EagleEye6.6k wrote:

Unzip using:

gunzip -d Homo_sapiens.GRCh38.85.gtf.gz

Convert into table format:

cat Homo_sapiens.GRCh38.85.gtf | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"$7}' | sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_biotype "//'| sed 's/gene_name "//' | sed 's/"//g' > Homo_sapiens.GRCh38.85_table.txt

The above command will convert GTF into annotation table as below,

ENSG00000223972  DDX11L1    1:11869-14409   +
ENSG00000227232  WASH7P 1:14404-29570   -
ENSG00000278267  MIR6859-1  1:17369-17436   -
ENSG00000243485  MIR1302-2  1:29554-31109   +
ENSG00000237613  FAM138A    1:34554-36081   -
ENSG00000268020  OR4G4P 1:52473-53312   +
ENSG00000240361  OR4G11P    1:62948-63887   +
ENSG00000186092  OR4F5  1:69091-70008   +
ENSG00000238009  RP11-34P13.7   1:89295-133723  -
ENSG00000239945  RP11-34P13.8   1:89551-91105   -
ENSG00000233750  CICP27 1:131025-134836 +
ENSG00000268903  RP11-34P13.15  1:135141-135895 -
ENSG00000269981  RP11-34P13.16  1:137682-137965 -
ENSG00000239906  RP11-34P13.14  1:139790-140339 -
ADD COMMENTlink written 3.5 years ago by EagleEye6.6k

Thank you very much!

Your answer are the best for me.

Thanks a lot.

ADD REPLYlink written 3.5 years ago by line143820

Good luck :)

ADD REPLYlink written 3.5 years ago by EagleEye6.6k

I try to find the same website by myself in the ensembl.org

but I seem to can't find it...

could you tell me where the wrong with me?

I want to find the website you mentioned :

If you are just looking for ensembl gene annotation,

ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/

below list the steps that I done

  1. Go to http://asia.ensembl.org/index.html

  2. Human GRCh38.p7 http://asia.ensembl.org/Homo_sapiens/Info/Index

then I try to find the all download, but still not to find the same website you mentioned...

ADD REPLYlink written 3.5 years ago by line143820

go to the ftp site

ADD REPLYlink written 3.5 years ago by Prasad1.6k

Thank you!

I already know how to get into the website to get the file.

ADD REPLYlink written 3.5 years ago by line143820

I guess you are looking for this,

ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz

ADD REPLYlink written 3.5 years ago by EagleEye6.6k

The GRCh38 is the latest version of genomes now,

if the version of GRCh39 has come out in the future,

can I get the gene annotation of GRCh39 in the same ftp website? (ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/)

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by line143820
1

Yes, this ftp link will be updated with current genome version whenever it is releases. If in future GRCh39 is released, the current ftp link will be updated for new assembly version.

JFI: But keep in mind that the transcript/gene annotation version keeps on updating (Here you got annotation version 85 for GRCh38, GRCh38.85. There are previous versions starts from GRCh38.76-84). Example, for GRCh37 assembly there was 18 different ensembl transcript/gene annotation versions (GRCh37.57-75).

If you check this link you will get an idea about the version history or you can also use GTF annotation from the following GENCODE link but it follows bit different gene names (ENSGs, there will be extra revision numbers in the end of each ENSG name, Example ENSGXXXX will be represented as ENSGXXXX.2).

http://www.gencodegenes.org/releases/

If you want to convert Gencode GTF to simple table format

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz

Unzip as I mentioned in earlier post.

cat gencode.v25.annotation.gtf | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[5]"\t"$1":"$4"-"$5"\t"a[3]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}' > gencode.v25.annotation_annotation.txt
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by EagleEye6.6k
1

Small correction,

echo -e "Geneid\tGeneSymbol\tChromosome\tStart\tEnd\tClass\tStrand\tLength"; zcat gencode.v25.annotation.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[4]"\t"$1":"$4"-"$5"\t"a[2]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}'  > gencode.v25.annotation_annotation.txt
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by EagleEye6.6k

Thank you so much. :)

ADD REPLYlink written 3.5 years ago by line143820
4
gravatar for EagleEye
3.5 years ago by
EagleEye6.6k
Sweden
EagleEye6.6k wrote:

If you are just looking for ensembl gene annotation,

ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/

For more downloads,

http://www.ensembl.org/info/data/ftp/index.html

FTP:

ftp://ftp.ensembl.org/pub/

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by EagleEye6.6k

Excuse me, I don't know how to get the file of gene annotation I wanted in your mentioned website.

can you tell me the detail about the process of catching the gene annotation for 1-23 chromosomes

the format of gene annotation I wanted is

34554 36081 ENSG00000237613 FAM138A

69091 70008 ENSG00000186092 OR4F5

367640 368634 ENSG00000235249 OR4F29

621059 622053 ENSG00000185097 OR4F16

721320 722513 ENSG00000197049 AL669831.1

Thanks a lot.

ADD REPLYlink written 3.5 years ago by line143820

I found a file named "Homo_sapiens.GRCh38.85.gtf.gz", is this you say the gene annotation?

and I have a question, how do I convert this gtf file to the txt file or a format relatively to see the position of all gene?

ADD REPLYlink written 3.5 years ago by line143820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1121 users visited in the last hour