Annotate Illumina SNP file with Human GRCh37 genes file
1
1
Entering edit mode
8.7 years ago
gcastaigne ▴ 10

Hello everybody.

This is my first post, so don't hesitate to tell me if I'm not efficently clear in my explanations.

I would like to annotate a Illumina SNP file and I need to compare it to a Human Genome annotated file with the GRCh37 build (I don't care about de patch, just the build is important).

To be efficient in my comparison , I need several informations in the Human genome file.

I need at least :

• HGNC symbol
• GeneID
• start gene position (bp)
• end gene position (bp)
• chromosomeID

There is no real problem to get these informations, I found it in UCSC or Biomart.

But I have a problem with NCBI symbol starting with LOC (i.e : LOC100287633, LOC100128613 etc...)

I compared NCBI and UCSC informations, and I can find every LOC symbols in NCBI but not in UCSC or Biomart.

I know that there are a lot of LOC symbols which are "discontinued" or not updated, however plenty of these symbols are still reviewed in NCBI but unfindable in Biomart or UCSC or other databases.

I could download them from NCBI, but their "start and end positions (bp)" are updated to the GRCh38, and I absolutely need the GRCh37 positions.

Guillaume

GRCh37 SNP annotation gene LOC • 3.0k views
0
Entering edit mode

Hi Guillaume

Could you let me know how you output HGNC symbol from UCSC. I tried to do the same tasks as you did. But I just need the genes known to HGNC. For example, I used track=UCSC Genes and selected "geneSymbol". But the output listed some genes not known to HGNC in the column of hg19.kgXref.geneSymbol.

Then I have trouble to annotate integenic SNPs. For example SNP rs188746275 should locate between (PABPC4L , PCDH18)

but the UCSC tables listed the cDNA genes such that the SNP was between BC032916 and BC031238 when I annotated it. Then BC032916 and BC031238 are not known to HGNC or NCBI.

Many thanks if you could guide me how to output the HGNC symbol.

Thanks!

Ake

3
Entering edit mode
8.7 years ago
Zhaorong ★ 1.4k

Check out the NCBI FTP archive for Annotation release 104, especially the GFF folder.

0
Entering edit mode

Do you know the difference between all GFF files in this folder? Because I can see 2 kinds of file, top_level and scaffolds..

0
Entering edit mode

Check the first column of each file and you'll see. :) The "top_level" file has coordinates on assembled chromosomes, i.e. NC_*, while the "scaffolds" file has coordinates on scaffolds (or contigs), e.g. NT_* and NW_*.

0
Entering edit mode

Thank you Zhaorong, I will try to do something with this. You save me from a lot of searching hours ! ;)