Question: Genomic coordinates from GENCODE annotation file
0
gravatar for seta
9 weeks ago by
seta1.1k
Sweden
seta1.1k wrote:

Hi all,

I’m going to extract all variants within some specific genes from 1K genome project and genomAD, so to get their genomic coordinates, I used the main GTF annotation file for human (version 28, GRCh 38) from GENCODE and made a bed file for the genes of interest. Here is the bed file for one of my genes (ISG15), as you can see, there are multiple records for the same gene. I think this redundancy is related to different transcripts of the same gene, yes is it right? However, I’m confused which start and end positions should be used for variant extraction; could you please help me out?

chr1    1001137 1014541 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001137 1001281 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014541 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014475 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014007 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014478 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001137 1001281 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014004 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014541 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001144 1014435 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001144 1001263 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014435 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014435 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014004 1014007 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1001144 1001263 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1008193 1008279 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014004 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013422 1014540 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013422 1013576 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013573 1013576 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013573 1013576 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014540 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013983 1014475 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014478 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1013422 1013573 ENSG00000187608.8    gene_name ISG15    .   +
chr1    1014475 1014540 ENSG00000187608.8    gene_name ISG15    .   +

Also, for including promoter region and getting variants in this region, please kindly tell me should I add about 200 or 500 bp (which one do you suggest?) to the starting position of genes? Please share me if there is any point for more consideration.

Thanks in advance

ADD COMMENTlink modified 9 weeks ago by Emily_Ensembl16k • written 9 weeks ago by seta1.1k
1
gravatar for PedroBarbosa
9 weeks ago by
PedroBarbosa10
Braga, Portugal
PedroBarbosa10 wrote:

You can look in your 3rd column of the GTF file for gene records and create a bed out of it. Then, you extract the variants in those regions using bcftools/bedtools/etc

awk -v OFS="\t" '$3 == "gene" { print $1,$4-1,$5,".",".",$7 }'  gencode.gtf | bedtools slop -s -r 0 -l 5000 -i stdin -g genome.size.file.txt > genes_with_promotor.bed

bcftools view -Oz -R genes_with_promotor.bed in.vcf > out.vcf.gz

or

bedtools intersect -header -a in.vcf -b genes_with_promotor.bed | bgzip > out.vcf.gz
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by PedroBarbosa10

Thank you. I made the bed file from gtf file as I posted. My problem was multiple records for a single gene, which as I found from your response, I should use all genomic coordinate from all records for getting variants, yes?

ADD REPLYlink written 9 weeks ago by seta1.1k

How did you make that bed file ? There should be only one gene record in the whole gtf pointing for ISG15.

It seems you have all the sub features of the gene in that bed. First line of your file seems to be the whole gene feature, which you would obtain with the command I referred.

Edit: Just saw now the promotor question. Updated the command.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by PedroBarbosa10

Thanks Pedro, yes, I used all sub features of the genes as you mentioned that I corrected with your help. Just two questions, maybe silly one. In your awk command, why you put ".", "." ? and for promoter section, I think using -l 1000 -r 500, please kindly share me your idea about it?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by seta1.1k
1
gravatar for Emily_Ensembl
9 weeks ago by
Emily_Ensembl16k
EMBL-EBI
Emily_Ensembl16k wrote:

It might be easier to just use Ensembl BioMart or the Ensembl REST API to put in your list of gene IDs and find all overlapping variants from 1000 Genomes and gnomAD.

ADD COMMENTlink written 9 weeks ago by Emily_Ensembl16k

Thanks Emily, it sounds easier, however, I tested BioMart with 10 gene names (and also with Entrez ID) and it returned many variants with no information about minor allele, minor allele count and frequency for most of them. Also, I could not find these variants came from 1000 genome (and which population) or genomAD. Could you please kindly advise me in this regard?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by seta1.1k

If you use the variation database rather than the gene database, you can add 1000 Genomes and/or gnomAD as an additional filter to get only those data. There are a lot of variants that don't have frequency data.

ADD REPLYlink written 9 weeks ago by Emily_Ensembl16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour