Reference sequence and gtf annotation.
8.0 years ago
juncheng ▴ 200

What we get from a gtf annotation file is like this:

chr1    hg19_refFlat    exon    11874    12227    0.000000    +    .    gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1    hg19_refFlat    exon    12613    12721    0.000000    +    .    gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1    hg19_refFlat    exon    13221    14409    0.000000    +    .    gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1    hg19_refFlat    exon    14362    14829    0.000000    -    .    gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    14970    15038    0.000000    -    .    gene_id "WASH7P"; transcript_id "WASH7P";
chr1    hg19_refFlat    exon    15796    15947    0.000000    -    .    gene_id "WASH7P"; transcript_id "WASH7P";


My question is about the strand information. Does the "+" means that the reference sequence is forward strand (does also mean the same strand as mRNA or coding sequence or sense strand)? While "-" means the reference sequence is reverse sequence, and the 'real' gene sequence is reverse complement?

In other words, does the reference sequence always in one strand (+?), or in forward or reverse strand depends on "+" and "_" in the annotation?

I'm interested in this because I want to make sure the SNPs in the vcf file is on the forward or reverse strand. For example, a T > C conversion, is this happens on sense strand or ant-sense strand.

Jun

8.0 years ago
Bert Overduin ★ 3.7k

The + in the GTF file means that the coding strand of the gene (and thus the mRNA and protein sequence) are on the forward / positive / plus strand of the genome, and the - means that the coding strand of the gene is on the reverse / negative / minus strand of the genome.

Variants can be reported on either strand (e.g. variants in dbSNP can be reported on either the forward or the reverse strand, while Ensembl reports all variants on the forward strand). So, always be aware on which strand your variants are reported!

Thanks!

Does sequence given by hg19 reference genome always forward/+ or the same with coding sequence (mRNA sequence)?

Basically, if I see a T on reference genome in CDS, does this means the in the mRNA sequence this is also a T (forget SNP in this case), or it also possible be a A depends on "+" or "-" annotation in gtf file.

I'm not sure I understand what you mean. If there is a + behind a gene in the GTF file then that means that the gene / mRNA / protein is on the forward strand of the genome. As the sequence of the genome is normally only given as the forward strand, in this case gene sequence and genome sequence will be identical. If there is a - behind a gene in the GTF file then that means that the gene / mRNA / protein is on the reverse strand of the genome. In this case you have to reverse complement the genome sequence to get the gene sequence. Is that what you meant?

Thanks! Indeed the answer I'm looking for everywhere. Thanks again.

8.0 years ago
juncheng ▴ 200

Or more simple, a T > C conversion in a vcf file could actually be a A > G in 'real' gene (coding sequence)?