Question: Reference sequence and gtf annotation.
0
gravatar for juncheng
4.9 years ago by
juncheng180
köln
juncheng180 wrote:

What we get from a gtf annotation file is like this:

chr1    hg19_refFlat    exon    11874    12227    0.000000    +    .    gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1    hg19_refFlat    exon    12613    12721    0.000000    +    .    gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1    hg19_refFlat    exon    13221    14409    0.000000    +    .    gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1    hg19_refFlat    exon    14362    14829    0.000000    -    .    gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg19_refFlat    exon    14970    15038    0.000000    -    .    gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg19_refFlat    exon    15796    15947    0.000000    -    .    gene_id "WASH7P"; transcript_id "WASH7P"; 

 

My question is about the strand information. Does the "+" means that the reference sequence is forward strand (does also mean the same strand as mRNA or coding sequence or sense strand)? While "-" means the reference sequence is reverse sequence, and the 'real' gene sequence is reverse complement?

In other words, does the reference sequence always in one strand (+?), or in forward or reverse strand depends on "+" and "_" in the annotation?

I'm interested in this because I want to make sure the SNPs in the vcf file is on the forward or reverse strand. For example, a T > C conversion, is this happens on sense strand or ant-sense strand.

Thanks for any help,

Jun

 

rna-seq • 3.1k views
ADD COMMENTlink modified 4.9 years ago by Bert Overduin3.6k • written 4.9 years ago by juncheng180
1
gravatar for Bert Overduin
4.9 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

The + in the GTF file means that the coding strand of the gene (and thus the mRNA and protein sequence) are on the forward / positive / plus strand of the genome, and the - means that the coding strand of the gene is on the reverse / negative / minus strand of the genome.

Variants can be reported on either strand (e.g. variants in dbSNP can be reported on either the forward or the reverse strand, while Ensembl reports all variants on the forward strand). So, always be aware on which strand your variants are reported!

 

ADD COMMENTlink written 4.9 years ago by Bert Overduin3.6k

Thanks!

Does sequence given by hg19 reference genome always forward/+ or the same with coding sequence (mRNA sequence)?

Basically, if I see a T on reference genome in CDS, does this means the in the mRNA sequence this is also a T (forget SNP in this case), or it also possible be a A depends on "+" or "-" annotation in gtf file.

ADD REPLYlink written 4.9 years ago by juncheng180

I'm not sure I understand what you mean. If there is a + behind a gene in the GTF file then that means that the gene / mRNA / protein is on the forward strand of the genome. As the sequence of the genome is normally only given as the forward strand, in this case gene sequence and genome sequence will be identical. If there is a - behind a gene in the GTF file then that means that the gene / mRNA / protein is on the reverse strand of the genome. In this case you have to reverse complement the genome sequence to get the gene sequence. Is that what you meant?

ADD REPLYlink written 4.9 years ago by Bert Overduin3.6k

Thanks! Indeed the answer I'm looking for everywhere. Thanks again.

ADD REPLYlink written 4.9 years ago by juncheng180
0
gravatar for juncheng
4.9 years ago by
juncheng180
köln
juncheng180 wrote:

Or more simple, a T > C conversion in a vcf file could actually be a A > G in 'real' gene (coding sequence)?

ADD COMMENTlink written 4.9 years ago by juncheng180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour