SAMtools mpileup / BCFtools call

Question

annotating strand orientation

1

Entering edit mode

6.4 years ago

jan ▴ 170

Is there a way to annotate my VCF with the transcript orientation ?

Is there any reference for strand orientation ?

vcf sequencing • 4.7k views

ADD COMMENT • link updated 6.4 years ago by prasundutta87 ▴ 660 • written 6.4 years ago by jan ▴ 170

0

Entering edit mode

HI,

My VCF has information regarding the strandbias which is not what Im asking.

What I need to know is the orientation of the transcripts , either on forward or reverse direction. eg BRCA1 transcript is on the reverse transcript. I have looked in my VCF and it's missing the transcript orientation, hence the question asking if there's any reference dataset that I can use to annotate the orientation in my VCF.

The reason is because I want to fetch flanking sequences in the correct orientation and use as input for another program.

I have found a reference dataset that i can use now

ADD REPLY • link 6.4 years ago by jan ▴ 170

0

Entering edit mode

You should have added your response as a comment to my answer, in order to maintain 'fluidity' of the thread/conversation.

If you read through my answer, you'll see that I'm not actually focusing just on strand bias. I provide information on how you can obtain strand orientation. Most likely, you will have both forward- and reverse-oriented reads across BRCA1.

Best of luck

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Please read the answer of prasundutta87 below. It is now clear that you were referring to whether variants were called on the coding or non-coding strand. I don't believe that you can infer this from any typical VCF. I explain this in my comment to prasundutta87

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2017-11-14

[Edit 15 November 2017: it became apparent that the question related to the reverse and forward strand, i.e., coding/non-coding, sense/non-sense, etc. My initial answer (below) assumed that the question pertained to forward and reverse read orientation].

Good question.

Strand information is initially recorded in BAMs when the reads are re-aligned to the chosen reference genome. For further information on filtering forward and reverse reads from BAMs, take a look around Biostars, particularly Samtools View: Only Forward Or Reverse Strand

Regarding the VCF, the information is not always recorded and, if it is, it may be recorded differently based on the variant caller used. Nothing new here as there are no concrete rules in bioinformatics. A good variant caller will take strand biases into account when calling variants, though, even if it may not report forward and reverse read numbers from which the variants are called.

SAMtools mpileup / BCFtools call

If you use samtools mpileup piped into BCFtools call, then strand orientation information is encoded with the DP4 INFO tag:

INFO=< ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases" >

GATK

If you use the GATK, using the default settings with HaplotypeCaller, you'll see an INFO tag for strand orientation in the form of an odds ratio to detect strand bias, but there's nothing on exact read numbers:

INFO=< ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias" >

To get the actual read counds with the GATK, I believe that you have to run GATK's VariantAnnotator on your VCF and then look for the StrandBiasBySample and StrandAlleleCountsBySample tags.

I cannot comment for other variant callers, but they undoubtedly record strand orientatation in some other tags. I checked the current VCF format specification and it actually does not mention anything specific about strand orientation. It has the following for the INFO tags:

INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data> ,data]. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):

AA : ancestral allele

AC : allele count in genotypes, for each ALT allele, in the same order as listed

AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes

AN : total number of alleles in called genotypes

BQ : RMS base quality at this position

CIGAR : cigar string describing how to align an alternate allele to the reference allele

DB : dbSNP membership

DP : combined depth across samples, e.g. DP=154

END : end position of the variant described in this record (for use with symbolic alleles)

H2 : membership in hapmap2

H3 : membership in hapmap3

MQ : RMS mapping quality, e.g. MQ=52

MQ0 : Number of MAPQ == 0 reads covering this record

NS : Number of samples with data

SB : strand bias at this position

SOMATIC : indicates that the record is a somatic mutation, for cancer genomics

VALIDATED : validated by follow-up experiment

1000G : membership in 1000 Genomes

I have put in bold the important parts.

Kevin

score 0 · Answer 2 · 2017-11-15

0

Entering edit mode

6.4 years ago

prasundutta87 ▴ 660

There is nothing called as strand orientation during variant calling. By default, variants are reported in the forward strand only. That also means that there is a variation in the reverse strand in the same position as well. Variant callers don't have any information about transcript orientation as well because Variant callers use only reads that have mapped at a particular position, irrespective of which transcript those reads came from.

What you can do is annotate your vcf file using programs such as SnpEff, annovar or ensemble VEP, and associate each variant record to a gene and it's associated transcript. Now , this is a prediction because if your gene of interest is present in the forward strand and there is another gene which is present in the opposite ( reverse ) strand, you cannot be sure which transcript is being effected by the variation.

This is my understanding. Correct me anyone if I am wrong.

ADD COMMENT • link 6.4 years ago by prasundutta87 ▴ 660

0

Entering edit mode

I think that the terminology we're using is getting somewhat confusing, but perhaps the original question was not clear enough.

Forward/Reverse read orientation

For your typical mate-pair sequencing, forward and reverse reads will align and overlap each other at certain points. This serves to increase confidence in both the alignment and variant calling, If a variant is genuine, it should appear in both forward and reverse reads over the same position.

Forward/reverse strand | coding/non-coding strand | template/non-template | plus/minus.

It is now evident that the original question was about coding and non-coding strands (or 'template'/'non-template', 'plus'/'minus', etc). There are methods currently in use that will sequence both of these and then corroborate variant calling on both, again, in order to increase confidence further.

Now , this is a prediction because if your gene of interest is present in the forward strand and there is another gene which is present in the opposite ( reverse ) strand, you cannot be sure which transcript is being effected by the variation.

That's a good point, as there are many locations in the genome where antisense transcripts (or other coding genes) overlap.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k