Question

Strand specific quantification with strand-splitted BAM files

0

Entering edit mode

2.7 years ago

compuTE ▴ 140

Hello!

I am on my way to understanding more deeply SAM flags, and in one of my tests I bumped into a question that I'm having a hard time finding an answer to it.

I had a BAM file with all my alignments. I wanted to split it by the strand the alignments were mapping to, so I followed the script in this tutorial , I'm using paired end data.

Now I have a reverse and a forward BAM file. I'm interested on a locus (chr3:85289931-85295940) which has a forward stranded L1PA2 in the human genome. Now, if I use featureCounts with -s 0 and quantify both BAM files (forward and reverse) in this feature, I get let's say 1,000 and 10 respectively.

I don't understand why I get 10 reads from my reverse BAM file as the L1PA2 is forward stranded. There is nothing else in this region (I also checked other regions just in case this was a special case, but no - this is not a unique scenario).

Here are some lines from the reverse strand SAM file version of my BAM (some of the reads escaping the filtering of the strands?):

A00681:387:HCVJTDRXY:1:2245:2419:34898 99 chr3 85295832 255 139M = 85295832 139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF:F:::FFFFFF:F,FFF,F:FFFFFF:,FFFF:F,FF NH:i:1 HI:i:1 AS:i:274 nM:i:1 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,1,139,85295832,85295970 MC:Z:139M A00681:387:HCVJTDRXY:1:2245:2709:35055 99 chr3 85295832 255 139M = 85295832 139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFF:FFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:276 nM:i:0 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,1,139,85295832,85295970 MC:Z:139M A00681:387:HCVJTDRXY:1:2245:2419:34898 147 chr3 85295832 255 139M = 85295832 -139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTACTAAACCT FFF,FFFFFFFFF:FFFFF::,,FFF,FF,F,F,F:FFFFFFFFFFFFFFFFFFFF,FF:FFFF:F:,FFFF,:FFF:FFFFFF:FFFF,:FFFFFFF,FFFFF:FFFFFFFFFF,,FFFFFFFFFFFFFF,FFF,FFF NH:i:1 HI:i:1 AS:i:274 nM:i:1 NM:i:1 MD:Z:131G7 jM:B:c,-1 jI:B:i,-1 rB:B:i,141,279,85295832,85295970 MC:Z:139M A00681:387:HCVJTDRXY:1:2245:2709:35055 147 chr3 85295832 255 139M = 85295832 -139 ACGAGTTAGTGGGTGCAACACACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACAAGTACCCTAAAACTTAAAGTATAATAAAAAAAAGAAAAAAAGAAGAAGAAAAAAAAAGTAGTAAACCT FFFFFFFFFFF:F,FFFFFFFFFFFF:FF:FF:FFFFFF:FFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFF,FF,FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:276 nM:i:0 NM:i:0 MD:Z:139 jM:B:c,-1 jI:B:i,-1 rB:B:i,141,279,85295832,85295970 MC:Z:139M

Any idea what could be going on?

I checked and my flags for the reverse strand BAM (99 and 147) and I think it makes sense..

99:

read paired (0x1)
read mapped in proper pair (0x2)
mate reverse strand (0x20)
first in pair (0x40)

147:

read paired (0x1)
read mapped in proper pair (0x2)
read reverse strand (0x10)
second in pair (0x80)

featureCounts SAM strand RNAseq • 2.0k views

ADD COMMENT • link updated 2.7 years ago by Istvan Albert 100k • written 2.7 years ago by compuTE ▴ 140

0

Entering edit mode

I am not sure I follow the question, you should be getting different counts, after all you separated the data by strand. What is unexpected about getting different counts?

you don't need to separate the BAM files as Featurecounts can handle the proper stranded counting in the first place.

In this case you should be getting the same numbers using -s 1 and -s 2

ADD REPLY • link 2.7 years ago by Istvan Albert 100k

0

Entering edit mode

Hi! Thank you so much for your answer! I really appreciate the tutorial you did about filtering alignments by strand :-) Yes, I understand I am expecting different counts. I just realised my question had an important typo (-s 0 --> -s 2), it's now corrected, sorry. The question is why don't I get zero counts from the reverse transcription BAM file if the feature I am quantifying is on the forward strand? (when forcing strandness with -s 2) I would assume if I am isolating the alignments from each transcriptional direction that none of the reads from the reverse BAM file should be quantified in a forward stranded feature..

Thank you! I know featurecounts can handle stranded counting - The reason I am doing it this way is that I have stranded and unstranded data of the same samples, and I want to quantify how much "noise" (more like, antisense transcription relative to the features I am interested in) I would be looking at if I would only have the unstranded data.

If I count with featureCounts (-s 2) I would get the quantification in the correct strand the features are in, but in my particular case I want to get numbers of the amount of reads in antisense to the features as well. That's why I filter first, although I guess I could do -s 1 and then -s 2 and should give me the same information, right? :-)

ADD REPLY • link 2.7 years ago by compuTE ▴ 140

score 1 · Answer 1 · 2021-07-30

1

Entering edit mode

2.7 years ago

benformatics 3.9k

This is normal, if you are analyzing stranded data just ignore the reads going in the incorrect direction. Could be caused but a multitude of things but not super important unless it's a huge chunk of your reads. The reason the filtering has this side product is that splitting the BAM by the strand is based on the orientation of the read and not the feature that is present on the genome.

ADD COMMENT • link 2.7 years ago by benformatics 3.9k

0

Entering edit mode

Hi! Thank you for the answer, it's good to hear this is normal :-).

I think I understand what you are saying, but I believe my question still holds. Please correct me if I'm wrong but from the filtering, I think I end up with:

One file containing all alignments to the reverse strand - Containing (but not exclusively containing) all reads mapping to features annotated in the negative strand and no reads coming from the forward strand
Another one with all alignments to the forward strand - Containing (but not exclusively containing) all reads mapping to features annotated in the positive strand and no reads coming from the reverse strand.

If I am correct in this, and as featureCounts quantifies over the features I give in the input GTF (which contains the strand of the features) - shouldn't it know from the flags in each of the BAM files (and the strand of the feature as well as the -s 2 parameter) that it should give me zero counts on a feature annotated in the positive strand if I am quantifying from the BAM file containing the alignments to the reverse strand? (and viceversa?)

I'm sorry if I just didn't catch well what you said at the very last bit!

ADD REPLY • link 2.7 years ago by compuTE ▴ 140

1

Entering edit mode

this is a common situation where one tries to understand exactly how certain RNA-Seq counts are computed, and it turns out that it is often surprisingly convoluted

for example, when a read is paired the other read in pair will be on the opposite strand, what does featureCounts do here? are you running it in paired mode? I would investigate that.

What does IGV show? Do you see reads or fragments overlapping?

ADD REPLY • link 2.7 years ago by Istvan Albert 100k

1

Entering edit mode

Couldn't have said it better myself. This is convoluted haha

FeatureCounts by default only takes in consideration properly paired reads and each pair is counted as one. I put in IGV the bigwig files of each BAM file and looking at the feature I was talking about, I do see a lot of reads in the forward strand, and a little in the reverse.

Although, I shouldn't be getting any reads from the reverse BAM file counted by featureCounts with the GTF specifying that the feature is in the forward strand.

So I looked again what was going on with featureCounts and finally found the silly error: I was missing -p to tell featureCounts this is paired end data. I assumed it would be sufficient from the BAM flags, but no. I tried several things, I'll make a comment with my observations.

Thank you both so much for pointing me in the right direction!

ADD REPLY • link 2.7 years ago by compuTE ▴ 140

score 1 · Answer 2 · 2021-07-30

So, I was missing -p when running featureCounts. Silly me.

I also said a couple of errors in this question (I thought I made a typo which was not a typo - I now corrected the original post). Sorry about that. But regardless, I think it's valuable to make note on the expected behaviour using different parameters if you for some reason filter the BAM files by strands before quantifying (as I did).

But I did some testing with the region I specified before (chr3:85289931-85295940), so for any other lost soul that comes across this question:

I was looking at an L1PA2 that I was interested at in those coordinates. This element is forward stranded and it is of course specified it as such in the GTF I'm using as an input for featureCounts.

When running featureCounts without -p, featureCounts quantifies both ends of the paired-end mates as an individual reads. So a BAM file with forward reads only and specifying -s 0 will count all reads (both mates in a pair) in the region without forcing strandness. In one of my samples and for this particular region I got 621 reads.

Using the same arguments (-s 0 and no -p) but a BAM with alignments to the reverse strand I got 14.

Using the forward BAM -s 0 and -p I got 325. Reverse BAM -s 0 and -p got me 7. So here it picks up that my reads are paired-end and a pair is counted as one fragment. This is what I was originally looking to do - To count as if the data was unstranded, but keeping track in which direction the transcription is coming from (as I am quantifying the two strands separately).

Using just -s 2 in either forward or reverse BAM files got me zero counts (I guess this is expected but don't really understand it to the point that I can explain it, sorry).

And finally the correct way, or the way that would give you the same results as if you would have just one BAM file with both strands, using -s 2 and -p . Where the forward strand BAM file got me 325 fragments and the reverse one zero.

Again thank you for the people that pointed me in the right direction, and I apologize if I confused you with my typo-not-typo.