Question

Another infer_experiment.py interpretation question

2

Entering edit mode

3.3 years ago

schmittelyn ▴ 20

Hello there,

So I was given a bacterial RNA-Seq experiment to analyze and after generating alignments I wanted to know if the reads were stranded or not. So I ran infer_experiment.py on each BAM file which returned the following results:

This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.1704
Fraction of reads explained by "1+-,1-+,2++,2--": 0.8296

This is PairEnd Data
Fraction of reads failed to determine: 0.0030
Fraction of reads explained by "1++,1--,2+-,2-+": 0.3413
Fraction of reads explained by "1+-,1-+,2++,2--": 0.6556
Unknown Data type

This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.2744
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7256

This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.2090
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7910
Unknown Data type

This is PairEnd Data
Fraction of reads failed to determine: 0.0015
Fraction of reads explained by "1++,1--,2+-,2-+": 0.2698
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7288
Unknown Data type

This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.2101
Fraction of reads explained by "1+-,1-+,2++,2--": 0.7899

Clearly, this does not answer my question. I've read the many posts on here and elsewhere about interpreting infer_experiment results, but this appears to be a unique case. Do you think aberrant antisense transcription played a part?

My main questions are, (1) what might have caused this, and (2) is the data usable for downstream analysis (differential expression)?

Many thanks for your help, it is much appreciated.

Lyn

UPDATE

It's supposed to be STRANDED. So apparently that's not completely the case. Any advice before moving forward?

RNA-Seq sequencing • 2.5k views

ADD COMMENT • link updated 17 months ago by GenoMax 141k • written 3.3 years ago by schmittelyn ▴ 20

0

Entering edit mode

Are you not able to find out which kit was used to generate these libraries? Have you visually examined the alignments in a genome browser? That can give you an idea as well.

ADD REPLY • link 3.3 years ago by GenoMax 141k

0

Entering edit mode

Thanks GenoMax. Unfortunately I don't have any information about the kit that was used (but I'm working on it). I've just looked at the alignments in IGV and it appears to be an unstranded library:

https://imgur.com/2lLcNqN

But does the infer_experiment output indicate a bias towards fr-firststrand - could this be a problem when counting features? Or should I just treat it as unstranded and move on?

Thanks!

ADD REPLY • link 3.3 years ago by schmittelyn ▴ 20

0

Entering edit mode

Are you sure you are using the correct reference? Double check that. Perhaps the strain you have has genomic rearrangements (and you may be using standard reference from NCBI?) Data does look stranded was opinion of a fellow moderator.

ADD REPLY • link 3.3 years ago by GenoMax 141k

0

Entering edit mode

Thanks again. I've tried now with an NCBI, Ensembl, and custom reference and infer_experiment gives the same results. I've actually got PacBio data from the same samples used for RNA-Seq and de novo assembly of each gives a single contig that is identical to the reference. So at this point do you think that it could be a poor annotation for this organism (GTF from both NCBI and Ensembl) or a failed library prep? And is this going to pose a problem? Thaaanks! :-)

ADD REPLY • link 3.3 years ago by schmittelyn ▴ 20

0

Entering edit mode

Poor annotation might be the reason, Do you know from where the annotation is coming?

ADD REPLY • link 3.1 years ago by Juke34 8.5k

0

Entering edit mode

Hi Lyn, it's a while ago now (over a year since you posted!), but did you find a resolution to this issue offline? I have the same, I expect it is quite common. Allegedly reverse stranded library that has ambiguous stranding. Of course I took a look with IGV and, well it just shows the same, but in subjective terms rather than a specific number. Also I assumed perhaps an error at the library concatenation phase? (split libraries) so re-downloaded raw data and tried again. Same. Updated the annotation genome. No effect. Quality all looks pretty good, so I've no clue why really.

ADD REPLY • link 18 months ago by rbj ▴ 40

0

Entering edit mode

If you have the sequence of the transcriptome then also try salmon. You can run it with -l A option and it will auto-detect the library type. This all assumes that your annotation/transcriptome is of good quality.

Also I assumed perhaps an error at the library concatenation phase?

What does that mean? What did you do? Are you using public SRA data? If so let us know what accession number?

ADD REPLY • link 18 months ago by GenoMax 141k

0

Entering edit mode

Thanks for the speedy reply; the “concatenation phase”: When you sequence to great depth the resulting reads are often split between FASTQ files. Sample-1-file-1-R1, sample-1-file-2-R1, sample-1-file-1-R2 et seq…. Therefore the standard way to kick off analysis (after adapter trimming) is to concatenate the R1 file-1s and file-2s together etc.
I have not used salmon on these data, although perhaps i should install it! The problem, however (my interpretation of the OP and my own similar issue) is not really to learn the stranding, but why it is not what it ought to be? I thought the error/anomalies were likely to be coming in at the concatenation phase, but that turned out to be a red herring, with further investigation I can see that the problem in my case is that some pairs are RF and others are reporting in stranded. Specifically my file-1R1/R2 mapped and then inferred (or inspected with IGV) gives 99% RF… but the file-2R1/R2 give you 50:50 i.e., unstranded. Then once you concatenate the lot together of course it gives you somewhere between 50% to 99% RF. To me this now smacks of an error in the sequencing process? No we have not uploaded the data publically yet sorry…..

ADD REPLY • link 18 months ago by rbj ▴ 40

1

Entering edit mode

Ah so this is your own data and is still unpublished. Yes it is fine to concatenate lane specific R1/R2 files together as long as the order of the pieces is kept exactly the same for R1 and R2.

but why it is not what it ought to be?

There may be a failure of the kit/experiment. If you are sure about your analysis then you could start working backwards. If sequencing metrics look fine then check with library kit manufacturer. If nothing conclusive turns up you may need to bite the bullet and re-do the experiment.

Give salmon a try with the individual file pieces first.

ADD REPLY • link 18 months ago by GenoMax 141k

0

Entering edit mode

Thanks, yes installed and run Salmon… will take me a while to read the manual to see what I have though. Thanks… RBJ

ADD REPLY • link 18 months ago by rbj ▴ 40

1

Entering edit mode

In my case I finally solved the issue. As suggested (thank you) I ran Salmon on all the individual sample R1/R2 pairs. It turns out I was misunderstanding the Illumina output. Incredibly, I thought the sample1_1 and sample1_2 were just split files. It turns out (facepalm) that our lab had actually split sample 1 into two technical replicates and prepared two libraries per sample (and I missed the email telling me)!! Therefore all that had happened was that one of the library preps produced a high quality library and the other was dismal. I had no business concatenating such fastqs. I was just supposed to have picked whichever was higher quality. Thanks for the help, if nothing else I have gotten reasonably familiar with Salmon now and I kinda like it!! Given our research question it seems great.

ADD REPLY • link 17 months ago by rbj ▴ 40

2

Entering edit mode

Devil is in the details and they are important as turned out in this case. Glad you found an explanation.

ADD REPLY • link 17 months ago by GenoMax 141k

score 2 · Answer 1 · 2021-03-31

2

Entering edit mode

3.1 years ago

DareDevil ★ 4.3k

Ironically the output of the tool supposed to help you decide only adds to the confusion.
Here is a better way to do it, that does not require running tools, instead asks you to understand your own data.
Go to the documentation for GUESSmyLT you don't have to run the tool, instead study and understand what the possible orientations are as described in the docs:
https://github.com/NBISweden/GUESSmyLT
Now open up your BAM and transcript GTG files in IGV and select "view as pairs" options (right-click the panel). Now all you need is to look at your data relative to a gene and visually evaluate which case do you have. The patterns are usually absolutely evident at first glance.

courtesy

ADD COMMENT • link 18 months ago by DareDevil ★ 4.3k

0

Entering edit mode

When you quote someome verbatim, you need to make that clear by using quotation marks or a blockquote liks so:

this text is quoted

(see the quote icon above when you are in edit mode)

In addition you need to provide a reference to the original source. What you wrote is a copy-paste from:

Answer: Infering strand specificity of bam files using rseqc?

ADD REPLY • link 3.1 years ago by Istvan Albert 100k