Question: Identifying covered genes from nanopore cDNA-seq reads
0
gravatar for Florian Bernard
7 months ago by
Florian Bernard50 wrote:

I am working with long reads generated by Nanopore technology.

I tried to write a small script in python (using pysam) that would allow, for every read, to 'detect' to which gene they actually map to. To do that, I was using the following informations: the chromosome, the strand, and the start and end genomic position of my read.

However, I just came to realize there is a flaw in my design. For a given gene, I will get reads that map to both strands (because of the way the library is prepared and the way the molecule is sequenced). So I cannot use the 'strand' information to detect which gene my read is actually mapping to. And I'm afraid that if I only use the chromosome information and the genomic coordinates, I will end up with 'false positive' in places where I have nested genes...

Would anyone have an idea how I could fix that ? Thanks in advance !

rna-seq alignment gene • 327 views
ADD COMMENTlink modified 7 months ago by WouterDeCoster37k • written 7 months ago by Florian Bernard50
2

This could help you: "nanopore_transcript_abundance.py estimates transcript abundance from reads mapped to a transcriptome. "

ADD REPLYlink written 7 months ago by WouterDeCoster37k

I see, it actually takes all possible genes and then compute the one that match the best. It's a great idea. You can see my post below where I figured it wasn't necessarily a problem to have more than one possible gene for a given read. It's not perfect but if I reduce the list of possible genes from 20K (every genes possible in C. elegans) to just a few, then I can actually carry on with downstream analysis in a more efficient way ! Thanks for your help.

ADD REPLYlink written 7 months ago by Florian Bernard50

What is the actual aim? Especially with long reads, you will see lots of reads overlapping multiple genes, e.g in human NKIRAS1 and RPL15.

ADD REPLYlink written 7 months ago by ATpoint14k

To know which gene my reads correspond to. It is long reads but coming from RNAs, so there shouldn't be a problem with overlapping.. unless with nested genes. if gene A is on the opposite strand of gene B, then by using genomic position I cannot distingues them unless I know the strand.

ADD REPLYlink written 7 months ago by Florian Bernard50

Can you clarify if you are using sequencing cDNA or RNA? With nanopore this terminology starts to get important.

ADD REPLYlink written 7 months ago by WouterDeCoster37k
1

In a perfect world I would only be using direct-RNA sequencing and the problem would be gone : for a given gene, all the reads would map to the 'correct' strand . But I've got some data generated with the classical 1D ligation kit (so depending which strand will actually enter the nanopore I will get reads mapping to one or the other strand)..

ADD REPLYlink written 7 months ago by Florian Bernard50
1
gravatar for Florian Bernard
7 months ago by
Florian Bernard50 wrote:

As pointed by some of you, there will be cases where it's absolutely impossible to determine if one read belong to gene A or gene B without having a strand information that I can actually rely on (like with data generated by direct-RNA or direct-cDNA kits). And after some thinking, I realized it's not necessarily a problem.

My "gene prediction" function is part of a larger script and it is just supposed to ease downstream computational work by first identifying the gene my reads correspond to, so that I can focus the downstream analysis on that particular gene and not all 20K genes in C. elegans. But I eventually figured that even if my "gene prediction" algorithm returns 2 or 3 possible genes, it is still a great improvement.

So I guess for all ambiguous cases where I can't rely on the strand information I will just make the function return all possible genes for a given read. And maybe add the possibility to input the type of kit used to generate the data and if it is direct-RNA or direct-cDNA kit then make it so the function takes in account the strand information and then only returns one possible gene.

Thanks a lot to everyone for your help, it made me see the problem in a different way and made me realized it wasn't necessarily a problem !

ADD COMMENTlink written 7 months ago by Florian Bernard50
0
gravatar for Friederike
7 months ago by
Friederike3.2k
United States
Friederike3.2k wrote:

As ATpoint notes, there is no perfect solution because the ambiguity cannot be solved (at least not by looking at every read individually). Check out the code and info provided by htseqcount, they demonstrate the different strategies you could use. Which one you end up using should depend on the question you're trying to address.

ADD COMMENTlink written 7 months ago by Friederike3.2k

I am more concerned by case when one gene is entirely overlapping with another on the opposite strand ( see figure C here: http://researcherslinks.com/current-issues/Association-of-Types-Overlapping-Genes/20/1/983/figures). Which doesn't seem to be a case covered by htseqcount ?

ADD REPLYlink written 7 months ago by Florian Bernard50
1

I think, the second case from the bottom ("ambiguous") seems to fit the bill: from the read's perspective, gene A and gene B are completely overlapping and it has no way of telling where it's from unless it has the strand information handy.

ADD REPLYlink written 7 months ago by Friederike3.2k

I think for those cases, given your data, you simply have to admit you don't know, at least for those reads that doesn't contain any anchoring sequence that allows you to pin-point it to one gene or the other (i.e., the case that is shown third from the bottom of that table by HTSeqcount). Just out of curiosity: how many reads actually contain completely ambiguous sequences?

ADD REPLYlink written 7 months ago by Friederike3.2k

I would need to look more closely at the data but according to this papers (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2747821/), C. elegans has ~450 nested genes. I would need to look at how high this genes are expressed but the chances that some reads may map to more than one gene are somewhat high.. But you're right, there will be cases where it's absolutely impossible to determine if it's one or the other without having a strand information that I can rely on. And after some thinking, I realized it's not necessarily a problem (see my more detailed answer in the post below). Thanks for your time !

ADD REPLYlink written 7 months ago by Florian Bernard50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1301 users visited in the last hour