Question: Identifying covered genes from nanopore cDNA-seq reads
0
gravatar for Florian Bernard
13 months ago by
Florian Bernard60 wrote:

I am working with long reads generated by Nanopore technology.

I tried to write a small script in python (using pysam) that would allow, for every read, to 'detect' to which gene they actually map to. To do that, I was using the following informations: the chromosome, the strand, and the start and end genomic position of my read.

However, I just came to realize there is a flaw in my design. For a given gene, I will get reads that map to both strands (because of the way the library is prepared and the way the molecule is sequenced). So I cannot use the 'strand' information to detect which gene my read is actually mapping to. And I'm afraid that if I only use the chromosome information and the genomic coordinates, I will end up with 'false positive' in places where I have nested genes...

Would anyone have an idea how I could fix that ? Thanks in advance !

rna-seq alignment gene • 411 views
ADD COMMENTlink modified 13 months ago by WouterDeCoster40k • written 13 months ago by Florian Bernard60
2

This could help you: "nanopore_transcript_abundance.py estimates transcript abundance from reads mapped to a transcriptome. "

ADD REPLYlink written 13 months ago by WouterDeCoster40k

I see, it actually takes all possible genes and then compute the one that match the best. It's a great idea. You can see my post below where I figured it wasn't necessarily a problem to have more than one possible gene for a given read. It's not perfect but if I reduce the list of possible genes from 20K (every genes possible in C. elegans) to just a few, then I can actually carry on with downstream analysis in a more efficient way ! Thanks for your help.

ADD REPLYlink written 13 months ago by Florian Bernard60

What is the actual aim? Especially with long reads, you will see lots of reads overlapping multiple genes, e.g in human NKIRAS1 and RPL15.

ADD REPLYlink written 13 months ago by ATpoint22k

To know which gene my reads correspond to. It is long reads but coming from RNAs, so there shouldn't be a problem with overlapping.. unless with nested genes. if gene A is on the opposite strand of gene B, then by using genomic position I cannot distingues them unless I know the strand.

ADD REPLYlink written 13 months ago by Florian Bernard60

Can you clarify if you are using sequencing cDNA or RNA? With nanopore this terminology starts to get important.

ADD REPLYlink written 13 months ago by WouterDeCoster40k
1

In a perfect world I would only be using direct-RNA sequencing and the problem would be gone : for a given gene, all the reads would map to the 'correct' strand . But I've got some data generated with the classical 1D ligation kit (so depending which strand will actually enter the nanopore I will get reads mapping to one or the other strand)..

ADD REPLYlink written 13 months ago by Florian Bernard60
1
gravatar for Florian Bernard
13 months ago by
Florian Bernard60 wrote:

As pointed by some of you, there will be cases where it's absolutely impossible to determine if one read belong to gene A or gene B without having a strand information that I can actually rely on (like with data generated by direct-RNA or direct-cDNA kits). And after some thinking, I realized it's not necessarily a problem.

My "gene prediction" function is part of a larger script and it is just supposed to ease downstream computational work by first identifying the gene my reads correspond to, so that I can focus the downstream analysis on that particular gene and not all 20K genes in C. elegans. But I eventually figured that even if my "gene prediction" algorithm returns 2 or 3 possible genes, it is still a great improvement.

So I guess for all ambiguous cases where I can't rely on the strand information I will just make the function return all possible genes for a given read. And maybe add the possibility to input the type of kit used to generate the data and if it is direct-RNA or direct-cDNA kit then make it so the function takes in account the strand information and then only returns one possible gene.

Thanks a lot to everyone for your help, it made me see the problem in a different way and made me realized it wasn't necessarily a problem !

ADD COMMENTlink written 13 months ago by Florian Bernard60
0
gravatar for Friederike
13 months ago by
Friederike5.0k
United States
Friederike5.0k wrote:

As ATpoint notes, there is no perfect solution because the ambiguity cannot be solved (at least not by looking at every read individually). Check out the code and info provided by htseqcount, they demonstrate the different strategies you could use. Which one you end up using should depend on the question you're trying to address.

ADD COMMENTlink written 13 months ago by Friederike5.0k

I am more concerned by case when one gene is entirely overlapping with another on the opposite strand ( see figure C here: http://researcherslinks.com/current-issues/Association-of-Types-Overlapping-Genes/20/1/983/figures). Which doesn't seem to be a case covered by htseqcount ?

ADD REPLYlink written 13 months ago by Florian Bernard60
1

I think, the second case from the bottom ("ambiguous") seems to fit the bill: from the read's perspective, gene A and gene B are completely overlapping and it has no way of telling where it's from unless it has the strand information handy.

ADD REPLYlink written 13 months ago by Friederike5.0k

I think for those cases, given your data, you simply have to admit you don't know, at least for those reads that doesn't contain any anchoring sequence that allows you to pin-point it to one gene or the other (i.e., the case that is shown third from the bottom of that table by HTSeqcount). Just out of curiosity: how many reads actually contain completely ambiguous sequences?

ADD REPLYlink written 13 months ago by Friederike5.0k

I would need to look more closely at the data but according to this papers (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2747821/), C. elegans has ~450 nested genes. I would need to look at how high this genes are expressed but the chances that some reads may map to more than one gene are somewhat high.. But you're right, there will be cases where it's absolutely impossible to determine if it's one or the other without having a strand information that I can rely on. And after some thinking, I realized it's not necessarily a problem (see my more detailed answer in the post below). Thanks for your time !

ADD REPLYlink written 13 months ago by Florian Bernard60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2774 users visited in the last hour