We have generated sequencing data with the direct-cDNA sequencing kit from ONT. During the library prep, prior to adding the adapters on each cDNA, we performed a ligation step to concatenate cDNAs together.
The aim was to see if we could increase the total number of cDNAs sequenced whithout actually increasing the number of molecules sequenced (increase in total number of bases sequenced and in mean read length).
However, I'm having trouble to actually identify sequences that would come from such reads and I'm not sure of what would be the most efficient way to find such reads.
I tried to map my reads onto the genome and then look at the alignments. Correct me if I'm wrong, but I assumed I would see an increase in the number of chimeric reads versus the primary reads ? However I see no such thing and - even worse - I came to realize I always have 13-15% of chimeric reads in dataset generated by direct-cDNA Seq. Could there be that some of those chimeric reads are not relevant ? If yes, how one would filter those while still retaining reads that are actually coming from 'real' cDNAs ? Nanopore reads being quite noisy I believe it makes the analyze even more complicated.
Otherwise, would there be a way to split my reads before alignment or identify reads that result from concatenated cDNAs ?
I thank you for your cooperation.
Just thinking aloud here. Are you sure the ligation strategy actually worked? Or if it did perhaps it generated concatamers that were very long and were not actually sequenced?
Yes you are right, I'm actually trying to see if it worked or not.. (hence: to see if there is a point to do that supplementary step or not). But being novice at bioinformatics I'm not sure if what I'm observing comes from an erroneous way of analyzing my dataset or if there is indeed no real difference with the other datasets that were generated by following the classical protocol (= no extra ligation step).
Also: I didn't do the library prep but if I recall correctly we saw an increase in fragments length (by running some of it on an agarose gel). But we didn't quantify the efficiency of the ligation and it might be concern a smaller fraction of the reads than what we expected. For some reasons, shorter reads might also be easier to sequence, resulting in those bigger reads being found less often.
However, I'm still surprised to find that many chimeric reads in my other datasets. So either reads reported are chimeric are not always relevant (and then not an accurate marker for observing concatenation of cDNAs) or it means ligations between cDNAs already happen during a normal library prep.
Just to clarify: by chimeric reads you mean that one part of the read is mapping to
gene A
(an exon there in) and other togene B
? And these were present in the datasets even when no ligation was tried?Yes, based on this definition of chimeric reads here A: definition of chimeric vs multiple-mapping (SAM)
As it seems that only chimeric reads are flagged as supplementary alignments, I used
samtools view -c -f 0x800
to count the number of chimeric reads I have in each dataset. And across all 4 direct-cDNA datasets (1 with the ligation step / 3 without), the percentage of chimeric reads stays the same.