Question: Screen reads based on certain sequences. Write code myself or designated tools.
0
gravatar for omer.k
11 months ago by
omer.k10
omer.k10 wrote:

Hi Guys, I intend on acquiring transcripts (cDNA) from certain human-cell line culture. I'd be acquiring all the reads using the MinION. The reads are expected to be 2-10kb (don't know yet the distribution).

From all the reads I obtain, I'd like to move forward to analysis with only certain read:

  • The reads which will have a certain known tagging sequence (45 bases on the 3' end).
  • The reads which will have and expected, known, open read frame (~2.5kb).

Obviously, I'd have to optimize the balance between sensitivity and specificity when targeting these reads.

Is there a tool designed for these specific types of screening? If not, would you suggest to write a code from scratch or revise an already somewhat-similar code?

I'm a novice to UNIX but have knowledge in coding (MATLAB, C), so keep that in mind please when answering.

Thanks.

sequencing next-gen sequence • 398 views
ADD COMMENTlink modified 8 months ago by Biostar ♦♦ 20 • written 11 months ago by omer.k10

It's probably important to note that the data is noisy and therefore exact matching of the known tagging sequence won't give you the best results. You'll probably need to do some fuzzy matching of your "recognition sequence".

Since you are looking for cDNA sequences I'm not sure why the second requirement would be an issue?

ADD REPLYlink written 11 months ago by WouterDeCoster24k
1
gravatar for apa@stowers
11 months ago by
apa@stowers320
Kansas City
apa@stowers320 wrote:

I typically use BLAT or cross_match for barcode detection. For ORFs on reads that long, why not just align to the transcriptome fasta? Or, given the read length, align the known ORFs to the reads? ORF prediction on every read will kill your runtime. And the error rate of long reads will kill your ORF detection.

ADD COMMENTlink modified 11 months ago • written 11 months ago by apa@stowers320
0
gravatar for omer.k
11 months ago by
omer.k10
omer.k10 wrote:

You'll probably need to do some fuzzy matching of your "recognition sequence".

Yes, that's what I meant by

Obviously, I'd have to optimize the balance between sensitivity and specificity when targeting these reads.

Regarding you second comment

Since you are looking for cDNA sequences I'm not sure why the second requirement would be an issue?

The presence of the ORF is suggestive of having full transcript. It could be the ORF or another ~45 bases on the 5' end. Either way, still we'd like to screen based on these two sequences.

ADD COMMENTlink modified 11 months ago • written 11 months ago by omer.k10

Please use ADD COMMENT or ADD REPLY to answer to earlier posts, as such this thread remains logically structured and easy to follow.

ADD REPLYlink written 11 months ago by WouterDeCoster24k
0
gravatar for omer.k
11 months ago by
omer.k10
omer.k10 wrote:

Hi, apa@stowers, thanks for commenting.

I typically use BLAT or cross_match for barcode detection.

Admittedly I'm not familiar with cross_match. I'll read of it.

Or, given the read length, align the transcriptome CDS fasta to the reads?

Did you offer this try and "thin down" to runtime? Not sure if this is what you meant. Either way the coding region is what I refereed to as the ORF and as I mentioned, ~2.5kb. Which tool would you suggest for alignment of the transcriptome CDS to the reads?

I could also unwillingly give up on this 2.5kb read and just suffice with the two 45-bases long tag, on the 5' and 3' ends of the reads.

ADD COMMENTlink written 11 months ago by omer.k10

Please use ADD COMMENT or ADD REPLY to answer to earlier posts, as such this thread remains logically structured and easy to follow.

ADD REPLYlink written 11 months ago by WouterDeCoster24k

Actually your ORF-ends-only approach would certainly be faster. I'm not sure what would be the fastest aligner for very long reads. If it were me, I might try STAR first?

ADD REPLYlink written 11 months ago by apa@stowers320

We have had good results with GMAP for nanopore cDNA reads.

ADD REPLYlink written 11 months ago by WouterDeCoster24k
0
gravatar for Brian Bushnell
11 months ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

I suggest BBDuk for barcode detection. It works by matching kmers, so the efficiency is related to the error rate of your reads. But, it can allow an edit distance, and besides that, Nanopore tends to have bursty errors that are correct for a while (hopefully long enough for a kmer) then wrong for a while, rather than random errors like PacBio.

A command like this:

bbduk.sh in=reads.fq outm=matches.fq literal=ACTG.....GACTG k=19 edist=1 restrictleft=50

...will catch all the reads that match any 19-mer in your literal, allowing 1 edit, looking for kmers only in the first 50bp (use restrictright=50 for the last 50bp). To demultiplex multiple barcodes at once you can use Seal which has similar functionality but produces multiple output files, one per barcode (if you supply a fasta file of barcodes).

ADD COMMENTlink written 11 months ago by Brian Bushnell15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 738 users visited in the last hour