Question: Looking for a tool like fastq screen but for ONT data
1
gravatar for Roxane Boyer
12 months ago by
Roxane Boyer950
France / Toulouse / GeT-Plage
Roxane Boyer950 wrote:

Hello Biostars !

I am currently looking for a tool similar to fastq screen : https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/fastq_screen_documentation.html , which is able to roughly characterize genome composition (did we sequenced the right species ? do we have contamination in our sequences ?) with nice graph directly from a subset sample of fastq reads (like showing the amount of hits for several species and such).

I will probably try fastq screen, but it is specififed that the tool may be more suitable for short reads technologies as it use short read aligner such as bowtie2. I thought maybe such a tool exist for longer and more erroneous reads. Or maybe a combination of a suited aligner (like ngmlr for long reads and higher error rate awareness) and then an other tool could do it ?

Does anyone have any ideas or suggestions ? I'll keep you updated on my own findings !

Cheers,

Roxane

gridion nanopore tool minion ont • 720 views
ADD COMMENTlink modified 12 months ago by jrj.healey13k • written 12 months ago by Roxane Boyer950
1

Take a look at https://www.biorxiv.org/content/early/2018/07/20/372474

ADD REPLYlink written 12 months ago by WouterDeCoster40k

Interesting tool. So as I understand, it's possible to specify a database of our choice right ?

ADD REPLYlink written 12 months ago by Roxane Boyer950

What I meant was : is it possible to use as compared to genome others stuff than bacteria ? I would like to be able to detect like is it's plant DNA, bacteria, mamals, fish... Something way more general. The output of metamaps show very high precision and recall but on like genus and family of bacteria. And I wonder if the tools works outside the context of metagenomic.

ADD REPLYlink written 12 months ago by Roxane Boyer950
1

I should think it's just a case of building a representative dataset for what you're interested in as it is with my suggestion of Kraken. Since its a new tool there probably aren't many benchmarks datasets outside of the authors lab. Adam Phillipy and crew are nice though, so I'd just mail them and explain what you want to do!

ADD REPLYlink written 12 months ago by jrj.healey13k

Roxane Boyer : I would suggest using DIAMOND against nr, if you have enough compute resources available. Your long reads are not going to be more than a million so it should be a workable option.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax69k

That is a nice suggestion. Is it well adapted to long ONT reads tho ? On their page they specify that it's faster than blast in the case of short Illumina reads, but must be different for ONT reads.

ADD REPLYlink written 12 months ago by Roxane Boyer950
1

I was able to search nr using DIAMOND with ~1070 fastq sequences contained in a ONT data file. Reads ranged from 375 bp to 126000 bp. I left all settings at default. It appears that DIAMOND reports a max of 25 hits per sequence. It took about ~6 h. DIAMOND can make a SAM file.

ADD REPLYlink modified 11 months ago • written 12 months ago by genomax69k

I am going to try with a set of ONT reads I have. Will let you know.

ADD REPLYlink written 12 months ago by genomax69k

not sure if it would serve your request but you might have a look at NanoPlot ?

ADD REPLYlink written 12 months ago by lieven.sterck5.5k

From what I understand from NanoPlot, it doesn't seems it will help me to answer to the "What did I have sequenced" question. I was more looking for a tool or a method that can determinate what organisme the reads originate from, which will both check for contamination and genome characterization (like a QC check). But thanks for the suggestion I'll keep that tool in mind !

ADD REPLYlink written 12 months ago by Roxane Boyer950

NanoPlot does not exactly do what OP is asking for :)

ADD REPLYlink written 12 months ago by WouterDeCoster40k

From the NanoPlot author :o)

ADD REPLYlink written 12 months ago by Roxane Boyer950
1

I can imagine (using minimap2) it would be fairly easy to write a fastq-screen-for-long-reads. But then again, I should be writing my thesis instead.

ADD REPLYlink written 12 months ago by WouterDeCoster40k
1

I may give a try to that task myself Here is a "good luck" upvote for your thesis :)

ADD REPLYlink written 12 months ago by Roxane Boyer950
1

Thanks!

I would approach it in Python, roughly similar to what I did for NanoLyse (which removes lambda reads from a fastq file): https://github.com/wdecoster/nanolyse/blob/master/nanolyse/NanoLyse.py#L101

I would use the python API for minimap2, mappy, and just check "does an alignment exist for this read on that genome" and keep count of that.

ADD REPLYlink written 12 months ago by WouterDeCoster40k

Was thinking with python as well, as the main goal is to integrate this analysis into an existing python workflow that seemed to be the easiest solution indeed. Even better if there is an API. Thanks for the reference of NanoLyse, I'll have a look :)

ADD REPLYlink written 12 months ago by Roxane Boyer950

Let me know if you get stuck - I need some coding during my writing to remain sane.

ADD REPLYlink modified 12 months ago • written 12 months ago by WouterDeCoster40k

Who doesn't ;) I'll keep you updated when I'll try something then !

ADD REPLYlink written 12 months ago by Roxane Boyer950

Did you guys manage to write a fastq-screen-for-long-reads based on minimap2? I am considering the same idea and wanted to make sure I was not reinventing the wheel.

ADD REPLYlink written 5 months ago by Benoit Dessailly10

Hi Benoit ! Nope, in the end I had some others projects I had to finish before this one, so I did not had any time to think about this... So you won't be reinventing the wheel ; ) I'll be very interested to test it if you are going to work on that ! Cheers, Roxane

ADD REPLYlink written 5 months ago by Roxane Boyer950
1

Hi Roxane, I've modified FastQ Screen so it now includes minimap2 as one of the alignment options, and can therefore process long-read data. I've submitted the code changes to the team maintaining FastQ Screen and I am hoping they are going to look into releasing it as the next version of FastQ Screen.

ADD REPLYlink written 5 months ago by Benoit Dessailly10

my bad indeed :/ , understood OP's question wrongly (completely wrong even)

ADD REPLYlink written 12 months ago by lieven.sterck5.5k

Do you expect specific contaminants or are you just trying to do a survey of what is there? You could always use minimap with the expected genome and figure out what remain unaligned.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax69k

I'm not looking for a particular contaminant, indeed it's more like a survey of my reads content. Minimap and miniasm seems indeed interesting thanks !

ADD REPLYlink written 12 months ago by Roxane Boyer950
0
gravatar for jrj.healey
12 months ago by
jrj.healey13k
United Kingdom
jrj.healey13k wrote:

I'd run the reads and/or assembled contings through Braken/Kraken.

It's designed for metagenomic studies, but it will tell you the distributions of your reads amongst different taxa. If you've got 2 different genomes in there (one as a contaminant for example) it should stick out like a sore thumb.

Some info in this link for instance to get started:

https://www.microbe.net/2017/04/27/why-use-bracken-instead-of-kraken/

ADD COMMENTlink written 12 months ago by jrj.healey13k

Do you know if these can use long reads?

ADD REPLYlink written 12 months ago by genomax69k

I’m not 100% sure, but they can bin contigs, so it should work or at least be kinda coercable.

ADD REPLYlink written 12 months ago by jrj.healey13k

That was a nice suggestion, but I feel like it won't be suited for my purpose as it's a metagenomic oriented tool that will only perform for bacterial DNA right ..?

ADD REPLYlink written 12 months ago by Roxane Boyer950

I assume you can replace the target database with your sequences of interest.

ADD REPLYlink written 12 months ago by genomax69k

Ah yes, I believe it will only work for bacteria 'out of the box'. No organism was specified so I thought it was worth suggesting ;)

You may be able to modify/expand on the approach however.

ADD REPLYlink modified 12 months ago • written 12 months ago by jrj.healey13k

Yeah I did not specified and as I said, it was a good suggestions anyway because I did not knew the tool :) I'm surveying all the tools to think about a nice approach and I'll keep that one in mind for sure. The thing is that I don't really know yet what are my sequences of interest... Have to put more precise ideas toward my question !

ADD REPLYlink written 12 months ago by Roxane Boyer950

Update:

I spoke to a few people including one of the authors who had this to say:

We use Kraken to filter contaminants in a lot of our projects as well. Its actually the program I used in filtering out contaminant sequences of the eukaryotic draft genomes in my latest paper. Other people in lab use it in assembly projects to filter potential contaminating bacterial sequences. We also use it to remove any non informative vector or human sequences out of our samples when working on diagnoses.

With regard to building a database:

yes. It would be easiest to just build a database of everything you want to exclude from your sample and then taking all unclassified reads to the next step, but you can also just use any database and exclude sequences that classified as particular taxons

.

but that would require writing another script to parse those out

the first option can be achieved by kraken/kraken2 itself using the --unclassified-reads flag I think

ADD REPLYlink written 12 months ago by jrj.healey13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1385 users visited in the last hour