Using the right reference to identify sequences origin
2
0
Entering edit mode
5.7 years ago
Rox ★ 1.4k

Hello Biostar,

I have a question I'm unable to answer myself for several weeks now.

I would like to add a new QC analysis to my pipeline (ONT data) which would be able to both :

  • detect if our sequences contains any contaminants (bacteria ? virus ? human DNA ?)
  • detect if the sequences belong to the specie we expected to sequence (for example, if the sequenced DNA is from an european perch, the expected result would be that my sequences are mapping to a fish genome of reference)

For that, the approach I can think of is to compare my reads to a bunch of references genomes. So what I had in idea was to use a chosed genome that is "common" (I know it doesn't really mean anything) enough so my reads map well to that reference. For example, I would like to use one virus genome to be able to detect any kind of viral contamination, and the same goes on with human genome, one bacteria genome, a fish genome... etc. I don't want to use a gigantic database with a bunch of everything because I don't want to have something to heavy to align again and because I don't need that precision.

I know this is not the best approache at all. Because there is no such thing as a reference genome for all virus or a reference genome for all bacteria. But I thought that for my simple purpose (because I do not try to identify exactly the contaminant and I don't want to retrieve the contaminated reads either), it could eventually fit. But I struggle to know what reference genome I should use for optimal results. Escherichia coli for bacteria ? Drosophila melanogaster for insects ? Or other way I was thinking of is to create several hybrid fasta for each categorie I would like to detect, and that fasta file would contain 4 or 5 different species for each genus (5 insects for insects, 5 bacteria for bacteria...).

What do you think about my ideas ? Do you see any major cons of such an approache that won't fit the analysis I'm trying to make ? Do you have any other suggestions I coudln't think of ? Thanks a lot for your advices !

Cheers,

Roxane

genome sequence • 2.0k views
ADD COMMENT
2
Entering edit mode
5.7 years ago
5heikki 11k

You could Mash your reads (or a random subset of reads) as individual sequences against a reference database such as pre-sketched RefSeq. See here

ADD COMMENT
1
Entering edit mode

This tool is awesome... Making a lot of test with it and it's a fast and smart approache... Thanks a lot for advising that to me !

ADD REPLY
0
Entering edit mode

It's IMO by far the most overlooked bioinfo tool of the last few years..

ADD REPLY
0
Entering edit mode

I can understand why ! I still need to try out few more test in order to make it fit my purpose tho. Because for now a mash dist with whole refseq and a fastq of fish sequence best matche with mammals genome before fish genomes... If you have any kind of experience with parameters I should use to build the sketch or to measure the distance, your advices would be welcomed !

ADD REPLY
0
Entering edit mode

Are you sure your fastq includes only fish DNA? Mash dist individual reads and maybe you'll see that some of them are fish and others something else? I've built my own RefSeq bacterial genomes DB with -k 21 and -s 5000

ADD REPLY
0
Entering edit mode

I have therefore an other question : don't you think that this tool, which kinda compress a sequence using it's most representative k-mers, would be a lot impacted by the used sequencing technolgy ? Both for the raw reads than for the resulting assemblies. Raw PacBio/ONT reads are longer but more erroneous, raw Illumina reads are short but with high quality. Is there any studies comparing sequences coming from a same species but generated with a different sequencing technology ?

I'm kinda afraid of the impact of my high error rate in my ONT reads.

ADD REPLY
0
Entering edit mode

What you are doing is only a qualitative analysis correct? This isn't in the category of being 100% sure about what all is in there. For the former purpose this should be adequate.

ADD REPLY
0
Entering edit mode

Exclude singleton k-mers and it should be fine..

ADD REPLY
0
Entering edit mode

Thanks, I'll have a look on it, but as I said to genomax, I'm not really asking for advices on a tool, but rather on how should I build my own small reference database... ;(

ADD REPLY
0
Entering edit mode

but rather on how should I build my own small reference database... ;(

I don't think there is a good/correct answer for that. No matter what you select you will likely miss some other thing. It will depend on what you are comfortable with.

ADD REPLY
0
Entering edit mode

Does the idea of constructing a hybrid fasta file containing several genome of a specified target I want to identify seems to be shocking according to you ? I realise it may be the main point of my post in the end.

ADD REPLY
1
Entering edit mode

No. But you are again likely to miss many things by cherry picking.

Instead of doing alignments you could use sketch/hash as referred to by @5heikki with RefSeq (that should be comprehensive). BBMap has tools to do that kind of searches as well.

ADD REPLY
1
Entering edit mode

More than that these Mash databases can be really small, e.g. all RefSeq in less than 100 MB

ADD REPLY
0
Entering edit mode
5.7 years ago
GenoMax 141k

Rather than chase after what is there (only in specific circumstances should you have odd things e.g. environmental samples) why not select what you need by using BBSplit and not worry about the rest. There is an advantage of being able to provide more than one reference genome and have the reads binned accordingly if you do have a need for that.

Otherwise we just had a discussion of how to do this: BLAST based on 100k sequences rnaseq?

Authors of FastQC also have a program called fastq-screen that does this.

ADD COMMENT
0
Entering edit mode

The main reason I'm not selecting what I want is because the sequences I'm producing may be from de novo sequencing. But indeed I guess such an approache could work. Isn't BBSplit going to discard the reads I don't want tho ? Because this is not what I am aiming for. I would like just to report what's in there.

Hehe, you got me there, I'm trying to make a tool like FastQScreen... But for ONT data ! I already asked about it in here : C: Looking for a tool like fastq screen but for ONT data Because I did several test with FastQScreen already and discussed with its author, and it seems that tool made for short Illumina reads with low error rate isn't suited at all for long messy ONT reads. The result are very noisy.

Thanks for the discussion I'll have a look on it !

ADD REPLY
0
Entering edit mode

The main reason I'm not selecting what I want is because the sequences I'm producing may be from de novo sequencing.

Fair point. BBSplit option should only be good for when you have reference genome(s) in hand that you can use.

In the thread that you referred to above I had posted some metrics of doing a DIAMOND search against nr (C: Looking for a tool like fastq screen but for ONT data ). If you need a reasonably complete answer then that may be the way to go.

ADD REPLY
0
Entering edit mode

What scare me with Diamond is that is seems to be too heavy for the simple purpose I want to accomplish. And it is not even suited for ONT Long reads. I'm not at the stage anymore where I'm looking for tools (unless what I want exist but I didn't saw it). I think I'm going to use a "homemade" solution using minimap2 as wouter advised me to. But Now my main question is regarding how should I build a small reference database.

ADD REPLY

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6