Question

What is the best way to compute genetic distances between FASTQ files?

0

Entering edit mode

6 months ago

mikazon ▴ 10

I have downloaded a couple hundred genome reads from the SRA and want to compute the genetic distances between each pair of reads. So far, the only way that I've been able to do so is to convert the SRA files to FASTQ format and use Dashing2 to compute the genetic distance matrix. Is this an appropriate method to obtain genetic distances between each pair of reads, or is there a better approach that you recommend?

genetic-distance fastq • 1.4k views

ADD COMMENT • link updated 6 months ago by GenoMax 141k • written 6 months ago by mikazon ▴ 10

0

Entering edit mode

Not sure why you're messing around with read (fastq) files. Just use Mash on the genome FASTA files: https://github.com/marbl/Mash .

ADD REPLY • link 6 months ago by Dunois ★ 2.5k

2

Entering edit mode

Not sure why you're messing around with read files

Likely because OP has short read files that are from genomes i.e. not actual genomes.

couple hundred genomic sequences from the SRA

mikazon are you looking to compute this for reads or for full genomes represented by those reads?

ADD REPLY • link 6 months ago by GenoMax 141k

1

Entering edit mode

Thank you for clarifying that. I misunderstood "sequenced genomes" in their statement to imply assembled genomes.

ADD REPLY • link 6 months ago by Dunois ★ 2.5k

1

Entering edit mode

mikazon will still need to clarify.

There are two separate statements in original post that are non-consistent

compute the genetic distances between each pair of sequenced genomes

Is this an appropriate method to obtain genetic distances between each pair of sequences

Does not make sense to do this for every short read.

ADD REPLY • link 6 months ago by GenoMax 141k

0

Entering edit mode

Thank you for your replies. Sorry if my original post did not make sense. Hopefully, I can clarify here. I am downloading reads such as this one; I am downloading these in fastq format, and I'm hoping to compute genetic dissimilarity between reads. So, with n reads, I should have n(n-1)/2 dissimilarity measures. I'm hoping that these dissimilarity measures are analogous to genetic dissimilarities between individuals from which these reads were obtained.

ADD REPLY • link 6 months ago by mikazon ▴ 10

0

Entering edit mode

I changed my language for clarity. Thanks for pointing out the confusion!

ADD REPLY • link 6 months ago by mikazon ▴ 10

0

Entering edit mode

To my knowledge, I can only download the reads in SRA format or FASTQ format from the SRA. However, if you know how I can download fasta files, I would greatly appreciate that.

ADD REPLY • link 6 months ago by mikazon ▴ 10

1

Entering edit mode

For this purpose it doesn't matter much whether you get fasta or fastq from SRA. When he said "Use FASTAs" he meant to use assemblies instead of reads.

ADD REPLY • link 6 months ago by Brian Bushnell 20k

score 3 · Accepted Answer · 2023-10-25

3

Entering edit mode

6 months ago

cmdcolin ★ 3.8k

fastq files are generally raw reads, so you would probably want to be careful about how you are doing the computations. one example of calculating 'genetic distances from raw reads' is read2tree

https://github.com/DessimozLab/read2tree

pub https://www.nature.com/articles/s41587-023-01753-4

ADD COMMENT • link 6 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

Okay, so I would use the alignment output of read2tree, then compute "genetic distances" from the alignment, correct?

ADD REPLY • link 6 months ago by mikazon ▴ 10

1

Entering edit mode

the genetic distances would be encoded by the tree (e.g. https://github.com/DessimozLab/read2tree#output-files). you can convert the tree into a distance matrix using other packages (example thread here Script To Calculate A Distance Matrix Based On Tree File)

ADD REPLY • link 6 months ago by cmdcolin ★ 3.8k

1

Entering edit mode

random other note: there are other tools than just read2tree that do "alignment free" phylogenetic tree reconstruction, or kmer based tree reconstruction. it may be worth literature review. the read2tree is just one that came to mind. finally, note also, you might consider, especially if all your fastq are from like the same species with a reference genome, is to align your fastq against a reference genome, variant calling, and computing tree from the resulting VCF

ADD REPLY • link 6 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

I definitely should do a bigger literature review. All the comments and posts here have been extremely helpful. I'll let you know if I have any more questions over the next few days. Thanks again fro your help.

ADD REPLY • link 6 months ago by mikazon ▴ 10

score 3 · Accepted Answer · 2023-10-25

3

Entering edit mode

6 months ago

Brian Bushnell 20k

Comparing fastqs is a bit more complex than assemblies due to the presence of errors. BBSketch gives you some control over that; e.g.

comparesketch.sh in=a.fastq ref=b.fastq minkeycount=2

...will ignore singleton kmers caused by errors. It's also very fast, so you can do an all-to-all comparison with lots of fastqs in a reasonable amount of time:

comparesketch.sh *.fastq alltoall format=3 records=9999

The fact that it does entropy filtering and has a blacklist of taxonomically uninformative conserved kmers present in many clades also prevents spurious hits of e.g. algae to octopus that otherwise confound this kind of methodology, since you are interested in "genetic distance" rather than spurious hits of ATATATATA...ATAT.

ADD COMMENT • link 6 months ago by Brian Bushnell 20k

1

Entering edit mode

Thank you for your comment. I'm looking into BBSketch now. Several follow-up questions for you: (1) Am I correct in assuming that the choice in ref is arbitrary for BBSketch? That is, will two different refs still result in the same genetic distance matrix? (2) It seems that alltoall does not require a reference, so does that compute genetic distances between each pair of reads independent of some reference? Additionally, what are the format and records arguments?

I found your introductory post on BBSketch. Looking forward to trying this new tool!

ADD REPLY • link 6 months ago by mikazon ▴ 10

1

Entering edit mode

There is a slight difference between query and reference for BBSketch. It does not affect the kmer distance or estimated ANI, but some extended metrics (like completeness) are calculated for the query assuming that the reference is a single organism, so swapping them would give a different result there. For your purpose those metrics are irrelevant.

In all-to-all mode everything is considered a "reference" and it calculates the distance between each pair of files (NOT each pair of reads). You CAN calculate distance on a per-sequence rather than per-file basis with the "persequence" flag but you definitely do not want to do that in all-to-all mode with short read files.
The default format is designed for displaying the best matches of a query to a bunch of references, kind of like BLAST results. "format=3" is designed for all-to-all comparisons, where it produces one line per query/ref pair indicating the kmer identity, estimated ANI, etc.
records=9999 is to uncap the number of records displayed per query. By default, to be concise, at most the top 20 hits will be displayed (normally what I care about, or what people I work with care about, is the identity of the primary organism and any major contaminants when comparing a file to a reference database). But for an all-to-all matrix you need to set that to a large number. You still will not get a line for every single file pair, though, because only pairs with at least 3 shared kmers will be reported. That's adjustable with the minhits flag but even if you set it at 0 it won't show pairs with no shared kmers. Although I am intending to add that functionality for convenience in generating matrices from the output.

ADD REPLY • link 6 months ago by Brian Bushnell 20k

0

Entering edit mode

Hi, Brian. I have been playing around with BBSketch this weekend, and I think I got everything working - your tools made it pretty easy! With my goal of computing genetic dissimilarities, do you think I should use ANI, KID, or WKID? I'm leaning towards WKID, but I imagine that you have much more experience than I do in selecting which metric(s) to use for my purpose.

ADD REPLY • link 6 months ago by mikazon ▴ 10

1

Entering edit mode

WKID is better than KID because it takes into account difference in genome sizes. ANI is strictly calculated from WKID, so they're basically the same number, just scaled differently. Personally I like to use ANI because it's the most straightforward and simple to understand, just bear in mind that it's an estimate that could be a little misleading in some cases (particularly when the ANI varies a lot across the genome).

ADD REPLY • link 6 months ago by Brian Bushnell 20k

0

Entering edit mode

Perfect - thank you for the clarification and for the amazing tools!

ADD REPLY • link 6 months ago by mikazon ▴ 10

1

Entering edit mode

Please accept this (and other answer, you can accept multiple answers if both were useful, green check mark) to provide closure to this thread.

ADD REPLY • link 6 months ago by GenoMax 141k