Question

FASTQ File Comparison

0

Entering edit mode

3.1 years ago

biohacker_tobe ▴ 80

Hello Community,

is there existing software or algorithms two compare genome files, possibly determine if they are the same or not?

Thanks :)

sequencing genome • 1.8k views

ADD COMMENT • link 3.1 years ago by biohacker_tobe ▴ 80

1

Entering edit mode

do a hash (md5sum) and compare the hashes. or post an example how you want to compare. Please note that I am aware of fastq format. Do not share the link to fastq format.

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

That's an interesting take to this problem, this is an example. As you can see both are the same, I just would like a negative or positive reply depending on if they are the same or not. FASTQ file 1:

@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

FASTQ file 2:

@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

ADD REPLY • link 3.1 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

If you have visual evidence like this then using a hash my be fine.

Let me illustrate a variation. Even if there is a single difference e.g. switched order of sequences.

$ more test1.fq 
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
@SIM:1:FCX:1:15:6330:1045 1:N:0:2
TCGCACTCAACGCCCTTTTTATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

Here is file2

$ more test2.fq 
@SIM:1:FCX:1:15:6330:1045 1:N:0:2
TCGCACTCAACGCCCTTTTTATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

then this will produce a different sum even though the data is the same.

$ md5sum test1.fq
814777bc0b8d5fbbbf91f98586cda920  test1.fq
$ md5sum test2.fq
b7c2dddb3041158b1e113443e27127dd  test2.fq

ADD REPLY • link 3.1 years ago by GenoMax 141k

1

Entering edit mode

$ cat test1.fq
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

$ cat test2.fq
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

$ md5sum *.fq
cf06ffb0724f3a928bbca54626824a8e  test1.fq
cf06ffb0724f3a928bbca54626824a8e  test2.fq

$ sha1sum *.fq
132ae93ffb80fbd70b69651c556cfe07156afc18  test1.fq
132ae93ffb80fbd70b69651c556cfe07156afc18  test2.fq

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

This looks awesome, I will definitely try this out :)

ADD REPLY • link 3.1 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

how do you want to compare it ? they're exactly the same ? same but unordered sequences ?

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm not sure if these files are the same. I have a directory with different FASTQ files, basically what I want to see if they are exactly the same. Was thinking of comparing sequence/quality lengths and labels...

ADD REPLY • link 3.1 years ago by biohacker_tobe ▴ 80

2

Entering edit mode

if they are exactly the same

You need to be very specific in defining your requirement. Are you thinking there are identical copy of the data with a different file name or do you think it is the same sample(s) that was re-sequenced again?

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

Sorry for the lack of clarification on my behalf... I believe that it's possible that I have samples that have been re-sequenced again.

ADD REPLY • link 3.1 years ago by biohacker_tobe ▴ 80

1

Entering edit mode

So basically you want to see if these are technical sequencing replicates or not.

You could align the data independently to a reference and see if you are able to call identical SNP's for the data files. Short of knowing real experimental provenance this is likely be the closest you can informatically get to deciding if the data came from the same sample.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

I think you can still use the hash approach, but look at mash distances instead.

I think you can do it with fastqs, but not 100% sure. This will tell you to some level of accuracy that the genomes are very similar or the same. An actual md5sum will only work if the files are identical as others pointed out, so a resequencing of the same sample/genome will not necessarily give you an identical md5, but a mash distance should be instructive.

If you can't use fastqs, you can definitely use contigs, so you can just assemble your data first.

ADD REPLY • link 3.1 years ago by Joe 21k