Comparing Two Fastq Files Linux
2
0
Entering edit mode
6 days ago
joe_genome ▴ 10

Hello community,

I've read different methods on using linux commands such as diff to compare the content between text files, was trying to apply this methods to fastq files. I'm looking to compare the fastq.gz created by using two separate bcl conversion methods, see if they produce the same fastq files.

Steps

1. Obtain BCL files from Illumina Sequencer
2. Convert BCL files to FASTQ (fastq.gz) format with: a. Method Conversion 1 (bcl2fastq) b. Method Conversion 2 (bcl-convert)
3. Determine if files are valid with validfastq
4. Compare fastq file content

Objective: Determine if sequence and quality score lengths match between files and that there are no mismatched labels found between files.

Is there any current methods other than diff that con provide this? Should I sort the files first and then compare via diff?

genomics illumina bioinformatics • 499 views
1
Entering edit mode

It would be good to use Comparing sequence files generated by bcl-convert and bcl2fastq as title instead of Comparing two files Linux. Consider editing the title.

2
Entering edit mode
6 days ago
5heikki 9.7k

For exact matching you just hash the files:

$cat file1 aaaaa aaaaa 11111 aaaaa$ cat file2
aaaaa
aaaaa
11111
aaaab

$md5sum file* d10d7704b2f469251f21258d828e4c2c file1 8645da65516e4b632f2acc572dd7f616 file2  ADD COMMENT 0 Entering edit mode Thanks for the clarification, I wanted to go more in depth and view the content of the files themselves, see where labels and other lines are differing. ADD REPLY 0 Entering edit mode Well, then it's either diff or comm.. ADD REPLY 0 Entering edit mode Thanks :) ADD REPLY 0 Entering edit mode 5heikki OP wants to compare fastq records generated by two methods so a simple md5sum may not work since we are not assured that the order of reads would be identical in two files. ADD REPLY 0 Entering edit mode Well the order part is easy to deal with.. md5sum <(sort file1) md5sum <(sort file2)  ADD REPLY 0 Entering edit mode Can you sort fastq files like this? I have never tried it. ADD REPLY 1 Entering edit mode You can sort any txt file. Could you recover a valid fastq file post this sorting? Of course not. In case there were some weird linebreaks this of course wouldn't work (but given that it's fastq format, I think the chances of having something odd are rather low..) $ cat file1
@SEQ_ID1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTA
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC66

$cat file2 @SEQ_ID2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTA + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC66 @SEQ_ID1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65$ md5sum file1
cd5813859537cd1b44f4d77b81770fca  file1

$md5sum file2 ff19aaf0c31e0f2473db7c98f3a5788e file2$ md5sum <(sort file1)
724584b21d1fd044a77ab3b69c2a1505  /dev/fd/63

\$ md5sum <(sort file2)
724584b21d1fd044a77ab3b69c2a1505  /dev/fd/63

1
Entering edit mode

Good to know! This should satisfy OP's requirement then, if the files are identical in all aspects except the order. You should add this as example in your original answer.

0
Entering edit mode

Wasn't aware of this capability either, will gunzip the original fastq.gz files and apply this method, appreciate it a lot :)

0
Entering edit mode

You can do this:

md5sum <(zcat file.fq.gz | sort)

0
Entering edit mode

The following error is presented @

sort: write failed: /tmp/sortLwtZzI: No space left on device
d41d8cd98f00b204e9800998ecf8427e  /dev/fd/63


Is this because of the tmp file?

0
Entering edit mode

I was afraid of that. /tmp is generally mapped to RAM on many servers.

0
Entering edit mode

oh darn, I have a server that is quite large, so I believe I could allocate here (possibly)?

1
Entering edit mode

Either export TMPDIR=some_local_directory with enough space or use sort -T /dir_you_want_to_use. sort then should use that space for its work.

1
Entering edit mode

Also given that it's a large task it pays off to define:

-S, --buffer-size=SIZE
use SIZE for main memory buffer


and

--parallel=N
change the number of sorts run concurrently to N


e.g. -S 90% --parallel=16

0
Entering edit mode

Thanks, I changed the TMP file to larger local directory, seems to be doing the trick for the moment. I'll also take into consideration these flags as well. Appreciate all the help :)

2
Entering edit mode
6 days ago
GenoMax 99k

joe_genome : My guess is pending differences in filtering options implemented in the two packages the fastq files generated should be very similar if not identical. The order of reads may even be the same (try out a FC with both). bcl-convert is meant to be a replacement for bcl2fastq so I would be surprised if the two packages generate very different results.

I found this nice comparison chart for bcl-convert and bcl2fastq on Illumina's site.