How To Efficiently Sort A Fastq File By Entry Id?
4
2
Entering edit mode
12.4 years ago
User 9996 ▴ 840

How can I efficiently sort a FASTQ file by the entry ID? e.g. by the read name, if it's illumina data? I'd prefer to use a Python library or a command line C utility for this, but not awk/sed if possible. as far as I can tell,standard libraries like BioPython can't do this.

EDIT: I've been trying to make use of the solution posted here but it does not seem to work. It still thinks that @42EBKAAXX090828:6:73:204:1871/2 comes before @42EBKAAXX090828:6:1:270:128/2, for example.

Example:

$ cat test.fq | perl mergelines.pl | sort --stable -k1,1 | perl splitlines.pl > sorted
$ head sorted
@42EBKAAXX090828:6:100:1699:328/2
TTATTGCTTAATATTTATCACTGCTGAGTCCCGTGGGGGTGTGGCTAAAAGAGGAGGGGTCTAGCTTTTTTTTTTG
+
-557459:<8<:7:;:798=<:=:;;8;8:;;58=77::8####################################
@42EBKAAXX090828:6:10:1077:1883/2
GGCCTTATAATTAATTAGAGGTAAAATTACACATGCAAACCTCCATAGACCGGTGTAAAATCCCTTAAACATTTAC
+

as you can see it thinks that @42EBKAAXX090828:6:100:1699:328/2 comes before @42EBKAAXX090828:6:10:1077:1883/2 which is clearly wrong, since 10 is actually smaller than 100.

Does anyone know how to fix this? I can't tell it when the numeric part begins. That trick doesn't work.

If anyone else has efficient suggestions on how to do this I would be very interested to know.

Thanks

fastq illumina sort next-gen-sequencing alignment • 21k views
ADD COMMENT
2
Entering edit mode

In case you want to either: (1) pair it with the FASTQ file of another library, e.g. with fastq of the other mate in paired-end sequencing, or (2) to pull out all unmapped reads relative to a BAM file, see http://biostar.stackexchange.com/questions/15049/getting-unmapped-reads-comparing-fastq-to-bam

ADD REPLY
0
Entering edit mode

Just out of curiousity - why would one need to sort a fastq file by sequence ID? Aren't sequencer assigned IDs essentially arbitrary?

ADD REPLY
0
Entering edit mode

"@42EBKAAXX090828:6:100:1699:328/2" does come before "@42EBKAAXX090828:6:10:1077:1883/2" because as strings "0" comes before ":". If you wanted to sort on the fields as integers you'd have to split at the ":" and use string/integer sorting as appropriate. That part is possible in Python but specific to the ID scheme.

Also if you want random access to a (large) FASTQ file, you could try Biopython's Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions.

ADD REPLY
10
Entering edit mode
9.9 years ago

This question is now very old but since I'm facing the same problem... Here's my solution, input and output compressed:

zcat reads.fq.gz \
  | paste - - - - \
  | sort -k1,1 -S 3G \
  | tr '\t' '\n' \
  | gzip > sorted/reads.fq.gz

On my system it takes about 50min for a file of ~70M reads (3.9 Gb compressed)

ADD COMMENT
0
Entering edit mode

Thanks for your answer, that worked great.

ADD REPLY
0
Entering edit mode

This is a very fast and good solution, the only problem is in sorting since it sorts string gives a higher rank to @SRR823377.10 for example comparing to @SRR823377.2 . Do yo have a suggestion to fix it ?

ADD REPLY
0
Entering edit mode

Hi- If you want xxxx.10 to come after xxxx.2, use the --V/--version-sort option in the sort command.

ADD REPLY
0
Entering edit mode

Hi,

It's been a while that someone updated this post but I'm looking to find help regarding the above one liner I used to sort my barcodes fastq file. When I ran the linux command on my barcodes.fastq file, the later part of the header "1:N:0:1" has been moved to next line and I'm not sure why. Please find the example sorted reads here :

@M00189:187:000000000-CK2YF:1:1101:10001:13101
1:N:0:1
CCTCGTTCGACT
+
CCCCCGGGGGGG
@M00189:187:000000000-CK2YF:1:1101:10001:13222
1:N:0:1
CCAGTGTATGCA
+
CCCCCGGGGGGG
@M00189:187:000000000-CK2YF:1:1101:10001:16772
1:N:0:1
TCCGACACAATT
+
CCCCCGGGGGGG

I would appreciate any help or any other way to sort fastq file.

Thanks,
Chetana

ADD REPLY
0
Entering edit mode

Hi Chetana- It may be that there is a tab instead of a backspace between the read name (@M00189:187:000000000-CK2YF:1:1101:10001:13101) and the 1:N:0:1 part. You can check with e.g. zcat my.fastq.gz | grep -P '\t' | head (which should print nothing if there is no tab).

If you do a have a tab in the read header, you could edit the script above by using paste -d'|' - - - - and tr '|' '\n' (assuming the | does not appear anywhere else, which should be the case)

Failing that, upload a small sample of your file somewhere where we can have a look...

ADD REPLY
0
Entering edit mode

Hi dariober,

Thank you so much the quick fix, it worked. The subseq function from seqtk is adding tabs in the read name for reason. But this would fix it when someone's trying to sort their fastq after the subseq. I appreciate your help.

Chetana

ADD REPLY
0
Entering edit mode

For my a minor modification was needed to avoid this

tr: extra operand ' '
Try 'tr --help' for more information.
paste:  : No such file or directory
sort: cannot read:  : No such file or directory

This worked for me

zcat /1.fastq.gz | paste - - - - | sort -k1,1 -S 3G | tr '\t' '\n' | gzip > 2.fastq.gz

Thank you

ADD REPLY
0
Entering edit mode

Thanks a lot! It worked really fast. But the issue in my case is it gives the deflines as @SRR5002342.10000008/1, @SRR5002342.10000011/1. I assume it creates its own number after the (.). what if I want my previous defline numbers (for example @SRR5002342.9966534/1) and sort it according?

ADD REPLY
0
Entering edit mode

sort without any options uses dictionary sorting, where 1000 comes before 2 because 1 < 2. When dealing with alphanumeric strings, use sort -V

$ echo "@SRR5002342.10000008/1\n@SRR5002342.10000011/1\n@SRR5002342.9966534/1"
@SRR5002342.10000008/1
@SRR5002342.10000011/1
@SRR5002342.9966534/1

$ echo "@SRR5002342.10000008/1\n@SRR5002342.10000011/1\n@SRR5002342.9966534/1" | sort
@SRR5002342.10000008/1
@SRR5002342.10000011/1
@SRR5002342.9966534/1

$ echo "@SRR5002342.10000008/1\n@SRR5002342.10000011/1\n@SRR5002342.9966534/1" | sort -V
@SRR5002342.9966534/1
@SRR5002342.10000008/1
@SRR5002342.10000011/1
ADD REPLY
2
Entering edit mode
12.4 years ago

It depends on how big your fastq file is. If it's pretty big, I would go with that method you posted. One caveat about the method you posted is that it's assuming the fastq file is in 4 lines. Some fastq files separate the sequence or quality scores over several lines.

If you have a smaller fastq file, you can try this in python:

import sys, operator

inFile = open(sys.argv[1],'r')

header = ''
data = ''

allData = []

for line in inFile:
    if line[0] == "@":
        if data != '':
            allData.append((header,data))

        header = line.strip()[1:]
        data = line
    else:
        data += line

allData.append((header,data))
allData.sort(key = operator.itemgetter(0))

for item in allData:
    print item[1]

Save as yourName.py. Usage:

python yourName.py sequences.fastq > sorted.fastq
ADD COMMENT
0
Entering edit mode

Hi. Can you estimate the time necessary to process a 4Gb file?

ADD REPLY
1
Entering edit mode
2.1 years ago
GenoMax 141k

sortbyname.sh from BBMap suite should be performant.

Description:  Sorts reads by name or other keys such as length,
quality, mapping position, flowcell coordinates, or taxonomy.
Writes temp files if memory is exceeded.

Usage:   sortbyname.sh in=<file> out=<file>

Input may be fasta, fastq, or sam, compressed or uncompressed.
ADD COMMENT
0
Entering edit mode
2.1 years ago
Ram 43k

Adding an answer to keep things updated

See: https://github.com/dcjones/fastq-tools

Unlike what it says, v0.8.2 doesn't seem to work on gzipped FASTQs, but it's fast.

ADD COMMENT

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6