I've added a new feature to my pyfaidx Python module which, in addition to current object-based random access to indexed Fasta files, allows the user to replace same-length portions of the Fasta file in-place on disk. As an example of the utility of this method I've written a blog post benchmarking in-place masking of Heng-Li's low-complexity regions against bedtools maskfasta. Note: I am not "slamming" current well-established Fasta masking tools, but was simply curious about the feasibility of modifying a Fasta file while maintaining its line wrapping and line breaks.
The blog post is on my site, as well as below.
Masking low-complexity regions using pyfaidx MutableFastaRecord
Masking a sequence is simply the process of changing some letters of the sequence to another character, or capitalization. All FASTA masking tools that I can find (such as bedtools maskfasta seem to read an entire file, or entire records in a file, and the perform character replacement in order from start to end of the file. This strategy requires that the list of regions (potentially specified in a BED file) are sorted in the same way as the FASTA file, or that the regions are all stored in memory. Also, the entire FASTA file must be read, and then subsequently modified and written back to disk. This round trip seems unnecessary if you can just modify the FASTA file in place. Therefore, I started thinking about the safest way to modify the sequence content of a FASTA file on-disk while maintaining the original line breaks and line wrapping length. The result is the pyfaidx MutableFastaRecord. As an example of this in-place modification of a FASTA file I decided to mask low-complexity regions of the human genome.
Heng Li's recent publication describing genomic regions with high variant calling artifacts provides a definition of low-complexity regions to filter for variant calling, along with a BED file for download. Using this bed file, the GRCv37human reference genome, and a [script] that will be included in the next version of pyfaidx, I benchmarked my in-place FASTA masking against bedtools maskfasta. Below is the main function of the script:
def mask_sequence(args): assert len(args.default_seq) == 1 fasta = Fasta(args.fasta, mutable=True) for line in args.bed: rname, start, end = bed_split(line) # fasta[rname] will return a MutableFastaRecord object if args.action == 'replace': fasta[rname][start:end] = (end - start) * args.default_seq elif args.action == 'lowercase': fasta[rname][start:end] = fasta[rname][start:end].lowercase()
As you can see,
fasta[rname][start:end] is a slice of a MutableFastaRecord object, which when called alone uses the
__getitem__ method and when assigned a value uses the
__setitem__ method. The
__getitem__ method fetches the sequence (without line breaks) from the FASTA file. The
__setitem__ method writes the sequence including the line breaks in the file, as they appear on disk. Timings for both methods:
2 minutes, 7 seconds
661 MB max memory usage
98% CPU usage
1 minute 32 seconds
121 MB max memory usage
78% CPU usage
Two things to note:
- At such small runtimes this process does not really need optimization
- The in-place masking runtime should scale with the number and length of regions to mask. For a small number of regions the operation will be almost instantaneous. For extensive masking, covering most of the file, the overhead of seeking, reading and then writing to/from the file for each region will overcome the performance gain over the streaming method.