Question

How to convert hard-masked genome to soft-masked?

1

Entering edit mode

13 months ago

BioinfoBee • 0

Hello All,

I was curious if anyone is aware of any tools or approach that can convert hard-masked genome to soft-masked format. For example,

genome: ....ATGCATGCATGC......

conversion required: ....ATGCNNNNATGC...... to ....ATGCatgcATGC......

Regards,
B

repeatmasking repeats • 1.5k views

ADD COMMENT • link updated 6 months ago by Dr.Animo ▴ 130 • written 13 months ago by BioinfoBee • 0

1

Entering edit mode

6 months ago

Dr.Animo ▴ 130

Hi, You can also use the following Python script instead of seqkit to get hardmasked genome coordinates. It is faster than seqkit and then use bedtools as mention rpolicastro

from Bio import SeqIO

# Input and output file paths
input_fasta_file = "hard_masked_genome.fasta"
output_bed_file = "hard_masked_regions.bed"

# Function to write a BED record
def write_bed_record(file, chrom, start, end, name=".", score="."):
    file.write(f"{chrom}\t{start}\t{end}\t{name}\t{score}\n")

# Open the input FASTA file and output BED file
with open(output_bed_file, "w") as bed_file:

    for record in SeqIO.parse(input_fasta_file, "fasta"):
        chrom = record.id  # Chromosome or contig name
        seq = str(record.seq)


        mask_start = None
        mask_end = None

        for i, base in enumerate(seq):
            if base == "N":
                # "N" indicates masked regions
                if mask_start is None:
                    mask_start = i
                mask_end = i
            elif mask_start is not None:

                write_bed_record(bed_file, chrom, mask_start, mask_end + 1)
                mask_start = None
                mask_end = None

        # Check if a masked region ends at the end of the sequence
        if mask_start is not None:
            write_bed_record(bed_file, chrom, mask_start, mask_end + 1)

ADD COMMENT • link 6 months ago by Dr.Animo ▴ 130

score 4 · Accepted Answer · 2023-03-06

4

Entering edit mode

13 months ago

rpolicastro 13k

I would first get the coordinates of Ns in the hard-masked genome and output them as a bed file. Here's an example using seqkit.

seqkit locate --bed -rPp "N+" hardmasked.fasta > N_coords.bed

You can then use bedtools to soft mask the non-masked genome.

bedtools maskfasta -soft -fi unmasked.fasta -bed N_coords.bed -fo softmasked.fasta

ADD COMMENT • link 13 months ago by rpolicastro 13k

0

Entering edit mode

rpolicastro Thank you. This works like a magic. Also, as I am dealing with large genome, is there a way to increase the speed of seqkit?. I tried increasing number of threads (-j/--threads) in seqkit but it doesn't seems to work. Kindly suggest!

Regards, B

ADD REPLY • link 13 months ago by BioinfoBee • 0

1

Entering edit mode

Besides increasing the number of threads further, the speed of the operation will likely remain slow with large file sizes and a regular expression. The author of the tool frequents biostars, so perhaps they have an idea.

ADD REPLY • link 13 months ago by rpolicastro 13k

2

Entering edit mode

I think when you use -j THREAD, the job will be paralleled only if you have a lot of files to work with.

So if you only have 1 file, it will not be faster.

ADD REPLY • link 13 months ago by Zhitian Wu ▴ 60