Question

Finding the Refseq Soft-masked Regions

0

Entering edit mode

4.0 years ago

benformatics 3.9k

I noticed that in IGV there are lower case regions (soft-masked). I want to exclude these in an analysis and was wondering if there was an annotation for the sites that are soft-masked. Like a bed file or GTF with all the specific soft-masked locations.

I could go through and identify the lower case regions manually but before I do that maybe I'm overlooking an available data source?

IGV Refseq • 2.2k views

ADD COMMENT • link updated 4.0 years ago by vkkodali_ncbi ★ 3.7k • written 4.0 years ago by benformatics 3.9k

0

Entering edit mode

I'm looking at human data (hg38) but this is probably not super relevant to the question

ADD REPLY • link 4.0 years ago by benformatics 3.9k

0

Entering edit mode

A recent past thread about this : Repeat masked gtf files from ensembl

I don't think GTF files with masked locations are available.

ADD REPLY • link 4.0 years ago by GenoMax 141k

3

Entering edit mode

4.0 years ago

vkkodali_ncbi ★ 3.7k

If you are downloading the genome FASTA file from NCBI, the corresponding repeat masker file is also available in the same FTP path that you can use to construct a BED file. For example, you can download the current human genome data from the following FTP path: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/109.20200228/GCF_000001405.39_GRCh38.p13. The FASTA file is GCF_000001405.39_GRCh38.p13_genomic.fna.gz and the RepeatMasker file GCF_000001405.39_GRCh38.p13_rm.out.gz has the data about the masked regions. Specifically, columns 5-7 in this file are query_sequence, start_position and end_position that you can use to construct a BED file. You may have to offset the start_position by 1 to make the coordinates BED-compatible.

ADD COMMENT • link 4.0 years ago by vkkodali_ncbi ★ 3.7k

score 1 · Accepted Answer · 2020-04-21

To solve this I used the script from steve found here: find positions of a short sequence in a genome using the pattern [agct]. The square brackets are standard for python regex and allow for matching to one of any of the characters within the brackets.

The script is used to find the index of all matches from the given short sequence to a fasta file.

The input fasta in my case was Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa from Ensembl

Then I collapsed all the identified regions in R and converted their Ensembl style chromosomes (i.e. "1","MT") to UCSC format (i.e. "chr1","chrM").

Be aware that this requires a lot of memory as I'm loading the location of all the soft-masked sites (~34GB file) into R. The whole method could be simplified and sped up significantly by identifying and converting all matches for each chromosome in parallel (the current method took a little over one hour).

library(rtracklayer)
library(GenomicRanges)

sites <- import.bed('output_from_script')
sites.fix <- reduce(sites)
seqlevels(sites.fix) <- gsub(" " ,"",seqlevels(sites.fix))
seqlevelsStyle(sites.fix) <- 'UCSC'

export(sites.fixed,'soft_masked_sites_hg38.bed')

The final bed file is 135 MB.