Question

How to extract repeat element coordinates from an Ensembl genome?

0

Entering edit mode

18 months ago

cchen635 • 0

I am trying to extract the repeat element coordinates, or bed specifically, from an Ensembl soft-masked fasta with format:

>1 dna_rm:primary_assembly primary_assembly:bTaeGut1_v1.p:1:1:114557415:1 REF
AAGCCAGCCATATCAGTATCCCAGCCGCGCAGGATCTGAGCGCCACCCAGCAATGGCCAG
CCACTATGTGTGCCCCTGTATCACTGGAATTCAAGGCCCACCACCCTGTTCTCAGCCATA

I know it is available via Ensembl API but my computer is hard to handle the whole dataset. Can anyone provide a code or guide me to a possible answer?

Ensembl repeat-element genome • 724 views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 18 months ago by cchen635 • 0

Ram · Answer 1 · 2022-12-08

If you don't want to download the whole data at once you can stream fastq files from ensembl directly and process.

for chrom in $(echo $(seq 1 22)"\nX\nY\nMT"); do  # loop over chromosomes
    wget -O- -q "https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.chromosome.$chrom.fa.gz" | # stream the fasta files
    zcat | # Unzip the fasta files
    sed 1d |  # Skip the fasta id line
    grep -Ebo '[[:lower:]]+' | # Capture the sm group -E for expanded regex -b for start character position -o for matches only
    {IFS=":"; while read start seq; do # Split the grep output
        len=$(echo "$seq" | wc -c);  # get the length of the sequence
        echo "$chrom\t$start\t$(( start + len ))"; # print out the bed
    done};
done > sm.bed # write to file

Ram · Answer 2 · 2022-12-08

Here is the python script I use, search_fasta_for_regex.py:

"""
Search a fasta file for a given
regex, and output matches coordinates
in BED format
"""

import sys
import re
from Bio import SeqIO

in_fasta = sys.argv[1]
regex = sys.argv[2]
out_bed = sys.argv[3]


regex = re.compile(regex)
with open(out_bed, 'w') as fo:
  for rec in SeqIO.parse(in_fasta, 'fasta'):
    for m in regex.finditer(str(rec.seq)):
      l = [ rec.id, m.start(), m.start() + len(m.group()) ]
      l = [str(x) for x in l]
      print('\t'.join(l), file=fo)

I then run it like this: python search_fasta_for_regex.py in.fasta "[atgcn]+" out.bed.