Question: Code golf: detecting homopolymers of length N in the (human) genome
2
gravatar for WouterDeCoster
16 months ago by
Belgium
WouterDeCoster44k wrote:

How would you feel about a code golf? Give us your pretiest, shortest, quickest pieces of code! Especially languages that nobody else has posted are welcome...

Objective: create a bed file of all homopolymers (repeats of the same nucleotides) of minimally length N based on the (human or other) reference genome.

The input is a fasta file, for example the human genome

Expected output example:

chr1    11540   11546
chr1    14908   14913
chr1    15468   15473
chr1    16318   16323
chr1    16505   16511
chr1    19735   19741
chr1    20316   20321

A useful benchmark set would be the human chromosome 22. My python code (see below), searching for 5-mers or longer, finds 244503 hits. The first 10 lines are:

chr22   16050521    16050526
chr22   16050548    16050553
chr22   16050570    16050575
chr22   16050578    16050583
chr22   16050679    16050684
chr22   16050835    16050840
chr22   16050932    16050937
chr22   16051192    16051198
chr22   16051303    16051310
chr22   16051311    16051317
ADD COMMENTlink modified 16 months ago • written 16 months ago by WouterDeCoster44k

DUPLICATE ! DUPLICATE ! DUPLICATE ! How to extract all the simple repeats from the hg19 reference genome

:-D

ADD REPLYlink written 16 months ago by Pierre Lindenbaum130k

We're waiting for your html solution Pierre

ADD REPLYlink written 16 months ago by WouterDeCoster44k
4
gravatar for Kevin Blighe
16 months ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

AWK solution

NB - this is a memory hog. It reads the entire FASTA file into memory. It could be designed to be more memory efficient, but works pretty quick on a cluster environment, and is also fine for smaller FASTA files on a laptop/desktop computer.

1,set desired minimum homopolymer length as shell variable

HOMOPOLYMER_LENGTH=10

2, determine homopolymers

This first 'flattens' the FASTA file and converts everything to upper case to avoid issues with soft- / repeat-masked sequence

awk '/^>/{if (NR==1) {print $0} else if (NR>1) {print "\n"$0}}; !/^>/ {printf toupper($0)}' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna | awk -v len="${HOMOPOLYMER_LENGTH}" -F '' '/^>/{gsub("^>", "", $0); chr=$0}; !/^>/ {base=$1; start=1; end=1; for (i=2; i<=NF+1; i++) {if (base==$(i)) {end=i} else if (base!=$(i)) {if (end-start>=len-1) {print chr"\t"start"\t"end"\t"base"\t"(end-start)+1}; start=i; end=i; base=$(i)}}}'

chr1    1   10000   N   10000
chr1    28589   28603   T   15
chr1    31720   31733   A   14
chr1    33450   33464   A   15
chr1    33531   33541   T   11
chr1    36352   36364   A   13
chr1    37241   37250   T   10
chr1    40640   40658   T   19
chr1    43797   43812   A   16
chr1    44174   44183   A   10
chr1    46403   46416   T   14
chr1    47319   47328   A   10
chr1    51865   51877   A   13
chr1    61351   61361   A   11
chr1    71176   71186   A   11
chr1    71847   71861   T   15
chr1    73843   73856   T   14
chr1    77175   77195   A   21
chr1    82134   82154   A   21
chr1    91201   91213   A   13

The extra 4th and 5th columns are the homopolymer base and its length, respectively.

Kevin

ADD COMMENTlink modified 15 months ago • written 16 months ago by Kevin Blighe65k
2
gravatar for Devon Ryan
16 months ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

This won't win any awards for brevity, but I'd already written it for other purposes. If it helps, the only bit of code it uses that I didn't write myself is argparse.

ADD COMMENTlink modified 16 months ago • written 16 months ago by Devon Ryan96k
2
gravatar for ATpoint
16 months ago by
ATpoint38k
Germany
ATpoint38k wrote:

Some R using Biostrings and BSgenomes, see on Git:

https://github.com/ATpoint/miscR/blob/master/CheckSequence.R

ADD COMMENTlink modified 16 months ago • written 16 months ago by ATpoint38k
2
gravatar for manuel.belmadani
16 months ago by
Canada
manuel.belmadani1.2k wrote:

I have a solution part in Go (because it does it really efficiently), part in Bash (for the golfing):

package main
import "fmt"
import "index/suffixarray"
import "io/ioutil"
import "regexp"
import "os"

func main() {
     M := os.Args[1]
     r := "[aA]{"+M+",}|[cC]{"+M+",}|[gG]{"+M+",}|[tT]{"+M+",}"
     R, _ := regexp.Compile(r)
     b, _ :=  ioutil.ReadAll(os.Stdin)
     index := suffixarray.New(b)
     fmt.Println(index.FindAllIndex( R, -1))        
}

The Go standard library includes a suffix array, which does all the heavy lifting for us.

Then you can do this for each chromosome (maybe in parallel for the whole genome, why not.) The chromosome is input by the standard input, and 5 in this case is the minimum size (called M in the Go code.)

Using chr22:

$ time tail -n+2 chr22.fa | ./homopolymer 5 | sed -e 's|\] \[|\n|g' | tr -d "[" | tr -d "]"  | sed 's|^|chr22\t|g' > homopolymer.bed


$ head homopolymer.bed 
chr22   16371531 16371536
chr22   16371581 16371586
chr22   16371589 16371594
chr22   16371692 16371697
chr22   16371851 16371856
chr22   16371950 16371955
chr22   16372215 16372221
chr22   16372329 16372336
chr22   16372337 16372343
chr22   16372418 16372425

Timing:

real    1m11.996s
user    1m12.240s
sys     0m0.408s
ADD COMMENTlink modified 16 months ago • written 16 months ago by manuel.belmadani1.2k
2
gravatar for Pierre Lindenbaum
16 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

C solution:

$ time ./biostar379454 chr22.fa 5
(...)
real    0m2.439s
user    0m0.475s
sys 0m0.538s

ADD COMMENTlink written 16 months ago by Pierre Lindenbaum130k

I suspect this will be faster by not mmaping the file but instead reading in block sized chunks.

ADD REPLYlink written 16 months ago by Devon Ryan96k
1
gravatar for WouterDeCoster
16 months ago by
Belgium
WouterDeCoster44k wrote:

I'll start with a Python snippet I had lying around, but can it be improved?

As input it takes just a fasta file (sys.argv[1]). This example searches for homopolymers of length 5 and up.

import sys
import re
from Bio import SeqIO

for nucl in ['A','C','T','G']:
    pattern = re.compile(nucl + "{5,}")
    for record in SeqIO.parse(sys.argv[1], "fasta"):
        for match in pattern.finditer(str(record.seq).upper()):
            print("\t".join([record.id, str(match.start()), str(match.end())]))
ADD COMMENTlink modified 16 months ago • written 16 months ago by WouterDeCoster44k
1
gravatar for Damian Kao
16 months ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Takes in fasta file and a second parameter for homopolymer length.

It streams through each line to find homopolymers. Outputs chromosome, start, end, homopolymer base, length of homopolymer.

I tested it out on this fasta file:

>A
AGTCAAAA
GGGGTTTTCCCC
>B
AGTCCCCCTTTTAAAA
GGGGTTTTCCCC
AAAATTTT

And it outputs:

A       4       8       A       4
A       8       12      G       4
A       12      16      T       4
A       16      20      C       4
B       3       8       C       5
B       8       12      T       4
B       12      16      A       4
B       16      20      G       4
B       20      24      T       4
B       24      28      C       4
B       28      32      A       4
B       32      36      T       4

Here's the script

import sys

inFile = open(sys.argv[1],'r')
l = int(sys.argv[2])
chrome = base = start = end = -1

for line in inFile:
    if line[0] == '>':
        if end - start > l:
            print('\t'.join(map(str,[chrome,start,end,base,end - start])))
        chrome = line.strip().split()[0][1:]
        base = -1
        start = end = 0
    else:
        for b in line.strip():
            if b != base:
                if end - start > l:
                    print('\t'.join(map(str,[chrome,start,end,base,end - start])))
                start = end
                base = b
            end += 1

if end - start > l:
    print('\t'.join(map(str,[chrome,start,end,base,end - start])))
ADD COMMENTlink modified 16 months ago • written 16 months ago by Damian Kao15k
1
gravatar for zx8754
16 months ago by
zx87549.6k
London
zx87549.6k wrote:

Using R, there must be a Biostrings-way of doing this, I am just using rle:

library(Biostrings)

N = 10

# http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz
x <- readBStringSet("../chr22.fa")
xRLE <- rle(unlist(strsplit(toupper(toString(x)), "")))

pos <- cumsum(c(1, xRLE$lengths))
ix <- which(xRLE$lengths >= N & xRLE$values %in% c("A", "T", "C", "G"))

res <- data.frame(chrom = "chr22",
                  start = pos[ ix ],
                  end = pos[ ix ] + xRLE$lengths[ ix ] - 1)

head(res)
#   chrom    start      end
# 1 chr22 16069784 16069795
# 2 chr22 16072433 16072447
# 3 chr22 16082599 16082608
# 4 chr22 16085351 16085376
# 5 chr22 16085557 16085568
# 6 chr22 16096059 16096068

Let's test:

subseq(x, 16069784, 16069795)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    12 aaaaaaaaaaaa                 chr22
subseq(x, 16072433, 16072447)
# A BStringSet instance of length 1
# width seq                          names               
# [1]    15 AAAAAAAAAAAAAAA              chr22
ADD COMMENTlink written 16 months ago by zx87549.6k

Yep, my Biostrings solution see above. Edited the original script I posted (wrote like > 2 years ago as a beginner), now made it like 10 times shorter and parallelized.

ADD REPLYlink modified 16 months ago • written 16 months ago by ATpoint38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1185 users visited in the last hour