Question: How to generate random genomic sequences that are GC- and length-matched to my input sequences?
gravatar for Xianjun
2.4 years ago by
Great Boston Area
Xianjun280 wrote:


I want to generate a list of random genomic sequences that match GC-content and length to my input DNA sequences. Note that my input sequences are not of the same length. My expected output will be the same number of random sequences that match length and GC-content to the individual input sequence.


Could anyone please suggest a proper tool for this purpose?

btw, I have looked into NullSeq, GC_compo, and BiasAway. But none of them fits my needs.



random background sequence gc • 1.6k views
ADD COMMENTlink modified 5 weeks ago by Houyu Zhang20 • written 2.4 years ago by Xianjun280

Take a look at from BBMap suite. It should produce fasta files.

ADD REPLYlink written 2.4 years ago by genomax87k

Thanks for the reply. I looked into the option of, right it can produce fasta files, but it doesn't match the length distribution and GC content.

ADD REPLYlink written 2.4 years ago by Xianjun280

You may need to write something yourself perhaps since you have a specific need.

ADD REPLYlink written 2.4 years ago by genomax87k

Sure. I thought there might already be some tools available since it's a common task when people are asked to compare to a length- and GC-matched background set, e.g. TFBS enrichment analysis.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Xianjun280
gravatar for finswimmer
2.4 years ago by
finswimmer13k wrote:

The easiest way could be to read in every sequence and shuffle the characters.

fin swimmer

ADD COMMENTlink written 2.4 years ago by finswimmer13k

Simple solutions are the best. Depending on input from OP you can move this to an answer.

ADD REPLYlink written 2.4 years ago by genomax87k

yes, that's also what I have in my mind so far. I am wondering if any "better" background (e.g. part of genomic region). I like the idea of in oPOSSUM, where they use background regions as human open chromatin regions compiled from the lowest scoring ChIP-Seq peaks in the Encode project. The short thing is their output sequences have a fixed length, not matching my input length distribution.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Xianjun280

I did something similar for fastq format, should give some ideas on how to adapt this to fasta:

ADD REPLYlink written 2.4 years ago by WouterDeCoster44k
gravatar for dariober
2.4 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

Among the many programs that I hacked together, there is MarkovSequenceGenerator which may be you can make use of. As I remember, just like the OP, I needed a null distribution of sequences matching nucleotides and di-nucleotide content. So this program shuffles sequences in a fasta file using a Markov model so that the shuffled sequences look like the original one.


java -jar MarkovSequenceGenerator-0.1.0.jar -h

usage: MarkovChainGenerator [-h] -f FASTA [-nr NRANDOM] [-l LENGTH] [-o OUTPUT] [-n NORDER]
                            [-p PCT_PSEUDOCOUNTS] [--version]

Generate random sequences from input sequence using Markov chain model

optional arguments:
  -h, --help             show this help message and exit
  -f FASTA, --fasta FASTA
                         Input fasta file to generate sequences from. Can be gzipped.
  -nr NRANDOM, --nrandom NRANDOM
                         Number of chains to generate. *Each* sequence in fastainput will be randomized this many times (default: 1)
  -l LENGTH, --length LENGTH
                         Lenght of each random sequence. Default to length of input sequence (default: -1)
  -o OUTPUT, --output OUTPUT
                         Output file. Use - to stdout (default: -)
  -n NORDER, --norder NORDER
                         Order of the Markov model. Must be greater than 0. (default: 1)
                         If the input sequence does not contain all the possible  kmers,add  the  missing  ones  at  this percentage of the total counts. E.g. -p 0.1
                         makes all the missing kmers to sum up to no more  than  10%  of  the  total  actual counts. Setting to zero might terminatethe chain with an
                         error. (default: 0.1)
ADD COMMENTlink written 2.4 years ago by dariober11k
gravatar for Houyu Zhang
5 weeks ago by
Houyu Zhang20
Houyu Zhang20 wrote:

I wrote a pure bash script for this purpose, verbose but useful. Wish to see better versions :)

This script is for generating length and GC content matched background sequence, heavily using bedtools [input bed file] [genome size] [genome fasta] [iteration times]
[input bed file]: A bed file as template, just first 3 cols are used
[genome size]: Genome size for bedtools
[genome fatsa]: Genome Fatsa file for bedtools
[iteration times]: Sequence pools for comparison, 100 is pretty good
ADD COMMENTlink written 5 weeks ago by Houyu Zhang20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 616 users visited in the last hour