Question: How to generate random genomic sequences that are GC- and length-matched to my input sequences?
gravatar for Xianjun
16 months ago by
Great Boston Area
Xianjun250 wrote:


I want to generate a list of random genomic sequences that match GC-content and length to my input DNA sequences. Note that my input sequences are not of the same length. My expected output will be the same number of random sequences that match length and GC-content to the individual input sequence.


Could anyone please suggest a proper tool for this purpose?

btw, I have looked into NullSeq, GC_compo, and BiasAway. But none of them fits my needs.



random background sequence gc • 916 views
ADD COMMENTlink modified 16 months ago by dariober10k • written 16 months ago by Xianjun250

Take a look at from BBMap suite. It should produce fasta files.

ADD REPLYlink written 16 months ago by genomax69k

Thanks for the reply. I looked into the option of, right it can produce fasta files, but it doesn't match the length distribution and GC content.

ADD REPLYlink written 16 months ago by Xianjun250

You may need to write something yourself perhaps since you have a specific need.

ADD REPLYlink written 16 months ago by genomax69k

Sure. I thought there might already be some tools available since it's a common task when people are asked to compare to a length- and GC-matched background set, e.g. TFBS enrichment analysis.

ADD REPLYlink modified 16 months ago • written 16 months ago by Xianjun250
gravatar for finswimmer
16 months ago by
finswimmer11k wrote:

The easiest way could be to read in every sequence and shuffle the characters.

fin swimmer

ADD COMMENTlink written 16 months ago by finswimmer11k

Simple solutions are the best. Depending on input from OP you can move this to an answer.

ADD REPLYlink written 16 months ago by genomax69k

yes, that's also what I have in my mind so far. I am wondering if any "better" background (e.g. part of genomic region). I like the idea of in oPOSSUM, where they use background regions as human open chromatin regions compiled from the lowest scoring ChIP-Seq peaks in the Encode project. The short thing is their output sequences have a fixed length, not matching my input length distribution.

ADD REPLYlink modified 16 months ago • written 16 months ago by Xianjun250

I did something similar for fastq format, should give some ideas on how to adapt this to fasta:

ADD REPLYlink written 16 months ago by WouterDeCoster40k
gravatar for dariober
16 months ago by
WCIP | Glasgow | UK
dariober10k wrote:

Among the many programs that I hacked together, there is MarkovSequenceGenerator which may be you can make use of. As I remember, just like the OP, I needed a null distribution of sequences matching nucleotides and di-nucleotide content. So this program shuffles sequences in a fasta file using a Markov model so that the shuffled sequences look like the original one.


java -jar MarkovSequenceGenerator-0.1.0.jar -h

usage: MarkovChainGenerator [-h] -f FASTA [-nr NRANDOM] [-l LENGTH] [-o OUTPUT] [-n NORDER]
                            [-p PCT_PSEUDOCOUNTS] [--version]

Generate random sequences from input sequence using Markov chain model

optional arguments:
  -h, --help             show this help message and exit
  -f FASTA, --fasta FASTA
                         Input fasta file to generate sequences from. Can be gzipped.
  -nr NRANDOM, --nrandom NRANDOM
                         Number of chains to generate. *Each* sequence in fastainput will be randomized this many times (default: 1)
  -l LENGTH, --length LENGTH
                         Lenght of each random sequence. Default to length of input sequence (default: -1)
  -o OUTPUT, --output OUTPUT
                         Output file. Use - to stdout (default: -)
  -n NORDER, --norder NORDER
                         Order of the Markov model. Must be greater than 0. (default: 1)
                         If the input sequence does not contain all the possible  kmers,add  the  missing  ones  at  this percentage of the total counts. E.g. -p 0.1
                         makes all the missing kmers to sum up to no more  than  10%  of  the  total  actual counts. Setting to zero might terminatethe chain with an
                         error. (default: 0.1)
ADD COMMENTlink written 16 months ago by dariober10k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1409 users visited in the last hour