Question: How to generate random genomic sequences that are GC- and length-matched to my input sequences?
gravatar for Xianjun
5 weeks ago by
Great Boston Area
Xianjun200 wrote:


I want to generate a list of random genomic sequences that match GC-content and length to my input DNA sequences. Note that my input sequences are not of the same length. My expected output will be the same number of random sequences that match length and GC-content to the individual input sequence.


Could anyone please suggest a proper tool for this purpose?

btw, I have looked into NullSeq, GC_compo, and BiasAway. But none of them fits my needs.



random background sequence gc • 198 views
ADD COMMENTlink modified 5 weeks ago by dariober9.0k • written 5 weeks ago by Xianjun200

Take a look at from BBMap suite. It should produce fasta files.

ADD REPLYlink written 5 weeks ago by genomax46k

Thanks for the reply. I looked into the option of, right it can produce fasta files, but it doesn't match the length distribution and GC content.

ADD REPLYlink written 5 weeks ago by Xianjun200

You may need to write something yourself perhaps since you have a specific need.

ADD REPLYlink written 5 weeks ago by genomax46k

Sure. I thought there might already be some tools available since it's a common task when people are asked to compare to a length- and GC-matched background set, e.g. TFBS enrichment analysis.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Xianjun200
gravatar for finswimmer
5 weeks ago by
finswimmer1.5k wrote:

The easiest way could be to read in every sequence and shuffle the characters.

fin swimmer

ADD COMMENTlink written 5 weeks ago by finswimmer1.5k

Simple solutions are the best. Depending on input from OP you can move this to an answer.

ADD REPLYlink written 5 weeks ago by genomax46k

yes, that's also what I have in my mind so far. I am wondering if any "better" background (e.g. part of genomic region). I like the idea of in oPOSSUM, where they use background regions as human open chromatin regions compiled from the lowest scoring ChIP-Seq peaks in the Encode project. The short thing is their output sequences have a fixed length, not matching my input length distribution.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Xianjun200

I did something similar for fastq format, should give some ideas on how to adapt this to fasta:

ADD REPLYlink written 5 weeks ago by WouterDeCoster27k
gravatar for dariober
5 weeks ago by
Glasgow - UK
dariober9.0k wrote:

Among the many programs that I hacked together, there is MarkovSequenceGenerator which may be you can make use of. As I remember, just like the OP, I needed a null distribution of sequences matching nucleotides and di-nucleotide content. So this program shuffles sequences in a fasta file using a Markov model so that the shuffled sequences look like the original one.


java -jar MarkovSequenceGenerator-0.1.0.jar -h

usage: MarkovChainGenerator [-h] -f FASTA [-nr NRANDOM] [-l LENGTH] [-o OUTPUT] [-n NORDER]
                            [-p PCT_PSEUDOCOUNTS] [--version]

Generate random sequences from input sequence using Markov chain model

optional arguments:
  -h, --help             show this help message and exit
  -f FASTA, --fasta FASTA
                         Input fasta file to generate sequences from. Can be gzipped.
  -nr NRANDOM, --nrandom NRANDOM
                         Number of chains to generate. *Each* sequence in fastainput will be randomized this many times (default: 1)
  -l LENGTH, --length LENGTH
                         Lenght of each random sequence. Default to length of input sequence (default: -1)
  -o OUTPUT, --output OUTPUT
                         Output file. Use - to stdout (default: -)
  -n NORDER, --norder NORDER
                         Order of the Markov model. Must be greater than 0. (default: 1)
                         If the input sequence does not contain all the possible  kmers,add  the  missing  ones  at  this percentage of the total counts. E.g. -p 0.1
                         makes all the missing kmers to sum up to no more  than  10%  of  the  total  actual counts. Setting to zero might terminatethe chain with an
                         error. (default: 0.1)
ADD COMMENTlink written 5 weeks ago by dariober9.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 956 users visited in the last hour