How to generate random genomic sequences that are GC- and length-matched to my input sequences?
3
0
Entering edit mode
6.1 years ago
Xianjun ▴ 310

Hi,

I want to generate a list of random genomic sequences that match GC-content and length to my input DNA sequences. Note that my input sequences are not of the same length. My expected output will be the same number of random sequences that match length and GC-content to the individual input sequence.

>seq1
AACAATGAGTTGTCATTTTATCCAAATCTAAAAAAAAACATACATACAGTATCAAGTTCTGGTTAAAGTAGTAGATGTACTTTACGTGGTACATAAGAGTATCTATACAGTTTGTGGG
>seq2
atacaacctgaaaaacagagagggaaaaataattaagaaaaatgaatagaagcttgggacatccttaatgagagtaccagaggagatgtgagagagaaagaagcagaaaaaaagttccaagcagaaaaaatgtccaagaaataatagctgaaaagctcccaaatttgctgaaaaatgttaacctacacattcaagaagcccaaaaagctctacacaaagagacacacctagacacacacacctggacacacacacctgga
>seq3
cacctaagctggagtgcagtggtgcaattacagcttactgcagcctcaatctcccaggcttaagggatcctcccatgtagctgagactacaggcatgagccactatgtccagctaatttttaaattttttgtagagacagggtctcgctaccttgaacgggctgaccttgaactcctgggctctggtggccctcctgtgttgacctcccaaagcattgggattacaggcatgagccactgcacccagccTAGAAGCTCTGTTCATATTTATTT
>seq4
CATGAGGCCCAGTCTGTGAACAGAGACCAGGTCTAACCCCTTCTTCCAGGAAAGCCTCGTAGGGCCTTCTGGCCAAGAGGCCACGagtggtgaagactgcagactctgaaatcagaaatacctgggctccactgtcagcatggcagctgaggaagagtgaaaattcctctaagttcttttagaagtcccagcctccccatgtaactggggaACTGATGGGAGGAGCAGAGCTGTCTGTGCACATAAGAAGTTCTCAGTAAATGGAGACAGTTACTATTTCTGTTATTATTGAATTTGAACAAATTCCCTGGGTATGTGTGGGGGGACACTTCAGGTGAAAACACGCCCCTCCTCCCCTGGTGCGGGGGCCTGTGCTGCCACCCTCTGGAAGCCTGCAGAGGGGCAGGGAAAACAGACCCTGAACAAAAGTGTGCACCCAGTGAGGAGGTGCAAGGGCACAAAGGTGGCACCAAGTGCCTCAAGGAGAGGCTGAAACGCGGCCTGGGGACCTCGCAGTGGTCTGGTCATATAGGC
>seq5
GGCATGTGGTGTCAGCAGAGGTGCCTCAAGGATAGAGTGAGTCCAGAGTCTAGAAAGGAGCAGATCACCAGGCTCTGGGAAGAGCACAGCATGGGTGCACACACTGCTCTACCCAGCATGGCTGCCGACCCAAAGACAGCAAAGCCAAGAAGGACACACAAGCGTGGCCAGATGCAGCCCTGTGAGGAAACTTACCCAAGAACGGGACGATGGGCTTGAGAAACCatccatctacaaggatggcgtttgctgcagcaatgtttataataaattgtgggaaactgtgaactgcctaaatgtctcacaataggaacaaattagtgcaccacaccatgaaactctctacagcTCCTGAGTTACAGAACGACAGTATAATACTA

Could anyone please suggest a proper tool for this purpose?

btw, I have looked into NullSeq, GC_compo, and BiasAway. But none of them fits my needs.

Thanks,

-Xianjun

sequence GC random background • 4.4k views
ADD COMMENT
1
Entering edit mode

Take a look at randomreads.sh from BBMap suite. It should produce fasta files.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I looked into the option of randomreads.sh, right it can produce fasta files, but it doesn't match the length distribution and GC content.

ADD REPLY
0
Entering edit mode

You may need to write something yourself perhaps since you have a specific need.

ADD REPLY
0
Entering edit mode

Sure. I thought there might already be some tools available since it's a common task when people are asked to compare to a length- and GC-matched background set, e.g. TFBS enrichment analysis.

ADD REPLY
3
Entering edit mode
6.1 years ago

The easiest way could be to read in every sequence and shuffle the characters.

fin swimmer

ADD COMMENT
0
Entering edit mode

Simple solutions are the best. Depending on input from OP you can move this to an answer.

ADD REPLY
0
Entering edit mode

yes, that's also what I have in my mind so far. I am wondering if any "better" background (e.g. part of genomic region). I like the idea of http://opossum.cisreg.ca/GC_compo/ in oPOSSUM, where they use background regions as human open chromatin regions compiled from the lowest scoring ChIP-Seq peaks in the Encode project. The short thing is their output sequences have a fixed length, not matching my input length distribution.

ADD REPLY
0
Entering edit mode

I did something similar for fastq format, should give some ideas on how to adapt this to fasta: https://github.com/wdecoster/fastq-anonymous

ADD REPLY
1
Entering edit mode
6.1 years ago

Among the many programs that I hacked together, there is MarkovSequenceGenerator which may be you can make use of. As I remember, just like the OP, I needed a null distribution of sequences matching nucleotides and di-nucleotide content. So this program shuffles sequences in a fasta file using a Markov model so that the shuffled sequences look like the original one.

wget https://github.com/dariober/Java-cafe/releases/download/v0.1.0/MarkovSequenceGenerator-0.1.0.jar

java -jar MarkovSequenceGenerator-0.1.0.jar -h

usage: MarkovChainGenerator [-h] -f FASTA [-nr NRANDOM] [-l LENGTH] [-o OUTPUT] [-n NORDER]
                            [-p PCT_PSEUDOCOUNTS] [--version]

DESCRIPTION
Generate random sequences from input sequence using Markov chain model

optional arguments:
  -h, --help             show this help message and exit
  -f FASTA, --fasta FASTA
                         Input fasta file to generate sequences from. Can be gzipped.
  -nr NRANDOM, --nrandom NRANDOM
                         Number of chains to generate. *Each* sequence in fastainput will be randomized this many times (default: 1)
  -l LENGTH, --length LENGTH
                         Lenght of each random sequence. Default to length of input sequence (default: -1)
  -o OUTPUT, --output OUTPUT
                         Output file. Use - to stdout (default: -)
  -n NORDER, --norder NORDER
                         Order of the Markov model. Must be greater than 0. (default: 1)
  -p PCT_PSEUDOCOUNTS, --pct_pseudocounts PCT_PSEUDOCOUNTS
                         If the input sequence does not contain all the possible  kmers,add  the  missing  ones  at  this percentage of the total counts. E.g. -p 0.1
                         makes all the missing kmers to sum up to no more  than  10%  of  the  total  actual counts. Setting to zero might terminatethe chain with an
                         error. (default: 0.1)
ADD COMMENT
0
Entering edit mode
3.8 years ago
Hughie ▴ 30

I wrote a pure bash script for this purpose, verbose but useful. Wish to see better versions :)

Negetive_sequence_matched_length_GC.sh

This script is for generating length and GC content matched background sequence, heavily using bedtools
Negetive_sequence_matched_length_GC.sh [input bed file] [genome size] [genome fasta] [iteration times]
[input bed file]: A bed file as template, just first 3 cols are used
[genome size]: Genome size for bedtools
[genome fatsa]: Genome Fatsa file for bedtools
[iteration times]: Sequence pools for comparison, 100 is pretty good
ADD COMMENT

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6