Question

How To Scramble A Sequence Using An Existing Script Or A Python Method?

3

Entering edit mode

12.9 years ago

Geparada ★ 1.5k

Hi!

I made a splice junction library to map the bowtie unmapped reads (to genome) like this:

chr167000051+67091529 ccaccatgatggaaggattgaaaaaacgta 30
chr167091593+67098752 ggacactgattctacaggttcaccagatag 30
chr167098777+67101626 gatagagatggaattcagcccagcccacac 30
chr167101698+67105459 ggaaaaaaagtttcgaagaaaagcaatggg 30
chr167105516+67108492 gattgggaaagatataactcacctgagctg 30
chr167108547+67109226 ccgaggaacccggctctaccaaaggaaagc 30


#It's a CSV file

Now we want to make a negative control to know if the reads aligns to the library more than expected to do by chance alone. To do that, we want to scramble the sequences at second column of the CSV generated.

I'm learning python and programming in general, and I made the script to generate the splice junction library by myself. So, you can help me telling me a python tip to scramble a string or letting me know about some tool to scramble a column of a CSV (or sequences of a multifasta file).

Bash tips using AWK, sed or wherever are also welcome.

Thanks a lot for your help!!!

using random.shuffle I get:

chr167000051+67091529 cgacaaaagtaacggactttaaaaggactg 30
chr167091593+67098752 ggcaaactattggtctataaatagccccgg 30
chr167098777+67101626 accgtgctcaatgaagacaaccgcgtaacg 30
chr167101698+67105459 atggttacggaacaaaaagaaaggaaaggt 30
chr167105516+67108492 ttaataggagaagcacagctcttagatcgg 30
chr167108547+67109226 gaaggcggcaaaccctacaatgacgccgca 30

thanks!

python mapping sequence next-gen sequencing • 7.1k views

ADD COMMENT • link updated 12.9 years ago by Martin A Hansen 3.0k • written 12.9 years ago by Geparada ★ 1.5k

Ram · Answer 1 · 2011-05-31

4

Entering edit mode

12.9 years ago

brentp 24k

This is mostly the same as @Michael Schuberts answer, but if you want to retain the exact nucleotide frequency of the existing sequence, you can use random.shuffle for your file above:

import random
import sys

for line in open(sys.argv[1]):
    toks = line.strip().split()
    toks[1] = list(toks[1])
    random.shuffle(toks[1])
    toks[1] = "".join(toks[1])
    print "\t".join(toks)

then call like:

$ python this-script.py your-bowtie-output.txt

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.9 years ago by brentp 24k

1

Entering edit mode

my mistake. I always thought random.sample was with replacement. But (as I am glad to learn) it is without replacement. http://docs.python.org/library/random.html

ADD REPLY • link 12.9 years ago by brentp 24k

1

Entering edit mode

you need to use "".join(shuffled_seq) as Michael did because sample returns a list.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.9 years ago by brentp 24k

0

Entering edit mode

random.sample doesn't retain the nucleotide frequency? are you sure?

ADD REPLY • link 12.9 years ago by Geparada ★ 1.5k

0

Entering edit mode

however random.sample generates a wired output, it isn't what I need, so I'll try with random.shuffle :)

ADD REPLY • link 12.9 years ago by Geparada ★ 1.5k

Ram · Answer 2 · 2011-05-31

3

Entering edit mode

12.9 years ago

Michael Schubert ★ 7.1k

Check out random.sample(), that should do exactly what you need.

seq = "atcgga"
print "".join(random.sample(seq, len(seq)))

I assume that you know how to read and write files ;-)

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.9 years ago by Michael Schubert ★ 7.1k

2

Entering edit mode

brent is right though, shuffle should be a (little) more memory- and syntax-efficient

ADD REPLY • link 12.9 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

thanks, I'll try to apply to my code :)

ADD REPLY • link 12.9 years ago by Geparada ★ 1.5k

0

Entering edit mode

tell me when you're stuck :)

ADD REPLY • link 12.9 years ago by Michael Schubert ★ 7.1k

Ram · Answer 3 · 2011-05-31

The following R script will work on a 2 column tab CSV file.

The input should be just like your example but without the 3rd size column and without comments.

#!/usr/bin/env Rscript

con <- file("uniprot_2011_02_sample.tab", "r")

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
  seq <- strsplit(line, "\t")[[1]]
  set.seed(1)
  cat(paste(sep="", collapse="",
    "shuffled", "\t",
    paste(sep="", collapse="", sample(strsplit(seq[[2]], "")[[1]])),
    "\n"
    )
  )
}

close(con)

https://github.com/alevchuk/nmercount-classifier/blob/master/030-shuffle

To run:

Save the code as the filename shuffle and run chmod +x ./shuffle
Change the hard-coded filename
Run ./shuffle > results.tab

Ram · Answer 4 · 2011-05-31

2

Entering edit mode

12.9 years ago

Martin A Hansen 3.0k

This can be done using Biopieces like this:

read_tab -i test.tab -k ID,SEQ,LEN | shuffle_seq | write_tab -k ID,SEQ,LEN -x

chr167000051+67091529   agaatccaagaaattggcgtctaaggaaac  30
chr167091593+67098752   cgaacaagaccttgctgtactaggagattc  30
chr167098777+67101626   aaacgtctggatactcagaacccggaccag  30
chr167101698+67105459   acgggaataaagtctgaaaagggtaaagaa  30
chr167105516+67108492   ggtcaacatttcaagagtgtacgagatcag  30
chr167108547+67109226   aaaggcgaatcatcagcaccaggccccgag  30

Of course, if you want to scramble the sequence of a multi FASTA file you do:

read_fasta -i test.fna | shuffle_seq | write_fasta -x

Cheers,
Martin

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.9 years ago by Martin A Hansen 3.0k

1

Entering edit mode

Nice collection of bioinformatics tools!! I already do the scramble, but I probably use other scripts from biopieces. Thanks!!

ADD REPLY • link 12.9 years ago by Geparada ★ 1.5k