How To Scramble A Sequence Using An Existing Script Or A Python Method?
4
3
Entering edit mode
12.9 years ago
Geparada ★ 1.5k

Hi!

I made a splice junction library to map the bowtie unmapped reads (to genome) like this:

chr167000051+67091529 ccaccatgatggaaggattgaaaaaacgta 30
chr167091593+67098752 ggacactgattctacaggttcaccagatag 30
chr167098777+67101626 gatagagatggaattcagcccagcccacac 30
chr167101698+67105459 ggaaaaaaagtttcgaagaaaagcaatggg 30
chr167105516+67108492 gattgggaaagatataactcacctgagctg 30
chr167108547+67109226 ccgaggaacccggctctaccaaaggaaagc 30


#It's a CSV file

Now we want to make a negative control to know if the reads aligns to the library more than expected to do by chance alone. To do that, we want to scramble the sequences at second column of the CSV generated.

I'm learning python and programming in general, and I made the script to generate the splice junction library by myself. So, you can help me telling me a python tip to scramble a string or letting me know about some tool to scramble a column of a CSV (or sequences of a multifasta file).

Bash tips using AWK, sed or wherever are also welcome.

Thanks a lot for your help!!!

using random.shuffle I get:

chr167000051+67091529 cgacaaaagtaacggactttaaaaggactg 30
chr167091593+67098752 ggcaaactattggtctataaatagccccgg 30
chr167098777+67101626 accgtgctcaatgaagacaaccgcgtaacg 30
chr167101698+67105459 atggttacggaacaaaaagaaaggaaaggt 30
chr167105516+67108492 ttaataggagaagcacagctcttagatcgg 30
chr167108547+67109226 gaaggcggcaaaccctacaatgacgccgca 30

thanks!

python mapping sequence next-gen sequencing • 7.1k views
ADD COMMENT
4
Entering edit mode
12.9 years ago
brentp 24k

This is mostly the same as @Michael Schuberts answer, but if you want to retain the exact nucleotide frequency of the existing sequence, you can use random.shuffle for your file above:

import random
import sys

for line in open(sys.argv[1]):
    toks = line.strip().split()
    toks[1] = list(toks[1])
    random.shuffle(toks[1])
    toks[1] = "".join(toks[1])
    print "\t".join(toks)

then call like:

$ python this-script.py your-bowtie-output.txt
ADD COMMENT
1
Entering edit mode

my mistake. I always thought random.sample was with replacement. But (as I am glad to learn) it is without replacement. http://docs.python.org/library/random.html

ADD REPLY
1
Entering edit mode

you need to use "".join(shuffled_seq) as Michael did because sample returns a list.

ADD REPLY
0
Entering edit mode

random.sample doesn't retain the nucleotide frequency? are you sure?

ADD REPLY
0
Entering edit mode

however random.sample generates a wired output, it isn't what I need, so I'll try with random.shuffle :)

ADD REPLY
3
Entering edit mode
12.9 years ago

Check out random.sample(), that should do exactly what you need.

seq = "atcgga"
print "".join(random.sample(seq, len(seq)))

I assume that you know how to read and write files ;-)

ADD COMMENT
2
Entering edit mode

brent is right though, shuffle should be a (little) more memory- and syntax-efficient

ADD REPLY
0
Entering edit mode

thanks, I'll try to apply to my code :)

ADD REPLY
0
Entering edit mode

tell me when you're stuck :)

ADD REPLY
2
Entering edit mode
12.9 years ago

The following R script will work on a 2 column tab CSV file.

The input should be just like your example but without the 3rd size column and without comments.

#!/usr/bin/env Rscript

con <- file("uniprot_2011_02_sample.tab", "r")

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
  seq <- strsplit(line, "\t")[[1]]
  set.seed(1)
  cat(paste(sep="", collapse="",
    "shuffled", "\t",
    paste(sep="", collapse="", sample(strsplit(seq[[2]], "")[[1]])),
    "\n"
    )
  )
}

close(con)

https://github.com/alevchuk/nmercount-classifier/blob/master/030-shuffle

To run:

  1. Save the code as the filename shuffle and run chmod +x ./shuffle
  2. Change the hard-coded filename
  3. Run ./shuffle > results.tab
ADD COMMENT
2
Entering edit mode
12.9 years ago

This can be done using Biopieces like this:

read_tab -i test.tab -k ID,SEQ,LEN | shuffle_seq | write_tab -k ID,SEQ,LEN -x

chr167000051+67091529   agaatccaagaaattggcgtctaaggaaac  30
chr167091593+67098752   cgaacaagaccttgctgtactaggagattc  30
chr167098777+67101626   aaacgtctggatactcagaacccggaccag  30
chr167101698+67105459   acgggaataaagtctgaaaagggtaaagaa  30
chr167105516+67108492   ggtcaacatttcaagagtgtacgagatcag  30
chr167108547+67109226   aaaggcgaatcatcagcaccaggccccgag  30

Of course, if you want to scramble the sequence of a multi FASTA file you do:

read_fasta -i test.fna | shuffle_seq | write_fasta -x

Cheers,
Martin

ADD COMMENT
1
Entering edit mode

Nice collection of bioinformatics tools!! I already do the scramble, but I probably use other scripts from biopieces. Thanks!!

ADD REPLY

Login before adding your answer.

Traffic: 2616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6