Question

Chunk That Fasta?

1

Entering edit mode

4.5 years ago

damonlbp ▴ 20

Hi all, another noob question from me.

I finally finished that FASTA Splitter from my last post, which I will update soon. My new task is to Chunk a Fasta File into well... chunks still in Python too.

FileInput = input('What is full address of the file you need to work with? ' )
FileOutPut = input('What is the address of the Directory you want saving to?')
OrganismOI = input('What organism  are you looking at? ')
ChunkSize = int(input("What is the chunk size (in kb) you want? "))

Chunk = ChunkSize*1000
Counter = 0
FileName = FileOutPut + '/' + OrganismOI + str(Counter) + '.fa''

FILE = open(FileInput, 'r')
File = FILE.read()
FileSize = len(File)
print(str(FileSize) + ' is the length of the file in bp.') 
#This also includes Fasta TitleNumber
print('Final Number of Files to be expected is ' + str(FileSize / Chunk) + '.')

for chunk in File:
    Save = open(FileName, 'w')
    Save.write(File[:Chunk])
    Save.close()

For the life of me, as soon as it comes to loops, I die inside. No matter how much I read or practise it just isn't going in. This script is supposed to go through the original FASTA file, and every chunk (ChunkSize, being user defined) spit out a file containing that chunk and called OrganismOI + counter (Supposed to count file numbers). Once again and always, help is very much appreciated. If I can then add a co-ordinate system ill be set for not. If anyone has any books or materials that could feasibly help me out learning this stuff then please send it my why. Would the BioStar handbook be a sensible purchase?

Again thanks everyone.

FASTA Python Chunk DNA Sequence • 1.7k views

ADD COMMENT • link updated 3.2 years ago by Biostar 20 • written 4.5 years ago by damonlbp ▴ 20

0

Entering edit mode

Can you clarify, is this meant to segregate whole sequences into numbered chunks (e.g. a file with 30 sequences, splits those sequences in to 3 files of 10 sequences), or is the intention to chop up the seqeunces themselves in to windows (e.g. a file with 30 sequences, split in to 30 files, each containing one sequence, broken into windows of length n)?

To you last point, really any number of books and websites exist to help learn, but I am personally an advocate of just learning-by-doing. For me at least, its the fastest way. Write code. Write lots. Write bad code. Pick up new and better habits from code you find online. Make the bad code a bit less bad. Rinse and repeat.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

Of course, I have the whole genome for Salmonella CT18 in a basic non annotated nucleotide FASTA format. Say I want to break it into chunks of 50kb to ease processing downstream. I think later on, it will be use full to break it into chunks with a number of gene sequences but not yet.

And thanks, that's what I'm trying to do but i'm clearly going wrong somewhere.

ADD REPLY • link 4.5 years ago by damonlbp ▴ 20

0

Entering edit mode

Would the BioStar handbook be a sensible purchase?

Not directly since BioStar handbook does not cover python programming. If you do purchase the book you get access to "Python programming in 100 hours", a separate series of video lectures that Prof. Istvan Alberts is including with the book. Unfortunately he has not completed that series yet since there are only 4 videos currently available.

ADD REPLY • link 4.5 years ago by GenoMax 141k

score 2 · Accepted Answer · 2019-10-22

2

Entering edit mode

4.5 years ago

massa.kassa.sc3na ▴ 600

Hi, try this link. (just load the genome as string and pass with to one of the implementations of chunker from the link).

If you have very low RAM or very long genome then you could do something like:

f = open(bare_fasta, 'r'):
l = f.readline()
to_write = []
c = 0
i = 0
threshold = 50000
while l:
  to_write.append(l)
  c += len(l)-1
  if c > threshold:
    with open('fout{}'.format(i), 'w') as o:
         o.write(''.join(to_write))
         i += 1
         to_write = []
         c = 0
  l = f.readline()
with open('fout{}'.format(i), 'w') as o:
  o.write(''.join(to_write))

This will split your original fast into chunks by whole lines just after the total number of nucleotides exceeded defined threshold. Be aware that this expects FASTA file with many lines. NOT one long line.

ADD COMMENT • link 4.5 years ago by massa.kassa.sc3na ▴ 600

0

Entering edit mode

Hi Massa, This worked amazing yesterday after changing a few little things.

Opening it up today to try it out again and now each file grows by the size of threshold and I can't figure out what has changed, same input file same script. Could you have a quick look please?

    FileInput = input('What is full address of the file you need to work with?, alternately save this script to your working directory. ' )
    FileOutPut = input('What is the address of the Directory you want saving to? (ending with a /)')
    OrganismOI = input('What organism  are you looking at? ')
    ChunkSize = int(input("What is the chunk size (in bp) you want per file? "))

    File = open(FileInput, 'r')
    Read = File.readline()
    ToWrite = []
    Length = 0
    Counter = 0

    while Read:                                                                             
        ToWrite.append(Read)                                                                
        Length += len(Read)-1                                                             
        if Length > ChunkSize:                                                              
            with open(FileOutPut + OrganismOI + '|{}'.format(Counter) + '.fa', 'w') as o:  
                o.write(''.join(ToWrite))
                Counter += 1
                to_write = []
                Length = 0                
        Read = File.readline()
    with open(FileOutPut + OrganismOI + '|{}'.format(Counter) + '.fa', 'w') as o:
        o.write(''.join(ToWrite))

ADD REPLY • link 4.5 years ago by damonlbp ▴ 20

1

Entering edit mode

You have to erase the ToWrite buffer after writing new file (you forgot to change it from to_write = [] to ToWrite = [])