Question: receive GC-content profile for a given window-size and splitting multi fasta file into separate fasta files by contigs name for running control-free
0
gravatar for reza
2.6 years ago by
reza230
Iran
reza230 wrote:

i have two question about control-FREE

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

thanks in advance

fasta splitting • 1.4k views
ADD COMMENTlink modified 2.6 years ago by genomax85k • written 2.6 years ago by reza230

the following format is for GC-content profile for a given window-size (50000) that freec-control need it. first column:chromosome name; column two: window-size, column tree: GC-content, column four: percentage of ACGT-letter per window (1-poly(N)%). the file containing the following information is in .cnp file.

1   0       0.45896 1

1   50000   0.38424 1

1   100000  0.43834 1

1   150000  0.45            0.3456
ADD REPLYlink modified 2.6 years ago by genomax85k • written 2.6 years ago by reza230
1
gravatar for genomax
2.6 years ago by
genomax85k
United States
genomax85k wrote:

For #1: geecee from EMBOSS.

For #2:

how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

Jim Kent's (UCSC) faSplit utility is the fastest solution. You will need to do chmod a+x faSplit to add execute permissions after you download the file.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by genomax85k
0
gravatar for Macspider
2.6 years ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

I would do a cut -c xx-yy where xx and yy are the two margins (included) of the region you're interested. Once you have that, you can easily count the number of G+C and divide it by the total number of chars.

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

I personally would make a python script. Something like:

#!/usr/bin/env python

import sys 

for line in sys.stdin:
  if line[0:1] == ">":
    if OUT:
      OUT.close()
    name = # process your line accordingly to get the sequence / scaffold name
    OUT = open("{0}.fa".format(name), "w")
    OUT.write(line)
  else:
    OUT.write(line)
else:
  OUT.close()

Then you:

  • chmod 755 pythonscript.py
  • cat filename.fa | pythonscript.py

Be careful: making that many files will clog your machine.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Macspider3.0k
1

Or something like this (untested pre-coffee code):

from Bio import SeqIO

for rec in SeqIO.parse("myfasta.fa", "fasta"):
    with openrec.id + ".fasta") as f:
        f.write(rec.format("fasta")
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by WouterDeCoster44k

it does not work for me.

cat DROM_5000.fa | ./pythonscript.py

File "./pythonscript.py", line 9 name = # process your line accordingly to get the sequence / scaffold name ^ SyntaxError: invalid syntax

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by reza230
1

That's because you have to fill in that line.

ADD REPLYlink written 2.6 years ago by WouterDeCoster44k
1

Simply copy-pasting code is never a good idea. I put a commented line on purpose because I don't know how your lines look like. But I guess a combination of .rstrip(), .split(), .replace() will give you a well formatted line. Those built-in functions are properly documented in the python documentation (https://docs.python.org/2/).

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

@reza you should install SeqIO and do as @Wouter suggested, that would be the easiest way. if not installed, use my code. :D

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Macspider3.0k

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

True, but they should install it anyway ;-D

ADD REPLYlink written 2.6 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1960 users visited in the last hour