Question: receive GC-content profile for a given window-size and splitting multi fasta file into separate fasta files by contigs name for running control-free
0
gravatar for reza
4 days ago by
reza160
Iran
reza160 wrote:

i have two question about control-FREE

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

thanks in advance

fasta splitting • 144 views
ADD COMMENTlink modified 4 days ago by genomax39k • written 4 days ago by reza160

the following format is for GC-content profile for a given window-size (50000) that freec-control need it. first column:chromosome name; column two: window-size, column tree: GC-content, column four: percentage of ACGT-letter per window (1-poly(N)%). the file containing the following information is in .cnp file.

1   0       0.45896 1

1   50000   0.38424 1

1   100000  0.43834 1

1   150000  0.45            0.3456
ADD REPLYlink modified 4 days ago by genomax39k • written 4 days ago by reza160
1
gravatar for genomax
4 days ago by
genomax39k
United States
genomax39k wrote:

For #1: geecee from EMBOSS.

For #2:

how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

Jim Kent's (UCSC) faSplit utility is the fastest solution. You will need to do chmod a+x faSplit to add execute permissions after you download the file.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
ADD COMMENTlink modified 4 days ago • written 4 days ago by genomax39k
0
gravatar for Macspider
4 days ago by
Macspider1.6k
Vienna - BOKU
Macspider1.6k wrote:

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

I would do a cut -c xx-yy where xx and yy are the two margins (included) of the region you're interested. Once you have that, you can easily count the number of G+C and divide it by the total number of chars.

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

I personally would make a python script. Something like:

#!/usr/bin/env python

import sys 

for line in sys.stdin:
  if line[0:1] == ">":
    if OUT:
      OUT.close()
    name = # process your line accordingly to get the sequence / scaffold name
    OUT = open("{0}.fa".format(name), "w")
    OUT.write(line)
  else:
    OUT.write(line)
else:
  OUT.close()

Then you:

  • chmod 755 pythonscript.py
  • cat filename.fa | pythonscript.py

Be careful: making that many files will clog your machine.

ADD COMMENTlink modified 4 days ago • written 4 days ago by Macspider1.6k
1

Or something like this (untested pre-coffee code):

from Bio import SeqIO

for rec in SeqIO.parse("myfasta.fa", "fasta"):
    with openrec.id + ".fasta") as f:
        f.write(rec.format("fasta")
ADD REPLYlink modified 4 days ago • written 4 days ago by WouterDeCoster24k

it does not work for me.

cat DROM_5000.fa | ./pythonscript.py

File "./pythonscript.py", line 9 name = # process your line accordingly to get the sequence / scaffold name ^ SyntaxError: invalid syntax

ADD REPLYlink modified 4 days ago • written 4 days ago by reza160
1

That's because you have to fill in that line.

ADD REPLYlink written 4 days ago by WouterDeCoster24k
1

Simply copy-pasting code is never a good idea. I put a commented line on purpose because I don't know how your lines look like. But I guess a combination of .rstrip(), .split(), .replace() will give you a well formatted line. Those built-in functions are properly documented in the python documentation (https://docs.python.org/2/).

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

@reza you should install SeqIO and do as @Wouter suggested, that would be the easiest way. if not installed, use my code. :D

ADD REPLYlink modified 9 hours ago • written 9 hours ago by Macspider1.6k

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

True, but they should install it anyway ;-D

ADD REPLYlink written 8 hours ago by WouterDeCoster24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour