Question: receive GC-content profile for a given window-size and splitting multi fasta file into separate fasta files by contigs name for running control-free
0
gravatar for reza
10 weeks ago by
reza160
Iran
reza160 wrote:

i have two question about control-FREE

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

thanks in advance

fasta splitting • 220 views
ADD COMMENTlink modified 10 weeks ago by genomax42k • written 10 weeks ago by reza160

the following format is for GC-content profile for a given window-size (50000) that freec-control need it. first column:chromosome name; column two: window-size, column tree: GC-content, column four: percentage of ACGT-letter per window (1-poly(N)%). the file containing the following information is in .cnp file.

1   0       0.45896 1

1   50000   0.38424 1

1   100000  0.43834 1

1   150000  0.45            0.3456
ADD REPLYlink modified 10 weeks ago by genomax42k • written 10 weeks ago by reza160
1
gravatar for genomax
10 weeks ago by
genomax42k
United States
genomax42k wrote:

For #1: geecee from EMBOSS.

For #2:

how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

Jim Kent's (UCSC) faSplit utility is the fastest solution. You will need to do chmod a+x faSplit to add execute permissions after you download the file.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by genomax42k
0
gravatar for Macspider
10 weeks ago by
Macspider2.2k
Vienna - BOKU
Macspider2.2k wrote:

1.how can i receive GC-content profile for a given window-size (50000) for a reference genome in fasta file?

I would do a cut -c xx-yy where xx and yy are the two margins (included) of the region you're interested. Once you have that, you can easily count the number of G+C and divide it by the total number of chars.

2.how can i split multi fasta file into separate fasta files by contigs name? (original fasta have ~32000 contig and after separation i must have 32000 fasta file)

I personally would make a python script. Something like:

#!/usr/bin/env python

import sys 

for line in sys.stdin:
  if line[0:1] == ">":
    if OUT:
      OUT.close()
    name = # process your line accordingly to get the sequence / scaffold name
    OUT = open("{0}.fa".format(name), "w")
    OUT.write(line)
  else:
    OUT.write(line)
else:
  OUT.close()

Then you:

  • chmod 755 pythonscript.py
  • cat filename.fa | pythonscript.py

Be careful: making that many files will clog your machine.

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Macspider2.2k
1

Or something like this (untested pre-coffee code):

from Bio import SeqIO

for rec in SeqIO.parse("myfasta.fa", "fasta"):
    with openrec.id + ".fasta") as f:
        f.write(rec.format("fasta")
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by WouterDeCoster25k

it does not work for me.

cat DROM_5000.fa | ./pythonscript.py

File "./pythonscript.py", line 9 name = # process your line accordingly to get the sequence / scaffold name ^ SyntaxError: invalid syntax

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by reza160
1

That's because you have to fill in that line.

ADD REPLYlink written 10 weeks ago by WouterDeCoster25k
1

Simply copy-pasting code is never a good idea. I put a commented line on purpose because I don't know how your lines look like. But I guess a combination of .rstrip(), .split(), .replace() will give you a well formatted line. Those built-in functions are properly documented in the python documentation (https://docs.python.org/2/).

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

@reza you should install SeqIO and do as @Wouter suggested, that would be the easiest way. if not installed, use my code. :D

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Macspider2.2k

@Wouter yes, but I usually avoid suggesting SeqIO to someone: non-python programmers and beginners usually don't have it installed!

True, but they should install it anyway ;-D

ADD REPLYlink written 9 weeks ago by WouterDeCoster25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 509 users visited in the last hour