Question: Multifasta to Singlefasta (replace headers for 'NNNNNNNNNN', then join the multiple contigs)
0
gravatar for oseias.rf.junior
7 months ago by
oseias.rf.junior0 wrote:

Hi to all,

I'm trying to use a script to find IS on my genomes. But the script will run only on singlefasta files (i.e., complete genomes). Some of the genome files I have are multifasta files. I wonder if there is any python (or perl) script to remove all the contig headers of a file (e.g. ">contig-header"), BUT the first one, replacing all the headers for something like "NNNNNNNNNN" so I would be able to either map where the headers were before, but also to use the script I first mention only with the purpose of looking for IS on the multifasta2singlefasta files.

ADD COMMENTlink modified 7 months ago • written 7 months ago by oseias.rf.junior0
1

Changing data drastically so it would work with a script seems like a really bad idea to me.

ADD REPLYlink written 7 months ago by RamRS20k
1

Why not split the multi-fasta files into individual ones and run the tool on those files instead of doing what you are proposing?

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax63k

Because some genomes have something like 300 contigs. It doesn't seem to me an idea with practicality (split them all). I would probably loose the track of each contig of a single file. The script for IS gives me the coordinates so I can track the IS local later.

ADD REPLYlink written 7 months ago by oseias.rf.junior0
1

Then you do not need the contig headers anyway. The coordinates will still be correct. Your proposed approach would actually be more difficult, since you’d be replacing the headers with an arbitrary number of Ns (and possibly different numbers of Ns depending on how you did the substitution), and so your base indices would be completely meaningless.

ADD REPLYlink written 7 months ago by jrj.healey11k

It didn't need to be an arbitrary number of Ns.

But anyway, thank you both very much genomax, jrj.healey and Ram for give me some clues and advices in a so fast way. I'll follow what you guys indicated/wrote.

ADD REPLYlink written 7 months ago by oseias.rf.junior0

Please search the forum for these various tasks, each one has very well documented solutions.

e.g.:

Concatenating sequences

Methods for manipulating fasta headers (one of many)

Give them a try. If you can’t make a solution work, come back and show us what you’ve tried and where you are stuck.

ADD REPLYlink written 7 months ago by jrj.healey11k

BTW,

An example of what I want the python script might do:

~BEFORE_MULTIFASTA_FILE~
>some header
ATCGATCGATCGATCG

>another header
ATCGATCGATCGATCG

~AFTER_SINGLEFASTA_FILE~
>some header
ATCGATCGATCGATCGNNNNNNNNNNATCGATCGATCGATCGATCG
ADD REPLYlink modified 7 months ago by genomax63k • written 7 months ago by oseias.rf.junior0
1

I’m with Ram and the others, this seems like a bad approach. Just concatenate the file normally, as a separate file and keep the original.

Concatenating draft genomes (or more usually scaffolding them) is fairly normal.

ADD REPLYlink written 7 months ago by jrj.healey11k

To add to jrj.healey's point, always design tools to write to stdout or an explicitly specified output file. Modifying an input file is unexpected (verging on harmful) behavior.

ADD REPLYlink written 7 months ago by RamRS20k

It's not my script (the one who finds IS in complete genomes). Someone designed it like I described. I'm just trying to figure something to have a first look on my data (the ones there are not complete genomes).

ADD REPLYlink written 7 months ago by oseias.rf.junior0

OK, then I'd recommend creating a new single fasta file and working on that.

ADD REPLYlink written 7 months ago by RamRS20k

What is IS?

ADD REPLYlink written 7 months ago by cschu1811.5k

probably insertions

ADD REPLYlink written 7 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1543 users visited in the last hour