Multifasta to Singlefasta (replace headers for 'NNNNNNNNNN', then join the multiple contigs)
0
0
Entering edit mode
5.7 years ago

Hi to all,

I'm trying to use a script to find IS on my genomes. But the script will run only on singlefasta files (i.e., complete genomes). Some of the genome files I have are multifasta files. I wonder if there is any python (or perl) script to remove all the contig headers of a file (e.g. ">contig-header"), BUT the first one, replacing all the headers for something like "NNNNNNNNNN" so I would be able to either map where the headers were before, but also to use the script I first mention only with the purpose of looking for IS on the multifasta2singlefasta files.

python perl multifasta singlefasta • 1.9k views
ADD COMMENT
1
Entering edit mode

Changing data drastically so it would work with a script seems like a really bad idea to me.

ADD REPLY
1
Entering edit mode

Why not split the multi-fasta files into individual ones and run the tool on those files instead of doing what you are proposing?

ADD REPLY
0
Entering edit mode

Because some genomes have something like 300 contigs. It doesn't seem to me an idea with practicality (split them all). I would probably loose the track of each contig of a single file. The script for IS gives me the coordinates so I can track the IS local later.

ADD REPLY
1
Entering edit mode

Then you do not need the contig headers anyway. The coordinates will still be correct. Your proposed approach would actually be more difficult, since you’d be replacing the headers with an arbitrary number of Ns (and possibly different numbers of Ns depending on how you did the substitution), and so your base indices would be completely meaningless.

ADD REPLY
0
Entering edit mode

It didn't need to be an arbitrary number of Ns.

But anyway, thank you both very much genomax, jrj.healey and Ram for give me some clues and advices in a so fast way. I'll follow what you guys indicated/wrote.

ADD REPLY
0
Entering edit mode

Please search the forum for these various tasks, each one has very well documented solutions.

e.g.:

Concatenating sequences

Methods for manipulating fasta headers (one of many)

Give them a try. If you can’t make a solution work, come back and show us what you’ve tried and where you are stuck.

ADD REPLY
0
Entering edit mode

BTW,

An example of what I want the python script might do:

~BEFORE_MULTIFASTA_FILE~
>some header
ATCGATCGATCGATCG

>another header
ATCGATCGATCGATCG

~AFTER_SINGLEFASTA_FILE~
>some header
ATCGATCGATCGATCGNNNNNNNNNNATCGATCGATCGATCGATCG
ADD REPLY
1
Entering edit mode

I’m with Ram and the others, this seems like a bad approach. Just concatenate the file normally, as a separate file and keep the original.

Concatenating draft genomes (or more usually scaffolding them) is fairly normal.

ADD REPLY
0
Entering edit mode

To add to jrj.healey's point, always design tools to write to stdout or an explicitly specified output file. Modifying an input file is unexpected (verging on harmful) behavior.

ADD REPLY
0
Entering edit mode

It's not my script (the one who finds IS in complete genomes). Someone designed it like I described. I'm just trying to figure something to have a first look on my data (the ones there are not complete genomes).

ADD REPLY
0
Entering edit mode

OK, then I'd recommend creating a new single fasta file and working on that.

ADD REPLY
0
Entering edit mode

What is IS?

ADD REPLY
0
Entering edit mode

probably insertions

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6