Question: Help With Creating Fasta Files For Dnasp And Other Programs!
1
gravatar for User 7433
7.7 years ago by
User 7433150
User 7433150 wrote:

Hi there,

I have an excel file which contains DNA sequence information (over 7000 nucleotides) for 4000 chromosomes. I want to create a fasta file that gives the nucleotide sequence for each chromosome..like this

>chromosome1
AGTGATGCGCGCTCGCGCTCGCCTCCGGCGCGCGT...
>chromosome2
GCCGCTCGCTCGCTCGCCGCGCGCGCGCGCGGCTC...

And so on...

In my excel file I can concatonate columns 1 and 2 to give something which is close to what I want..eg >chromosome1GATAGATGAGAGAGAGGA...

But then I have to copy this into word and MANUALLY (it takes me AGES! lol) hit enter after chromosome1 to give me the correct arrangement for the input file DNAsp needs.

Does anyone know of a program that easily makes fasta files for software like DNAsp? Or any other way I can do this without having to sit and do it manually until my fingers fall of?

I am not exactly a computer whizz so I am sure someone can help me?

Hope so - many thanks in advance!!

xx

ADD COMMENTlink modified 4.2 years ago by mikaellfl0 • written 7.7 years ago by User 7433150

For future reference, if you want ">" characters to display properly in your questions, you need to indent each line with 4 spaces; otherwise the text appears as a blockquote.

ADD REPLYlink written 7.7 years ago by Neilfws48k
2
gravatar for Neilfws
7.7 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Here is how a bioinformatician would deal with this problem.

First, they would roll their eyes in despair at the inappropriate use of Excel and Word for storing/manipulating sequence data.

Next, they would export the data from Excel to a portable, plain text format which could be used by many programs on multiple platforms. They would probably export to CSV (comma separated values) format, so each column would look like this:

chromosome1,AGTGATGCGCGCTCGCGCTCGCCTCCGGCGCGCGT...
chromosome2,GCCGCTCGCTCGCTCGCCGCGCGCGCGCGCGGCTC...

Finally, they would write a short script in the language of their choice to reformat the CSV file as FASTA. They might also do a web search for the term "CSV to fasta", to see if anyone else has already solved the problem.

Their script might look something like this, which is written in Awk:

awk 'BEGIN {FS=","} {print ">"$1"\n"$2}' sequences.csv > sequences.fasta

Which would not be perfect, since the sequence parts of the file will all be on one line, not a maximum of 80 characters long as they should be. But most programs ignore that, so the bioinformatician might decide this "quick and dirty" solution is acceptable. Or, assuming that the header line is never more than 80 characters, they might use another small text-processing tool named fold, to wrap the lines containing sequence:

awk 'BEGIN {FS=","} {print ">"$1"\n"$2}' sequences.csv | fold -w 80 > sequences.fasta

They may use another language: Python, Perl, Ruby... They may also use existing libraries from that language (Bioperl, BioPython, BioRuby) to handle sequence reading and writing.

Not being a "computer whizz", it may be that none of this means much to you, so here's the take-home message. When you stop using tools inappropriately because those are the only tools that you know; and you stop storing data in proprietary, locked-up formats and start using more open, versatile formats (simple plain text), then a whole world of simple solutions to your problems opens up to you. A lot of bioinformatics involves processing text in one way or another and processing text is a breeze for people who know how to use a command line in Linux.

So - find such a person, or become one :)

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Neilfws48k
1
gravatar for Swbarnes2
7.7 years ago by
Swbarnes21.4k
Swbarnes21.4k wrote:

You could probably do it with a perl one-liner, which I'm not very good at making. Do you have access to a computer with Perl, or Python, or something?

Did you ask the person who gave you the file to format it for you? They probably didn't type it by hand.

If you really have to do it in Excel, make your file look like this, with a line between every entry

>chr1   atctcta

>chr2   tatcgat

>chr3   atatatt

>chr4   tttagcta

You can make dummy entries that have names that will fit between your entries, like chr1a. Then a couple of columns over, The top cell equals A1, and the next cell down equals A2, and copy and paste that pattern down the column.

ADD COMMENTlink written 7.7 years ago by Swbarnes21.4k
0
gravatar for mikaellfl
4.2 years ago by
mikaellfl0
Denmark
mikaellfl0 wrote:

This a very old post, but I actually had this problem, and solved it by using Excel so I will give my answer to anyone else with the same problem with name and sequence in 2 seperate columns. this is not difficult but requires a basic understanding of Excel, the data is not broken into 80character lines as FASTA should be, but I have had no problems with it.

Eks of the setup needed to be converted.

Column1 Column2

name1   ATCGCGG

name2   CGTGA...

name3   CGTATG...

name4   CGTATGG 

Solution:

you allready have column A and B.

C is the number of the rows starting with 1

D is "a" and "b" repeated loop, so all names get "a" and all sequnces get "b". Drag this column Down so it is minimum twice as long as your data columns. 

E The 2 first rows (name and sequence) both gets the value "1" inputted manually. from the 3th cell and downwards use the formula =If(D3="a";E2+1;E2) this ensures that all names and sequnces gets a paired number.Drag this column Down so it is min. twice as long as your data columns. 

C-D is guide columns and are only needed for the final formula to Work.

F this is the important column that puts your data in a fasta like format use the forumula: =if(D1="a";">"&INDIRECT("A"&E1);INDIRECT("B"&E1)) in F1, Drag this column Down so it is Exactly twice as long as your data columns - you can adjust this when you reach the end.

The final Excel document looks like below. Finally you copy column F to column A in a new sheet and save the new sheet as Text MS-DOS, the final product is in fasta format and can be used as you please

    A             B             C   D  E        F

name1   sequence1    1   a   1   >name1

name2   sequence2    2   b   1  sequence1    

name3   sequence3    3   a    2  >name2

name4   sequence4    4   b    2  sequence2    

                                    5  a    3  >name3

                                    6  b    3  sequence3

                                    7  a    4  >name4

                                    8  b    4  sequence4

ADD COMMENTlink written 4.2 years ago by mikaellfl0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour