linux 'cat' command merge fasta files without headers
0
0
Entering edit mode
7.9 years ago

I've been trying to merge separate fasta files into a single file. I use cat *.fasta > outputname but every time I do it I lose some of the headers, which is puzzling.

Example:

File1

>scaffold001
AGTCATGAT

File2

>scaffold004
AGTATAAAA

after using cat, output is:

New file

>scaffold001
AGTCATGAT
AGTATAAAA

There is no pattern, some random scaffolds headers appear, some don't. I have no duplicate scaffolds to merge, so that's not the case. I double checked and basically all names have the same format, with changes only in the numbers, there's no spaces or anything.

I have no idea what could be going on or what else I can use to concatenate the files.

Thanks!

cat linux concatenate fasta files bioinformatics • 7.3k views
ADD COMMENT
1
Entering edit mode

The command should work. May be one of your file is lacking \n after the last sequence. Thus, when you concatenate this file with the other, the header of the other file gets attached to the last line of the previous file. Just guessing.

ADD REPLY
0
Entering edit mode

can you please show us the ouput of:

 file *.fasta
ADD REPLY
0
Entering edit mode

This is the output of file *.fasta

scaffold001.fasta: ASCII text, with very long lines
scaffold004.fasta: ASCII text, with very long lines
scaffold014.fasta: ASCII text, with very long lines
scaffold019.fasta: ASCII text, with very long lines
scaffold059.fasta: ASCII text, with very long lines
scaffold074.fasta: ASCII text, with very long lines
scaffold080.fasta: ASCII text, with very long lines
scaffold081.fasta: ASCII text, with very long lines
scaffold098.fasta: ASCII text, with very long lines
scaffold108.fasta: ASCII text, with very long lines
scaffold117.fasta: ASCII text, with very long lines
scaffold123.fasta: ASCII text, with very long lines
scaffold138.fasta: ASCII text, with very long lines
ADD REPLY
1
Entering edit mode

I'm curious about the 'with very long lines' output. Apparently, a line has to be > 300 characters before that output is generated: http://superuser.com/questions/91660/how-long-is-long-for-the-unix-file-command

Do you really have lines longer than 300 characters in that file? Also, if you try to open it in emacs do you get weird characters like '^@', or '^M'?

ADD REPLY
0
Entering edit mode

Yes, as I have DNA sequencing data, my lines are huge! The problem was in the fasta file headers though. I finally saw the pattern yesterday. After any header with a hyphen dash, the next header wouldn't be called. Simple but I just didn't catch it until I posted the question and started looking at it. Thanks for the input!

ADD REPLY
0
Entering edit mode

I found out the problem. Some of my fasta headers had a dash in the name ( - ) and that is what made cat behave weirdly. Thanks for the input though!

ADD REPLY
0
Entering edit mode

Probably the dash was surrounded by spaces. If you have filenames with spaces enclose the name with single or double quotes like cat "Escherichia coli - genome.fa".

ADD REPLY

Login before adding your answer.

Traffic: 1648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6