I have around 149 fasta files of mouse genes cds sequences in fasta.txt format. I have to combine them into a single file containing all the sequences and run it against a gene dataset that I downloaded from ensembl biomart. Is there any shortcut command line I can use in cmd to combine all of them or any way of doing them in less time? Any suggestions are appreciated
The cat command is available as Powershell alias on Windows 7 and above. Press the Windows logo key + R, then type powershell and press Enter. Supposed your fasta files are located on d:\data, just type:
cd d:\data
cat *.fasta.txt > d:\combined.fasta
For any serious bioinformatics analysis, learning linux command is a must.
ADD COMMENT
• link
updated 3.6 years ago by
Ram
45k
•
written 10.8 years ago by
rtliu
★
2.2k
If the file you create matches the pattern of the files you are concatenating, you can get into an infinite loop where the file you create is being concatenated to itself. I've done this before :)
when this does not work you can use the below
find . -maxdepth 1 -type f -name 'file_.pdb' -print0 |
sort -zV |
xargs -0 cat >all.pdb
The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort"
to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we
would not have needed -V ). xargs takes this list of sorted pathnames and runs cat on these in as large
batches as possible.
This should work even if the filenames contains strange characters such as newlines and spaces. We
use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z . xargs too
reads nul-terminated names with its -0 flag.
Note that I'm writing the result to a file whose name does not match the pattern file_.pdb .
I suggest a slight change:
If the file you create matches the pattern of the files you are concatenating, you can get into an infinite loop where the file you create is being concatenated to itself. I've done this before :)
This was very useful!
when this does not work you can use the below find . -maxdepth 1 -type f -name 'file_.pdb' -print0 | sort -zV | xargs -0 cat >all.pdb The find command finds all relevant files, then prints their pathnames out to sort that does a "version sort" to get them in the right order (if the numbers in the filenames had been zero-filled to a fixed width we would not have needed -V ). xargs takes this list of sorted pathnames and runs cat on these in as large batches as possible. This should work even if the filenames contains strange characters such as newlines and spaces. We use -print0 with find to give sort nul-terminated names to sort, and sort handles these using -z . xargs too reads nul-terminated names with its -0 flag. Note that I'm writing the result to a file whose name does not match the pattern file_.pdb .
Is not
cata linux command? I am working on Windows. Do you know the equivalent command for Windows?"Any command-line or batch cmd to concatenate multiple files?" http://superuser.com/questions/111825/