Aligning thousands of genomes with Mugsy
1
0
Entering edit mode
23 months ago

Hello everyone, I'm rather a beginner to bioinformatics with background mainly in molecular biology (wet labs).

I have a project where I must align thousands of HIV1 genomes and the tool I found perfect for the job was Mugsy, I spent few weeks understanding it and get it to work. I tested it to align 2, 5 , 10 and 50 and the results were perfect however now I must align 13K genomes.

I'm using a server with 128gb RAM and shouldn't face any technical/memory related issues however the terminal command for running mugsy requires I type all input files, typing 13K file names is quite impractical and with I tried to copy-pasta file names in notepad and paste it to the terminal it just too long for a single terminal page and cuts the line in between. I tried combining all the genomes in a single multi-fasta file however mugsy detected only the first genome.

Any ways I could input a directory rather than a single file into mugsy or any suggestion how i could kick mugsy to detect all the genomes in a multi-fasta file ? Any help would be really appreciated, I tried searching the forum for similar issues but couldn't find any, thanks in advance for any help :)

Mugsy • 1.1k views
0
Entering edit mode

Doesn't *.fasta work?

0
Entering edit mode

Worth a try. Here is what mugsy help says.

   mugsy --directory /data/output --prefix mygenomes genome1.fasta genome2.fasta genome3.fasta


This example will align three genomes and output a file /data/output/mygenomes.maf. The --directory setting is also used for storing temporary files during the run.

The prefix of each input filename will be used as the genome name in the output files (eg. genome1 from genome1.fsa). Header lines in the FASTA files should not contain ':' or '-' to avoid parsing problems.

0
Entering edit mode

I tested it to align 2, 5 , 10 and 50 and the results were perfect however now I must align 13K genomes.

Did you do that by providing individual file names?

I tried combining all the genomes in a single multi-fasta file however mugsy detected only the first genome.

It appears that mugsy expects genomes to be in independent files so trying to provide multi-fasta file is not likely to work (looks like mugsy will take multi-fasta files as long as the contigs are from one genome, not the case here).

You may want to look at an alternate tool (e.g. like t-coffee, MUSCLE multiple sequence alignment tools).

0
Entering edit mode

Did you do that by providing individual file names?

Yes I tried individual names, typing 5, 10 or 50 is not such an issue.

Doesn't *.fasta work?

I tried:

mugsy --directory /output/mugsy --prefix *.fasta


And it returned:

Character '.' found in --prefix=*.fasta.  Please choose another --prefix that excludes '.'.

0
Entering edit mode

I'm sorry to bother again but I've been getting another error which I'm not able to bypass properly. So when try the lines of code you offer, it works perfect if there's 100 fasta files in the directory but when I try it for the entire 12.9K sequences I get the error:

ERROR : Could not parse delta file, /home/...directory.../HIV_db_001.delta

.ERROR : Could not parse delta file, /home/...directory.../HIV_db_001.filt.delta

The error repeats for all the sequences simultaneously, I'm sure its not a memory related problem because the server I'm running it on is 128gb of ram and its unlikely. Any idea what could be causing the problem ?

3
Entering edit mode
23 months ago
h.mon 34k

Try:

mugsy \
--directory /output/mugsy \
--prefix HIV_genomes \
*.fasta


HIV_genomes will be used as prefix for the names of the output files.

0
Entering edit mode

I don't understand logic behind the code but OMGGGGG, it worked perfectly, thanks a lot sir, I ow you one :)