Split a concatenated alignment in multiple files
5
0
Entering edit mode
6.1 years ago
dago ★ 2.7k

I have a concatenated MS alignment of more then 3000 genes in fasta format.

 

I would like to split this alignment in 10-20 files, equally divided if it possible.

 

In my MSA I have 30 seqs, and each seq is made of milions of bp.

>1

ATTGCTGAAACGGCTTTGAAAAGGGCGGAAAATCTTTCTGGTGGT [..]

>2

ATTGCTGAAAC----GGCTTTGAAAAGGGC----GGAAAATCTTTCTGGTGG [..]

>3

ATTGCTGAAAC----GGCTTTGAAAAGGGC----GGAAAATCTTTCTGGTGGT [..]

>4

GATGAGCCGATTGCTTCGCTGGATCCGATGAATGCGCAGGTGGTGATGGACGCTCTTAAG [..]

........

 

 

Now I would like to split this file in multiple files having chunks of the MSA. for example file1 from position 1 to 300; file 2 from position 301 to 600; file 3 from 6001 to 900; and so on.

 

I could not find a way of doing that, any suggestion?

The problem is that most of the tools only split multiple fasta in different files or split single sequences.

 

PS: I do not have available the original single MSA

fasta alignment • 3.4k views
ADD COMMENT
0
Entering edit mode

How are they concatenated? Is there any example you could provide?

ADD REPLY
0
Entering edit mode

@mxs, it is just a simple MSA, but instead of having around 1000 bp, each entry (>Ids) has milions of bp

ADD REPLY
0
Entering edit mode

And what is the file format? Multi fasta? Phylip? Stockholm? Something else?

ADD REPLY
0
Entering edit mode

@5heikki, right, it is in fasta.

ADD REPLY
2
Entering edit mode
6.1 years ago
5heikki 9.7k

If the seqs have no linebreaks, then e.g.:

awk '{if(/^>/)print $0; else print substr($0,1,4)}' file > output 

Would print headers and first 4 characters of each seq into output, while:

awk '{if(/^>/)print $0; else print substr($0,5,4)}' file > output

Would print headers and characters 5-8 to output (substr(string,start index,length).

 

ADD COMMENT
0
Entering edit mode

that's a good one, but the seqs have line breacks
 

ADD REPLY
0
Entering edit mode

Well, you can remove those too with awk. I'm not going to write how but google is your friend..

ADD REPLY
0
Entering edit mode

hahah...great, sorry for my laziness. In this case `tr` is even easier: ` tr -d '\n\r'`

ADD REPLY
0
Entering edit mode

That tr would delete all linebreaks, including the ones after headers..

ADD REPLY
1
Entering edit mode
3.4 years ago
Prakki Rama ★ 2.4k

I wanted to do the samething. So I wrote this script. This perl script will break the aligned fasta file into smaller chunks from position1 to position2 (1 to 100, 100 to 200,...etc). The default chunk size used in the script is: 500KB.

syntax : <perl splitalignment.pl="">

$filename="seqnew.oneline.fa"; ##Input Aligned fasta file (after alignment)

open FH,"$filename";

while(<FH>)
{
    $header=$_;
    $seq=<FH>;
    @splitSeq=unpack("(A500000)*", $seq); ##Breaks into 500 KB sequence length
    for($i=0;$i<=$#splitSeq;$i++)
        {
        open FH2,">>$filename\_$i";
        print FH2"$header";
        print FH2"$splitSeq[$i]\n";
        }
}
ADD COMMENT
0
Entering edit mode
6.1 years ago
venu 6.8k

If MSA is in FASTA format.. faSplit is very useful.

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/?C=D;O=A

$ chmod +x faSplit

$ ./faSplit

$ prints usage information
ADD COMMENT
0
Entering edit mode

this is a realy cool tool, but I am afraid it splits the first seq only and does not take all the sqs in my MSA, did I miss something?

ADD REPLY
0
Entering edit mode

keep a backup file and/or do with a simple file (containing less number of sequences)..

try sequence, base, size flags individually. They must do this job. 

ADD REPLY
0
Entering edit mode

@venu, you mean use for each alignment single files containing only one sequence; split them; and put them together?

ADD REPLY
0
Entering edit mode

not yet all..that is a risky task right? create a MSA with less number of sequences. (I suggest create a protein MSA, in which each sequence length is very less)

when you do ./faSplit it prints some message on how to use. Try each online code on this new MSA file you created. For each output, check it whether it is what you intended. 

ADD REPLY
0
Entering edit mode

@venu, thanks I tried that but It is not what I want

ADD REPLY
0
Entering edit mode

fine..we will do this in some other way..you have a MSA of 3000 genes, under each sequence header there are plenty of bp. Now you need 10-20 files that contain your 3000 genes equally distributed. Is this what you exactly want?

ADD REPLY
0
Entering edit mode

Thanks, please see my edit

ADD REPLY
0
Entering edit mode
6.1 years ago
mxs ▴ 530

I have to admit I am a bit confused right now. The line below (regardless of the number of lines in a fasta entry) splits fasta entries in 3 files. each containing equal number of sequences.

 

perl -lne 'if(/>/){$t++;$x=$t%3; open(O,">>","file.$x.fa");print O $_}else{print O $_} ' fasta.fa

 

fasta.fa - you input

3  - number of files

file.0.fa, file.1.fa, file.2.fa  - output files

Is this what you are looking for?

 

UPDATE:

 

perl -lne 'chomp;if(/>(.*)/){if($s){$i = 0;$z = 3;for(1..$z){open(O,">>","file.$_.fa");print O ">$h"; $f = length($s)/$z; ;print O substr($s, int($i), int($f+1)); $i=int($i+$f+1);}};$h = $1;$i = 0; $f = 0;$s = ""}else{$s .=$_} ' fasta.fa

 

3 is the number of files . if modified it has to be changed to something else  $z = 5 for 5 output files.

Quick explanation:

the oneliner above takes your input fasta.fa file (which has many fasta records and each can be  be seperated in multiple lines).

>1
atggatgatag
gcgctgctcgtc
gctagctgat
>2
gcgctgctcgtc
gctagctgat

and divides this initial file into $z number of files splitting each fasta record accordingly. This means if fasta seq 1 has 33 nt's and number of files is 4 then 3 files will contain 9 nt's and the last one 6. The same dividing strategy holds for >2 and >3 ...

ADD COMMENT
0
Entering edit mode

thanks, but it seems that it separates the seqs and does not split the alignment. Lets say that in my file I have: >seq1: >seq2; >seq3. You script will divide this in 3 files containing 1 seq each. Instead I want multiple files having all 3 seqs but with a shorter length of the original

ADD REPLY
0
Entering edit mode

aha. is there any limitation on the length  ??  can they all be of equal size ?? (let say 300 )

ADD REPLY
0
Entering edit mode

I do not know which size suits me the nest, because I have to feed this seqs to another program. Duno what it likes.

ADD REPLY
0
Entering edit mode
6.1 years ago

You can do this using pyfaidx:

pip install --user pyfaidx
faidx -i file.fa | cut -f1 | xargs -n 300 | while read region; do faidx file.fa $region > file.$RANDOM.fa; done
ADD COMMENT

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6