Question: Split a concatenated alignment in multiple files
0
gravatar for dago
3.2 years ago by
dago2.5k
Germany
dago2.5k wrote:

I have a concatenated MS alignment of more then 3000 genes in fasta format.

 

I would like to split this alignment in 10-20 files, equally divided if it possible.

 

In my MSA I have 30 seqs, and each seq is made of milions of bp.

>1

ATTGCTGAAACGGCTTTGAAAAGGGCGGAAAATCTTTCTGGTGGT [..]

>2

ATTGCTGAAAC----GGCTTTGAAAAGGGC----GGAAAATCTTTCTGGTGG [..]

>3

ATTGCTGAAAC----GGCTTTGAAAAGGGC----GGAAAATCTTTCTGGTGGT [..]

>4

GATGAGCCGATTGCTTCGCTGGATCCGATGAATGCGCAGGTGGTGATGGACGCTCTTAAG [..]

........

 

 

Now I would like to split this file in multiple files having chunks of the MSA. for example file1 from position 1 to 300; file 2 from position 301 to 600; file 3 from 6001 to 900; and so on.

 

I could not find a way of doing that, any suggestion?

The problem is that most of the tools only split multiple fasta in different files or split single sequences.

 

PS: I do not have available the original single MSA

alignment fasta • 1.7k views
ADD COMMENTlink modified 6 months ago by Prakki Rama2.1k • written 3.2 years ago by dago2.5k

How are they concatenated? Is there any example you could provide?

ADD REPLYlink written 3.2 years ago by mxs530

@mxs, it is just a simple MSA, but instead of having around 1000 bp, each entry (>Ids) has milions of bp

ADD REPLYlink written 3.2 years ago by dago2.5k

And what is the file format? Multi fasta? Phylip? Stockholm? Something else?

ADD REPLYlink written 3.2 years ago by 5heikki7.2k

@5heikki, right, it is in fasta.

ADD REPLYlink written 3.2 years ago by dago2.5k
2
gravatar for 5heikki
3.2 years ago by
5heikki7.2k
Finland
5heikki7.2k wrote:

If the seqs have no linebreaks, then e.g.:

awk '{if(/^>/)print $0; else print substr($0,1,4)}' file > output 

Would print headers and first 4 characters of each seq into output, while:

awk '{if(/^>/)print $0; else print substr($0,5,4)}' file > output

Would print headers and characters 5-8 to output (substr(string,start index,length).

 

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by 5heikki7.2k

that's a good one, but the seqs have line breacks
 

ADD REPLYlink written 3.2 years ago by dago2.5k

Well, you can remove those too with awk. I'm not going to write how but google is your friend..

ADD REPLYlink written 3.2 years ago by 5heikki7.2k

hahah...great, sorry for my laziness. In this case `tr` is even easier: ` tr -d '\n\r'`

ADD REPLYlink written 3.2 years ago by dago2.5k

That tr would delete all linebreaks, including the ones after headers..

ADD REPLYlink written 3.2 years ago by 5heikki7.2k
0
gravatar for venu
3.2 years ago by
venu5.1k
Germany
venu5.1k wrote:

If MSA is in FASTA format.. faSplit is very useful.

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/?C=D;O=A

$ chmod +x faSplit

$ ./faSplit

$ prints usage information
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by venu5.1k

this is a realy cool tool, but I am afraid it splits the first seq only and does not take all the sqs in my MSA, did I miss something?

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by dago2.5k

keep a backup file and/or do with a simple file (containing less number of sequences)..

try sequence, base, size flags individually. They must do this job. 

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by venu5.1k

@venu, you mean use for each alignment single files containing only one sequence; split them; and put them together?

ADD REPLYlink written 3.2 years ago by dago2.5k

not yet all..that is a risky task right? create a MSA with less number of sequences. (I suggest create a protein MSA, in which each sequence length is very less)

when you do ./faSplit it prints some message on how to use. Try each online code on this new MSA file you created. For each output, check it whether it is what you intended. 

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by venu5.1k

@venu, thanks I tried that but It is not what I want

ADD REPLYlink written 3.2 years ago by dago2.5k

fine..we will do this in some other way..you have a MSA of 3000 genes, under each sequence header there are plenty of bp. Now you need 10-20 files that contain your 3000 genes equally distributed. Is this what you exactly want?

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by venu5.1k

Thanks, please see my edit

ADD REPLYlink written 3.2 years ago by dago2.5k
0
gravatar for mxs
3.2 years ago by
mxs530
mxs530 wrote:

I have to admit I am a bit confused right now. The line below (regardless of the number of lines in a fasta entry) splits fasta entries in 3 files. each containing equal number of sequences.

 

perl -lne 'if(/>/){$t++;$x=$t%3; open(O,">>","file.$x.fa");print O $_}else{print O $_} ' fasta.fa

 

fasta.fa - you input

3  - number of files

file.0.fa, file.1.fa, file.2.fa  - output files

Is this what you are looking for?

 

UPDATE:

 

perl -lne 'chomp;if(/>(.*)/){if($s){$i = 0;$z = 3;for(1..$z){open(O,">>","file.$_.fa");print O ">$h"; $f = length($s)/$z; ;print O substr($s, int($i), int($f+1)); $i=int($i+$f+1);}};$h = $1;$i = 0; $f = 0;$s = ""}else{$s .=$_} ' fasta.fa

 

3 is the number of files . if modified it has to be changed to something else  $z = 5 for 5 output files.

Quick explanation:

the oneliner above takes your input fasta.fa file (which has many fasta records and each can be  be seperated in multiple lines).

>1
atggatgatag
gcgctgctcgtc
gctagctgat
>2
gcgctgctcgtc
gctagctgat

and divides this initial file into $z number of files splitting each fasta record accordingly. This means if fasta seq 1 has 33 nt's and number of files is 4 then 3 files will contain 9 nt's and the last one 6. The same dividing strategy holds for >2 and >3 ...

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by mxs530

thanks, but it seems that it separates the seqs and does not split the alignment. Lets say that in my file I have: >seq1: >seq2; >seq3. You script will divide this in 3 files containing 1 seq each. Instead I want multiple files having all 3 seqs but with a shorter length of the original

ADD REPLYlink written 3.2 years ago by dago2.5k

aha. is there any limitation on the length  ??  can they all be of equal size ?? (let say 300 )

ADD REPLYlink written 3.2 years ago by mxs530

I do not know which size suits me the nest, because I have to feed this seqs to another program. Duno what it likes.

ADD REPLYlink written 3.2 years ago by dago2.5k
0
gravatar for Matt Shirley
3.2 years ago by
Matt Shirley8.4k
Cambridge, MA
Matt Shirley8.4k wrote:

You can do this using pyfaidx:

pip install --user pyfaidx
faidx -i file.fa | cut -f1 | xargs -n 300 | while read region; do faidx file.fa $region > file.$RANDOM.fa; done
ADD COMMENTlink written 3.2 years ago by Matt Shirley8.4k
0
gravatar for Prakki Rama
6 months ago by
Prakki Rama2.1k
Singapore
Prakki Rama2.1k wrote:

I wanted to do the samething. So I wrote this script. This perl script will break the aligned fasta file into smaller chunks from position1 to position2 (1 to 100, 100 to 200,...etc). The default chunk size used in the script is: 500KB.

syntax : <perl splitalignment.pl="">

$filename="seqnew.oneline.fa"; ##Input Aligned fasta file (after alignment)

open FH,"$filename";

while(<FH>)
{
    $header=$_;
    $seq=<FH>;
    @splitSeq=unpack("(A500000)*", $seq); ##Breaks into 500 KB sequence length
    for($i=0;$i<=$#splitSeq;$i++)
        {
        open FH2,">>$filename\_$i";
        print FH2"$header";
        print FH2"$splitSeq[$i]\n";
        }
}
ADD COMMENTlink modified 6 months ago • written 6 months ago by Prakki Rama2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 725 users visited in the last hour