Question: how to use faSplit to split fasta into x files
0
gravatar for olechnwin
10 days ago by
olechnwin0
olechnwin0 wrote:

I'm so confused. I can't figure out how to use faSplit to split my fasta file into 2 files. From the documentation, it seems I can do this command:

~/opt/faSplit sequence 1SQ_reads.fasta 2 1SQ_reads_

but, this generates files 1SQ_reads_0.fa, 1SQ_reads_1.fa 1SQ_reads_2.fa, and so on...

what did I do wrong? How do I split my fasta file into several files?

sequence next-gen fasplit • 178 views
ADD COMMENTlink modified 10 days ago • written 10 days ago by olechnwin0

Ensure the faSplit you're using is the right faSplit. Run a man faSplit to check the version as well as the usage document.

The actual faSplit binary can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

ADD REPLYlink modified 10 days ago • written 10 days ago by RamRS19k

Thanks for your reply. Tried man faSplit it came back with the 'No manual entry for faSplit' I thought that's where I downloaded it from. I'll try to re-download.

ADD REPLYlink written 10 days ago by olechnwin0

type faSplit and press enter button without any input. It would print help. Copy/pasted from help:

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries
ADD REPLYlink written 10 days ago by cpad011210k

Yes. I get that. I was wondering if the ability to print manual is because of the newer version.

What is the latest version? The one I have is this one:

# Name                    Version                   Build  Channel
ucsc-fasplit              366                  h5eb252a_0    bioconda
ADD REPLYlink modified 10 days ago • written 10 days ago by olechnwin0

This seems to be the latest version on bioconda. I installed that version and I see the help text when I run fasplit without arguments. The faSplit binary from UCSC doesn't work for me on macOS but works fine on GNU Linux.

ADD REPLYlink modified 10 days ago • written 10 days ago by RamRS19k

Thanks for checking the version. So, the version on bioconda faSplit sequence does not split fasta into desired number of files. So instead I divided the size of my original file and use faSplit about to split by size and get approximately the number of files I wanted.

ADD REPLYlink modified 10 days ago • written 10 days ago by olechnwin0

I checked on my computer, it works fine. It does not split them into equally sized files, but it does split them into as many files as requested. My commands:

$ cat >seq.fa
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG
ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA
ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT
CACTGGCGCGCGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT
CTCCCCTCCCC
>sequence4
CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGTTAATGGGAACTTC
AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAG
TCTGCACACC
[Ctrl-D]

$ ls
seq.fa

$ fasplit sequence seq.fa 2 ssss
$ ls
seq.fa  ssss0.fa  ssss1.fa

$ cat ssss0.fa
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCT
CGCAGCCGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCC
CGGGTGGCC

$ cat ssss1.fa
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGA
TCGGCGCCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAG
GACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCTCACTGGCGCG
CGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATA
TACCTTTCCAGACGCGGGATCTCCCCTCCCC
>sequence4
CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCA
TTATCGTTAATGGGAACTTCAGTGACCAGTCCTCAGACACGAAGGATGCT
CCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAGTCTGCACACC
ADD REPLYlink written 10 days ago by RamRS19k

I have absolutely no idea why it doesn't work on mine. My commands:

~/opt/miniconda3/bin/faSplit sequence 1SQ_reads.fasta 2 test/test_

$ cd test    
$ ls | less

test_0.fa
test_10000.fa
test_10001.fa
test_10002.fa
test_10003.fa
test_10004.fa
test_10005.fa
test_10006.fa
test_10007.fa
test_10008.fa
test_10009.fa
test_1000.fa

and many more test_.fa files.

My input:

$ less 1SQ_reads.fasta
>m54201_171012_165441/4653169/0_22251
TGCAGAATTTAAGGTTTTTCTAACTCTGATCCTGATCAAATATGCCCTCTGCTGTGGCCC
AACCTGTTCCATTATAAGGCAAGCTGATGAGAGTAAATACCCCCTACTGGTGCTGAGTAG
ATAAGCCCCTGTTGCTGAAAGGTCAGGGGGTATTTGTCTTAAACGGGCGTTTAGTTTATT
GTGGGTGTCACATCTTGTGCTACCGAGATTCCACGGAGATCACACAGAGGAGGGATAATT
CTGATGGTGGCCTCTTGAATGTTCGCAGGCACAACTTAAGCAAAGAGACAGAGAACGAGT
CTCTAAGACATCATAACTGCTGATTATAGCTCCTCCTGAAGCTTGATGGGACAATGTAAT
TGCCTACTTTATGAAACTATTAAATTATTATTAGCCGCATCTAATTTATAATCCTAAACG
AACTATATTTGTTTAACATTCAATCATCTTTGCTGTTTTATAAACTTTGATTTTAAAGCC
ATCTGTTTGATCCTTTAATACATGCCTTTGGGCCTGGATTGAAGACGCGTCTATTTGGCA
AGCAGATGATCCTTTCTCCTTCCCACGTTTTTTGGTTTTCTGTTTTGTCTGTAATGGCTT
ATTGAGAAAAACACAAATACTTTGTTGCAGAGAATTGGCAACAGCGGGAGCTTGCATATT
ADD REPLYlink modified 7 days ago • written 7 days ago by olechnwin0

Maybe check with your sysadmin on this? Can you also post output of uname -a?

ADD REPLYlink written 7 days ago by RamRS19k
$ uname -a
Linux qmaster01.cluster 2.6.32-696.30.1.el6.x86_64 #1 SMP Tue May 22 03:28:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
ADD REPLYlink written 7 days ago by olechnwin0

Can you try with the fasta file I used and run the same commands and see if the output is different? I just wanna make sure your FASTA identifiers are not messing with the program (they shouldn't be, but just in case)

ADD REPLYlink written 7 days ago by RamRS19k

@RamRS, I tried to use your fasta file and it worked!

$~/opt/miniconda3/bin/faSplit sequence seq.fa 2 test2/
$ ls test2/
0.fa  1.fa
$ ~/opt/miniconda3/bin/faSplit sequence seq.fa 3 test2/
$ ls test2/
0.fa  1.fa  2.fa

So, my FASTA identifiers are messing with the program? FYI, my fasta file was from pacbio. Should I be concerned with running faSplit on my fasta files then?

ADD REPLYlink modified 7 days ago • written 7 days ago by olechnwin0

Not sure if it's the identifiers, why don't you try:

~/opt/miniconda3/bin/faSplit sequence <(sed 's#/#__#g' 1SQ_reads.fasta) 2 test/test_
ADD REPLYlink written 7 days ago by RamRS19k

This doesn't work either:

$ ~/opt/miniconda3/bin/faSplit sequence <(sed 's#/#__#g' 1SQ_reads.fasta) 2 test/test_
$ ls test/test_* |wc -l
17188
ADD REPLYlink written 6 days ago by olechnwin0

Just to make sure that the previous run did not affect this one, you did rm -rf test before running the faSplit command, right? How big is your 1SQ_reads.fasta file?

ADD REPLYlink written 6 days ago by RamRS19k

Yes. I removed the test folder before running the faSplit with sed. My 1SQ_reads.fasta file is about 40 GB.

ADD REPLYlink written 6 days ago by olechnwin0

You should not use faSplit sequence then, it seems to work in a fashion that doesn't really make sense. In your case it would, if it worked as it should, produce two files where one is a few kB and the other almost 40GB. Maybe try faSplit about and copy over a few lines if it breaks halfway through an entry?

ADD REPLYlink written 6 days ago by RamRS19k

hmm....thanks for the hint about checking the files I did faSplit about when I realized faSplit sequence does not work. But, upon checking the result, it seems that although faSplit about seems to be working properly, it didn't!

Update: made a mistake. Seems to be working. At least for the ones I checked.

The beginning of file 2 is this:

$ sed -n -e 1p -e 2p 1SQ_reads02.fa
>m54201_171012_165441/42795207/24314_44191
CCGGTTAAGAAGAGATACACCACCAGAGCAACGAGTTGTGAGATAAAGAG

Searching for this line in original file:

$ grep -n ">m54201_171012_165441/42795207/24314_44191" ../1SQ_reads.fasta
66989054:>m54201_171012_165441/42795207/24314_44191

Printing the previous line from original file:

 $ sed -n -e 66989052,66989054p 1SQ_reads.fasta
TTTGAAGTGACATCTATCACTTTTGCTCTACATTTTATTTCATTAGAAGGGAGTGATCTT
GTAGG
>m54201_171012_165441/42795207/24314_44191

It does match! At least for the ones I checked.

$ tail -n 1 1SQ_reads01.fa
TTTTGCTCTACATTTTATTTCATTAGAAGGGAGTGATCTTGTAGG

But, now I'm wary with using faSplit to split fasta file from pacbio.

ADD REPLYlink modified 6 days ago • written 6 days ago by olechnwin0

did you try faSplit binary from http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/ ? @ RamRS

ADD REPLYlink modified 10 days ago • written 10 days ago by cpad011210k

I don't think I tried that, but on macOS I used the one from bioconda and it works fine.

ADD REPLYlink written 10 days ago by RamRS19k

The faSplit binary does not work for me since it was built on a more recent OS than the one I currently has.

ADD REPLYlink written 7 days ago by olechnwin0

I downloaded the binary from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64 and is working as expected (faSplit base)

function (split test.fa to 2 files):

$ faSplit sequence test.fa 2 test_

output:

$ tail -n+1 test_*
==> test_0.fa <==
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCT
CGCAGCCGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCC
CGGGTGGCC

==> test_1.fa <==
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGA
TCGGCGCCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAG
GACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCTCACTGGCGCG
CGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATA
TACCTTTCCAGACGCGGGATCTCCCCTCCCC

input:

$ cat test.fa 
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG
ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA
ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT
CACTGGCGCGCGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT
CTCCCCTCCCC
ADD REPLYlink modified 9 days ago • written 9 days ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1296 users visited in the last hour