how to use faSplit to split fasta into x files
0
0
Entering edit mode
3.9 years ago
olechnwin ▴ 60

I'm so confused. I can't figure out how to use faSplit to split my fasta file into 2 files. From the documentation, it seems I can do this command:

what did I do wrong? How do I split my fasta file into several files?

next-gen faSplit sequence • 4.1k views
0
Entering edit mode

Ensure the faSplit you're using is the right faSplit. Run a man faSplit to check the version as well as the usage document.

The actual faSplit binary can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

0
Entering edit mode

Thanks for your reply. Tried man faSplit it came back with the 'No manual entry for faSplit' I thought that's where I downloaded it from. I'll try to re-download.

0
Entering edit mode

type faSplit and press enter button without any input. It would print help. Copy/pasted from help:

Examples:
faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

0
Entering edit mode

Yes. I get that. I was wondering if the ability to print manual is because of the newer version.

What is the latest version? The one I have is this one:

# Name                    Version                   Build  Channel
ucsc-fasplit              366                  h5eb252a_0    bioconda

0
Entering edit mode

This seems to be the latest version on bioconda. I installed that version and I see the help text when I run fasplit without arguments. The faSplit binary from UCSC doesn't work for me on macOS but works fine on GNU Linux.

0
Entering edit mode

Thanks for checking the version. So, the version on bioconda faSplit sequence does not split fasta into desired number of files. So instead I divided the size of my original file and use faSplit about to split by size and get approximately the number of files I wanted.

0
Entering edit mode

I checked on my computer, it works fine. It does not split them into equally sized files, but it does split them into as many files as requested. My commands:

$cat >seq.fa >sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC >sequence2 CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT CACTGGCGCGCGGGCGAGCGCACGGGCGCTC >sequence3 CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT CTCCCCTCCCC >sequence4 CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGTTAATGGGAACTTC AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAG TCTGCACACC [Ctrl-D]$ ls
seq.fa

$fasplit sequence seq.fa 2 ssss$ ls
seq.fa  ssss0.fa  ssss1.fa

$cat ssss0.fa >sequence1 ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCT CGCAGCCGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCC CGGGTGGCC$ cat ssss1.fa
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGA
TCGGCGCCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAG
GACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCTCACTGGCGCG
CGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATA
TACCTTTCCAGACGCGGGATCTCCCCTCCCC
>sequence4
CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCA
TTATCGTTAATGGGAACTTCAGTGACCAGTCCTCAGACACGAAGGATGCT
CCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAGTCTGCACACC

0
Entering edit mode

I have absolutely no idea why it doesn't work on mine. My commands:

~/opt/miniconda3/bin/faSplit sequence 1SQ_reads.fasta 2 test/test_

$cd test$ ls | less

test_0.fa
test_10000.fa
test_10001.fa
test_10002.fa
test_10003.fa
test_10004.fa
test_10005.fa
test_10006.fa
test_10007.fa
test_10008.fa
test_10009.fa
test_1000.fa


and many more test_.fa files.

My input:

$less 1SQ_reads.fasta >m54201_171012_165441/4653169/0_22251 TGCAGAATTTAAGGTTTTTCTAACTCTGATCCTGATCAAATATGCCCTCTGCTGTGGCCC AACCTGTTCCATTATAAGGCAAGCTGATGAGAGTAAATACCCCCTACTGGTGCTGAGTAG ATAAGCCCCTGTTGCTGAAAGGTCAGGGGGTATTTGTCTTAAACGGGCGTTTAGTTTATT GTGGGTGTCACATCTTGTGCTACCGAGATTCCACGGAGATCACACAGAGGAGGGATAATT CTGATGGTGGCCTCTTGAATGTTCGCAGGCACAACTTAAGCAAAGAGACAGAGAACGAGT CTCTAAGACATCATAACTGCTGATTATAGCTCCTCCTGAAGCTTGATGGGACAATGTAAT TGCCTACTTTATGAAACTATTAAATTATTATTAGCCGCATCTAATTTATAATCCTAAACG AACTATATTTGTTTAACATTCAATCATCTTTGCTGTTTTATAAACTTTGATTTTAAAGCC ATCTGTTTGATCCTTTAATACATGCCTTTGGGCCTGGATTGAAGACGCGTCTATTTGGCA AGCAGATGATCCTTTCTCCTTCCCACGTTTTTTGGTTTTCTGTTTTGTCTGTAATGGCTT ATTGAGAAAAACACAAATACTTTGTTGCAGAGAATTGGCAACAGCGGGAGCTTGCATATT  ADD REPLY 0 Entering edit mode Maybe check with your sysadmin on this? Can you also post output of uname -a? ADD REPLY 0 Entering edit mode $ uname -a
Linux qmaster01.cluster 2.6.32-696.30.1.el6.x86_64 #1 SMP Tue May 22 03:28:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

0
Entering edit mode

Can you try with the fasta file I used and run the same commands and see if the output is different? I just wanna make sure your FASTA identifiers are not messing with the program (they shouldn't be, but just in case)

0
Entering edit mode

@RamRS, I tried to use your fasta file and it worked!

$~/opt/miniconda3/bin/faSplit sequence seq.fa 2 test2/$ ls test2/
0.fa  1.fa
$~/opt/miniconda3/bin/faSplit sequence seq.fa 3 test2/$ ls test2/
0.fa  1.fa  2.fa


So, my FASTA identifiers are messing with the program? FYI, my fasta file was from pacbio. Should I be concerned with running faSplit on my fasta files then?

0
Entering edit mode

Not sure if it's the identifiers, why don't you try:

~/opt/miniconda3/bin/faSplit sequence <(sed 's#/#__#g' 1SQ_reads.fasta) 2 test/test_

0
Entering edit mode

This doesn't work either:

$~/opt/miniconda3/bin/faSplit sequence <(sed 's#/#__#g' 1SQ_reads.fasta) 2 test/test_$ ls test/test_* |wc -l
17188

0
Entering edit mode

Just to make sure that the previous run did not affect this one, you did rm -rf test before running the faSplit command, right? How big is your 1SQ_reads.fasta file?

0
Entering edit mode

Yes. I removed the test folder before running the faSplit with sed. My 1SQ_reads.fasta file is about 40 GB.

1
Entering edit mode

You should not use faSplit sequence then, it seems to work in a fashion that doesn't really make sense. In your case it would, if it worked as it should, produce two files where one is a few kB and the other almost 40GB. Maybe try faSplit about and copy over a few lines if it breaks halfway through an entry?

0
Entering edit mode

hmm....thanks for the hint about checking the files I did faSplit about when I realized faSplit sequence does not work. But, upon checking the result, it seems that although faSplit about seems to be working properly, it didn't!

Update: made a mistake. Seems to be working. At least for the ones I checked.

The beginning of file 2 is this:

$sed -n -e 1p -e 2p 1SQ_reads02.fa >m54201_171012_165441/42795207/24314_44191 CCGGTTAAGAAGAGATACACCACCAGAGCAACGAGTTGTGAGATAAAGAG  Searching for this line in original file: $ grep -n ">m54201_171012_165441/42795207/24314_44191" ../1SQ_reads.fasta
66989054:>m54201_171012_165441/42795207/24314_44191


Printing the previous line from original file:

 $sed -n -e 66989052,66989054p 1SQ_reads.fasta TTTGAAGTGACATCTATCACTTTTGCTCTACATTTTATTTCATTAGAAGGGAGTGATCTT GTAGG >m54201_171012_165441/42795207/24314_44191  It does match! At least for the ones I checked. $ tail -n 1 1SQ_reads01.fa
TTTTGCTCTACATTTTATTTCATTAGAAGGGAGTGATCTTGTAGG


But, now I'm wary with using faSplit to split fasta file from pacbio.

0
Entering edit mode
0
Entering edit mode

I don't think I tried that, but on macOS I used the one from bioconda and it works fine.

0
Entering edit mode

The faSplit binary does not work for me since it was built on a more recent OS than the one I currently has.

0
Entering edit mode

I downloaded the binary from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64 and is working as expected (faSplit base)

function (split test.fa to 2 files):

$faSplit sequence test.fa 2 test_  output: $ tail -n+1 test_*
==> test_0.fa <==
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCT
CGCAGCCGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCC
CGGGTGGCC

==> test_1.fa <==
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGA
TCGGCGCCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAG
GACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCTCACTGGCGCG
CGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATA
TACCTTTCCAGACGCGGGATCTCCCCTCCCC


input:

\$ cat test.fa
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG
ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA
ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT
CACTGGCGCGCGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT
CTCCCCTCCCC