how to use faSplit to split fasta into x files
0
0
Entering edit mode
5.4 years ago
olechnwin ▴ 60

I'm so confused. I can't figure out how to use faSplit to split my fasta file into 2 files. From the documentation, it seems I can do this command:

~/opt/faSplit sequence 1SQ_reads.fasta 2 1SQ_reads_

but, this generates files 1SQ_reads_0.fa, 1SQ_reads_1.fa 1SQ_reads_2.fa, and so on...

what did I do wrong? How do I split my fasta file into several files?

next-gen faSplit sequence • 6.2k views
ADD COMMENT
0
Entering edit mode

Ensure the faSplit you're using is the right faSplit. Run a man faSplit to check the version as well as the usage document.

The actual faSplit binary can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

ADD REPLY
0
Entering edit mode

Thanks for your reply. Tried man faSplit it came back with the 'No manual entry for faSplit' I thought that's where I downloaded it from. I'll try to re-download.

ADD REPLY
0
Entering edit mode

type faSplit and press enter button without any input. It would print help. Copy/pasted from help:

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries
ADD REPLY
0
Entering edit mode

Yes. I get that. I was wondering if the ability to print manual is because of the newer version.

What is the latest version? The one I have is this one:

# Name                    Version                   Build  Channel
ucsc-fasplit              366                  h5eb252a_0    bioconda
ADD REPLY
0
Entering edit mode

This seems to be the latest version on bioconda. I installed that version and I see the help text when I run fasplit without arguments. The faSplit binary from UCSC doesn't work for me on macOS but works fine on GNU Linux.

ADD REPLY
0
Entering edit mode

Thanks for checking the version. So, the version on bioconda faSplit sequence does not split fasta into desired number of files. So instead I divided the size of my original file and use faSplit about to split by size and get approximately the number of files I wanted.

ADD REPLY
0
Entering edit mode

I checked on my computer, it works fine. It does not split them into equally sized files, but it does split them into as many files as requested. My commands:

$ cat >seq.fa
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG
ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA
ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT
CACTGGCGCGCGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT
CTCCCCTCCCC
>sequence4
CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCATTATCGTTAATGGGAACTTC
AGTGACCAGTCCTCAGACACGAAGGATGCTCCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAG
TCTGCACACC
[Ctrl-D]

$ ls
seq.fa

$ fasplit sequence seq.fa 2 ssss
$ ls
seq.fa  ssss0.fa  ssss1.fa

$ cat ssss0.fa
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCT
CGCAGCCGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCC
CGGGTGGCC

$ cat ssss1.fa
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGA
TCGGCGCCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAG
GACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCTCACTGGCGCG
CGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATA
TACCTTTCCAGACGCGGGATCTCCCCTCCCC
>sequence4
CAGCAGACATCTGAATGAAGAAGAGGGTGCCAGCGGGTATGAGGAGTGCA
TTATCGTTAATGGGAACTTCAGTGACCAGTCCTCAGACACGAAGGATGCT
CCCTCACCCCCAGTCTTGGAGGCAATCTGCACAGAGCCAGTCTGCACACC
ADD REPLY
0
Entering edit mode

I have absolutely no idea why it doesn't work on mine. My commands:

~/opt/miniconda3/bin/faSplit sequence 1SQ_reads.fasta 2 test/test_

$ cd test    
$ ls | less

test_0.fa
test_10000.fa
test_10001.fa
test_10002.fa
test_10003.fa
test_10004.fa
test_10005.fa
test_10006.fa
test_10007.fa
test_10008.fa
test_10009.fa
test_1000.fa

and many more test_.fa files.

My input:

$ less 1SQ_reads.fasta
>m54201_171012_165441/4653169/0_22251
TGCAGAATTTAAGGTTTTTCTAACTCTGATCCTGATCAAATATGCCCTCTGCTGTGGCCC
AACCTGTTCCATTATAAGGCAAGCTGATGAGAGTAAATACCCCCTACTGGTGCTGAGTAG
ATAAGCCCCTGTTGCTGAAAGGTCAGGGGGTATTTGTCTTAAACGGGCGTTTAGTTTATT
GTGGGTGTCACATCTTGTGCTACCGAGATTCCACGGAGATCACACAGAGGAGGGATAATT
CTGATGGTGGCCTCTTGAATGTTCGCAGGCACAACTTAAGCAAAGAGACAGAGAACGAGT
CTCTAAGACATCATAACTGCTGATTATAGCTCCTCCTGAAGCTTGATGGGACAATGTAAT
TGCCTACTTTATGAAACTATTAAATTATTATTAGCCGCATCTAATTTATAATCCTAAACG
AACTATATTTGTTTAACATTCAATCATCTTTGCTGTTTTATAAACTTTGATTTTAAAGCC
ATCTGTTTGATCCTTTAATACATGCCTTTGGGCCTGGATTGAAGACGCGTCTATTTGGCA
AGCAGATGATCCTTTCTCCTTCCCACGTTTTTTGGTTTTCTGTTTTGTCTGTAATGGCTT
ATTGAGAAAAACACAAATACTTTGTTGCAGAGAATTGGCAACAGCGGGAGCTTGCATATT
ADD REPLY
0
Entering edit mode

Maybe check with your sysadmin on this? Can you also post output of uname -a?

ADD REPLY
0
Entering edit mode
$ uname -a
Linux qmaster01.cluster 2.6.32-696.30.1.el6.x86_64 #1 SMP Tue May 22 03:28:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
ADD REPLY
0
Entering edit mode

Can you try with the fasta file I used and run the same commands and see if the output is different? I just wanna make sure your FASTA identifiers are not messing with the program (they shouldn't be, but just in case)

ADD REPLY
0
Entering edit mode

@RamRS, I tried to use your fasta file and it worked!

$~/opt/miniconda3/bin/faSplit sequence seq.fa 2 test2/
$ ls test2/
0.fa  1.fa
$ ~/opt/miniconda3/bin/faSplit sequence seq.fa 3 test2/
$ ls test2/
0.fa  1.fa  2.fa

So, my FASTA identifiers are messing with the program? FYI, my fasta file was from pacbio. Should I be concerned with running faSplit on my fasta files then?

ADD REPLY
0
Entering edit mode

Not sure if it's the identifiers, why don't you try:

~/opt/miniconda3/bin/faSplit sequence <(sed 's#/#__#g' 1SQ_reads.fasta) 2 test/test_
ADD REPLY
0
Entering edit mode

This doesn't work either:

$ ~/opt/miniconda3/bin/faSplit sequence <(sed 's#/#__#g' 1SQ_reads.fasta) 2 test/test_
$ ls test/test_* |wc -l
17188
ADD REPLY
0
Entering edit mode

Just to make sure that the previous run did not affect this one, you did rm -rf test before running the faSplit command, right? How big is your 1SQ_reads.fasta file?

ADD REPLY
0
Entering edit mode

Yes. I removed the test folder before running the faSplit with sed. My 1SQ_reads.fasta file is about 40 GB.

ADD REPLY
1
Entering edit mode

You should not use faSplit sequence then, it seems to work in a fashion that doesn't really make sense. In your case it would, if it worked as it should, produce two files where one is a few kB and the other almost 40GB. Maybe try faSplit about and copy over a few lines if it breaks halfway through an entry?

ADD REPLY
0
Entering edit mode

hmm....thanks for the hint about checking the files I did faSplit about when I realized faSplit sequence does not work. But, upon checking the result, it seems that although faSplit about seems to be working properly, it didn't!

Update: made a mistake. Seems to be working. At least for the ones I checked.

The beginning of file 2 is this:

$ sed -n -e 1p -e 2p 1SQ_reads02.fa
>m54201_171012_165441/42795207/24314_44191
CCGGTTAAGAAGAGATACACCACCAGAGCAACGAGTTGTGAGATAAAGAG

Searching for this line in original file:

$ grep -n ">m54201_171012_165441/42795207/24314_44191" ../1SQ_reads.fasta
66989054:>m54201_171012_165441/42795207/24314_44191

Printing the previous line from original file:

 $ sed -n -e 66989052,66989054p 1SQ_reads.fasta
TTTGAAGTGACATCTATCACTTTTGCTCTACATTTTATTTCATTAGAAGGGAGTGATCTT
GTAGG
>m54201_171012_165441/42795207/24314_44191

It does match! At least for the ones I checked.

$ tail -n 1 1SQ_reads01.fa
TTTTGCTCTACATTTTATTTCATTAGAAGGGAGTGATCTTGTAGG

But, now I'm wary with using faSplit to split fasta file from pacbio.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I don't think I tried that, but on macOS I used the one from bioconda and it works fine.

ADD REPLY
0
Entering edit mode

The faSplit binary does not work for me since it was built on a more recent OS than the one I currently has.

ADD REPLY
0
Entering edit mode

I downloaded the binary from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64 and is working as expected (faSplit base)

function (split test.fa to 2 files):

$ faSplit sequence test.fa 2 test_

output:

$ tail -n+1 test_*
==> test_0.fa <==
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCT
CGCAGCCGCAGCCCGCGTGGACGCTCTCGCCTGAGCGCCGCGGACTAGCC
CGGGTGGCC

==> test_1.fa <==
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGA
TCGGCGCCGGAGATTCGCGAACCCGACACTCCGCGCCGCCCGCCGGCCAG
GACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCTCACTGGCGCG
CGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATA
TACCTTTCCAGACGCGGGATCTCCCCTCCCC

input:

$ cat test.fa 
>sequence1
ACTCCCCGTGCGCGCCCGGCCCGTAGCGTCCTCGTCGCCGCCCCTCGTCTCGCAGCCGCAGCCCGCGTGG
ACGCTCTCGCCTGAGCGCCGCGGACTAGCCCGGGTGGCC
>sequence2
CAGTCCGGCAGCGCCGGGGTTAAGCGGCCCAAGTAAACGTAGCGCAGCGATCGGCGCCGGAGATTCGCGA
ACCCGACACTCCGCGCCGCCCGCCGGCCAGGACCCGCGGCGCGATCGCGGCGCCGCGCTACAGCCAGCCT
CACTGGCGCGCGGGCGAGCGCACGGGCGCTC
>sequence3
CACGACAGGCCCGCTGAGGCTTGTGCCAGACCTTGGAAACCTCAGGTATATACCTTTCCAGACGCGGGAT
CTCCCCTCCCC
ADD REPLY

Login before adding your answer.

Traffic: 1719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6