Splitting genome 2bit files
1
0
Entering edit mode
5.6 years ago
c7750 • 0

I want to run continuous integration for my software which depends on genome data from the UCSC genome browser in 2bit format. The file it depends on is 800 MB, which is too large for GitHub. How can I split one of these files to have a manageable size for testing? Is there a way I can split by chromosome or genome position?

python genome testing 2bit twobit • 2.7k views
ADD COMMENT
0
Entering edit mode

I want to run continuous integration for my software

What does that mean? Asking as someone who is not a software developer?

Can you not link the 2bit files directly from UCSC providing instructions on what people should do with the download, if you need the files for your software?

ADD REPLY
0
Entering edit mode

It means a third-party is running my tests whenever I push a change. I can't link, because a computer is running my program.

ADD REPLY
0
Entering edit mode

You probably want your tests to be over small examples, like a small chromosome, or even just a fragment of a chromosome.

You can split a fasta file with samtools (among dozens of other options, see How To Split A Multiple Fasta which ironically doesn't include a samtools solution), and convert the small fasta to 2bit with faToTwoBit.

ADD REPLY
0
Entering edit mode
5.6 years ago

Fasta and 2bit can be easily converted back and forth https://genome.ucsc.edu/goldenpath/help/twoBit.html

Chromosome-wise Fastas can be dloaded from here: http://hgdownload.cse.ucsc.edu/downloads.html#human

ADD COMMENT

Login before adding your answer.

Traffic: 3373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6