Question

Small dataset for data analysis in the own laptop

0

Entering edit mode

5.1 years ago

iibrams07 ▴ 10

Dear All,

I have little experience with Bioinformatics. In the moment I do not have access to a HPC-Cluster which seems to be indispensable for big data analysis.

Do you know of a publicly available data set which is small enough and yet instructive, so as to be possible to perform the data analysis in the own laptop. I have a HP laptop with 7 CPU and 16 GB RAM. In the best scenario the data set would come from a eucaryotic genome for which there is a reference genome aiming at training on the RNA-seq, ChIP-seq or ATAC-seq.

Beside, is there any such thing as a public cloud computing so that one could analyze big data without much restrictions ?

I will highly appreciate any comment.

RNA-Seq ChIP-Seq • 1.4k views

ADD COMMENT • link updated 5.1 years ago by Istvan Albert 100k • written 5.1 years ago by iibrams07 ▴ 10

1

Entering edit mode

What are the specifications of your laptop? I perform DNA-, RNA-, and ChIP-seq on my laptop. I have processed entire TCGA datasets. The limitation is whole genome phasing.

Specs:

16GB RAM
2TB hard-disk (3TB on external drives)
4 CPU cores
Model: HP Pavilion

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

As I noted above the specs are:

16GB RAM
1TB hard-disk (this should not be a problem since one can use external drives)
7 CPU
Model: HP Pavillon

I know of a data set containing subsets of many RNAseq, ATAC-seq and ChIP-seq files (mouse genome) in a total volume of about 1TB. Do you think it is feasible in the case I would download the data one by one and process them in my laptop in a sequential way one by one? How long it would take to my HP Pavillon to process each such file ? What do you mean by whole genome phasing ?

Many thanks.

ADD REPLY • link 5.1 years ago by iibrams07 ▴ 10

1

Entering edit mode

Whole genome phasing analysis is a 'compute intensive' task. It involves haplotype-resolving variants called in a particular sample. If you are not sure, do not worry about it. The zipped FASTQ files can be 30GB each.

RNA-seq analysis can be very quick if you use a pseudo-aligner, like Salmon. Exome seq can take quite some time, e.g., 4-12 hours to align reads and call variants in a single sample. ChIP- and ATAC-seq will depend on the marker used and how extensive the binding (and thus number of reads) was.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Will it make a big difference in time efficiency by upgrading the working memory from 16GB RAM to 32 GB RAM ? Is it worth ? Concerning the Salmon tool, do you mean using Salmon instead of DESeq ? I am asking this since I have not encountered Salmon in the tutorials on RNAseq I have looked at.

ADD REPLY • link 5.1 years ago by iibrams07 ▴ 10

2

Entering edit mode

Salmon is a transcript quantifier. It produces transcriptome abundance estimates. There are typically aggregated to the gene level with tximport (BioC package) followed by differential analysis with DESeq2 (or any other framework you prefer), see here. Salmon is not resource-hungry and very fast. The limitation factor will be the alignment of ChIP/ATAC-seq data. 16GB should be enough, CPU is the critical factor as there is a somewhat linear relationship between saving time and CPUs used (at least in the range of using 1 to about 16 cores, after that IO bottleneck kicks in more and more). Memory is only required to read the alignment index basically and store the reads currently processed by BWA (bowtie2 needs fewer memory in my experience, maybe give it a try), and in part to sort the alignment files. Still, do not waste too much private money, if you have limited computational resources you simple have to wait longer for the job to finish. Test your scripts with small datasets and once they are stable, just start and wait.

ADD REPLY • link 5.1 years ago by ATpoint 81k

1

Entering edit mode

Not sure that you need 32GB RAM for most things. Performing alignment of Exome-seq to the human genome just takes 5-6GB RAM. It may be better to think about parallelising the alignment process: A: Using parallel with Mem

Salmon will perform read count abundance. The output of Salmon can then be input to DESeq2 for normalisation and differential expression analysis. Another program is Kallisto.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2019-03-16

If you search for tutorial for the types of data you mention above you will find plenty of entries. (e.g. ATAC-seq, ChIP-seq). You can look through some of these to find examples where they are using a reduced dataset so you would be able to follow along on your laptop. Generally RAM is a limiting step but you have 16G so that should be enough for small/reduced data. Look into getting Drosophila datasets since the genome is complete and relatively small.

You can get an account on Cyverse to try some things out. Galaxy servers around the world will also let you try these analyses out. There are plenty of tutorials available for Galaxy. Some you could download for local use on your laptop.

Google, amazon, azure are all public cloud providers but you will need to pay to use them productively. They even offer trials or small VM's that could be used for a small amount of free time each month. Those will likely not be enough.

score 1 · Answer 2 · 2019-03-16

I'd look for data from yeast or Drosophila. While the biases are slightly different than for mouse and human (fewer headaches with alternative splicing, different GC biases), these are widely used model organisms with tons of data sets and lots of insights to be gained -- and their genomes/transcriptomes are considerably smaller, so that often fewer reads are necessary to get sufficient depth and particularly the alignment step, which is typically the computationally most intense step, will be much less heavy.

You can either start directly at GEO and just search for the respective data types in the model organism of your choice or check out the resources of modEncode and their data warehouse. You could also consider downloading already processed data (e.g., BAM files, peaks), depending on the types of analyses that you want to actually perform.

score 1 · Answer 3 · 2019-03-16

Saccharomyces Cerevisiae has a genome size of a mere 12 million bp, it is also one of the most studied model organisms. Thanks to the small genome size, you can re-run just about any analysis on a laptop.

Another common strategy, for say human size genomes is to find an experiment that also publishes BAM files, then use that BAM file to extract only the fastq data that aligns to a smaller chromosome. For example, chromosome 22 is around 50 million bp long. Doing so will give you access to the"original" data but instead of using the entire human genome you can repeat the analysis just by using the target chromosome. The reduced size of the data and reference would allow you to practice the data analysis on a laptop.