Small RNAseq Dataset to test scripts
2
0
Entering edit mode
4.0 years ago
avelarbio46 ▴ 30

Hello everyone!

I will be processing a big amount of RNAseq data in the next months, but we do not have a server yet (it will arrive next march).

In the mean time, I have a small PC with 16gb ram. I would like to test my scripts (cutadapt, RNA-Star, STAR-fusion, htseq, deseq2 etc). This means that I want to debug, beforehand, how to install programs, how to use them etc.

For this purpose, I was thinking that if I had a small RNAseq dataset (very small), I could test scripts faster without memory limitations, just to get used to all programs.

Basically, what I would need is:

GTF file for splice junctions;

FASTQ file

FASTA reference

maybe adaptor sequences?

Is there any online dataset for learning purposes like this?

RNA-Seq rna-seq • 2.6k views
ADD COMMENT
2
Entering edit mode
4.0 years ago
Ram 44k

You should search for sample S. cerevisiae RNAseq datasets - they are really useful in testing pipelines before you apply those pipelines on human data (assuming you'll be working with human data later).

ADD COMMENT
0
Entering edit mode

Please put this as an answer! I calculated and it seems like it will be plausible to do everything I need with the S. cerevisiae RNA-seq datasets. They are small enough that I can create indexes and align fast, but also it is highly studied which gives rise to many acessible datasets and references. I will try this and will write a more elaborate answer after !

Thank you for your suggestion

ADD REPLY
0
Entering edit mode

Done. Please accept it (green check mark on the left) if it worked for you.

ADD REPLY
1
Entering edit mode
4.0 years ago
Qiongyi ▴ 180

You can go to NCBI SRA: https://www.ncbi.nlm.nih.gov/sra, then search RNA-Seq and select the same species as your own study. After download one RNA-Seq dataset, you may also subsample the files for quick testing purpose.

A simplest way is just take the first 1million reads from each fastq file for testing purpose using the below command.

head -4000000 in_read1.fastq >out_read1.fastq
head -4000000 in_read2.fastq >out_read2.fastq
ADD COMMENT
0
Entering edit mode

I'm sure this might work, but there is a big problem: I can't subsample the reference. For RNA-star, the hg38 reference wants 64gb of memory to create the index, which is prohibitive for my system!

ADD REPLY
0
Entering edit mode

Who told you that you need 64Gb of memory to create the index of hg38? You can even download the index file from http://daehwankimlab.github.io/hisat2/download/#h-sapiens if you use HISAT2 for the alignment. My laptop only has 8G of memory and I can do the index and alignment for testing sets.

Alternatively, you can just use chromosome 1 for testing purpose. I don't think there is a problem in any case...

ADD REPLY
0
Entering edit mode

I really need to use STAR. We decided that it is the best aligner for our needs, and we can also use RNA-star fusion. We tried to create the index and with 32 gb RAM we got no memory error! The problem with using the indexed reference is that sometimes it is hard to download, after two or three years, the exact same index. To prevent this, we use our own storage with original reference, indexed and GFF file.

ADD REPLY
0
Entering edit mode

Are you using STAR or STAR-Fusion? STAR does not need 64GB of memory.

ADD REPLY
0
Entering edit mode

I'm using STAR and STAR fusion. Last time we tried, we got memory error, but worked in 64gb machine.

ADD REPLY
0
Entering edit mode

STAR does alignments; STAR-Fusion does fusion gene detection. What is your goal?

ADD REPLY

Login before adding your answer.

Traffic: 1293 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6