I'm a microbiology postdoc but new to bioinformatics, so could do with a little help. I recently received the output files (fastq) from an RNAseq experiment we ran. The RNA was library prep'd and sequenced by our in house genomics facility, but I wanted to learn how to do the analysis of it. The RNA was from Gram +ve bacteria, and I want to do differential expression. Using FastQC the quality of the data is great, so I'm not concerned about that.
I've been advised to use bowtie2 to align to the reference, so I started with bowtie2-build to create the indexes from the reference .fna file. That seemed fine, and created several files.
Then I used
/Volumes/Daniel_HDD/RNAseq/bin/bowtie2-2.4.1/bowtie2 -x /Volumes/Daniel_HDD/RNAseq/index/700669_index -1 /Volumes/Daniel_HDD/RNAseq/WT1_R1_fastq.gz -2 /Volumes/Daniel_HDD/RNAseq/WT1_R2.fastq.gz -S /Volumes/Daniel_HDD/RNAseq/WT1.sam
Now, I previously did a run through using a nature protocols pipeline (Hisat2, then stringtie), and it gave me BAM files, but the bowtie2 method suggests to output to SAM files (which I know are bigger), but even the sam files for the last attempt were not anywhere near this size.. They've been running for a 1.5h and its currently at 19gb for alignment of this first pair. (I'm running a MacBook pro, 2015, 2.2 GHz Quad-Core Intel Core i7, 16gb RAM)
Is this normal? Should the file be this big? The genome of the reference is only 2.2mbp, and there are about 2k genes.. so I wasn't expecting it to be this big a file. Should I have output to a bam file, or compressed, or would that matter for this stage? Is there another pipeline that would be better to follow? Thanks in advance