Question

Bowtie2 alignment issues - huge file size for SAM file creation

0

Entering edit mode

3.6 years ago

dnljmrs ▴ 20

Hi all,

I'm a microbiology postdoc but new to bioinformatics, so could do with a little help. I recently received the output files (fastq) from an RNAseq experiment we ran. The RNA was library prep'd and sequenced by our in house genomics facility, but I wanted to learn how to do the analysis of it. The RNA was from Gram +ve bacteria, and I want to do differential expression. Using FastQC the quality of the data is great, so I'm not concerned about that.

I've been advised to use bowtie2 to align to the reference, so I started with bowtie2-build to create the indexes from the reference .fna file. That seemed fine, and created several files.

Then I used /Volumes/Daniel_HDD/RNAseq/bin/bowtie2-2.4.1/bowtie2 -x /Volumes/Daniel_HDD/RNAseq/index/700669_index -1 /Volumes/Daniel_HDD/RNAseq/WT1_R1_fastq.gz -2 /Volumes/Daniel_HDD/RNAseq/WT1_R2.fastq.gz -S /Volumes/Daniel_HDD/RNAseq/WT1.sam

Now, I previously did a run through using a nature protocols pipeline (Hisat2, then stringtie), and it gave me BAM files, but the bowtie2 method suggests to output to SAM files (which I know are bigger), but even the sam files for the last attempt were not anywhere near this size.. They've been running for a 1.5h and its currently at 19gb for alignment of this first pair. (I'm running a MacBook pro, 2015, 2.2 GHz Quad-Core Intel Core i7, 16gb RAM)

Is this normal? Should the file be this big? The genome of the reference is only 2.2mbp, and there are about 2k genes.. so I wasn't expecting it to be this big a file. Should I have output to a bam file, or compressed, or would that matter for this stage? Is there another pipeline that would be better to follow? Thanks in advance

RNA-Seq bowtie2 alignment • 1.9k views

ADD COMMENT • link 3.6 years ago by dnljmrs ▴ 20

0

Entering edit mode

Size of files is never a good metric. How many reads are you working with? If you have a large dataset then you are likely to get a large SAM file (which is text and uncompressed as compared to your reads which are compressed). You may have secondary alignments (multi-mapping) which would lead to additional lines in SAM file.

I don't recollect if bowtie2 will output to standard out if you remove the -S option. If it does then you can directly pipe the output of bowtie2 into samtools view | samtools sort to generate a sorted BAM file directly.

ADD REPLY • link 3.6 years ago by GenoMax 141k

0

Entering edit mode

Do you have a reference transcriptome?

ADD REPLY • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

I downloaded from the NCBI the genome assemblies folder for this strain, and in that are many different files - so one of those could be the transcriptome rather than the genome. They are (among other text files, all preceded by GCF_000026665.1_ASM266v1_) cds_from_genomic.fna.gz, genomic.fna.gz, genomic.gbff.gz, genomic.gff.gz, genomic.gtf.gz, protein.faa.gz ,protein.gpff.gz, rna_from_genomic.fna.gz < could this one be it?, translated_cds.faa.gz

Should I use that to create the index files instead of the genomic one?

ADD REPLY • link 3.6 years ago by dnljmrs ▴ 20

0

Entering edit mode

Since you are working with bacteria it can be the CDS file. But since you have already built an index from genome reference and there is GTF/GFF file available you can just use one of those files to do read counting with featureCounts.

ADD REPLY • link 3.6 years ago by GenoMax 141k