Bowtie2 alignment issues - huge file size for SAM file creation
0
0
Entering edit mode
6 months ago
dnljmrs ▴ 10

Hi all,

I'm a microbiology postdoc but new to bioinformatics, so could do with a little help. I recently received the output files (fastq) from an RNAseq experiment we ran. The RNA was library prep'd and sequenced by our in house genomics facility, but I wanted to learn how to do the analysis of it. The RNA was from Gram +ve bacteria, and I want to do differential expression. Using FastQC the quality of the data is great, so I'm not concerned about that.

I've been advised to use bowtie2 to align to the reference, so I started with bowtie2-build to create the indexes from the reference .fna file. That seemed fine, and created several files.

Then I used /Volumes/Daniel_HDD/RNAseq/bin/bowtie2-2.4.1/bowtie2 -x /Volumes/Daniel_HDD/RNAseq/index/700669_index -1 /Volumes/Daniel_HDD/RNAseq/WT1_R1_fastq.gz -2 /Volumes/Daniel_HDD/RNAseq/WT1_R2.fastq.gz -S /Volumes/Daniel_HDD/RNAseq/WT1.sam

Now, I previously did a run through using a nature protocols pipeline (Hisat2, then stringtie), and it gave me BAM files, but the bowtie2 method suggests to output to SAM files (which I know are bigger), but even the sam files for the last attempt were not anywhere near this size.. They've been running for a 1.5h and its currently at 19gb for alignment of this first pair. (I'm running a MacBook pro, 2015, 2.2 GHz Quad-Core Intel Core i7, 16gb RAM)

Is this normal? Should the file be this big? The genome of the reference is only 2.2mbp, and there are about 2k genes.. so I wasn't expecting it to be this big a file. Should I have output to a bam file, or compressed, or would that matter for this stage? Is there another pipeline that would be better to follow? Thanks in advance

RNA-Seq bowtie2 alignment • 354 views
ADD COMMENT
0
Entering edit mode

Size of files is never a good metric. How many reads are you working with? If you have a large dataset then you are likely to get a large SAM file (which is text and uncompressed as compared to your reads which are compressed). You may have secondary alignments (multi-mapping) which would lead to additional lines in SAM file.

I don't recollect if bowtie2 will output to standard out if you remove the -S option. If it does then you can directly pipe the output of bowtie2 into samtools view | samtools sort to generate a sorted BAM file directly.

ADD REPLY
0
Entering edit mode

Do you have a reference transcriptome?

ADD REPLY
0
Entering edit mode

I downloaded from the NCBI the genome assemblies folder for this strain, and in that are many different files - so one of those could be the transcriptome rather than the genome. They are (among other text files, all preceded by GCF_000026665.1_ASM266v1_) cds_from_genomic.fna.gz, genomic.fna.gz, genomic.gbff.gz, genomic.gff.gz, genomic.gtf.gz, protein.faa.gz ,protein.gpff.gz, rna_from_genomic.fna.gz < could this one be it?, translated_cds.faa.gz

Should I use that to create the index files instead of the genomic one?

ADD REPLY
0
Entering edit mode

Since you are working with bacteria it can be the CDS file. But since you have already built an index from genome reference and there is GTF/GFF file available you can just use one of those files to do read counting with featureCounts.

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6