Question

Running featureCounts separately for each sample and merging results

0

Entering edit mode

4.4 years ago

daniel22373 • 0

I am running a differential expression analysis project using HISAT2 for alignment and featureCounts for assembly.

I have 27 samples with paired-end reads, and the FASTQ files alone take up about 270 GB. After running alignment for a few samples, it seems like all of the resulting BAM files after alignment with HISAT2 will likely take up well over 300 GB of space. I do not have that much storage space on my local machine, and transferring the files to Galaxy to run the analyses there will take too long.

Is it possible to run featureCounts on each BAM file separately, and then to combine the resulting raw read count matrices? Instead of running featureCounts on all of the BAM files from the alignment step at once? This way, I can align a sample of reads, run featureCounts on the BAM file, and then purge the original FASTQ and BAM files, leaving only the read count matrices which would take up much less storage space.

Any help would be much appreciated.

RNA-Seq assembly rna-seq sequencing gene • 10k views

ADD COMMENT • link updated 3.4 years ago by DareDevil ★ 4.4k • written 4.4 years ago by daniel22373 • 0

0

Entering edit mode

If you have run featurecounts on each BAM file separately and you would like to merge the counts from all individual files, check this article for merging the counts from all individual files https://www.reneshbedre.com/blog/featurecounts-matrix.html

ADD REPLY • link 3.9 years ago by Renesh ★ 2.2k

score 4 · Answer 1 · 2021-02-16

4

Entering edit mode

4.4 years ago

harishk0201 ▴ 130

featureCounts can take multiple bam files as input: just use

featureCounts -parameters etc *.bam or $(ls *bam)

You'll get a counts file that is tab-delimited which you can then parse out

grep - v "#" counts_file | cut -d$'\t' -f1,7- > counts.matrix

ADD COMMENT • link 4.4 years ago by harishk0201 ▴ 130

1

Entering edit mode

Their problem was that they didn't have the space to store all BAM files.

ADD REPLY • link 4.4 years ago by rpolicastro 13k

0

Entering edit mode

Ahh, that's what I get not having my coffee ;)

I understood that they had already performed the alignments.

In case they didn't obviously salmon/kallisto would be the best option!

ADD REPLY • link 4.4 years ago by harishk0201 ▴ 130

score 2 · Answer 2 · 2021-02-16

If you just need to quantify read counts you can use a transcript aligner like Salmon instead, which goes straight from FASTQ files to count matrices (among a few other things), thus saving you the space of storing BAM files. Salmon tends to be quicker, more memory efficient, and more accurate than STAR + featureCounts anyway.

If you really want to use, or need to use featureCounts, you can save the individual count matrices, and merge/join all of them by gene name/id using your favorite programming language.

score 2 · Answer 3 · 2021-02-16

2

Entering edit mode

4.4 years ago

swbarnes2 15k

If you are running something separate on your bam, run RSEM. It's smarter.

But yes, Salmon or Kallisto would solve your space problem, and be faster.

ADD COMMENT • link 4.4 years ago by swbarnes2 15k

score 2 · Answer 4 · 2022-02-01

2

Entering edit mode

3.4 years ago

DareDevil ★ 4.4k

You can try in bash script as well

# get the count
ls -1  *featureCount.txt | parallel 'cat {} | sed '1d' | cut -f7 {} > {/.}_clean.txt' 
ls -1  *featureCount.txt | head -1 | xargs cut -f1 > genes.txt
paste genes.txt *featureCount_clean.txt > output.txt

ADD COMMENT • link 3.4 years ago by DareDevil ★ 4.4k