Running featureCounts separately for each sample and merging results
4
0
Entering edit mode
3.2 years ago

I am running a differential expression analysis project using HISAT2 for alignment and featureCounts for assembly.

I have 27 samples with paired-end reads, and the FASTQ files alone take up about 270 GB. After running alignment for a few samples, it seems like all of the resulting BAM files after alignment with HISAT2 will likely take up well over 300 GB of space. I do not have that much storage space on my local machine, and transferring the files to Galaxy to run the analyses there will take too long.

Is it possible to run featureCounts on each BAM file separately, and then to combine the resulting raw read count matrices? Instead of running featureCounts on all of the BAM files from the alignment step at once? This way, I can align a sample of reads, run featureCounts on the BAM file, and then purge the original FASTQ and BAM files, leaving only the read count matrices which would take up much less storage space.

Any help would be much appreciated.

RNA-Seq assembly rna-seq sequencing gene • 6.7k views
ADD COMMENT
0
Entering edit mode

If you have run featurecounts on each BAM file separately and you would like to merge the counts from all individual files, check this article for merging the counts from all individual files https://www.reneshbedre.com/blog/featurecounts-matrix.html

ADD REPLY
4
Entering edit mode
3.2 years ago
harishk0201 ▴ 130

featureCounts can take multiple bam files as input: just use

featureCounts -parameters etc *.bam or $(ls *bam)

You'll get a counts file that is tab-delimited which you can then parse out

grep - v "#" counts_file | cut -d$'\t' -f1,7- > counts.matrix

ADD COMMENT
1
Entering edit mode

Their problem was that they didn't have the space to store all BAM files.

ADD REPLY
0
Entering edit mode

Ahh, that's what I get not having my coffee ;)

I understood that they had already performed the alignments.

In case they didn't obviously salmon/kallisto would be the best option!

ADD REPLY
2
Entering edit mode
3.2 years ago

If you just need to quantify read counts you can use a transcript aligner like Salmon instead, which goes straight from FASTQ files to count matrices (among a few other things), thus saving you the space of storing BAM files. Salmon tends to be quicker, more memory efficient, and more accurate than STAR + featureCounts anyway.

If you really want to use, or need to use featureCounts, you can save the individual count matrices, and merge/join all of them by gene name/id using your favorite programming language.

ADD COMMENT
2
Entering edit mode
3.2 years ago

If you are running something separate on your bam, run RSEM. It's smarter.

But yes, Salmon or Kallisto would solve your space problem, and be faster.

ADD COMMENT
2
Entering edit mode
2.2 years ago
DareDevil ★ 4.3k

You can try in bash script as well

# get the count
ls -1  *featureCount.txt | parallel 'cat {} | sed '1d' | cut -f7 {} > {/.}_clean.txt' 
ls -1  *featureCount.txt | head -1 | xargs cut -f1 > genes.txt
paste genes.txt *featureCount_clean.txt > output.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6