Hello,
I have a bam file for the samples of RNA-Seq data. I need to get read counts for these to perform differential expression analysis. Which tools should I use for this and how?
Thanq
Hello,
I have a bam file for the samples of RNA-Seq data. I need to get read counts for these to perform differential expression analysis. Which tools should I use for this and how?
Thanq
featureCounts is the easiest option.
...as well as multithreaded and tailored to RNA-seq. Go with featureCounts. You will need a GFF/GTF file that fits your organism/the assembly you aligned to and the BAM files.
### Assign reads to exons based on the GTF annotation:
# --a = the GTF/GFF file
# --F = specify the format of -a
# --p = data are paired-end (unset it if you have single-end)
# --T = set number of threads
# -P  = only consider pairs with ISIZE defined by -d & -D, default 50-600bp
# -o  = output file
$FCOUNTS -a $GTF -F GTF -p -T 8 -P -o countMatrix_all.txt *.bam
                    
                Sorted BAMs are not necessary, FC will sort internally if you have paired-end data (still, sorting a file is as easy as samtools sort  -o sorted.bam unsorted.bam). I edited my post in that regard. For the checking, simply use samtools view -f 1 in.bam | head and then check if some reads pop up. -f 1 means only keep paired reads. If there are none, your data are not PE. Alternatively, use samtools view -H in.bam to get the header of the file, and then check the @PG line at the bottom where the alignment command is saved. If you need help, please post the output of both commands, so we can have a look.
I actually gave it like this samtools view -H file.bam
I see this
@HD VN:1.4  SO:coordinate   GO:none
@SQ SN:NR_039978    LN:984
@SQ SN:NM_130786    LN:1766
@SQ SN:NM_138932    LN:9293
@SQ SN:NM_033110    LN:2678
@SQ SN:NM_000014    LN:4678
@SQ SN:NM_144670    LN:5192
@SQ SN:NR_040112    LN:1201
                    
                I dont see the gene which I need in the differential expressed genes list.
While we like to get an expected result for our experiments, biology does not always work that way. Perhaps the difference is subtle (or variation is too high) and is not supported by number of samples/replicates etc.
I use StringTie... Another option is Cufflinks
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
There are many tools for that such as featureCounts, htseq, bedtools (multicov) etc... Also they are well documented.