Question: Count mapped reads to genes reference
gravatar for RiccPicc
16 months ago by
RiccPicc0 wrote:

Hello everyone,

I want to explore differential expression of a relatively small set of genes (~30). I used this set as reference to map reads.

I used Bowtie2 to build an index of these genes sequences, then I mapped transcriptomes reads to this reference, and I got the sam files. In the sam file, for each mapped reads, there is the gene name to which the read mapped as RNAME (reference sequence name).

Now, I want to obtain the matrix to see if there are differentially expressed genes. Looking to HTSeq, I noticed he wants gff or gtf file to build this matrix, but since I don't have the whole genome as reference, I cannot use it.

Do you have some suggestion on how to build the matrix? Should I use samtools and build a bed file to get the coverage for each gene?

rna-seq samtools bowtie2 • 639 views
ADD COMMENTlink modified 16 months ago by i.sudbery9.4k • written 16 months ago by RiccPicc0

You can make up a simple annotation format (SAF) file with the gene names and their lengths and use it with featureCounts. See help page here.

ADD REPLYlink written 16 months ago by genomax91k
gravatar for ATpoint
16 months ago by
ATpoint40k wrote:

There are a couple of problems with this appoach:

  1. bowtie2 is not a splice-aware aligner so reads spanning exon/intron junctions will likely go unmapped.
  2. Alignment against a subset of the total genome or transcriptome is problematic as the aligner will always try to find best matches for every read. If the true origin is a gene that is not included, you will get false-positives (maybe even a lot of them).
  3. Most differential analysis frameworks rely on normalization based on a large number of genes not changing between conditions, e.g. TMM from edgeR. With only 30 genes you will have a hard time doing proper normalization. Don't use RPKM or FPKM as they do not correct for library composition changes, e.g. where a few genes with high counts in one condition "dominate" the sequencing run (=take away reads from other genes leading to fewer counts even though these are not differentially expressed).
  4. How do you identify differentially-expressed genes (DEG)? I am pretty sure established tools do not work on such small subsets. I hope you do not plan custom statistics such as pairwise rank sum tests or t-tests?

I suggest you align against the entire genome with a splice-aware aligner such as hisat2 or star or use a lightweight quantifier such as salmon or kallisto. Then use established DEG tools such as DESeq2, edgeR or limma to get all DE genes and check how many of ones you are interested in are among them.

ADD COMMENTlink modified 16 months ago • written 16 months ago by ATpoint40k

Hi, many thanks for answering and clarifying the problems of this approach. I accept your suggestion and I will use the whole genome for mapping and one of the softwares, probably STAR, you suggested, since I didn't consider points 3 and 4 while planning.

ADD REPLYlink written 16 months ago by RiccPicc0
gravatar for i.sudbery
16 months ago by
Sheffield, UK
i.sudbery9.4k wrote:

You can do this by first converting your sam file to a BAM file:

samtools view -bh  samfile.sam > bamfile.bam

Now index the file:

samtools index bamfile.bam

And finall get the per reference counts:

samtools idxstats bamfile.bam

The first column will be the reference, and the third will contain the count of reads mapped to it.

One thing to be careful of though: this method will count multi-mapping reads twice. The chance of a read multi-mapping might be very high if you have included multiple transcripts for each gene in your reference.

The way to do this is to use a proper transcript quantification tool, such as RSEM or salmon, which can opperate on transcriptome alignments.

Be aware that a lot of the stats involved in differentially expression analysis rely on having many genes present in the dataset to do both library normalisation and variance estimation. Neither DESeq, nor edgeR, nor limma are likely to work well with only 30 genes.

ADD COMMENTlink written 16 months ago by i.sudbery9.4k

Many thanks for your answer and your warnings. I will proceed using genome instead of the small set.

ADD REPLYlink written 16 months ago by RiccPicc0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour