Question: Count mapped reads to genes reference
0
gravatar for RPuni
8 weeks ago by
RPuni0
Italy
RPuni0 wrote:

Hello everyone,

I want to explore differential expression of a relatively small set of genes (~30). I used this set as reference to map reads.

I used Bowtie2 to build an index of these genes sequences, then I mapped transcriptomes reads to this reference, and I got the sam files. In the sam file, for each mapped reads, there is the gene name to which the read mapped as RNAME (reference sequence name).

Now, I want to obtain the matrix to see if there are differentially expressed genes. Looking to HTSeq, I noticed he wants gff or gtf file to build this matrix, but since I don't have the whole genome as reference, I cannot use it.

Do you have some suggestion on how to build the matrix? Should I use samtools and build a bed file to get the coverage for each gene?

rna-seq samtools bowtie2 • 163 views
ADD COMMENTlink modified 8 weeks ago by i.sudbery5.2k • written 8 weeks ago by RPuni0
1

You can make up a simple annotation format (SAF) file with the gene names and their lengths and use it with featureCounts. See help page here.

ADD REPLYlink written 8 weeks ago by genomax70k
5
gravatar for ATpoint
8 weeks ago by
ATpoint21k
Germany
ATpoint21k wrote:

There are a couple of problems with this appoach:

  1. bowtie2 is not a splice-aware aligner so reads spanning exon/intron junctions will likely go unmapped.
  2. Alignment against a subset of the total genome or transcriptome is problematic as the aligner will always try to find best matches for every read. If the true origin is a gene that is not included, you will get false-positives (maybe even a lot of them).
  3. Most differential analysis frameworks rely on normalization based on a large number of genes not changing between conditions, e.g. TMM from edgeR. With only 30 genes you will have a hard time doing proper normalization. Don't use RPKM or FPKM as they do not correct for library composition changes, e.g. where a few genes with high counts in one condition "dominate" the sequencing run (=take away reads from other genes leading to fewer counts even though these are not differentially expressed).
  4. How do you identify differentially-expressed genes (DEG)? I am pretty sure established tools do not work on such small subsets. I hope you do not plan custom statistics such as pairwise rank sum tests or t-tests?

I suggest you align against the entire genome with a splice-aware aligner such as hisat2 or star or use a lightweight quantifier such as salmon or kallisto. Then use established DEG tools such as DESeq2, edgeR or limma to get all DE genes and check how many of ones you are interested in are among them.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by ATpoint21k

Hi, many thanks for answering and clarifying the problems of this approach. I accept your suggestion and I will use the whole genome for mapping and one of the softwares, probably STAR, you suggested, since I didn't consider points 3 and 4 while planning.

ADD REPLYlink written 8 weeks ago by RPuni0
2
gravatar for i.sudbery
8 weeks ago by
i.sudbery5.2k
Sheffield, UK
i.sudbery5.2k wrote:

You can do this by first converting your sam file to a BAM file:

samtools view -bh  samfile.sam > bamfile.bam

Now index the file:

samtools index bamfile.bam

And finall get the per reference counts:

samtools idxstats bamfile.bam

The first column will be the reference, and the third will contain the count of reads mapped to it.

One thing to be careful of though: this method will count multi-mapping reads twice. The chance of a read multi-mapping might be very high if you have included multiple transcripts for each gene in your reference.

The way to do this is to use a proper transcript quantification tool, such as RSEM or salmon, which can opperate on transcriptome alignments.

Be aware that a lot of the stats involved in differentially expression analysis rely on having many genes present in the dataset to do both library normalisation and variance estimation. Neither DESeq, nor edgeR, nor limma are likely to work well with only 30 genes.

ADD COMMENTlink written 8 weeks ago by i.sudbery5.2k

Many thanks for your answer and your warnings. I will proceed using genome instead of the small set.

ADD REPLYlink written 8 weeks ago by RPuni0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour