Question

Count mapped reads to genes reference

1

Entering edit mode

4.9 years ago

RiccPicc ▴ 10

Hello everyone,

I want to explore differential expression of a relatively small set of genes (~30). I used this set as reference to map reads.

I used Bowtie2 to build an index of these genes sequences, then I mapped transcriptomes reads to this reference, and I got the sam files. In the sam file, for each mapped reads, there is the gene name to which the read mapped as RNAME (reference sequence name).

Now, I want to obtain the matrix to see if there are differentially expressed genes. Looking to HTSeq, I noticed he wants gff or gtf file to build this matrix, but since I don't have the whole genome as reference, I cannot use it.

Do you have some suggestion on how to build the matrix? Should I use samtools and build a bed file to get the coverage for each gene?

samtools rna-seq bowtie2 • 3.6k views

ADD COMMENT • link updated 4.9 years ago by i.sudbery 19k • written 4.9 years ago by RiccPicc ▴ 10

1

Entering edit mode

You can make up a simple annotation format (SAF) file with the gene names and their lengths and use it with featureCounts. See help page here.

ADD REPLY • link 4.9 years ago by GenoMax 142k

5

Entering edit mode

4.9 years ago

i.sudbery 19k

You can do this by first converting your sam file to a BAM file:

samtools view -bh  samfile.sam > bamfile.bam

Now index the file:

samtools index bamfile.bam

And finall get the per reference counts:

samtools idxstats bamfile.bam

The first column will be the reference, and the third will contain the count of reads mapped to it.

One thing to be careful of though: this method will count multi-mapping reads twice. The chance of a read multi-mapping might be very high if you have included multiple transcripts for each gene in your reference.

The way to do this is to use a proper transcript quantification tool, such as RSEM or salmon, which can opperate on transcriptome alignments.

Be aware that a lot of the stats involved in differentially expression analysis rely on having many genes present in the dataset to do both library normalisation and variance estimation. Neither DESeq, nor edgeR, nor limma are likely to work well with only 30 genes.

ADD COMMENT • link 4.9 years ago by i.sudbery 19k

0

Entering edit mode

Many thanks for your answer and your warnings. I will proceed using genome instead of the small set.

ADD REPLY • link 4.9 years ago by RiccPicc ▴ 10

0

Entering edit mode

Thanks for this helpful answer. A couple of comments on my implementing of your recipie: 1) I needed to sort the bam file before indexing it; 2) the last command produces standard output, i.e., one needs to add > counts.txt.

ADD REPLY • link 3.4 years ago by vpuller • 0

score 8 · Accepted Answer · 2019-06-27

There are a couple of problems with this appoach:

bowtie2 is not a splice-aware aligner so reads spanning exon/intron junctions will likely go unmapped.
Alignment against a subset of the total genome or transcriptome is problematic as the aligner will always try to find best matches for every read. If the true origin is a gene that is not included, you will get false-positives (maybe even a lot of them).
Most differential analysis frameworks rely on normalization based on a large number of genes not changing between conditions, e.g. TMM from edgeR. With only 30 genes you will have a hard time doing proper normalization. Don't use RPKM or FPKM as they do not correct for library composition changes, e.g. where a few genes with high counts in one condition "dominate" the sequencing run (=take away reads from other genes leading to fewer counts even though these are not differentially expressed).
How do you identify differentially-expressed genes (DEG)? I am pretty sure established tools do not work on such small subsets. I hope you do not plan custom statistics such as pairwise rank sum tests or t-tests?

I suggest you align against the entire genome with a splice-aware aligner such as hisat2 or star or use a lightweight quantifier such as salmon or kallisto. Then use established DEG tools such as DESeq2, edgeR or limma to get all DE genes and check how many of ones you are interested in are among them.