Question: unable to count reads that are aligned to coding region
0
gravatar for shivyasoni1994
18 months ago by
shivyasoni19940 wrote:

I want to do differential-gene-expression analysis. I took the RNA-Seq (paired-end)data from BioProject(NCBI), and aligned the fastq files against coding region(cds) from ensemble, using tophat2, command=> tophat Homo_sapiens r1.fastq r2.fastq

the resultant BAM file looks like this =>

SRR7523525.12972962.1   339 ENST00000005286.8   1395    1   101M    =1379   -117    CAAGGAGAGCCGGGGCGCCCGGGGGGTGCGAGTGGACTTCTGGTGGCGCCGGCTCCGCGCCTCGCTGCGGCTGACCGTGTGGGCCCCGCTGCTACCGCTGC   @GDHHGHDHHCHDDHHDCDCC?HDHIIIHHHCIIHFHHHHFIHHD<CGHGHHFCDE<HDHHDDIIIIHGHHHIHIIIIIIHHDDGHHFG@GIIHGIDACDD   AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:87C13  YT:Z:UU NH:i:3  CC:Z:ENST00000453848.6  CP:i:1392   HI:i:0
SRR752525.24445544.1    83  ENST00000005286.8   1395    3   101M    =1319   -177    CAAGGAGAGCCGGGGCGCCCGGGGGGTGCGAGTGGACTTCTGGTGGCGCCGGCTCCGCGCCTCGCTGCGGCTGACCGTGTGGGCCCCGCTGCTACCGCCGC   HIIIIIIIIIIIIIIHHECIIIIIHIGIIIIHIIIIIIIIIIIIIHHIIIIHHDIHIIIIIIIIHIIHHIIHIIHIIIIIIIIIIIIIIIIIIIIIDDDDD   AS:i:-11    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:87C10T2    YT:Z:UU NH:i:2  CC:Z:ENST00000453848.6  CP:i:1392   HI:i:0

Now, i am unable to count the aligned-reads with HTSeq or featureCounts, as the BAM file contains only transcript_id ,,,,and the softwares count reads for gene_id or exon. I also set the -t option in HTSeq to transcript_id ,,and i also tried featureCounts with -g Parent parameter, but unable to count the reads. Please suggest me any possible solution , so that i can count my reads, and analyze that result for differential gene expression(in DESeq or edgeR).

thanks in advance for your kind support

rna-seq • 749 views
ADD COMMENTlink modified 18 months ago • written 18 months ago by shivyasoni19940
4

This is a sub-optimal way to count your transcripts, because you will have plenty of reads that are mapped to more than one transcript and you can either count them all or filter them out. Other tools like RSEM try to estimate the proportion of reads that are originating from each transcript. Better yet, you can use newer tools like salmon.

If your aim is to do a standard gene-level differential expression analysis instead, you can map your reads to the genome and count them with a GTF file (for example from ensembl).

ADD REPLYlink modified 18 months ago • written 18 months ago by Martombo2.6k
1

Look at Figure 1 from Why you should use alignment-independent quantification to see Martombo point.

ADD REPLYlink written 18 months ago by h.mon29k

aligned the fastq files against coding region(cds) from ensemble

You should align to the reference genome ideally, or use tools specifically for transcript pseudo-alignment such as Salmon.

ADD REPLYlink written 18 months ago by WouterDeCoster43k

Thankyou for suggestions. I counted the reads using Salmon , in non-alignment-based mode , using transcripts (cdna from ensemble} as index.

As the output from Salmon,that is= quant.sf, contains counts for whole transcripts (non-translated rna +translated rna).

Can ,counts for other RNAs be filtered out from quant.sf , so that ,,, only counts for mRNA is left in quant.sf(output).If this is possible, kindly suggest me the process.

Please correct me, if I am wrong. Thanks for your support.

ADD REPLYlink written 18 months ago by shivyasoni19940
2

You can use biomRt to get "gene_biotype" and filter quant.sf accordingly.

ADD REPLYlink written 18 months ago by h.mon29k

Thanks for the suggestion!!

ADD REPLYlink written 18 months ago by shivyasoni19940
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1488 users visited in the last hour