Question: Difference between two count table that created from same fastq files?
0
gravatar for cilgaiscan
12 months ago by
cilgaiscan50
Turkey
cilgaiscan50 wrote:

Hi everyone.

I am using publicly available RNA-seq data from GEO. I am using Tophat2>featureCounts>DESeq2 pipeline with hg19 genome build. The count table given in geo has 38k reads while the one I generated with featurecounts gives 66k. I do not have any idea what can cause this problem. Or how to fix? Is there someone had or having this problem?

Thank you for helps.

ADD COMMENTlink modified 12 months ago by h.mon27k • written 12 months ago by cilgaiscan50

Please add the commands used to map and count the reads, the version and source of the genome and annotation, and the GEO dataset you are comparing to. In addition to Devon's answer bellow, there are some other possibilities:

  • you are counting as single-end reads, but the sequencing is paired end
  • you have used an annotation from the same source, but a later version, with many new genes added.
ADD REPLYlink written 12 months ago by h.mon27k

I have pair-end and did pair-end alignment with tophat2 And featureCounts recognizes my bams from paired end data as expected.

I used tophat2 (v.2.1.0), and featureCounts (v.1.5.0) with hg19 (Febr. 2009, NCBI RefSeq).

I used default options.. for tophat2:

tophat2   hg19 (with a full path where my bt2 files are)  -o output.bam pair1.fastq.qz pair2.fastq.qz

for featurecounts:

featureCounts -a hg19.gtf -o lncap_STAR.txt LNCaP_control_rep1.bam \
            LNCaP_control_rep2.bam LNCaP_control_rep3.bam LNCaP_DHT_rep1.bam \
            LNCaP_DHT_rep2.bam LNCaP_DHT_rep3.bam
ADD REPLYlink modified 12 months ago by RamRS23k • written 12 months ago by cilgaiscan50
1

You did not add -p option for featureCounts to count the paired-end reads as fragments.

ADD REPLYlink modified 12 months ago • written 12 months ago by genomax70k

actually I runned featurecounts with -p option, too. but results come same. again i have 66k counts.and i checked for dublicates or different naming of same gene but all came negative. they are unique 66k counts.

ADD REPLYlink written 12 months ago by cilgaiscan50
0
gravatar for Devon Ryan
12 months ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

Never ever use tophat2.

The difference is most likely caused by using a GTF file from a different source. Most likely, you're using either a Gencode or Ensembl annotation and the GEO dataset was created from a UCSC annotation. Neither is wrong, just different due to containing more or fewer genes (I would say that the Ensembl one is better, but I'm kind of an Ensembl evangelist).

ADD COMMENTlink written 12 months ago by Devon Ryan91k

both of our count tables are created by NCBI refseq gtf files. I am sure that I use NCBI Refseq but theirs appears so. but there's 28k difference which is very huge. I checked for XM's and XR'S both count table does not contain any of those.

And firstly, I thought there might be a problem in my alignment and tried STAR aligner too. Both aligners gave same results. That's how I thought it might be an issue due to gtf file. I tried other NCBI gtf files, too, but none of them worked. Since its my first time in RNA-seq, I do not know what to do. I am kind of lost.

ADD REPLYlink written 12 months ago by cilgaiscan50

I suspect h.mon is correct and you're just looking at a difference in the date the refseq GTF was retrieved from NCBI. It annoying that NCBI doesn't have proper versioned releases. In general, it's better to use Ensembl for references and GTF files. The files then all have versions attached so it's easier to keep track of everything.

ADD REPLYlink written 12 months ago by Devon Ryan91k

I suppose, they are using NCBI Refseq UCSC for hg19. Because counts appear as NM_013943 instead of NM_013943.2 So, that leaves being ENSEMBL or GENCODE out I assume. Also it leaves to be NCBI curated, aligned or all, too. Assuming that they are using NCBI Refseq UCSC, I am using it where I gt from UCSC genome browser.

ADD REPLYlink written 12 months ago by cilgaiscan50
0
gravatar for h.mon
12 months ago by
h.mon27k
Brazil
h.mon27k wrote:

To summarize the hypothesis suggested so far:

  • you are using a different annotation than the GEO study used. Compare the number and names of the genes between your counts table and theirs. Although this could explain some differences in counts, I don't think it would result in almost twice as many counts.

  • you are counting each read separately, instead of as a pair, because you didn't provide the -p flag to featureCounts. Probably this is the main / sole cause of your inflated counts.

ADD COMMENTlink written 12 months ago by h.mon27k
  1. I suppose, they are using NCBI Refseq UCSC ann. hg19 as well as I am using it too. Because counts appear as NM_013943 instead of NM_013943.2 So, that leaves being ENSEMBL or GENCODE I assume. Also it leaves to be NCBI curated, aligned or all, too. Assuming that they are using NCBI Refseq UCSC, I am using it too.

  2. I used -p flag too. but still i got 66k counts. nothing is changed.

ADD REPLYlink written 12 months ago by cilgaiscan50

How about pointing to the exact files you're using.

ADD REPLYlink written 12 months ago by Devon Ryan91k

I am using bam files which are outputs of tophat2 with paired-end reads. To check that I am okay with alignment, I have bam files gathered by STAR aligner with exact fastq files. And in both alignments my alignment score is around 91-94%. And I gathered fastq files with corresponding SRR numbers with fastq. The GEO accession number is GSE64529.

ADD REPLYlink modified 12 months ago • written 12 months ago by cilgaiscan50

Here are the counts from the GEO you provided, I don't see 32k anywhere

LNCaP_10%dsFCS72h_1             72138441
LNCaP_10%dsFCS72h_2             76023029
LNCaP_10%dsFCS72h_3             76620894
3a_LNCaP_plus_DHT_10-7M_6h_1   100842500
3a_LNCaP_plus_DHT_10-7M_6h_2    94271818
3a_LNCaP_plus_DHT_10-7M_6h_3    94898450
ADD REPLYlink modified 12 months ago • written 12 months ago by h.mon27k

Yes it is listed as supplementary file down the website. I am sending the link where you can download and see Count table

ADD REPLYlink written 12 months ago by cilgaiscan50

I have downloaded the count table, loaded it into R, summed up all counts from each sample, and the numbers above are the result. The numbers above are the total counts per sample for the GEO GSE64529 samples.

ADD REPLYlink written 12 months ago by h.mon27k

I think cilgaiscan means 32k lines/genes/entries, since there are 38k of them.

ADD REPLYlink modified 12 months ago • written 12 months ago by Devon Ryan91k

I think you are right, but I (and I suspect genomax) thought cilgaiscan meant the number of reads mapping. The following comment made me think he referred to read counts:

count table given in geo has 38k reads

@cilgaiscan, could you clarify if you are talking about number of features (genes / transcripts) in the annotation, or number of reads mapping to genes after featureCounts summarization?

ADD REPLYlink modified 12 months ago • written 12 months ago by h.mon27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 777 users visited in the last hour