Difference between two count table that created from same fastq files?
2
0
Entering edit mode
3.2 years ago
cilgaiscan ▴ 50

Hi everyone.

I am using publicly available RNA-seq data from GEO. I am using Tophat2>featureCounts>DESeq2 pipeline with hg19 genome build. The count table given in geo has 38k reads while the one I generated with featurecounts gives 66k. I do not have any idea what can cause this problem. Or how to fix? Is there someone had or having this problem?

Thank you for helps.

RNA-Seq featureCounts gtf counttable tophat • 1.0k views
0
Entering edit mode

Please add the commands used to map and count the reads, the version and source of the genome and annotation, and the GEO dataset you are comparing to. In addition to Devon's answer bellow, there are some other possibilities:

• you are counting as single-end reads, but the sequencing is paired end
• you have used an annotation from the same source, but a later version, with many new genes added.
0
Entering edit mode

I have pair-end and did pair-end alignment with tophat2 And featureCounts recognizes my bams from paired end data as expected.

I used tophat2 (v.2.1.0), and featureCounts (v.1.5.0) with hg19 (Febr. 2009, NCBI RefSeq).

I used default options.. for tophat2:

tophat2   hg19 (with a full path where my bt2 files are)  -o output.bam pair1.fastq.qz pair2.fastq.qz


for featurecounts:

featureCounts -a hg19.gtf -o lncap_STAR.txt LNCaP_control_rep1.bam \
LNCaP_control_rep2.bam LNCaP_control_rep3.bam LNCaP_DHT_rep1.bam \
LNCaP_DHT_rep2.bam LNCaP_DHT_rep3.bam

1
Entering edit mode

You did not add -p option for featureCounts to count the paired-end reads as fragments.

0
Entering edit mode

actually I runned featurecounts with -p option, too. but results come same. again i have 66k counts.and i checked for dublicates or different naming of same gene but all came negative. they are unique 66k counts.

0
Entering edit mode
3.2 years ago

Never ever use tophat2.

The difference is most likely caused by using a GTF file from a different source. Most likely, you're using either a Gencode or Ensembl annotation and the GEO dataset was created from a UCSC annotation. Neither is wrong, just different due to containing more or fewer genes (I would say that the Ensembl one is better, but I'm kind of an Ensembl evangelist).

0
Entering edit mode

both of our count tables are created by NCBI refseq gtf files. I am sure that I use NCBI Refseq but theirs appears so. but there's 28k difference which is very huge. I checked for XM's and XR'S both count table does not contain any of those.

And firstly, I thought there might be a problem in my alignment and tried STAR aligner too. Both aligners gave same results. That's how I thought it might be an issue due to gtf file. I tried other NCBI gtf files, too, but none of them worked. Since its my first time in RNA-seq, I do not know what to do. I am kind of lost.

0
Entering edit mode

I suspect h.mon is correct and you're just looking at a difference in the date the refseq GTF was retrieved from NCBI. It annoying that NCBI doesn't have proper versioned releases. In general, it's better to use Ensembl for references and GTF files. The files then all have versions attached so it's easier to keep track of everything.

0
Entering edit mode

I suppose, they are using NCBI Refseq UCSC for hg19. Because counts appear as NM_013943 instead of NM_013943.2 So, that leaves being ENSEMBL or GENCODE out I assume. Also it leaves to be NCBI curated, aligned or all, too. Assuming that they are using NCBI Refseq UCSC, I am using it where I gt from UCSC genome browser.

0
Entering edit mode
3.2 years ago
h.mon 33k

To summarize the hypothesis suggested so far:

• you are using a different annotation than the GEO study used. Compare the number and names of the genes between your counts table and theirs. Although this could explain some differences in counts, I don't think it would result in almost twice as many counts.

• you are counting each read separately, instead of as a pair, because you didn't provide the -p flag to featureCounts. Probably this is the main / sole cause of your inflated counts.

0
Entering edit mode
1. I suppose, they are using NCBI Refseq UCSC ann. hg19 as well as I am using it too. Because counts appear as NM_013943 instead of NM_013943.2 So, that leaves being ENSEMBL or GENCODE I assume. Also it leaves to be NCBI curated, aligned or all, too. Assuming that they are using NCBI Refseq UCSC, I am using it too.

2. I used -p flag too. but still i got 66k counts. nothing is changed.

0
Entering edit mode

How about pointing to the exact files you're using.

0
Entering edit mode

I am using bam files which are outputs of tophat2 with paired-end reads. To check that I am okay with alignment, I have bam files gathered by STAR aligner with exact fastq files. And in both alignments my alignment score is around 91-94%. And I gathered fastq files with corresponding SRR numbers with fastq. The GEO accession number is GSE64529.

0
Entering edit mode

Here are the counts from the GEO you provided, I don't see 32k anywhere

LNCaP_10%dsFCS72h_1             72138441
LNCaP_10%dsFCS72h_2             76023029
LNCaP_10%dsFCS72h_3             76620894
3a_LNCaP_plus_DHT_10-7M_6h_1   100842500
3a_LNCaP_plus_DHT_10-7M_6h_2    94271818
3a_LNCaP_plus_DHT_10-7M_6h_3    94898450

0
Entering edit mode

Yes it is listed as supplementary file down the website. I am sending the link where you can download and see Count table

0
Entering edit mode

I have downloaded the count table, loaded it into R, summed up all counts from each sample, and the numbers above are the result. The numbers above are the total counts per sample for the GEO GSE64529 samples.

0
Entering edit mode

I think cilgaiscan means 32k lines/genes/entries, since there are 38k of them.

0
Entering edit mode

I think you are right, but I (and I suspect genomax) thought cilgaiscan meant the number of reads mapping. The following comment made me think he referred to read counts:

count table given in geo has 38k reads

@cilgaiscan, could you clarify if you are talking about number of features (genes / transcripts) in the annotation, or number of reads mapping to genes after featureCounts summarization?