Question

cut off value of fpkm

0

Entering edit mode

2.3 years ago

mdfardin374 ▴ 10

Hello all, I have gtf files from Hisat2/Stringtie and I am interested in identification of novel genes so I want to filter genes based on fpkm vakues to reduce the no of false positives. Please tell me how to proceed? Thank you

fpkm RNA-Seq and • 1.0k views

ADD COMMENT • link updated 2.3 years ago by seidel 11k • written 2.3 years ago by mdfardin374 ▴ 10

2

Entering edit mode

You're going to need a lot more specifics to get anywhere with this question. What organism are you working with? What is your definition of a novel gene? (what criteria would you use to distinguish novelty from non-novel?). What is your definition of a false positive? What would be your definition of a false negative? Indeed what is the question you are really asking? Why do you think FPKM is related to novelty? How would you be able to prove that it is or is not?

ADD REPLY • link 2.3 years ago by seidel 11k

0

Entering edit mode

The organism is cow and novel genes are those genes that are not reported in any database (gtf annotation file has not reported this gene). I just want to know that what should be the fpkm cut off value to reduce the number of identified transcripts? What is the fpkm distribution?

ADD REPLY • link 2.3 years ago by mdfardin374 ▴ 10

1

Entering edit mode

So do you have two sources of GTFs? You mention wanting to find transcripts not "reported in any database" but also files from Stringtie. Are these the same or separate things? If you want to stick with FPKM, and presence or absence in a GTF as the criteria, you could try an empirical determination: subtract your GTF from the genome, take the remaining bits, and break them up into 500 bp bins (or maybe something approaching average exon size), quantify reads across those non-gene areas (by definition, since they do not occur in the database GTF). This would give you a view of the "background" distribution. Depending on its shape, you might be able to make a decision for your particular data set. If you have two sets of GTF files (the "database" and Stringtie results) you could do various comparisons to each other and to the background of each.

Major caveat: Like any organism, your FPKM values come from cells, and there are many different types of cells that exist in many different states at different points in time or under different conditions. Distinguishing "background" expression, or false positives from true positives resulting from divergent mixtures of cell types is troublesome, complicated, and requires specific experiments designed to do so.

Anyway, looking at the expression space outside your GTF in your data set would be one way to operationally define a cutoff.

ADD REPLY • link 2.3 years ago by seidel 11k