In short, my question is:
1. what is the running time of building FM-Index, is it linear in the sense of reference genome?

I was told that FM-Index can be built in linear time, i.e. O(n), where n is the length of reference genome.
But I actually can not find the paper describing it.

The paper cited mostly "Indexing Compressed Text" by Ferragina et al. 2004 focus on the analysis of memory footprint and query time analysis, but not the running time of building FM Index.

It's mentioned in the intro of Simpson and Durbin 2010. I haven't looked in detail, but presumably they link to something with a bit more analysis on the time complexity.

In 2003, two (or maybe three) conference papers first showed that suffix array can be constructed in linear time. As generating BWT from suffix array takes linear time, they also proved that FM-index can be constructed in linear time. One of the most influential linear-time algorithms, SA-IS, was invented by Nong et al (2008). They provided a ~100-line implementation in C. It is much simpler and faster than the previous works. Yuta Mori optimized Nong et al's implementation into the sais library. A few years ago (when I was still following the literature), that library was fastest linear-time implementation in practical benchmarks. However, at least at that time, libdivsufsort, an O(nlogn) algorithm developed also by Yuta Mori, was widely believed to be the fastest open-source library to construct suffix arrays, faster than sais. Linear-time algorithms are not necessarily faster. If a linear algorithm is associated with a large constant and lots of cache misses, an O(nlogn) algorithm can be faster in practice, just as Rob said.

the running time analysis actually matters if we want to iteratively build FM-index (e.g. build FM-index 100 times by integrating genetic variants in each iteration) or dynamically build FM index

Yes, there are a few linear construction algorithms. See, e.g. http://www.sciencedirect.com/science/article/pii/S030439750700477X and references therein. The algo is often written with a log Sigma factor, which goes away for constant sized alphabets like DNA. One practical note, the worst-case linear time algorithms are asymptotically optimal, but not always the fastest in practice on real data.

It's mentioned in the intro of Simpson and Durbin 2010. I haven't looked in detail, but presumably they link to something with a bit more analysis on the time complexity.