Methodology FAQ


What is Triple Power Law?  What does it mean to parameterize the distribution?

What do the parameters represent?

How does Radar Logic determine the parameters?

Once you have the distribution, how does Radar Logic actually get the daily price?

Why doesn’t Radar Logic just use the data median?

Why not fit the data more exactly with a non-parametric approach?

How does Radar Logic adjust for the variations in individual properties?

How does the publication timeline work?  How are the 1-, 7-, and 28-day prices calculated?

By using a one-year average to fit the shape, isn’t Radar Logic introducing a lag on how fast the shape changes over time?

Why use median for subindices (zip code, property type, etc.) on the analytics site?

If a datapoint falls outside the distribution, does that mean it does not count in calculating the index?

Wouldn't the distribution over the year be influenced if there had been a sharp increase in house prices throughout the year, resulting in an increase in the "quality" of properties? Would a very high quality house that appeared in the upper tail during the first half of a year move down towards the middle by the end of the year?

How do you come up with a price on a day with low transaction volume? 

What is Triple Power Law?  What does it mean to parameterize the distribution?

In the context of the RPX methodology, parameterization is the process of fitting a theoretical distribution to empirical data.  For the RPX, we refer to this theoretical distribution as “Triple Power Law.  The goal is to incorporate as much information as is available about the universe of data in order to help control for the idiosyncrasies of a particular sample.  In essence, we do not calculate the daily price based solely on the raw data in a short period of time (a day, a week, or a month).  We use an understanding of the dynamics derived from looking at the data over a longer period of time to give us a truer interpretation of the data in the shorter time period than simple statistics (e.g., average or median) allow.  This enables us to combine the timeliness of using data from the recent past with the robustness of using a longer time window. 

For example, if you are trying to find the median or average height of male New Yorkers, you might stand on Seventh Avenue measuring the first 100 men to walk past.  If in the course of that time the New York Knicks basketball team passes by, simply taking a median or average of the raw heights in your sample might lead you to conclude that the height of the total population is higher than the true measure.  If you know, however, that height tends to be normally distributed (i.e., conforms to a “bell curve”), you can overlay a theoretical normal distribution on top of your sample distribution.  Assuming the rest of your sample was sufficiently random, it will likely line up closely, except that your sample distribution will have a cluster of high values on the right tail.  You would likely conclude that the appropriate height is the central value on your theoretical distribution.  This is not discounting the Knicks, just acknowledging that they would probably be balanced out by short people in a larger sample.   

The distribution of prices per square foot paid for residential real estate in a given period of time does not conform to a normal or other common distribution, but it has a consistent shape that can be characterized and used to shed light on the true value of the residential property market on a given day.  We call this distribution Triple Power Law, because the general shape is defined by three lines that appear straight on a logarithmic scale (a feature that characterizes mathematical expressions known as “power laws”).  Knowing the generalized form of the distribution, we can then determine the specific shape of the distribution and its position on the x-axis based on the actual data.   

In the height example above, in order to overlay a normal distribution you had to draw a conclusion about the appropriate height and width for the distribution.  In the absence of any other information, you would have to glean this from your 100-person sample.  You could get an even better answer if you took a sample of 100 men for 100 days in row to get an idea of the true standard deviation (which determines the shape of the distribution; since height is very stable over time, this probably would not give you any more insight than measuring 10,000 men in one day, but for the purposes of comparing it to real estate, in which dynamics can and do change, a time series component is necessary). 

Just as a normal distribution can be tall and narrow or short and wide, the shape of the Triple Power Law distribution can vary across time and geographies but be established by looking at a good-sized dataset.  While not as stable as the standard deviation of the height of a population, the shape of Triple Power Law changes fairly slowly.  In concrete terms, we characterize this shape as the distribution of quality in the housing stock of an area (i.e., where properties fall relative to each other on the price spectrum, regardless of the absolute price level).  We can therefore use a relatively long time horizon to establish the shape.  The actual price level can change much more rapidly, which we account for by sliding the shape along the ppsf axis until it best fits data from recent transactions (either 1, 7, or 28 days worth of transactions – see question below).

[top]

  

What do the parameters represent?

The parameters describe the shape and position of the Triple Power Law distribution.  There are six of them–five for the shape and one for the position–although one of the shape parameters is simply derived from the other four.  There are three distinct regions of the distribution: low, middle, and high (which are established by the process of fitting the distribution rather than an a priori notion on our part of which properties are supposed to fall into which regions). 

In the figure below, the x-axis is the natural logarithm of the price per square foot, and the y-axis is the natural logarithm of the frequency.   

For a set of shape parameters, the entire Triple Power Law structure is moved to the position on the x-axis where it best fits the data for a given day or set of days, and the process is iterated until both the optimal shape and position are established.  Given a shape, the position can be defined by any single point on the Triple Power Law distribution. We use the convention whereby we identify the position with the price per square foot of the first turning point, parameter b.  We sometimes refer to b as the mode, since the first turning point is typically the point of greatest frequency; however, occasionally that distinction belongs to the second turning point (i.e., βM is positive and hc is greater than 1). 

TPL chart

Here’s a typical example of what the distribution looks like when fit to actual data: 

Distribution chart

[top]

How does Radar Logic determine the parameters?

We use a fitting algorithm that tests a large number of possible values for each parameter until it finds the combination that best fits the data.  Specifically, the algorithm fixes a set of shape parameters, starting with the parameters from the previous day as a first guess.  For each day in the previous year, the algorithm then varies the position parameter and computes the log likelihood function (a standard statistical calculation that conveys how likely it is that a specific dataset could have been generated by a given distribution).  It adds up the 365 individual log likelihood functions to get the cumulative log likelihood.  It then tweaks the shape parameters and repeats the process.  It continues to loop through these steps looking for better fits until it gets to a point that its tinkering stops making a discernable improvement, at which point it locks down the optimized parameters for the day.

[top]

 

Once you have the distribution, how does Radar Logic actually get the daily price?

We take the median of the distribution–that is, the price per square foot at which half the area under the distribution curve is below and half above.

[top]

 

Why doesn’t Radar Logic just use the data median?

There are a couple of big advantages to Triple Power Law parameterization over the raw data median.  First, the data median can miss dynamics that are present only in a particular segment of the market.  For example, if low-end prices are declining while the rest of the market is remaining stable, the median would largely ignore it since those transactions started and ended below the median, while the left slope of the Triple Power Law distribution would become less steep (and consequently the index price lower) to reflect that decline.   A second major benefit is that a Triple Power Law approach does better in the presence of bad or idiosyncratic data since it does not blindly accommodate all datapoints but instead places them in a larger context.  As a corollary, Triple Power Law provides a criterion for determining whether there are issues with the data, as in the example above of the normal distribution of heights and an irregular bump on the right tail resulting from the passage of the Knicks. Knowing what the distribution should look like helps establish statistical filters that pick out apparently irregular clusters of transactions for further scrutiny. Such filters help identify and fix data problems, which helps clean up the actual data samples and in turn yield more accurate Triple Power Law distributions and eventually daily prices.

[top]

  

Why not fit the data more exactly with a non-parametric approach?

Non-parametric statistical methods (e.g., Kernel estimators) assume no knowledge and in return supply no insights.  Non-parametric estimates are in effect black boxes, which are an unsatisfactory second resort.  We can do—and do—much better than that. 

A parameterization of the distribution, if adequate information exists to peg down a model, can: a) incorporate all the known facts about the data and the underlying dynamics; b) provide a benchmark against which data can be assessed for errors or manipulation; and c) comprise meaningful parameters that in themselves reflect facts about the data or dynamics.

[top]

 

How does Radar Logic adjust for the variations in individual properties?

We do make one major property-specific adjustment—for size—by using a price per square foot calculation.  Beyond that, there are obviously an infinite number of other variables that affect the price of a home.  Moreover, the variables themselves vary by location (e.g., an oceanfront view has a big impact on a condo price on the coast but is irrelevant in the heartland).  The advantage of fitting a distribution to the data is that we don’t have to know all the factors contributing to the price of a home to determine where on the price spectrum it should fall. We take the pragmatic approach that it is impossible to determine all the individual attributes (objective or subjective and often intangible) that induce a buyer to pay a certain price for a home at a certain place and time, and instead seek an accurate statistical representation of the residential market price movement dynamics. 

We undertook to develop the methodology with a couple of index characteristics in mind.  First, we wanted to include as many transactions as possible.  A repeat-sales method fairly neatly adjusts for variations in individual properties but by definition excludes new properties and raises questions regarding to which time periods to attribute price movements.  We also wanted to report a meaningful number at the MSA level every day.  Least-squares regression methods that pick a set of independent variables might have good explanatory power, but a multivariate regression involving that many variables requires a large amount of data–a data requirement that typically requires aggregating across larger geographical areas or timeframes.

[top]

  

How does the publication timeline work?  How are the 1-, 7-, and 28-day prices calculated?

The prices for a given transaction date are published exactly 63 days (nine weeks) later.  In the case of the 7- and 28-day prices, the price comes from establishing the position parameter by looking at transactions from the transaction date 63 days ago, plus the previous 6 or 27 calendar days, respectively.

[top]

 

By using a one-year average to fit the shape, isn’t Radar Logic introducing a lag on how fast the shape changes over time?

We studied a variety of options before settling on a year for the shape parameters.  We do, however, use statistics (namely, the Kolmogorov probability) to monitor on an ongoing basis how well the theoretical distribution fits the empirical one to be sure that we aren't missing any important dynamics.

[top]

 

Why use median for subindices (zip code, property type, etc.) on the analytics site?

We chose to use the data median for the analytics site because without being able to control what property characteristics users select, we cannot guarantee that any combination that they put together will be appropriately captured by two or three power-law regions (Triple Power Law does fine with double-power law scenarios—it just folds down one of the regions).  At the MSA level, or any region broad enough to encompass a full socioeconomic spectrum, Triple Power Law works well because it is a true representation of the data.  If you just pick out isolated chunks of the spectrum, however, applying Triple Power Law may not be the best bet.  Imagine the user who chooses Manhattan and the Bronx, which have very different price points.  That psf distribution is essentially bimodal, and so the median is volatile.  To conceptualize, imagine that between the two locations you have 100 transactions every day.  One day 51 of them occurred in the Bronx and 49 in Manhattan, so the median is a Bronx price; the next day the transaction counts are reversed, so you have a Manhattan price.  By virtue of being unimodal rather than bimodal, a Triple Power Law distribution would be more stable, and the price might actually be a pretty good reconciliation of the two disparate regions; however, we could not really say that the Triple Power Law distribution reflected the actual data distribution.

[top]

  

If a datapoint falls outside the distribution, does that mean it does not count in calculating the index?

Every individual datapoint is equally influential in fitting the distribution.  It is true that what matters in the final index price is the area under the curve, but just because a datapoint does not fall under that umbrella does not mean that it was not accounted for in the fitting process. 

[top]

 

Wouldn't the distribution over the year be influenced if there had been a sharp increase in house prices throughout the year, resulting in an increase in the "quality" of properties? Would a very high quality house that appeared in the upper tail during the first half of a year move down towards the middle by the end of the year?

A sharp increase in house prices does not necessarily have to affect the shape of the distribution, although it could.  For example, imagine one scenario in which the price increased $50 per square foot, and another in which the ppsf increased 20%.  The distribution for the first scenario simply shifts to the right (i.e., only the position parameter would change), and in the second the shape would flatten out since the percentage change makes a bigger difference in the absolute price the more expensive a property was to start. 

Note, though, that in both these scenarios the properties are all in the same positions on the spectrum relative to each other.  So a high-end, right-tail house is still in the right tail.  Now imagine the scenario in which a large section of lower-end housing is razed and re-built with luxury condos, all of which are more expensive than the high-end house.  That would indeed alter the shape of the distribution by moving strength from the left to the right tail relative to the median, and would move the reference house toward the middle of the distribution. Changes in the profile of properties on the ground occur gradually (over extended periods of time rather than overnight) and are picked up by the shape parameters, which are designed to capture precisely such effects.

[top]

  

How do you come up with a price on a day with low transaction volume? 

It is indeed harder to determine the appropriate position for the distribution on the ppsf axis on low-volume days, so we get the best fit we can; however, those days will tend to be more volatile. The fact that the shape of the distribution is established using a year’s worth of data helps reduce volatility.

[top]