Wouldn't the distribution over the year be influenced if there had been a sharp increase in house prices throughout the year, resulting in an increase in the "quality" of properties? Would a very high quality house that appeared in the upper tail during the first half of a year move down towards the middle by the end of the year?
In the context of the RPX methodology, parameterization is the
process of fitting a theoretical distribution to empirical data.
For the RPX, we refer to this theoretical
distribution as “Triple Power Law.™”
goal is to incorporate as much information as is available about the
universe of data in order to help control for the idiosyncrasies of a
In essence, we do not calculate the daily price
based solely on the raw data in a short period of time (a day, a week, or a
We use an understanding of the dynamics derived from
looking at the data over a longer period of time to give us a truer
interpretation of the data in the shorter time period than simple statistics
(e.g., average or median) allow.
This enables us to combine the timeliness of
using data from the recent past with the robustness of using a longer time
For example, if
you are trying to find the median or average height of male New Yorkers, you
might stand on Seventh Avenue measuring the first 100 men to walk past.
If in the course of that time the New York
Knicks basketball team passes by, simply taking a median or average of the
raw heights in your sample might lead you to conclude that the height of the
total population is higher than the true measure.
If you know, however, that height tends to be
normally distributed (i.e., conforms to a “bell curve”), you can overlay a
theoretical normal distribution on top of your sample distribution.
Assuming the rest of your sample was
sufficiently random, it will likely line up closely, except that your sample
distribution will have a cluster of high values on the right tail.
You would likely conclude that the appropriate
height is the central value on your theoretical distribution.
This is not discounting the Knicks, just
acknowledging that they would probably be balanced out by short people in a
of prices per square foot paid for residential real estate in a given period
of time does not conform to a normal or other common distribution, but it
has a consistent shape that can be characterized and used to shed light on
the true value of the residential property market on a given day.
We call this distribution Triple Power Law,
because the general shape is defined by three lines that appear straight on
a logarithmic scale (a feature that characterizes mathematical expressions
known as “power laws”).
Knowing the generalized form of the
distribution, we can then determine the specific shape of the distribution
and its position on the x-axis based on the actual data.
In the height
example above, in order to overlay a normal distribution you had to draw a
conclusion about the appropriate height and width for the distribution.
the absence of any other information, you would have to glean this from your
You could get an even
better answer if you took a sample of 100 men for 100 days in row to get an
idea of the true standard deviation (which determines the shape of the
distribution; since height is very stable over time, this probably would not
give you any more insight than measuring 10,000 men in one day, but for the
purposes of comparing it to real estate, in which dynamics can and do
change, a time series component is necessary).
Just as a normal distribution can be tall and narrow or short and wide, the shape of the Triple Power Law distribution can vary across time and geographies but be established by looking at a good-sized dataset. While not as stable as the standard deviation of the height of a population, the shape of Triple Power Law changes fairly slowly. In concrete terms, we characterize this shape as the distribution of quality in the housing stock of an area (i.e., where properties fall relative to each other on the price spectrum, regardless of the absolute price level). We can therefore use a relatively long time horizon to establish the shape. The actual price level can change much more rapidly, which we account for by sliding the shape along the ppsf axis until it best fits data from recent transactions (either 1, 7, or 28 days worth of transactions – see question below).
describe the shape and position of the Triple Power Law distribution.
There are six of them–five for the shape and one
for the position–although one of the shape parameters is simply derived from
the other four.
There are three distinct
regions of the distribution: low, middle, and high (which are established by
the process of fitting the distribution rather than an
notion on our part of which properties are supposed to fall into which
In the figure
below, the x-axis is the natural logarithm of the price per square foot, and
the y-axis is the natural logarithm of the frequency.
Shape parameters βL and βR define the left and right slopes.
Shape parameter p is the ratio of the price per square foot at the second turning point (between the middle to high regions) relative to the first turning point (between the low and middle regions).
Shape parameter hc is the ratio of the frequency of the second turning point to the first.
Shape parameter βM simply connects the dots defined by the other parameters.
For a set of shape parameters, the entire Triple Power
Law structure is moved to the position on the x-axis where it best fits the
data for a given day or set of days, and the process is iterated until both
the optimal shape and position are established.
Given a shape, the position can be defined by
any single point on the Triple Power Law distribution. We use the convention
whereby we identify the position with the price per square foot of the first
turning point, parameter b.
We sometimes refer to
b as the
mode, since the first
turning point is typically the point of greatest frequency; however,
occasionally that distinction belongs to the second turning point (i.e.,
is positive and hc
is greater than 1).
Here’s a typical
example of what the distribution looks like when fit to actual data:
How does Radar Logic determine the parameters?
We use a fitting algorithm that tests a large number of possible values for each parameter until it finds the combination that best fits the data. Specifically, the algorithm fixes a set of shape parameters, starting with the parameters from the previous day as a first guess. For each day in the previous year, the algorithm then varies the position parameter and computes the log likelihood function (a standard statistical calculation that conveys how likely it is that a specific dataset could have been generated by a given distribution). It adds up the 365 individual log likelihood functions to get the cumulative log likelihood. It then tweaks the shape parameters and repeats the process. It continues to loop through these steps looking for better fits until it gets to a point that its tinkering stops making a discernable improvement, at which point it locks down the optimized parameters for the day.
you have the distribution, how does Radar Logic actually get the daily
We take the median of the distribution–that is, the price per square foot at which half the area under the distribution curve is below and half above.
doesn’t Radar Logic just use the data median?
There are a couple of big advantages to Triple Power Law parameterization over the raw data median. First, the data median can miss dynamics that are present only in a particular segment of the market. For example, if low-end prices are declining while the rest of the market is remaining stable, the median would largely ignore it since those transactions started and ended below the median, while the left slope of the Triple Power Law distribution would become less steep (and consequently the index price lower) to reflect that decline. A second major benefit is that a Triple Power Law approach does better in the presence of bad or idiosyncratic data since it does not blindly accommodate all datapoints but instead places them in a larger context. As a corollary, Triple Power Law provides a criterion for determining whether there are issues with the data, as in the example above of the normal distribution of heights and an irregular bump on the right tail resulting from the passage of the Knicks. Knowing what the distribution should look like helps establish statistical filters that pick out apparently irregular clusters of transactions for further scrutiny. Such filters help identify and fix data problems, which helps clean up the actual data samples and in turn yield more accurate Triple Power Law distributions and eventually daily prices.
fit the data more exactly with a non-parametric approach?
Non-parametric statistical methods (e.g., Kernel
estimators) assume no knowledge and in return supply no insights.
Non-parametric estimates are in effect black
boxes, which are an unsatisfactory second resort.
We can do—and do—much
better than that.
A parameterization of the distribution, if adequate information exists to peg down a model, can: a) incorporate all the known facts about the data and the underlying dynamics; b) provide a benchmark against which data can be assessed for errors or manipulation; and c) comprise meaningful parameters that in themselves reflect facts about the data or dynamics.
We do make one major property-specific
adjustment—for size—by using
a price per square foot calculation. Beyond that, there are obviously
an infinite number of other variables that affect the price of a home.
Moreover, the variables themselves vary by location (e.g., an oceanfront
view has a big impact on a condo price on the coast but is irrelevant in the
heartland). The advantage of fitting a distribution to the data is
that we don’t have to know all the factors contributing to the price of a
home to determine where on the price spectrum it should fall. We take the
pragmatic approach that it is impossible to determine all the individual
attributes (objective or subjective and often intangible) that induce a
buyer to pay a certain price for a home at a certain place and time, and
instead seek an accurate statistical representation of the residential
market price movement dynamics.
We undertook to develop the methodology with a couple of index characteristics in mind. First, we wanted to include as many transactions as possible. A repeat-sales method fairly neatly adjusts for variations in individual properties but by definition excludes new properties and raises questions regarding to which time periods to attribute price movements. We also wanted to report a meaningful number at the MSA level every day. Least-squares regression methods that pick a set of independent variables might have good explanatory power, but a multivariate regression involving that many variables requires a large amount of data–a data requirement that typically requires aggregating across larger geographical areas or timeframes.
does the publication timeline work?
How are the 1-, 7-,
and 28-day prices calculated?
The prices for a given transaction date are published exactly 63 days (nine weeks) later. In the case of the 7- and 28-day prices, the price comes from establishing the position parameter by looking at transactions from the transaction date 63 days ago, plus the previous 6 or 27 calendar days, respectively.
We studied a variety of options before settling on a year for the shape parameters. We do, however, use statistics (namely, the Kolmogorov probability) to monitor on an ongoing basis how well the theoretical distribution fits the empirical one to be sure that we aren't missing any important dynamics.
use median for subindices (zip code, property type, etc.) on the analytics
We chose to use the data median for the analytics site because without being able to control what property characteristics users select, we cannot guarantee that any combination that they put together will be appropriately captured by two or three power-law regions (Triple Power Law does fine with double-power law scenarios—it just folds down one of the regions). At the MSA level, or any region broad enough to encompass a full socioeconomic spectrum, Triple Power Law works well because it is a true representation of the data. If you just pick out isolated chunks of the spectrum, however, applying Triple Power Law may not be the best bet. Imagine the user who chooses Manhattan and the Bronx, which have very different price points. That psf distribution is essentially bimodal, and so the median is volatile. To conceptualize, imagine that between the two locations you have 100 transactions every day. One day 51 of them occurred in the Bronx and 49 in Manhattan, so the median is a Bronx price; the next day the transaction counts are reversed, so you have a Manhattan price. By virtue of being unimodal rather than bimodal, a Triple Power Law distribution would be more stable, and the price might actually be a pretty good reconciliation of the two disparate regions; however, we could not really say that the Triple Power Law distribution reflected the actual data distribution.
datapoint falls outside the distribution, does that mean it does not count
in calculating the index?
Every individual datapoint is equally influential in fitting the distribution. It is true that what matters in the final index price is the area under the curve, but just because a datapoint does not fall under that umbrella does not mean that it was not accounted for in the fitting process.
Wouldn't the distribution over the year be influenced if there had been a
sharp increase in house prices throughout the year, resulting in an increase
in the "quality" of properties? Would a very high quality house that
appeared in the upper tail during the first half of a year move down towards
the middle by the end of the year?
A sharp increase in house prices does not necessarily have
to affect the shape of the distribution, although it could. For
example, imagine one scenario in which the price increased $50 per square
foot, and another in which the ppsf increased 20%. The distribution
for the first scenario simply shifts to the right (i.e., only the position
parameter would change), and in the second the shape would flatten out since
the percentage change makes a bigger difference in the absolute price the
more expensive a property was to start.
Note, though, that in both these scenarios the properties are all in the same positions on the spectrum relative to each other. So a high-end, right-tail house is still in the right tail. Now imagine the scenario in which a large section of lower-end housing is razed and re-built with luxury condos, all of which are more expensive than the high-end house. That would indeed alter the shape of the distribution by moving strength from the left to the right tail relative to the median, and would move the reference house toward the middle of the distribution. Changes in the profile of properties on the ground occur gradually (over extended periods of time rather than overnight) and are picked up by the shape parameters, which are designed to capture precisely such effects.
you come up with a price on a day with low transaction volume?
It is indeed harder to determine the appropriate position for the distribution on the ppsf axis on low-volume days, so we get the best fit we can; however, those days will tend to be more volatile. The fact that the shape of the distribution is established using a year’s worth of data helps reduce volatility.