Methodology FAQ
What is Triple Power Law?
What does it mean to parameterize the
distribution?
How
does Radar Logic determine the parameters?
Once
you have the distribution, how does Radar Logic actually get the daily
price?
Why
doesn’t Radar Logic just use the data median?
Why not fit the
data more exactly with a nonparametric approach?
How
does Radar Logic adjust for the variations in individual properties?
How does the publication timeline work?
How are the 1, 7, and 28day prices
calculated?
Why
use median for subindices (zip code, property type, etc.) on the analytics
site?
How do
you come up with a price on a day with low transaction volume?
In the context of the RPX methodology, parameterization is the
process of fitting a theoretical distribution to empirical data.
For the RPX, we refer to this theoretical
distribution as “Triple Power Law.™”
The
goal is to incorporate as much information as is available about the
universe of data in order to help control for the idiosyncrasies of a
particular sample.
In essence, we do not calculate the daily price
based solely on the raw data in a short period of time (a day, a week, or a
month).
We use an understanding of the dynamics derived from
looking at the data over a longer period of time to give us a truer
interpretation of the data in the shorter time period than simple statistics
(e.g., average or median) allow.
This enables us to combine the timeliness of
using data from the recent past with the robustness of using a longer time
window.
For example, if
you are trying to find the median or average height of male New Yorkers, you
might stand on Seventh Avenue measuring the first 100 men to walk past.
If in the course of that time the New York
Knicks basketball team passes by, simply taking a median or average of the
raw heights in your sample might lead you to conclude that the height of the
total population is higher than the true measure.
If you know, however, that height tends to be
normally distributed (i.e., conforms to a “bell curve”), you can overlay a
theoretical normal distribution on top of your sample distribution.
Assuming the rest of your sample was
sufficiently random, it will likely line up closely, except that your sample
distribution will have a cluster of high values on the right tail.
You would likely conclude that the appropriate
height is the central value on your theoretical distribution.
This is not discounting the Knicks, just
acknowledging that they would probably be balanced out by short people in a
larger sample.
The distribution
of prices per square foot paid for residential real estate in a given period
of time does not conform to a normal or other common distribution, but it
has a consistent shape that can be characterized and used to shed light on
the true value of the residential property market on a given day.
We call this distribution Triple Power Law,
because the general shape is defined by three lines that appear straight on
a logarithmic scale (a feature that characterizes mathematical expressions
known as “power laws”).
Knowing the generalized form of the
distribution, we can then determine the specific shape of the distribution
and its position on the xaxis based on the actual data.
In the height
example above, in order to overlay a normal distribution you had to draw a
conclusion about the appropriate height and width for the distribution.
In
the absence of any other information, you would have to glean this from your
100person sample.
You could get an even
better answer if you took a sample of 100 men for 100 days in row to get an
idea of the true standard deviation (which determines the shape of the
distribution; since height is very stable over time, this probably would not
give you any more insight than measuring 10,000 men in one day, but for the
purposes of comparing it to real estate, in which dynamics can and do
change, a time series component is necessary).
Just as a normal distribution can be tall and narrow or
short and wide, the shape of the Triple Power Law distribution can vary
across time and geographies but be established by looking at a goodsized
dataset.
While not as stable as the standard deviation of the
height of a population, the shape of Triple Power Law changes fairly slowly.
In
concrete terms, we characterize this shape as the distribution of quality in
the housing stock of an area (i.e., where properties fall relative to each
other on the price spectrum, regardless of the absolute price level).
We can therefore use a relatively long time
horizon to establish the shape.
The actual price level can change much more
rapidly, which we account for by sliding the shape along the ppsf axis until
it best fits data from recent transactions (either 1, 7, or 28 days worth of
transactions – see
question below).
What do the parameters
represent?
The
parameters
describe the shape and position of the Triple Power Law distribution.
There are six of them–five for the shape and one
for the position–although one of the shape parameters is simply derived from
the other four.
There are three distinct
regions of the distribution: low, middle, and high (which are established by
the process of fitting the distribution rather than an
a priori
notion on our part of which properties are supposed to fall into which
regions).
In the figure
below, the xaxis is the natural logarithm of the price per square foot, and
the yaxis is the natural logarithm of the frequency.

Shape parameters β_{L} and β_{R} define the left and right slopes.

Shape parameter p is the ratio of the price per square foot at the second turning point (between the middle to high regions) relative to the first turning point (between the low and middle regions).

Shape parameter h_{c} is the ratio of the frequency of the second turning point to the first.

Shape parameter β_{M} simply connects the dots defined by the other parameters.
For a set of shape parameters, the entire Triple Power
Law structure is moved to the position on the xaxis where it best fits the
data for a given day or set of days, and the process is iterated until both
the optimal shape and position are established.
Given a shape, the position can be defined by
any single point on the Triple Power Law distribution. We use the convention
whereby we identify the position with the price per square foot of the first
turning point, parameter b.
We sometimes refer to
b as the
mode, since the first
turning point is typically the point of greatest frequency; however,
occasionally that distinction belongs to the second turning point (i.e.,
β_{M}
is positive and h_{c}
is greater than 1).
Here’s a typical
example of what the distribution looks like when fit to actual data:
How
does Radar Logic determine the parameters?
We use a fitting algorithm that tests a large number of possible values for
each parameter until it finds the combination that best fits the data.
Specifically, the algorithm fixes a set of shape
parameters, starting with the parameters from the previous day as a first
guess.
For each day in the previous year, the algorithm then
varies the position parameter and computes the log likelihood function (a
standard statistical calculation that conveys how likely it is that a
specific dataset could have been generated by a given distribution).
It adds up the 365 individual log likelihood
functions to get the cumulative log likelihood.
It then tweaks the shape parameters and repeats
the process.
It continues to loop through these steps looking
for better fits until it gets to a point that its tinkering stops making a
discernable improvement, at which point it locks down the optimized
parameters for the day.
Once
you have the distribution, how does Radar Logic actually get the daily
price?
We take the median of
the distribution–that is, the price per square foot at which half the area
under the distribution curve is below and half above.
There are a
couple of big advantages to Triple Power Law parameterization over the raw
data median. First, the data median can miss dynamics that are present
only in a particular segment of the market. For example, if lowend
prices are declining while the rest of the market is remaining stable, the
median would largely ignore it since those transactions started and ended
below the median, while the left slope of the Triple Power Law distribution
would become less steep (and consequently the index price lower) to reflect
that decline. A second major benefit is that a Triple Power Law
approach does better in the presence of bad or idiosyncratic data since it
does not blindly accommodate all datapoints but instead places them in a
larger context. As a corollary, Triple Power Law provides a criterion
for determining whether there are issues with the data, as in the example
above
of the normal distribution of heights and an irregular bump on the right
tail resulting from the passage of the Knicks. Knowing what the distribution
should look like helps establish statistical filters that pick out
apparently irregular clusters of transactions for further scrutiny. Such
filters help identify and fix data problems, which helps clean up the actual
data samples and in turn yield more accurate Triple Power Law distributions
and eventually daily prices.
Why not
fit the data more exactly with a nonparametric approach?
Nonparametric statistical methods (e.g., Kernel
estimators) assume no knowledge and in return supply no insights.
Nonparametric estimates are in effect black
boxes, which are an unsatisfactory second resort.
We can do—and do—much
better than that.
A parameterization of the distribution, if adequate
information exists to peg down a model, can: a) incorporate all the known
facts about the data and the underlying dynamics; b) provide a benchmark
against which data can be assessed for errors or manipulation; and c)
comprise meaningful parameters that in themselves reflect facts about the
data or dynamics.
How
does Radar Logic adjust for the variations in individual properties?
We do make one major propertyspecific
adjustment—for size—by using
a price per square foot calculation. Beyond that, there are obviously
an infinite number of other variables that affect the price of a home.
Moreover, the variables themselves vary by location (e.g., an oceanfront
view has a big impact on a condo price on the coast but is irrelevant in the
heartland). The advantage of fitting a distribution to the data is
that we don’t have to know all the factors contributing to the price of a
home to determine where on the price spectrum it should fall. We take the
pragmatic approach that it is impossible to determine all the individual
attributes (objective or subjective and often intangible) that induce a
buyer to pay a certain price for a home at a certain place and time, and
instead seek an accurate statistical representation of the residential
market price movement dynamics.
We undertook to develop the methodology with a couple of
index characteristics in mind. First, we wanted to include as many
transactions as possible. A repeatsales method fairly neatly adjusts
for variations in individual properties but by definition excludes new
properties and raises questions regarding to which time periods to attribute
price movements. We also wanted to report a meaningful number at the
MSA level every day. Leastsquares regression methods that pick a set
of independent variables might have good explanatory power, but a
multivariate regression involving that many variables requires a large
amount of data–a data requirement that typically requires aggregating across
larger geographical areas or timeframes.
How
does the publication timeline work?
How are the 1, 7,
and 28day prices calculated?
The prices for a given transaction
date are published exactly 63 days (nine weeks) later.
In the case of the 7 and 28day prices, the
price comes from establishing the position parameter by looking at
transactions from the transaction date 63 days ago, plus the previous 6 or
27 calendar days, respectively.
By
using a oneyear average to fit the shape, isn’t Radar Logic introducing a
lag on how fast the shape changes over time?
We studied a variety of options
before settling on a year for the shape parameters.
We do, however, use statistics (namely, the
Kolmogorov probability) to monitor on an ongoing basis how well the
theoretical distribution fits the empirical one to be sure that we aren't
missing any important dynamics.
Why
use median for subindices (zip code, property type, etc.) on the analytics
site?
We chose to use the data median for the analytics
site because without being able to control what property characteristics
users select, we cannot guarantee that any combination that they put
together will be appropriately captured by two or three powerlaw regions
(Triple Power Law does fine with doublepower law scenarios—it just folds
down one of the regions). At the MSA level, or any region broad enough
to encompass a full socioeconomic spectrum, Triple Power Law works well
because it is a true representation of the data. If you just pick out
isolated chunks of the spectrum, however, applying Triple Power Law may not
be the best bet. Imagine the user who chooses Manhattan and the Bronx,
which have very different price points. That psf distribution is
essentially bimodal, and so the median is volatile.
To conceptualize, imagine that between the two
locations you have 100 transactions every day.
One day 51 of them occurred in the Bronx and 49
in Manhattan, so the median is a Bronx price; the next day the transaction
counts are reversed, so you have a Manhattan price. By virtue of being
unimodal rather than bimodal, a Triple Power Law distribution would be more
stable, and the price might actually be a pretty good reconciliation of the
two disparate regions; however, we could not really say that the Triple
Power Law distribution reflected the actual data distribution.
If a
datapoint falls outside the distribution, does that mean it does not count
in calculating the index?
Every individual datapoint is equally influential in
fitting the distribution. It is true that what matters in the final
index price is the area under the curve, but just because a datapoint does
not fall under that umbrella does not mean that it was not accounted for in
the fitting process.
Wouldn't the distribution over the year be influenced if there had been a
sharp increase in house prices throughout the year, resulting in an increase
in the "quality" of properties? Would a very high quality house that
appeared in the upper tail during the first half of a year move down towards
the middle by the end of the year?
A sharp increase in house prices does not necessarily have
to affect the shape of the distribution, although it could. For
example, imagine one scenario in which the price increased $50 per square
foot, and another in which the ppsf increased 20%. The distribution
for the first scenario simply shifts to the right (i.e., only the position
parameter would change), and in the second the shape would flatten out since
the percentage change makes a bigger difference in the absolute price the
more expensive a property was to start.
Note, though, that in both these scenarios the
properties are all in the same positions on the spectrum relative to each
other. So a highend, righttail house is still in the right tail.
Now imagine the scenario in which a large section of lowerend housing is
razed and rebuilt with luxury condos, all of which are more expensive than
the highend house. That would indeed alter the shape of the
distribution by moving strength from the left to the right tail relative to
the median, and would move the reference house toward the middle of the
distribution. Changes in the profile of properties on the ground occur
gradually (over extended periods of time rather than overnight) and are
picked up by the shape parameters, which are designed to capture precisely
such effects.
How do
you come up with a price on a day with low transaction volume?
It is indeed harder to determine the appropriate position
for the distribution on the ppsf axis on lowvolume days, so we get the best
fit we can; however, those days will tend to be more volatile. The fact that
the shape of the distribution is established using a year’s worth of data
helps reduce volatility.