You see data...

We see data differently.

Predicting Groundhog Day


There sure are some weird holidays in the United States (and Canada). One such holiday, Groundhog Day, occurs every February 2nd. On this day we let a groundhog pop out of a hole in the ground in order to predict how much longer winter will last. If he “sees his shadow” then there will be 6 more weeks of winter, and if he does not then spring will come early. Sounds crazy, but this holiday has been celebrated since the 1800’s.

There are many groundhogs in different locations that make predictions, but in Philadelphia we look to Punxsutawney Phil for our forecast. Up to 40,000 people gather to watch Phil each year making Philadelphia home of the largest Groundhog Day celebration. People trust in his predictions even though he only has 39% accuracy, according to StormFax Weather Almanac. Below is a picture of the cute little (not so little) guy helping answer interview questions for Channel 6 News.

How adorable!

Is it possible to predict the outcome of Groundhog Day based on past data? I set out to investigate this by compiling a database of variables that might affect whether or not Phil sees his shadow.

The obvious variables to include were weather factors such as average temperature, humidity, wind speed, dew point, amount of precipitation, sea level pressure, visibility, and heating degree day (the amount of energy needed to heat a building, related to outside temperature). I guessed that these variables will have the most significance.

Other interesting variables I included in the model were day of the week, percent moon visible, whether or not it was a leap year, and the party of the incumbent president. I did not really expect any of these wildcard variables to have a significant contribution, but it was still fun to investigate.

Just for kicks, I threw in a variable I named “number of Bill Murray movies” for each year, to appease those who have seen Groundhog Day and enjoy the undeniable talent of Mr. Murray (myself included).

In order to build a model to predict the outcome of Groundhog Day 2013, first I had to compile my database. I used 40 years of data (1973-2012) and included 13 different explanatory variables. The response variable was whether or not the groundhog saw his shadow.

With this database in place, I was able to start building a model. Since my response variable is binary in nature (two possibilities: yes or no), I have to treat it differently from a regular linear model response variable. This means I will have to round my final output up or down to the nearest integer.

First I created 13 different models each with one of the explanatory variables to find the best one-variable model. I put the most significant variable in the model and then added each of the 12 remaining variables to find the best two-variable model. I repeated this process to find the best three-variable model, however additional variables were not significant therefore the two-variable model was the best.

The explanatory variables that ended up significant in predicting the outcome of Groundhog Day were dew point (in degrees F) and amount of precipitation (in inches) that occurred on that day. Unfortunately, the number of movies that Bill Murray acted in that year was not a significant contribution to the model. Sorry, Bill!

The final model:
Shadow = 1.158317 – 0.014294*Dew Point – 0.728101*Precipitation

Both variables had negative coefficients, indicating that as they increase, the likelihood of the groundhog seeing his shadow decreases. In other words the more precipitation in inches and the higher the dew point, the less likely it is that Phil will see his shadow and the more likely that spring will come early.

In order to predict the outcome of this year’s Groundhog Day, I will use historical temperature data from the past two weeks for the dew point prediction and forecasts from multiple weather sources for the precipitation prediction.

To predict the dew point for Saturday, I used temperature and dew point data from the past 2 weeks. Even though temperature is not included in the final model, it trends similarly to dew point so tomorrow’s temperature prediction will be a good indicator of tomorrow’s dew point. In the graph below, the temperature for tomorrow is predicted to be about 32F, an increase from today’s temperature. Therefore I believe the dew point will increase similarly, so my best guess is that the dew point will be 19F.

According to, Yahoo! Weather, the Weather Channel, and NBC Philadelphia Weather, there is a low chance of precipitation for Saturday, February 2 (around 0-10%). All sources predict that the precipitation will most likely occur in the evening, so for my purposes I am going to use a precipitation amount of 0.0 inches.

I plugged these two numbers into my final model. If the result is greater than 0.5, then the groundhog will see his shadow. If the result is less than 0.5, the groundhog will not see his shadow. The final calculation is:
Shadow = 1.158317 – 0.014294*19 – 0.728101*0.0
Shadow = 1.158317 – 0.271586=0.886

Since the model output a value of 0.886 which is greater than 0.5 and relatively close to 1, I predict that the groundhog will see his shadow tomorrow. That means 6 more weeks of winter weather! Snow and icy wind lovers rejoice. For the rest of us warm weather folks, don’t put your scarves and gloves away just yet.

At MaassMedia, we apply these statistical techniques to a variety of data from multiple channels. Like our predictions for the Groundhog Day, our analyses identify metrics correlated with success to create predictive models. Knowing what to expect from your audience gives you the insight to develop targeted marketing that optimizes the user experience and boosts ROI. Get in touch with us to learn how your organization can move further along the path to Transformative Insights™.

Comments are closed.