The Driving Forces behind Democratic Voting in America: A Demographic View

There are certain groups that the United States Democratic Party knows to be their “base” group, consisting of their most reliable voters, who will vote for a Democrat above any other candidate in any given political race. Through polling and experimentation, it has been accepted as a fact that the Democratic base is made up largely of young people, non-whites, and lower-income earners. We wanted to put this “fact” to the test and find out not only whether it was true, but also what the largest indicator of Democratic voting might be, and whether there were any more indicators that we don’t know of as of yet.


We began this process using data taken from the United States census website, in the form of demographic data on individual counties. We looked at which counties fell into which districts, and then took each of the counties that were within a district, and averaged the values of all of their variables in order to get a measure for the district as a whole. We then coalesced each district into a spreadsheet for the entire state and used a combination of these spreadsheets for California, Colorado, Utah, South Carolina, Ohio, New Jersey, Arizona, and Washington to produce the regression.

I believe the sample to be roughly representative, given that there are three states that have been traditionally and reliably Democratic in the Presidential elections for the past twenty years or so, (California, New Jersey, Washington), two states that have been Republican for the same amount of time (Utah,  South Carolina), and three “battleground” states that have switched between the two for the past twenty years (Colorado, Ohio, Arizona). These states are also spread throughout the nation, hopefully eliminating any geographic biases. The regressor was taken by hand from the 2014 House Congressional results as given on, and it is the percent of each district that voted for the Democratic candidate.

If there were two Democrats running, we took the higher value of the two, and if there were no Democrats running, we assigned it 0. Most of the other variables we’re using came straight from the census website. We created a few additional variables, including dummy variables that correspond to the state that the datapoint corresponds to, and a variable that measures the percent of racial minorities in the county. We created the second variable by simply taking the percent of the county that was “White only, not Hispanic or Latino” and subtracting that from 100.

Simple Variable Model

Using this data, we attempted to create a linear regression model that hopefully has some explanatory power towards explaining Democratic voting in the United States. Beginning with a regression model with only one independent variable, minority_percent, the variable derived from taking the “white only” percentage and subtracting that from 100. This regression model has the form dem_percent_2014 = minority_percent*x + u, where u is the error term. This is an example of a regression model, and is essentially a line in the x-y plane. It may be related to the “y=mx+b” formula that many of us learned in algebra, with dem_percent_2014, or the dependent variable, as y, and minority_percent as the individual values of x, and x as a weight given to the values. Minority_percent had a p-value of 0, which is to say that there is a 0% chance that these two values are uncorrelated, and this model had an R-squared value of .2196, which means that roughly 21% of the variance in the dem_percent_2014 figure was explained solely by the minority_percent variable.

minority_percent vs.

Figure 1: minority_percent vs. dem_percent_2014

Multiple Variable Model

This model may be extended into the final multiple-variable linear regression model, with multiple independent variables in addition to the aforementioned minority_percent. The final model is as follows: dem_percent_2014 = _cons + minority_percentx1 + percent_under_18x2 + percent_wout_health_insurancex3 + cax4 + ohx5 + scx6 + utx7 + wax8. _cons would be the b in the y=mx+b model, or the intercept. Minority_percent we’ve already discussed, percent_under_18 is the percent of each district that is under the age of 18, percent_wout_health_insurance is the percent of each district that is without health insurance, and all of the other variables are “dummy variables” that correspond to states we included in the data set. These “dummy variables” have values of either 0 or 1, for example ca would be equal to 1 if the given observation is a district from California, and 0 if not.

I will now launch into a discussion of the individual weights on the different variables and the reason that those variables are statistically and practically significant and thus included in the regression. Minority_percent has a coefficient of 1.1731, meaning that for an increase of ~1 percentage point in the percent of a given district that is of minority descent, there will be a 1 percentage point increase in the Democratic vote for a candidate in that district.

This figure has a p-value of 0.000, so it is extremely statistically significant. Furthermore, this fits in with our intuitions regarding Democratic voters that many of them come from minority backgrounds.

This relationship is shown in Figure 1. Percent_under_18 has a coefficient of -3.799, and this may be interpreted in the same way as the previous coefficient, only in reverse. That is to say, for a decrease of 3.799 percentage points in the percentage of a given district’s percent that is under the age of 18, there will be a 1 percentage point increase in the voting outcome for the Democratic candidate running in that district. This goes against my initial assumptions on the subject, as I surmised that a higher percentage of individuals under 18 would signify a large group of voters just over 18, which would indicate a more Democratic district. This didn’t hold up though.

Having a large percent of individuals under 18 simply signified a more suburban district, which are often Republican strongholds.

This value is also highly statistically significant, again with a p-value of 0.000. This relationship is depicted in Figure 2. Percent_wout_health_insurance is the final non-dummy I included, and this variable has a coefficient of -1.137 in this regression. Since the sample is taken from 2014, the individuals without health insurance are most likely either those that chose not to be covered under the ACA or individuals that are somehow outside of the governmental system, yet still counted in the census. Given that Democrats, broadly speaking, advocate for government programs to help those in need and thus would most likely have health insurance through the ACA or otherwise, this coefficient makes sense. This p-value is higher than the other two, at around .067, but this is just outside the normally accepted confidence range of 95%, which corresponds to a p-value of .05. Furthermore, the inclusion of this variable increases the adjusted R-squared, which is the same as a regular R-squared except that the inclusion of a variable that doesn’t provide additional explanatory power actually lowers the value of an adjusted R2. So the fact that it increases the adjusted R2 means that it adds to the explanatory power of the regression in a meaningful way.

percent_under_18 vs. dem_percent_2014

Figure 2: percent_under_18 vs. dem_percent_2014

The coefficients and significances of the dummy variables are not interpreted in the same way as the other variables. The coefficients of the dummy variables actually may be thought of as changes to the intercept of the regression line. Returning to the y=mx+b form, one can think of the coefficient on ca as y=mx+(b+ca) where b is the constant, mx is representative of all three non-dummy variables, and ca is the coefficient on the ca variable. This is only the case if the district being looked at is a California district, though the process is identical for all states. For example, if the district being looked at is from South Carolina, the basic y=mx+b form is y=mx+(b+sc) with the exact same interpretation as above with sc substituted in for ca. Not all of the states we looked at are included as dummy variables because not all of them are statistically significant, as some have high p-values and decrease the adjusted R-squared. I am using New Jersey as my reference group in this regression, so if a dummy variable is not statistically significant, it simply means that it is not significantly different in voting from New Jersey. The only variable left out for this reason was Arizona, as New Jersey is left out because it is the group everything is compared to, or the reference group.

The simple R-squared for this entire regression is .4746, meaning that around 47% of the variance in dem_percent_2014 is explained by the regression model created. I will not discuss heteroskedasticity nor potential error born from model misspecification here, but through the Breusch-Pagan test and the White test I found that heteroskedasticity was extremely unlikely, and through RESET and general experimentation I did not find that any linear or non-linear combination of the variables, both included and not included, were significant.

Potential Sources of Error:

Unfortunately, in the dataset, there is no measure for simply wealth, only income. This is disappointing, because I think that wealth can be very independent of wages in some cases, and thus wealth might track more with a higher Republican vote more than our measures of wages did. But given the constraints of our dataset, we are unable to quantify the effect of this variable, nor create a proxy for it. It may be the case that given California’s size and many districts, it is having an undue effect on the sample, given  ~45% of the districts are from California. However, as previously stated, California is an enormous and extremely diverse area, and the inclusion of the California dummy should eliminate any bias that California itself creates. 


Some bits of this regression didn’t fit in with my initial assumptions on the subject, but most did. Specifically, I didn’t think that a high youth percent would be indicative of a highly Republican district, but that indeed was the case to an extreme degree.

However, most of the common assumptions as to who the Democratic base is proved to be correct in this study, as the base seems to be individuals of minority descent, individuals who live in cities, and individuals who have healthcare in some form.

This only explains just under 50% of the variance in the dem_percent_2014 figure, so it is likely that there is still something large missing from this regression, but it may be something difficult or impossible to know, like the party of an individuals’ parents and family. Not only this, but to figure out exactly how this affects party allegiance, you would also need to know how the individual feels about their parents and family, as that would affect party allegiance to a large degree. However, absent that variable, I stand behind the created regression as a fairly explanatory foray into finding the demographic hallmarks of the Democratic voter base.

Please feel free to contact me for further info on the specifics of this model, the data, or the methodology.

Research and analysis contributed by Eva Sachar and Kyle Scott


  1. McCormick, M., Scruton, P., & Rogers, S. (2012). US election 2012 results visualised. Retrieved March 31, 2016, from
  2. Population estimates, July 1, 2015, (V2015). (n.d.). Retrieved April 3, 2016, from
Merritt Smith
Tufts Class of 2018, Data Science and Public Policy Major.


Leave a Reply

Your email address will not be published. Required fields are marked *