2015年02月23日 エンジニアリング

Kibana 4 & Civic Hacking: Investigating Campaign Contributions

By Peter Kim

It's a great time to be a civic hacker. We're seeing increased transparency from local and national governments through more data sets released to the public every day around subjects such as traffic accidents, adverse drug reactions, financial aid applications for higher education loans, restaurant inspections and even public restroom locations. Now, anyone can access this data, analyze it and build apps that promote the common good. Yay for civic hacking!

The United States Federal Election Commission publishes campaign contributions data to its website (www.fec.gov), covering elections for President, Senate and House of Representatives. As stated on fec.gov:

“In 1975, Congress created the Federal Election Commission (FEC) to administer and enforce the Federal Election Campaign Act (FECA) – the statute that governs the financing of federal elections. The duties of the FEC, which is an independent regulatory agency, are to disclose campaign finance information, to enforce the provisions of the law such as the limits and prohibitions on contributions, and to oversee the public funding of Presidential elections."

Providing this information to the public is critical to ensuring the integrity of the election process.

So now that the FEC has provided us with the raw data, what can we do with it? If you don't consider yourself a Data Scientist who knows how to analyze data in R or create pretty D3.js-based visualizations like the New York Times, you probably feel stuck at this point. Fortunately, the ELK stack makes it possible to perform rich, visual, interactive data analysis with very little coding. I'll describe data loading steps in a separate post but for now, I'll provide a glimpse at some of the data visualizations possible with Kibana 4.

Discover

In Kibana 4, you typically start on the Discover tab. This is where you get a high-level view of the data set, immediately seeing the distribution of data over time, a list of the fields that structure your documents, and the contents of some documents in your index.

In the screenshot above, we're looking at almost 2.1 million records representing campaign contributions from individuals to political committees during the 2013-2014 election cycle. We can see a clear trend in the number of contributions increasing over the election cycle, with a few interesting spikes at seemingly random points.

The column on the left side lists all of the fields contained in the data set. This is extremely helpful in giving us context for the questions we might want to ask of the data. For example, since we now know our data set potentially contains fields such as “name", “city", “state", “transactionAmount" and “transactionDate", we can start building a list of questions we'll want to ask of the data:

  • Which states had the largest number of contribution transactions?
  • Which states had the largest total sum value of contribution dollars?
  • How did the total sum value of contributions from individuals in Iowa change over time?
  • What were the top 3 cities in each of the top 10 states in terms of campaign contributions made?
  • Did my favorite actress (e.g. Gwyneth Paltrow) make any contributions to any candidates?

The list of fields can also be helpful in identifying holes in the data set which may prevent you from getting answers to the questions you want to ask. For example, the file representing these individual contributions does not contain clear information about the committee receiving contributions or the candidate associated with the committee (technically, individual contributions go to a committee associated with a candidate). The raw data file simply contains the cryptic IDs of the committees and associated candidates.

This makes it difficult for me to ask a question like “What are the names of the top 10 committees receiving contributions?" Identifying these gaps using the Discover interface can lead us to decide we need to load additional data to make this application more useful.

Visualize

Once we've identified some questions we might want to ask, we can start building visualizations based on some of the attributes in the data set. Let's take one of the questions we brainstormed above as an example.

Here's a pie chart representing the top 10 states from which individual campaign contributions are made, measured by the sum value of contributions:

Not a lot of surprises here, as the pie chart shows California, New York, Texas, Florida and Illinois (the five most populous states in the US) amongst the top sources of contributions. DC's position at the #3 spot is interesting and worth investigating — even though DC would be the third least populous state if it was a state, perhaps its role as the seat of the federal government naturally draws residents who are more likely to be politically engaged.

The pie chart was easy to create:

  1. Select the Aggregation to use to determine the “Slice Size": Count, Sum or Unique Count. If you select Sum or Unique Count, Kibana needs to know what field's values to perform the Sum or Unique Count on.
  2. Select Split Slices to break up the pie into slices.
  3. Select the criteria to inform Kibana on how you want those slices to be drawn:
    1. Aggregation: Select “Terms" since we want to create slices based on some field's values (aka “terms" in Elasticsearch-speak).
    2. Field: Select the field whose values we want to represent the slices. In this case, we want to see the distribution of contributions by state, so we select “state".
    3. Order/Size: Select “Top" for the order and “10" for the size in order to create a pie chart based on the Top 10 values.
    4. Order by: Typically you'll want to order the slices/buckets based on the same function used to determine the size of the slices/buckets that we selected in Step 1 but some advanced use cases may warrant a different selection here.
  4. Click Apply and presto! Now you have a pretty pie chart.
  5. Click the Save icon at the upper right and name it appropriately so you can add it to a Dashboard.

If you've got some experience with data visualization, you might have been thinking “This guy is a total hack. A pie chart is the wrong visualization to use for this type of data representation." And you're right (well, hopefully not about the “total hack" part). There's some distortion of the data here because using a pie chart gives the viewer the misperception that all of the slices add up to 100% of the total data, which in this case, makes it look like contributions from California comprise a quarter of all campaign contributions.

You can change the “size" parameter to '51' so that the sum of the slices adds up to the actual total, but as you can see here, this makes the visualization less pretty:

An alternative is to use a different visualization, perhaps a Vertical Bar Chart.

The input parameters to creating the Vertical Bar Chart probably look familiar to you. They're exactly the same as the parameters used to create the Pie Chart because the underlying query used to drive the visualization is exactly the same. We're just visualizing it differently here, hopefully with a lower probability of misinterpretation.

Dashboard

Creating visualizations is fun but at some point, you'll want to package these together onto a nice Dashboard where you can perform some aggregate analysis, gain interesting insights across multiple, disparate fields of data, and share these findings with others.

The actual process of adding visualizations to dashboards is straight-forward. Once you've created a number of visualizations you'd want to place on a dashboard together, you can click the Add Visualization icon in the upper right corner of the Dashboard tab and start adding visualizations!

Pro tip: Before you start going nuts creating visualizations and dashboards, it would be worthwhile to identify a naming convention to use when saving those elements. For example, prefixing your saved objects with the name of the Elasticsearch index and/or type is one idea.

At some point, you might have a dashboard that looks something like this:

Explore

Let's walk through two potential data discovery scenarios: one focused on a particular Super PAC and another looking at campaign contributions from your home town.

Who's behind these PACs?

Political Action Committees, or PACs as they're commonly called, aren't a new thing. The first PAC was created in 1947 in response to a part of the Taft-Hartley Act that prohibited labor unions or corporations from spending money to influence federal elections.

Super PACs were made possible by two Supreme Court decisions in 2010 that declared PACs that did not make contributions to candidates, parties, or other PACs could accept unlimited contributions from individuals, unions, and corporations (both for profit and not-for-profit) for the purpose of making independent expenditures. [ http://en.wikipedia.org/wiki/Political_action_committee]

Super PACs have been the source of much controversy and debate because prior to the existence of Super PACs, there were clear restrictions on the amount of money that could be contributed towards elections.

In this screenshot above, we see a high-level view of the contributions, and in particular, the top committees receiving contributions, committee types (e.g. Super PAC, PAC, Party, etc.) and interest group categories (e.g. Corporation, Labor Organization, etc.). I can probably guess what a lot of these committees represent but some of these are less obvious — e.g. “ACTBLUE" and “NEXTGEN CLIMATE ACTION COMMITTEE". Over $77 million contributed to a single vaguely-named committee is worth taking a closer look at.

You can filter the data set simply by clicking on that element in the data table:

After clicking “NEXTGEN CLIMATE ACTION COMMITTEE", Kibana refreshes all of the other charts and tables to only show the relevant data for the contributions to this committee. We immediately discover some interesting insights:

The vast majority of the contributions to “NEXTGEN CLIMATE ACTION COMMITTEE" are by people:

  • whose self-declared “occupation" is “FOUNDER"
  • employer is Fahr, LLC
  • live in San Francisco

When you drill down further by clicking on “FAHR, LLC", it's obvious all of these contributions are from the same person:

Prior to drilling down by the employer, we noticed there were only 56 contribution transactions for “NEXTGEN CLIMATE ACTION COMMITTEE". Just from a few clicks, we've discovered that this Super PAC predominantly consists of contributions from 1 person and a few others, who we presume are likely friends, associates, or have something beyond a superficial relationship.

The contributor base for the other large PAC, “ACTBLUE", is quite different.

There are far more transactions making up the contributions to this PAC (154,448 vs 56 for NextGen) and the contribution sources are far more geographically distributed:

One of the more interesting analytics functions provided by Elasticsearch is the significant terms aggregation. You can use significant terms for use cases such as fraud detection, anomaly detection, recommendations, and more. There's a great introduction to it on the Elasticsearch blog: Significant Terms Aggregation.

With the campaign contribution data set, one example of using significant terms is to identify statistically unique characteristics of a particular query state. For example, it's very common to see contributors with an Occupation of “Attorney", “Retired", “Lawyer" across many of the PACs. As a result, just getting a list of the top Occupations for any particular PAC may not disclose useful information about the types of people who contribute to that PAC. Using the statistical terms aggregation, as done on the table on the far right, exposes the Occupations that are uniquely common for ActBlue, such as “Professor", “Self" and “Writer":

We can filter the data by another PAC, the Democratic National Committee, and discover the occupations that are uniquely common for that PAC:

Even though we started this exploration process not knowing anything about these PACs, it's triggered many more questions that I want to follow up on:

  • Who is Thomas Steyer and what's the relationship between him and the ~40-50 other contributors to his Super PAC?
  • Which candidates' campaigns are NextGen Climate and ActBlue supporting?
  • Is there any correlation in the timing of the spend from these two organizations?
  • Is there something intentional or unintentional about a particular PAC's philosophy or marketing that appeals to people employed within a particular industry?

The beautiful thing about this process is that in addition to helping provide some of the answers to these questions, using the ELK stack helps formulate questions that you didn't even know you wanted to ask!

Who are people in my home town giving money to?

Warning: Depending on the size of your hometown, your findings here may lead to awkward interactions with your neighbors. 

All contributions over $200 are required by law to be disclosed to the public so while it might be awkward seeing your neighbor's information here, data about campaign contributions is public information and the public has a legal right to know.

You can quickly drill down the data set to state and city and get some insightful information about who in your town is making contributions and to whom.

With only 449 transactions from Hoboken, New Jersey, it wouldn't be that time consuming to sift through each record one by one. However, if you had to analyze 70,850 contribution records from New York City, there's a clear benefit from being able to do so with the interactive user experience that the ELK stack provides:

Back to my hometown of Hoboken, New Jersey, just from making a few clicks, you can start building a list of the top contributors to the campaigns for the local House and Senate seats. I've always been curious why people would contribute money to non-competitive races, which Cory Booker (won with 56% of the vote) and Albio Sires (won with 77.3% of the vote) were involved in. Maybe it's just a matter of wanting to support a friend but a cynic might keep an eye on any potential attempts to return the favor.

Conclusion

We've just grazed the surface of what's possible when using the ELK stack to explore the FEC campaign contributions data set. Hopefully this exercise has also expanded your vision for use of the ELK stack and apply these same data discovery principles to any type of data whether that's highly structured data (e.g. transaction data), unstructured data (e.g. plain-text documents), or a combination of the two.

Individuals, non-profits, government organizations and private companies from startups to large corporations are using the ELK stack to get real-time insights on data sets varying in size from a few megabytes to a few petabytes, and with the release of Kibana 4, this becomes even easier and more powerful.

Appendix A. How to get ELK with this data set running on your laptop

If you don't already have the ELK stack with the latest versions of each piece of the stack, you can download it here and follow the installation instructions on that page.

You don't actually need Logstash to get this up and running but if you wanted to tweak the Logstash configs and re-load the raw data yourself, it'd definitely be worthwhile installing it.

Restoring the Elasticsearch index snapshot

After downloading and installing the ELK stack, you'll need to download the index snapshot file for the campaign contributions data which can be obtained here (FYI it's a 1.4GB file; we take no responsibility for this download eating up your monthly mobile tethering quota):

http://download.elasticsearch.org/demos/usfec/snapshot_demo_usfec.tar.gz

Create a folder somewhere on your local drive called “snapshots" and uncompress the .tar.gz file into that directory. For example:

mkdir -p ~/elk/snapshots
cp ~/Downloads/snapshot_demo_usfec.tar.gz ~/elk/snapshots
cd ~/elk/snapshots
tar xf snapshot_demo_usfec.tar.gz
        

Once you have Elasticsearch running, restoring the index is a two-step process:

1) Register a file system repository for the snapshot (change the value of the “location" parameter below to the location of your usfec snapshot directory):

curl -XPUT 'http://localhost:9200/_snapshot/usfec' -d '{
    "type": "fs",
    "settings": {
        "location": "/tmp/snapshots/usfec",
        "compress": true,
        "max_snapshot_bytes_per_sec": "1000mb",
        "max_restore_bytes_per_sec": "1000mb"
    }
}'
        

2) Call the Restore API endpoint to start restoring the index data into your Elasticsearch instance:

curl -XPOST "localhost:9200/_snapshot/usfec/1/_restore"
        

At this point, go make yourself a coffee. When your delicious cup of single-origin, direct trade coffee has finished brewing, you can check to see if the restore operation is complete by calling the cat recovery API:

curl -XGET 'localhost:9200/_cat/recovery?v'
        

Or get a count of the documents in the expected indexes:

curl -XGET localhost:9200/usfec*/_count -d '{
        "query": {
                "match_all": {}
        }
}'
        

which should return a count of approximately 4250251.

Pointing Kibana 4 to an Elasticsearch index

The first time you go to Kibana at localhost:5601, it'll ask you to define an “index pattern":

Since the Elasticsearch cluster may contain numerous indexes, you need to tell Kibana which index contains the data you want to build visualizations and dashboards against. In this case, the campaign contribution snapshot contained four indexes, so when you ran your index restore operation, it should have created four new indexes in your Elasticsearch instance:

  • usfec_indiv_contrib: contributions from individuals to committees
  • usfec_comm2cand_contrib: contributions from committees to candidates
  • usfec_comm2comm_contrib: contributions from committees to other committees
  • usfec_oppexp: committee operating expenditures

You can type in one of these index names into the input field, select a time-field (in our indexes, it will be '@timestamp'), then click Create:

In the examples in this blog post, we've exclusively looked at the individual contributions data but there's certainly a wealth of data to explore in the other three indexes. You could even set up an index pattern in Kibana to point to all four indexes and try to correlate data between data sets!

Go to the Discover tab, select a more suitable time frame (set an Absolute “From" date of about 2012-12-18), and go exploring!

Appendix B. Supporting Links

Raw data and data dictionary files from fec.gov
http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2013_2014

OpenSecrets.org resource center
Numerous resources for analyzing campaign finance data. Thanks for providing a more detailed data dictionary for the FEC data!
https://www.opensecrets.org/resources/create/

Github repo with supporting files
Logstash config, index template, Python script for denormalizing data and creating JSON, etc.
https://github.com/elasticsearch/demo/tree/master/usfec