For the introduction of Elasticsearch #BYODemo environment, please check out our announcement blog post.
“…today in New York, approximately 4,000 New Yorkers are seriously injured and more than 250 are killed each year in traffic crashes. Being struck by a vehicle is the leading cause of injury-related death for children under 14, and the second leading cause for seniors. On average, vehicles seriously injure or kill a New Yorker every two hours.
This status quo is unacceptable. The City of New York must no longer regard traffic crashes as mere ‘accidents,’ but rather as preventable incidents that can be systematically addressed. No level of fatality on city streets is inevitable or acceptable. This Vision Zero Action Plan is the City’s foundation for ending traffic deaths and injuries on our streets.
The City will use every tool at its disposal to improve the safety of our streets.”
From NYC.gov’s Vision Zero site
Before making major public policy decisions based on anecdotal information that may cost significant taxpayer money based on anecdotal information, are there tools available to analyze data to identify patterns and trends then drill down into details in motor vehicle collisions? Can we get answers to questions such as:
- Are there noticeable trends in the number and types of accidents occurring over time?
- Are there certain areas that experience more accidents than others? Are there particular types of accidents more likely to occur in these areas?
- Are there intersections within my borough that experience more accidents involving pedestrians than others? What are the leading causes of accidents in those areas?
In less than an hour, we can set up Logstash, Elasticsearch, and Kibana (ELK) to start asking these questions. With our search capabilities, powerful APIs, and visualizations, it’s possible to calculate totals, counts, averages and have a very interactive experience with your data. And if you don’t know what to ask, you can use aggregations to discover aspects and anomalies of the data.
After following these instructions, you should see the following dashboard in your browser.
By clicking on the rows LIVES LOST and PEOPLE INJURED, several stats panels will be opened, displaying statistics for deaths and injuries at a glance.
The row REASONS FOR LOSS OF LIFE accommodates a hits panel displaying the varying reasons lives are lost in combination with an automobile accident.
In the row REASONS FOR INJURIES you will find the same for injuries.
We have setup a number of different panels to show the many different ways you can visualize and interact with the data in Kibana. We encourage you to expand each row and learn more about the data and how to visualize it in Kibana.
Discovering the uncommonly common
By clicking the row ALL ACCIDENTS OVER TIME, a histogram for all accident types over the time will be displayed.
There are several spikes indicating some unusual behavior in our data. But which events and accident types are responsible for these? Can we find the reason among our all-time top 5 accident types? Let us open the row TOP 5 ACCIDENTS OVER THE TIME and see if we can answer this question.
By looking at the the top 5 accident types: Driver Inattention/Distraction, Failure to Yield Right-of-Way, Fatigued/Drowsy, Backing Unsafely, Lost Consciousness in the histogram TOP 5 EVENTS OVER THE TIME it becomes clear that these events are not necessarily the major factors for most of the spikes. How can we find out what the top contributing factors for these time periods are?
Let us zoom into the time around March 1, 2013 on the histogram ALL ACCIDENTS OVER TIME (just span a rectangle with the mouse around the spike).
Now let us see the top accidents types for the selected time range. In doing so, we can refer to the terms panel TOP ACCIDENT TYPES in the row ALL ACCIDENT TYPES AND THEIR DISTRIBUTIONS.
There is a change of order. Suddenly, a new accident type is appearing among the top 5 – Slippery. Is this the accident type responsible for the spike in accidents reported? If yes, we would see some correlation of spikes between all events and this particular accident type. To confirm this, we are going to modify the histogram ALL EVENTS OVER TIME. By clicking on the cog symbol in the top right corner, a setting dialog will be displayed.
Please open the tab called Queries and activate Slippery by clicking on it. Now click on the Save button. The histogram should look like this.
The correlation of all events and accidents of the type Slippery at the February 9, 2013 looks like this.
Let’s have a look at this eye-catching behavior. After selecting the time frame around this spike and removing all events query, we get something like this.
Since we are looking at the query for Slippery, we assume that the reason for this unusual event is due to some changed weather conditions.
By doing some research, we discover that on February 9, 2013, the storm unofficially called “Winter Storm Nemo” or “Blizzard of 2013” developed strong activity over NYC and other parts of the East Coast. It was snowing, freezing cold, and streets got slippery, no wonder there was a spike in accidents caused by slippery pavement!
Image of the the nor’easter on February, 2013 (source: Wikipedia)
By the way:
- We could improve our demo by using some weather data directly in Kibana. We leave it to you as an exercise.
- Elasticsearch version 1.0 introduced aggregations as a new way to explore data in multiple dimensions. To getting started with aggregations using this dataset, use the sample query by clicking on the link. Please make sure that Marvel is up and running. The sample query will be loaded into Marvel Sense console.
Are there places that are more dangerous than others?
Let’s see if there are any areas of NYC that are more likely to have slippery streets (on the January 21st, 2014 – another date with a spike in activity). The pie chart DISTRIBUTION OF ACCIDENTS OVER BOROUGHS suggests that the most accidents happened in Queens followed by Manhattan, Brooklyn, the Bronx, and Staten Island.
Let us have a closer look at Queens by clicking on the corresponding part of the pie chart and analyze the distribution of the accidents on the map LOCATION OF ACCIDENTS. The map suggests that all events are almost evenly distributed across Queens and the limited slice of the data we’re looking at prevents us from making any clear conclusions on specific danger areas.
We can broaden our analysis by expanding the search and removing all filters to look at the entire data set for Queens. Now, some locations in Queens indicate multiple accidents. In the example below, we can observe seven accidents at the junction of Queens Blvd. and Skillman Avenue. With this information, the city may want to consider implementing safety measures at that particular location.
Understanding where and why fatal accidents occur
It is in the interest of every modern city to make roads safer. One way to achieve this goal is to understand why and where accidents happen.
Let us have a look at the overall statistics for deaths over all accidents types. By clicking at the row PEOPLE KILLED, some stats panels will be displayed exhibiting the number of killed people by category.
Let us now analyze geographic distribution of accidents with deadly outcome. For doing so, we will modify the query All accidents. Please unpin the query.
After unpinning, an input field will be displayed.
In the input field, an arbitrary query can be entered and issued. To restrict the dataset to the events with a deadly outcome, please enter number_of_pedestrians_killed:[1 TO *] OR number_of_persons_killed:[1 TO *] OR number_of_motorists_killed:[1 TO *] OR number_of_cyclists_killed:[1 TO *].
Let us go back to to the pie chart DISTRIBUTION OF ACCIDENTS OVER BOROUGHS. It should look like this, suggesting that the most accidents with deadly outcome happen to be in Brooklyn followed by Queens, Manhattan, Bronx and Staten Island.
Let us check the distribution on a map and see if there are any “hot spots”.
It looks like that the accidents are evenly distributed and there is no particular hot spot. Another observation you can make by zooming into the map, is that the many accidents happen on a road junction. Checking the contribution factors and if ignoring the numbers for unspecified events, it it gets evident that Trafic Control Disregarded is the top contribution factor followed by Failure to Yield Right-of-Way and Driver Inattention/Distraction.
By the way, Failure to Yield Right-of-Way is among the all-time top three factors for all accidents. It seems that NYC needs to address this issue in particular.
It is said that a picture is worth a thousand words. We believe this is true. That is the one of the reasons why we’re proud of what the ELK stack (Elasticsearch, Logstash, Kibana) can do.
This New York City traffic incidents demo is the first of a series of demos we hope to publish in the coming months to show how easy it is to perform meaningful analysis on data using Elasticsearch, Logstash and Kibana. In this post, we presented you with some ideas on how to get started with experimenting with this data set, but we encourage you to explore further — brainstorm to generate interesting questions you want to ask of the data and experiment with the various features in Kibana to start getting answers to those questions. If you find anything interesting or have feedback you’d like to share with us, we’d love to hear it either via email or on Twitter – just make sure to use the hashtag #BYODemos.