Elastic Observability
Software & Technology

Box: Deploying the Elastic Stack for observability — one microservice at a time

AT A GLANCE

  • 180
    billion documents in Elasticsearch
  • 190 TB
    total data size
  • 20 TB
    daily ingest rate

Reduced costs

Migrating from Splunk to Elastic has sliced Box’s logging ingestion costs by half

Increased observability

No longer turning away proposed logging projects because of high ingest pricing

Empowering engineers

Engineers are embracing logging and expanding observability as Box scales


Company Overview

Storage provider Box has morphed into a brand-name Cloud Content Management platform, enabling businesses to increase employee and customer productivity, all while protecting customers' most valuable data: their files and mission-critical workflows.

In today's competitive Cloud Content Management landscape, it's essential that Box rapidly detects, identifies, and resolves issues more quickly than ever. That's because Box is integrated with millions of its customers' business-critical workflows, and the company must deliver on strict SLAs for stability, performance, and compliance purposes.

The SLAs it promises to customers leave the company just a small sliver for error. All of which means Box requires a clear window of observability into its infrastructure to support an extraordinary number of use cases from its millions of users, in addition to Box's more than 95,000 enterprise customers.

"Our customers and users entrust their business-critical workloads on top of content in Box. So if Box is not working, it's a strict, state loss of revenue for them," says Deepak Wadhwani, the engineering manager for the Observability team at Box.

Box's website touts customers like Allstate, AstraZeneca, Coca-Cola Company, Morgan Stanley, Oxfam, Olympus, Pandora, and a Who's Who of others.

Box’s Journey with Elastic

Unboxing the Elastic journey

In recent years, Box's engineering team was growing concerned that its legacy backend for reporting was not scaling. And as Box's roadmap was growing, the Observability team was interested in adopting a more reliable and cost effective solution for application and operational logging than what Splunk provided.

These were mission-critical processes taking on heightened importance — especially as Box continued on its path of transforming a monolithic infrastructure into hundreds of microservices to enable it to grow, innovate, and deliver new customer-facing features.

Because Box's legacy logging solution was priced on the amount of data ingested, sometimes Box would have to cut logging projects to limit costs or Box engineers would decide not to log events from newly deployed microservices.

That was a fact of life at the time, and was at odds with Box's mission to rapidly transform itself into a leading Cloud Content Management platform. This transformation required Box to break down its monolith into microservices, a direction that requires more and comprehensive logging — not less.

"We would have to reduce logging data volumes to cut costs. That went against our mission of making Box systems more observable. We wanted to build a system that was more cost effective and to continue on a mission of more visibility. That's why we chose Elastic. It's built for developers to help developers."

Deepak Wadhwani, Engineering manager for the Observability team | Box

What's more, Box's use of MySQL for enterprise reporting had its own problems. Among them was the inability to render usage logs data for large enterprises — such as when files were opened, moved to a different folder, or even shared — because of the enormous amounts of events that were being generated.

So beginning in 2015, after surveying the competitive landscape for pricing, expandability, support, and security, Box went with Elastic.

By harnessing the Elastic Stack, Box has adopted a future-proof, scalable approach of observability with a pricing model rooted in the amount of memory under management instead of data ingest. This changeover to Elastic, while strengthening Box's reporting capabilities and dramatically reducing costs, also captures logs from within Box's microsystems to understand performance and behavior. These logs are available at a moment's notice, too.

The reporting use case was the entry point for Elastic at Box. That success in replacing MySQL for reporting with Elastic seeded confidence and trust in Elastic. These were the major factors prompting Box to begin replacing its logging solution with Elastic, one new microservice at a time.

Even so, a new reporting and logging system must generate buy-in from engineers or face potential failure or slow adoption. With Box's choosing of the Elastic Stack, engineers are now more motivated than ever to run reports at scale and to program logs for existing or new microservices — all of which positions Box to succeed even more.

"Our engineers are much happier now and queries complete almost immediately. Our satisfaction index is much higher."

Salman Ahmed, Engineering director, Data Platform and Observability SRE teams | Box

All the while, Box's initial journey started with a few terabytes of storage and a handful of developers on the Storage team developing and adopting Box's new reporting features alongside the help of on-site Elastic consultants and Elastic training sessions. Today, about 500 engineers across a wide swath of teams at Box are embracing the Elastic Stack for both reporting and logging — and visualizing terabytes of data daily on hundreds of Kibana dashboards.

Read all about "Newsroom"

As we know, Box's first move with Elastic was to migrate enterprise logs to Elasticsearch for reporting purposes and to serve as the backend for data analytics. This data pipeline would later evolve to support Box's new, Elastic-backed logging environment.

For the reporting project, the company is taking advantage of mandated security features such as role-based access control and user authentication. These and other features — such as being able to tell who accessed what Box file when and where — enabled Box to satisfy compliance obligations involving security and privacy.

The initial switch to this Elasticsearch reporting project, known internally as "Newsroom," cured issues surrounding filtering problems and inconsistencies between enterprise logs and business analytics. No longer was Box having issues producing reports that it was meeting its SLAs, for example. For large enterprises, sometimes the reports didn't load at all, and for medium enterprises, they took too long to return.

Other fixes included:

  • Gained ability to efficiently filter usage logs based on users/folders
  • Able to acquire events related to specific users in the enterprise or a specific folder
  • Statistics reports shown in admin console are unbroken
  • Resolved inconsistency of enterprise logs and business analytics

Hello "Arta"

Box's former logging solution was priced on the amount of data ingested, which at the time was about 7 terabytes a day. That has since grown to around 20 terabytes a day, and is expected to increase even more. So as not to break the bank, that meant Box had to pick and choose where to implement observability features on its ever-growing microservices platform — an untenable issue that switching to Elastic has resolved.

Cost wasn't the only concern. Under its previous logging platform, indexers would sometimes fail. Latency, too, was a problem and alerting was not reliable.

"We were thinking, what does all of this mean in 5 years as we scale. Our switch to Elastic has helped reduce our per-terabyte cost by half, make life better for our developers and provide them with observability capabilities for the microservices that they are building. We are no longer turning away logging projects because of cost."

Deepak Wadhwani, Engineering manager for the Observability team | Box

With that in mind, and as the scale of data from daily logging was exploding, the company re-engineered its logging infrastructure in 2017.

Box is continually moving all operational and application logging securely into Elasticsearch with a trifurcated approach of separating retention, ad hoc, and interactive logging via a pipeline internally codenamed Almost-Real-Time-Analytics, or "Arta."

Mission critical business functions

Box says producing and maintaining enterprise-grade systems performing at the guaranteed SLAs is critical. So Box engineers must have visibility into their systems and the logs they are generating. Ahmed says Elastic has provided the Observability team at Box with a logging system offering sophisticated querying capabilities at minimal indexing overhead — all at a reasonable cost as Box scales.

"The Elastic Stack is critical to us. Every day millions of users and customers worldwide trust Box to execute mission-critical business functions. Elasticsearch has enabled the Observability team at Box to work with a reliable and cost effective logging system."

Salman Ahmed, Engineering director, Data Platform and Observability SRE teams | Box

With this Elastic-driven observability, in addition to the hundreds of Kibana dashboards to visualize this data, Box engineers have insights at their fingertips. They can see whether there are issues for customers opening their Box files, if collaborators were added to files, if files were added to a folder — and you name it. Logging also provides engineers a quick route to fix issues — such as customer file upload latency — in order to stay on target of uptime SLA requirements.

This Kibana dashboard provides the Storage team at Box, which is responsible for critical business workflows, with visibility into file uploads aggregated over the type of files uploaded, in addition to the web client used across different sizes of files.

In addition, to ensure that Box customers continue with a great user experience, logging is central to the act of coding at Box to ensure that code changes have been committed, says Dave Ward, senior director of engineering at Box.

"When you make a change to add new features or fix, did my change produce the desired result?" Ward asks. "Am I seeing errors when pushing code? If the observability pipeline isn't functioning as intended, we could not guarantee the sanity of our systems. Arta leveraging Elasticsearch is mission critical to sustaining our agility and release cadence at scale."

The Elastic future roadmap for Box

Today, Box is on the path to running much of its production-level, developer-focused logging streams in Arta. As the company continues to add new and existing logging streams from its ongoing development of microservices, the Box team is looking into creating further improvements in stability, resiliency, operational efficiency, and bringing the costs down even more.

For the future, Box is considering centralizing more of its tracing metrics into one platform under the Elastic umbrella. Box also is actively partnering with Elastic engineers and refining and sharing product feedback.

The company is interested in other Elastic features like Elastic APM for application performance monitoring, machine learning to help detect and alert to issues, as well as Elastic's geoIP mapping capabilities so Box can visualize in Kibana where in the world queries are coming from, starting at the IP address.

Box is also exploring the Elastic SIEM to help drive security operations as well as beautifying their Kibana dashboards with Canvas, in addition to surveying many other Elastic features.

Return on investment

In financial terms, the adoption of Elastic has helped reduce the price by half when it comes to data loads. In addition, reporting problems and hiccups under MySQL have been shored up. And the latency caused, in part, by its former logging solution indexing on read compared to Elastic indexing on write — is becoming a thing of the past.

"Boil the ocean or boil a pond to get the data." That's how Wadhwani describes the difference in latency. "Querying in Kibana with Elasticsearch is so much faster," Wadhwani says.

On a more unquantifiable level, when it comes to ROI, Box is actively encouraging its engineers to produce the right amount of logs to achieve the level of observability required for building enterprise grade products.

These and other nuanced changes are helping Box meet internal and external SLA agreements while at the same time they are encouraging Box's developers to innovate as Box scales.

The Box Cluster(s)

  • Number of Clusters
    1
  • Number of Nodes
    85 Data, 3 Master, 6 Client
  • Number of LS instances/Beats
    20 Logstash
  • Total Number of Documents
    180 Billion
  • Total Data Size
    190 TB
  • Daily Ingest Rate
    20 TB
  • Number of Indexes
    250
  • Query Rate
    25 Queries/Sec
  • Replicas
    1
  • Time-based indexes
    Daily
  • Node Specifications: Memory Total, CPU, Disk-type (SSD, HDD)
    AWS i3.4XLarge