Elastic's collaboration with OpenTelemetry on improving the filelog receiver

As the newest generally available signal in OpenTelemetry (OTel), logging support currently lags behind tracing and metrics in terms of feature scope and maturity. At Elastic, we bring years of extensive experience with logging use cases and the challenges they present. Committed to advancing OpenTelemetry's logging capabilities, we have focused on enhancing its logging functionalities.

Over the past few months, we have dealt with the capabilities of the filelog receiver in the OpenTelemetry Collector, leveraging our expertise as the Filebeat's maintainers to help refine and expand its potential. Our goal is to contribute meaningfully to the evolution of OpenTelemetry's logging features, ensuring they meet the high standards required for robust observability.

Specifically, we focused on verifying that the receiver is well covered for cases and aspects that have been a pain for us in the past with Filebeat — such as fail-over handling, self-telemetry, test coverage, documentation and usability. Based on our exploration, we started insightful conversations with the OTel project's maintainers, sharing our thoughts and any suggestions that could be useful from our experience. Moreover, we've started putting up PRs to add documentation, make enhancements, improve tests, fix bugs, and even implement completely new features.

In this blog post we'll provide a sneak preview of the work that we've done so far in collaboration with the OpenTelemetry community and what's coming next as we continue to explore ways to improve the OpenTelemetry Collector for log collection.

Enhancing the filelog receiver's telemetry

Observability tools are software components like any other and, thus, need to be monitored as any other software to be able to debug problems and tune relevant settings. In particular, users of the filelog receiver will want to know how it's performing. It's important that the filelog receiver emits sufficient telemetry data for common troubleshooting and optimization use cases. This includes sufficient logging and observable metrics providing insights into the filelog receiver's internal state.

While the filelog receiver already provided a good set of self-telemetry data, we identified some areas of improvement. In particular, we contributed functionality to emit self-telemetry logs on crucial events like when log files are discovered, moved or truncated. Another contribution includes observable metrics about filelog’s receiver internal state about how many files are opened and being harvested. You can find more information on the respective tracking issue.

Improving the Kubernetes container logs parsing

The filelog receiver has been able to parse Kubernetes container logs for some time now. However, properly parsing logs from Kubernetes Pods required a fair bit of configuration to deal with different runtime formats and to extract important meta information, such as

k8s.pod.name

k8s.container.name

, etc. With this in mind we proposed to abstract these complex set of configuration into a simpler implementation specific container parser and contributed this new feature to the filelog receiver. With that new feature, setting up logs collection for Kubernetes is by magnitudes easier - with only eight lines of configuration vs. ~ 80 lines of configuration before.

You can learn more about the details of the new container logs parser in the corresponding OpenTelemetry blog post.

Evaluating test coverage

Logs collection from files can run into different unexpected scenarios such as restarts, overload and error scenarios. To ensure reliable and consistent collection of logs, it's important to ensure tests cover these kind of scenarios. Based on our experience with testing Filebeat, we evaluated the existing filelog receiver tests with respect to those scenarios. While most of the use cases and scenarios were well-tested already, we identified a few scenarios to improve tests for to ensure reliable logs collection.
At the creation time of this blog posts we were working on contributing additional tests to address the identified test coverage gaps. You can learn more about it in this GitHub issue.

Persistence evaluation

Another important aspect for log collection that we often hear from Elastic's log users are the failover handling capabilities and the delivery guarantees for logs. Some logging use cases, for example audit logging, have strict delivery guarantee requirements. Hence, it's important that the filelog receiver provides functionality to reliably handle situations, such as temporary unavailability of the logging backend or unexpected restarts of the OTel Collector.

Overall, the filelog receiver already has corresponding functionality to deal with such situations. However, user documentation on how to setup reliable logs collection with tangible examples was an area with potential for improvement.

In this regard, beyond verifying the persistence and offset tracking capabilities we worked on improving respective documentation 1 2 and also are collaborating on a community reported issue to ensure delivery guarantees for logs.

Helping users help themselves

Elastic has a long and varied history of supporting customers who use our products for log ingestion. Drawing from this experience, we've proposed a couple of documentation improvements to the OpenTelemetry Collector to help logging users get out of some tricky situations.

Documenting the structure of the tracking file

For every log file the filelog receiver ingests, it needs to track how far into the file it has already read, so it knows where to start reading from when new contents are added to the file. By default, the filelog receiver doesn't persist this tracking information to disk, but it can be configured to do so. We felt it would be useful to document the structure of this tracking file. When ingestion stops unexpectedly, peeking into this tracking file can often provide clues as to where the problem may lie.

Challenges with symlink target changes

The filelog receiver periodically refreshes its memory of the files it's supposed to be ingesting. The interval at which these refreshes happen is controlled by the

poll_interval

setting. In certain setups log files being ingested by the filelog receiver are symlinks pointing to actual files. Moreover, these symlinks can be updated to point to newer files over time. If the symlink target changes twice before the filelog receiver has had a chance to refresh its memory, it will miss the first change and therefore not ingest the corresponding target file. We've documented this edge case, suggesting the users with such setups should make sure they set

poll_interval

to a sufficiently low value.

Planning ahead for the receiver's GA

Last but not least, we have raised the topic of making the filelog receiver a generally available (GA) component. For users it's important to be able to rely on the stability of used functionality, hence, not being required to deal with the risk of breaking changes through minor version updates. In this regard, for the filelog receiver we have kicked off a first plan with the maintainers to mark any issue that is a blocker for stability with a

required_for_ga

label. Once the OpenTelemetry collector goes to version

v1.0.0

we will be able to also work towards the specific receiver’s GA.

Conclusion

Overall, OTel's filelog receiver component is in a good shape and provides important functionality for most log collection use cases. Where there are still minor gaps or need for improvement with the filelog receiver, we are gladly to contribute our expertise and experience from Filebeat use cases. The above is just the beginning of our effort to help advancing the OpenTelemetry Collector, and specifically for log collection, get closer to a stable version. Moreover, we are happy to help the filelog receiver maintainers with general maintenance of the component, hence, dealing with community issues and PRs, jointly working on the component's roadmap, etc.

We'd like to thank the OTel Collector group and, in particular, Daniel Jaglowski for the great and constructive collaboration on the filelog receiver, so far!

Stay tuned to learn more about our future contributions and involvement in OpenTelemetry.