Time-based data (documents that are predominantly identified by their timestamp) often have associated retention policies to manage data growth. For example, your system may be generating 500,000 documents every second. That will generate 43 million documents per day, and nearly 16 billion documents a year.
While your analysts and data scientists may wish you stored that data indefinitely for analysis, time is never-ending and so your storage requirements will continue to grow without bound. Retention policies are therefore often dictated by the simple calculation of storage costs over time, and what the organization is willing to pay to retain historical data. Often these policies start deleting data after a few months or years.
Storage cost is a fixed quantity. It takes X money to store Y data. But the utility of a piece of data often changes with time. Sensor data gathered at millisecond granularity is extremely useful right now, reasonably useful if from a few weeks ago, and only marginally useful if older than a few months.
So while the cost of storing a millisecond of sensor data from ten years ago is fixed, the value of that individual sensor reading often diminishes with time. It’s not useless — it could easily contribute to a useful analysis — but it’s reduced value often leads to deletion rather than paying the fixed storage cost.