Predicting delayed flights with classification analysisedit

Let’s try to predict whether a flight will be delayed or not by using the sample flight data. The data set contains information such as weather conditions, carrier, flight distance, origin, destination, and whether or not the flight was delayed. When you create a data frame analytics job for classification analysis, it learns the relationships between the fields in your data in order to predict the value of the dependent variable, which in this case is the boolean FlightDelay field. For an overview of these concepts, see Classification and Introduction to supervised learning.

If you want to view this example in a Jupyter notebook, click here.

Preparing your dataedit

Each document in the sample flight data set contains details for a single flight, so this data is ready for analysis; it is already in a two-dimensional entity-based data structure. In general, you often need to transform the data into an entity-centric index before you can analyze the data.

In order to be analyzed, a document must contain at least one field with a supported data type (numeric, boolean, text, keyword or ip) and must not contain arrays with more than one item. If your source data consists of some documents that contain the dependent variable and some that do not, the model is trained on the subset of documents that contain it.

Example source document
{
  "_index": "kibana_sample_data_flights",
  "_type": "_doc",
  "_id": "S-JS1W0BJ7wufFIaPAHe",
  "_version": 1,
  "_seq_no": 3356,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "FlightNum": "N32FE9T",
    "DestCountry": "JP",
    "OriginWeather": "Thunder & Lightning",
    "OriginCityName": "Adelaide",
    "AvgTicketPrice": 499.08518599798685,
    "DistanceMiles": 4802.864932998549,
    "FlightDelay": false,
    "DestWeather": "Sunny",
    "Dest": "Chubu Centrair International Airport",
    "FlightDelayType": "No Delay",
    "OriginCountry": "AU",
    "dayOfWeek": 3,
    "DistanceKilometers": 7729.461862731618,
    "timestamp": "2019-10-17T11:12:29",
    "DestLocation": {
      "lat": "34.85839844",
      "lon": "136.8049927"
    },
    "DestAirportID": "NGO",
    "Carrier": "ES-Air",
    "Cancelled": false,
    "FlightTimeMin": 454.6742272195069,
    "Origin": "Adelaide International Airport",
    "OriginLocation": {
      "lat": "-34.945",
      "lon": "138.531006"
    },
    "DestRegion": "SE-BD",
    "OriginAirportID": "ADL",
    "OriginRegion": "SE-BD",
    "DestCityName": "Tokoname",
    "FlightTimeHour": 7.577903786991782,
    "FlightDelayMin": 0
  }
}

The sample flight data set is used in this example because it is easily accessible. However, the data has been manually created and contains some inconsistencies. For example, a flight can be both delayed and canceled. This is a good reminder that the quality of your input data affects the quality of your results.

Creating a classification modeledit

To predict whether a specific flight is delayed:

  1. Create a data frame analytics job.

    You can use the wizard on the Machine Learning > Data Frame Analytics tab in Kibana or the create data frame analytics jobs API.

    Creating a data frame analytics job in Kibana
    Creating a data frame analytics job in Kibana – continued
    1. Choose kibana_sample_data_flights as the source index.
    2. Choose classification as the job type.
    3. Choose FlightDelay as the dependent variable, which is the field that we want to predict with the classification analysis.
    4. Add Cancelled, FlightDelayMin, and FlightDelayType to the list of excluded fields. These fields will be excluded from the analysis. It is recommended to exclude fields that either contain erroneous data or describe the dependent_variable.
    5. Choose a training percent of 10 which means it randomly selects 10% of the source data for training. While that value is low for this example, for many large data sets using a small training sample greatly reduces runtime without impacting accuracy.
    6. Use the default feature importance values.
    7. Use the default memory limit for the job. If the job requires more than this amount of memory, it fails to start. If the available memory on the node is limited, this setting makes it possible to prevent job execution.
    8. Add a job ID and optionally a job description.
    9. Add the name of the destination index that will contain the results of the analysis. It will contain a copy of the source index data where each document is annotated with the results. If the index does not exist, it will be created automatically.
    API example
    PUT _ml/data_frame/analytics/model-flight-delay-classification
    {
      "source": {
        "index": [
          "kibana_sample_data_flights"
        ]
      },
      "dest": {
        "index": "df-flight-delayed",
        "results_field": "ml" 
      },
      "analysis": {
        "classification": {
          "dependent_variable": "FlightDelay",
          "training_percent": 10
        }
      },
      "analyzed_fields": {
        "includes": [],
        "excludes": [
          "Cancelled",
          "FlightDelayMin",
          "FlightDelayType"
        ]
      }
    }

    The field name in the dest index that contains the analysis results.

  2. Start the job in Kibana or use the start data frame analytics jobs API.

    The job takes a few minutes to run. Runtime depends on the local hardware and also on the number of documents and fields that are analyzed. The more fields and documents, the longer the job runs. It stops automatically when the analysis is complete.

    API example
    POST _ml/data_frame/analytics/model-flight-delay-classification/_start
  3. Check the job stats to follow the progress in Kibana or use the get data frame analytics jobs statistics API.

    Statistics for a data frame analytics job in Kibana

    When the job stops, the results are ready to view and evaluate. To learn more about the job phases, see How it works.

    API example
    GET _ml/data_frame/analytics/model-flight-delay-classification/_stats

    The API call returns the following response:

    {
      "count" : 1,
      "data_frame_analytics" : [
        {
          "id" : "model-flight-delay-classification",
          "state" : "stopped",
          "progress" : [
            {
              "phase" : "reindexing",
              "progress_percent" : 100
            },
            {
              "phase" : "loading_data",
              "progress_percent" : 100
            },
            {
              "phase" : "feature_selection",
              "progress_percent" : 100
            },
            {
              "phase" : "coarse_parameter_search",
              "progress_percent" : 100
            },
            {
              "phase" : "fine_tuning_parameters",
              "progress_percent" : 100
            },
            {
              "phase" : "final_training",
              "progress_percent" : 100
            },
            {
              "phase" : "writing_results",
              "progress_percent" : 100
            },
            {
              "phase" : "inference",
              "progress_percent" : 100
            }
          ],
          "data_counts" : {
            "training_docs_count" : 1305,
            "test_docs_count" : 11754,
            "skipped_docs_count" : 0
          },
          "memory_usage" : {
            "timestamp" : 1597182490577,
            "peak_usage_bytes" : 316613,
            "status" : "ok"
          },
          "analysis_stats" : {
            "classification_stats" : {
              "timestamp" : 1597182490577,
              "iteration" : 18,
              "hyperparameters" : {
                "class_assignment_objective" : "maximize_minimum_recall",
                "alpha" : 11.630957564710283,
                "downsample_factor" : 0.9418550623091531,
                "eta" : 0.032382816833064335,
                "eta_growth_rate_per_tree" : 1.0198807182688074,
                "feature_bag_fraction" : 0.5504020748926737,
                "gamma" : 0.08388388780939579,
                "lambda" : 0.08628826657684924,
                "max_attempts_to_add_tree" : 3,
                "max_optimization_rounds_per_hyperparameter" : 2,
                "max_trees" : 644,
                "num_folds" : 5,
                "num_splits_per_feature" : 75,
                "soft_tree_depth_limit" : 7.550606337307592,
                "soft_tree_depth_tolerance" : 0.13448633124842999
              },
              "timing_stats" : {
                "elapsed_time" : 44206,
                "iteration_time" : 1884
              },
              "validation_loss" : {
                "loss_type" : "binomial_logistic",
                "fold_values" : [ ]
              }
            }
          }
        }
      ]
    }

Viewing classification resultsedit

Now you have a new index that contains a copy of your source data with predictions for your dependent variable.

When you view the classification results in Kibana, it shows contents of the destination index in a tabular format:

Results for a data frame analytics job in Kibana

In this example, the table shows a column for the dependent variable (FlightDelay), which contains the ground truth values that you are trying to predict. It also shows a column for the predicted values (ml.FlightDelay_prediction), which were generated by the classification analysis. The ml.is_training column indicates whether the document was used in the training or testing data set. You can filter the table and the confusion matrix such that they contain only testing or training data. You can also enable histogram charts to get a better understanding of the distribution of values in your data.

If you examine this destination index more closely in the Discover app in Kibana or use the standard Elasticsearch search command, you can see that the analysis predicts the probability of all possible classes for the dependent variable (in a top_classes object). In this case, there are two classes: true and false. The most probable class is the prediction, which is what’s shown in the classification results table. If you want to understand how sure the model is about the prediction, however, you might want to examine the class probability values. A higher number means that the model is more confident.

API example
GET df-flight-delayed/_search

The snippet below shows a part of a document with the annotated results:

          ...
          "FlightDelay" : false,
          ...
          "ml" : {
            "top_classes" : [ 
              {
                "class_probability" : 0.9198146781161334,
               "class_score" : 0.36964390728677926,
               "class_name" : false
              },
              {
                "class_probability" : 0.08018532188386665,
                 "class_score" : 0.08018532188386665,
                 "class_name" : true
              }
            ],
            "prediction_score" : 0.36964390728677926,
            "FlightDelay_prediction" : false,
            "prediction_probability" : 0.9198146781161334,
            "feature_importance" : [
              {
                "feature_name" : "DistanceMiles",
                "importance" : -3.039025449178423
              },
              {
                "feature_name" : "FlightTimeMin",
                "importance" : 2.4980756273399045
              }
            ],
            "is_training" : false
          }

An array of values specifying the probability of the prediction and the class_score for each class.

The top_classes object contains the predicted classes with the highest scores. The class_probability is a value between 0 and 1. The higher the number, the more confident the model is that the data point belongs to the named class. In the example above, false has a class_probability of 0.91 while true has only 0.08, so the prediction will be false. The class_score is a function of the probability.

Evaluating classification resultsedit

Though you can look at individual results and compare the predicted value (ml.FlightDelay_prediction) to the actual value (FlightDelay), you typically need to evaluate the success of your classification model as a whole.

Kibana provides a normalized confusion matrix that contains the percentage of occurrences where the analysis classified data points correctly with their actual class and the percentage of occurrences where it misclassified them.

Evaluation of a data frame analytics job in Kibana

As the sample data may change when it is loaded into Kibana, the results of the classification analysis can vary even if you use the same configuration as the example. Therefore, use this information as a guideline for interpreting your own results.

If you want to see the exact number of occurrences, select a quadrant in the matrix. You can optionally filter the table to contain only testing data so you can see how well the model performs on previously unseen data. In this example, there are 2952 documents in the testing data that have the true class. 1893 of them are predicted as false; this is called a false negative. 1059 are predicted correctly as true; this is called a true positive. The confusion matrix therefore shows us that 36% of the actual true values were correctly predicted and 64% were incorrectly predicted in the test data set.

Likewise if you select other quadrants in the matrix, it shows the number of documents that have the false class as their actual value in the testing data set. In this example, the model labeled 1033 documents out of 8802 correctly as false; this is called a true negative. 7769 documents are predicted incorrectly as true; this is called a false positive. Thus 12% of the actual false values were correctly predicted and 88% were incorrectly predicted in the test data set. When you perform classification analysis on your own data, it might take multiple iterations before you are satisfied with the results and ready to deploy the model.

You can also generate these metrics with the data frame analytics evaluate API. For more information about interpreting the evaluation metrics, see Classification evaluation.

API example

First, we want to know the training error that represents how well the model performed on the training data set.

POST _ml/data_frame/_evaluate
{
 "index": "df-flight-delayed",
   "query": {
    "term": {
      "ml.is_training": {
        "value": true  
      }
    }
  },
 "evaluation": {
   "classification": {
     "actual_field": "FlightDelay",
     "predicted_field": "ml.FlightDelay_prediction",
     "metrics": {
       "multiclass_confusion_matrix" : {}
     }
   }
 }
}

We calculate the training error by evaluating only the training data.

Next, we calculate the generalization error that represents how well the model performed on previously unseen data:

POST _ml/data_frame/_evaluate
{
 "index": "df-flight-delayed",
   "query": {
    "term": {
      "ml.is_training": {
        "value": false  
      }
    }
  },
 "evaluation": {
   "classification": {
     "actual_field": "FlightDelay",
     "predicted_field": "ml.FlightDelay_prediction",
     "metrics": {
       "multiclass_confusion_matrix" : {}
     }
   }
 }
}

We evaluate only the documents that are not part of the training data.

The returned confusion matrix shows us how many data points were classified correctly (where the actual_class matches the predicted_class) and how many were misclassified (actual_class does not match predicted_class):

{
  "classification" : {
    "multiclass_confusion_matrix" : {
      "confusion_matrix" : [
        {
          "actual_class" : "false", 
          "actual_class_doc_count" : 8802, 
          "predicted_classes" : [
            {
              "predicted_class" : "false", 
              "count" : 1033 
            },
            {
              "predicted_class" : "true",
              "count" : 7769
            }
          ],
          "other_predicted_class_doc_count" : 0
        },
        {
          "actual_class" : "true",
          "actual_class_doc_count" : 2952,
          "predicted_classes" : [
            {
              "predicted_class" : "false",
              "count" : 1893
            },
            {
              "predicted_class" : "true",
              "count" : 1059
            }
          ],
          "other_predicted_class_doc_count" : 0
        }
      ],
      "other_actual_class_count" : 0
    }
  }
}

The name of the actual class. In this example, there are two actual classes: true and false.

The number of documents in the data set that belong to the actual class.

The name of the predicted class.

The number of documents belong to the actual class that are labeled as the predicted class.

When you have trained a satisfactory model, you can deploy it to make predictions about new data. Those steps are not covered in this example. See Inference.

If you don’t want to keep the data frame analytics job, you can delete it by using the delete data frame analytics job API. When you delete data frame analytics jobs, the destination indices remain intact.