Unfortunately, sometimes things do not go as expected and your elasticsearch-hadoop job execution might go awry: incorrect data might be read or written, the job might take significantly longer than expected or you might face some exception. This section tries to provide help and tips for doing your own diagnostics, identifying the problem and hopefully fixing it.
Test that Elasticsearch is reacheable from the Spark/Hadoop cluster where the job is running. Your machine might reach it but that is not where the actual code will be running. If ES is accessible, minimize the number of tasks and their bulk size; if Elasticsearch is overloaded, it will keep falling behind, GC will kick in and eventually its nodes will become unresponsive causing clients to think the machines have died. See the Performance considerations section for more details.
Test your networkedit
Way too many times, folks use their local, development settings in a production environment. Double check that Elasticsearch is accessible from your production environments, check the host address and port and that the machines where the Hadoop/Spark job is running can access Elasticsearch (use
telnet or whatever tool you have available).
localhost (aka the default) in a production environment is simply a misconfiguration.
Triple check the classpathedit
Make sure to use only one version of elasticsearch-hadoop in your classpath. While it might not be obvious, the classpath in Hadoop/Spark is assembled from multiple folders; furthermore, there are no guarantees what version is going to be picked up first by the JVM. To avoid obscure issues, double check your classpath and make sure there is only one version of the library in there, the one you are interested in.
Isolate the issueedit
When encountering a problem, do your best to isolate it. This can be quite tricky and many times, it is the hardest part so take your time with it. Take baby steps and try to eliminate unnecessary code or settings in small chunks until you end up with a small, tiny example that exposes your problem.
Use a speedy, local environmentedit
A lot of Hadoop jobs are batch in nature which means they take a long time to execute. To track down the issue faster, use whatever means possible to speed-up the feedback loop: use a small/tiny dataset (no need to load millions of records, some dozens will do) and use a local/pseudo-distributed Hadoop cluster alongside an Elasticsearch node running on your development machine.
Check your settingsedit
Double check your settings and use constants or replicate configurations wherever possible. It is easy to make typos so try to reduce manual configuration by using properties files or constant interfaces/classes. If you are not sure what a setting is doing, remove it or change its value and see whether it affects your job output.
Verify the input and outputedit
Take a close eye at your input and output; this is typically easier to do with Elasticsearch (the service out-lives the job/script, is real-time and can be accessed right away in a flexible meaner, including the command-line). If your data is not persisted (either in Hadoop or Elasticsearch), consider doing that temporarily to validate each step of your work-flow.
While logging helps with bugs and errors, for runtime behavior we strongly recommend doing proper monitoring of your Hadoop and Elasticsearch cluster. Both are outside the scope of this chapter however there are several popular, free solutions out there that are worth investigating. For Elasticsearch, we recommend Marvel, a free monitoring tool (for development) created by the team behind Elasticsearch. Monitoring gives insight into how the cluster is actually behaving and helps you correlate behavior. If a monitoring solution is not possible, use the metrics provided by Hadoop, Elasticsearch and elasticsearch-hadoop to evaluate the runtime behavior.
Logging gives you a lot of insight into what is going on. Hadoop, Spark and Elasticsearch have extensive logging mechanisms as does elasticsearch-hadoop however use that judiciously: too much logging can hide the actual issue so again, do it in small increments.
Measure, do not assumeedit
When encountering a performance issue, do some benchmarking first, in as much isolation as possible. Do not simply assume a certain component is slow; make sure/prove it actually is. Otherwise, more often than not, one might find herself ‘fixing’ the wrong problem (and typically creating a new one).
Find a baselineedit
Indexing performance depends heavily on the type of data being targeted and its mapping. Same goes for searching but add the query definition to the mix. As mentioned before, experiment and measure the various parts of your dataset to find the sweet-spot of your environment before importing/searching big amounts of data.
If something is not working, there are two possibilities:
- there is a bug
- you are doing something wrong
Whichever it is, a clear description of the problem will help other users to help you. The more complete your report is, the quickest you will receive help from users!
What information is useful?edit
- OS & JVM version
- Hadoop / Spark version / distribution
- if using a certain library (Cascading, Hive, Pig), the version used
- elasticsearch-hadoop version
- the job or script that is causing the issue
- Hadoop / Spark cluster size
- Elasticsearch cluster size
- the size of the dataset and a snippet of it in its raw format (CSV, TSV, etc..)
If you don’t provide all of the information, then it may be difficult for others to figure out where the issue is.
Where do I post my information?edit
Please don’t paste long lines of code in the mailing list or the IRC – it is difficult to read, and people will be less likely to take the time to help.
“Gist is a simple way to share snippets and pastes with others. All gists are git repositories, so they are automatically versioned, forkable and usable as a git repository.”
Please see the Elasticsearch help page for tips on how to create a detailed user report, fast and easy.