07 10월 2014 엔지니어링

Playing HTTP Tricks with Nginx

By Karel Minařík

Update November 2, 2015: If you're interested in advanced access control configuration or other security features, consider taking Shield, security for Elasticsearch, for a spin.

One of the defining features of Elasticsearch is that it’s exposed as a (loosely) RESTful service over HTTP.

The benefits are easy to spell out, of course: the API is familiar and predictable to all web developers. It’s easy to use with “bare hands” via the curl command, or in the browser. It’s easy to write API wrappers in various programming languages.

Nevertheless, the importance of the HTTP-based nature of Elasticsearch is rooted deeper: in the way it fits into the existing paradigm of software development and architecture.

HTTP As an Architectural Paradigm

The typical modern software or information system is quite frequently a collection of loosely coupled services, communicating over network: typically, via HTTP. Design-wise, the important aspect of this approach is that you can always “rip apart” the chain of services, and insert another component, which adds or changes functionality, into the “stack.” In the old days, this has been traditionally called “middleware,” but it resurfaced in context of RESTful web services, for example as Rack middleware,
used notably in the Ruby on Rails framework.

HTTP is particularly well suited for such architectures, because its perceived shortcomings (lack of state, text-based representation, URI-centric semantics, …) turn into an advantage: neither “middleware” has to accommodate for something specific in the “chain”, and just passes along status codes, headers, and bodies. In this sense, HTTP is functionally transparent – it doesn’t matter, for example, if you fetch an image from the original web server or a cache on a different continent. It’s still the same “resource.”

Caching is a prime example of this aspect of HTTP, presented already in Roy Fielding’s seminal work on RESTful architectures. (For a thorough information on the subject, see Ryan Tomayko’s Things Caches Do and Mark Nottingham’s Caching Tutorial.)

Technically, the cache operates as a proxy here – it “stands for” some other component in the stack.

But proxies can do so much more. A good example is authentication and authorization: a proxy can intercept requests to a service, perform authentication and/or authorization routines, and either allow or deny access to the client.

This type of proxy is usually called a reverse proxy. The name makes sense when you consider that a traditional proxy “forwards” traffic from a local network to the remote network (the internet), which is reversed here, because the “reverse” proxy forwards requests from the internet to a “local” backend. Such a proxy could be implemented in a programming language like Java or Go, or with a framework like Node.js. Alternatively, we could use a configurable webserver like Nginx.

Nginx

Nginx is an open source web server, originally writen by Igor Sysoev, focused on high performance, concurrency and low memory footprint. (For a detailed technical overview, see the relevant chapter of the The Architecture of Open Source Applications book.)

Nginx has been designed with a proxy role in mind from the start, and supports many related configuration directives an options. It is fairly common to run Nginx as a load balancer in front of Ruby on Rails of Django applications. Many large PHP applications even put Nginx in front of Apache running mod_php to accelerate serving static content and scale the application. Most parts of this article assume a standard Nginx installation, but the advanced parts rely on the Lua module for Nginx.

To run Nginx as a “100% transparent” proxy for Elasticsearch, we need a very minimal configuration:

http {
  server {
    listen 8080;
    location / {
      proxy_pass http://localhost:9200;
    }
  }
}

When we execute a request to http://localhost:8080, we’ll get a response from Elasticsearch running on port 9200.

This proxy is of course quite useless — it just hands over data between the client and Elasticsearch; though astute readers might have guessed that it already adds something to the “stack,” namely the logging of every request.

Use Cases

In this article, we’ll go through some of the more interesting use cases for Nginx as a reverse proxy for Elasticsearch.

Persistent HTTP Connections

Let’s start with a very simple example: using Nginx as a proxy which keeps persistent (“keep-alive”) connections to Elasticsearch. Why we would like to do it? The primary reason would be to relieve Elasticsearch of the stress from opening and closing a connection for each request when using a client without support for persistent connections. Elasticsearch has many more responsibilities than just handling the networking, and opening/closing connections wastes valuable time and resources (such as open files limit).

The full configuration is available, like all examples in this article, in this gist.

events {
    worker_connections  1024;
}

http {

  upstream elasticsearch {
    server 127.0.0.1:9200;

    keepalive 15;
  }

  server {
    listen 8080;

    location / {
      proxy_pass http://elasticsearch;
      proxy_http_version 1.1;
      proxy_set_header Connection "Keep-Alive";
      proxy_set_header Proxy-Connection "Keep-Alive";
    }

  }

}

Let’s launch Nginx with this configuration:

$ nginx -p $PWD/nginx/ -c $PWD/nginx_keep_alive.conf

When you execute a request directly to Elasticsearch, you’ll notice that the number of opened connections is increasing all the time:

$ curl 'localhost:9200/_nodes/stats/http?pretty' | grep total_opened
# "total_opened" : 13
$ curl 'localhost:9200/_nodes/stats/http?pretty' | grep total_opened
# "total_opened" : 14
# ...

But it’s a completely different story when using Nginx — the number of opened connections stays the same:

$ curl 'localhost:8080/_nodes/stats/http?pretty' | grep total_opened
# "total_opened" : 15
$ curl 'localhost:9200/_nodes/stats/http?pretty' | grep total_opened
# "total_opened" : 15
# ...

Simple Load Balancer

With a very small change to the configuration, we can use it to pass requests to multiple Elasticsearch nodes, and use Nginx as a light-weight load balancer:

events {
    worker_connections  1024;
}

http {

  upstream elasticsearch {
    server 127.0.0.1:9200;
    server 127.0.0.1:9201;
    server 127.0.0.1:9202;

    keepalive 15;
  }

  server {
    listen 8080;

    location / {
      proxy_pass http://elasticsearch;
      proxy_http_version 1.1;
      proxy_set_header Connection "Keep-Alive";
      proxy_set_header Proxy-Connection "Keep-Alive";
    }

  }

}

As you can see, we’ve added two additional nodes in the upstream directive. Nginx will now automatically distribute requests, in a round-robin fashion, across these servers, spreading the load on the Elasticsearch cluster evenly across the nodes:

$ curl localhost:8080 | grep name
# "name" : "Silver Fox",
$ curl localhost:8080 | grep name
# "name" : "G-Force",
$ curl localhost:8080 | grep name
# "name" : "Helleyes",
$ curl localhost:8080 | grep name
# "name" : "Silver Fox",
# ...

This is a desirable behaviour, because it prevents hitting a single “hot” node, which has to perform the regular node duties and also route all the traffic and perform all the other associated actions. To change the configuration, we just need to update the list of servers in the upstream directive, and send Nginx a reload signal.

For more information about Nginx’s load balancing features, including different balancing strategies, setting “weights” for different nodes, health checking and live monitoring, please see the Load Balancing with NGINX and NGINX Plus article.

(Note, that the official Elasticsearch clients can perform such load balancing by themselves, with the ability to automatically reload list of nodes in the cluster, retrying a request on another node, etc.)

Basic Authentication

Let’s focus on another functionality: authentication and authorization. By default, Elasticsearch doesn’t prevent any unauthorized access, because it’s not designed to be running as an open service. When you’d allow open access to port 9200, you’d be vulnerable to data theft, loss and even to a system compromise.

The usual way of protecting your Elasticsearch cluster to is to restrict the access via VPN, firewall rules, AWS’s security groups, etc. What if you want or need to connect to the cluster from the outside, though, authenticating by an username and password?

Well, if we consider the proxy concept, as outlined above, that could work, right? We just need to intercept requests to Elasticsearch, authorize the client, and allow or deny access. Since Nginx supports basic access authentication out-of-the-box, it’s absolutely trivial to do it:

events {
  worker_connections  1024;
}

http {

  upstream elasticsearch {
    server 127.0.0.1:9200;
  }

  server {
    listen 8080;

    auth_basic "Protected Elasticsearch";
    auth_basic_user_file passwords;

    location / {
      proxy_pass http://elasticsearch;
      proxy_redirect off;
    }
  }

}

We can generate the passwords file with many utilities, for example with openssl:

$ printf "john:$(openssl passwd -crypt s3cr3t)n" > passwords

Let’s run Nginx with this configuration (don’t forget to shutdown the Nginx process first):

$ nginx -p $PWD/nginx/ -c $PWD/nginx_http_auth_basic.conf

When we attempt to access the proxy without proper credentials, the request will be denied:

$ curl -i localhost:8080
# HTTP/1.1 401 Unauthorized
# ...

With proper credentials, though, the access is allowed:

$ curl -i john:s3cr3t@localhost:8080
# HTTP/1.1 200 OK
# ...

We can now restrict access to the 9200 port to local network (eg. with firewall rules), leaving only the 8080 port open to the outside. Any client accessing Elasticsearch has to know the correct credentials.

Simple Authorization

Having a secure way for accessing the cluster from the outside is certainly great, but you might have noticed that there’s no granularity when it comes to authorization – once allowed access, the client can do whatever it wants in the cluster: change or delete the data, inspect internal statistics, even shutdown the cluster.

A very simple way of authorizing the access would be to flat out deny requests to certain endpoints, so they’re allowed only from a client running on the local machine or network. We can change the location directive a little bit:

location / {
  if ($request_filename ~ _shutdown) {
    return 403;
    break;
  }

  proxy_pass http://elasticsearch;
  proxy_redirect off;
}

Let’s shutdown Nginx and run it with the new configuration:

$ nginx -p $PWD/nginx/ -c $PWD/nginx_http_auth_deny_path.conf

When we attempt a request to the shutdown API now, it will be denied even with correct credentials:

$ curl -i -X POST john:s3cr3t@localhost:8080/_cluster/nodes/_shutdown
# HTTP/1.1 403 Forbidden
# ....

We can also flip the approach – let’s allow only certain endpoints, such as the administrative APIs, and deny access to anything else. We’ll distinguish between them by two separate location directives:

events {
  worker_connections  1024;
}

http {

  upstream elasticsearch {
    server 127.0.0.1:9200;
  }

  server {
    listen 8080;

    auth_basic "Protected Elasticsearch";
    auth_basic_user_file passwords;

    location ~* ^(/_cluster|/_nodes) {
      proxy_pass http://elasticsearch;
      proxy_redirect off;
    }

    location / {
      return 403;
      break;
    }
  }

}

Authenticated requests to /_cluster and /_nodes APIs will be allowed, but anything else will be denied:

$ curl -i john:s3cr3t@localhost:8080/
HTTP/1.1 403 Forbidden
# ...

$ curl -i john:s3cr3t@localhost:8080/_cluster/health
# HTTP/1.1 200 OK
# ...

$ curl -i john:s3cr3t@localhost:8080/_nodes/stats
HTTP/1.1 200 OK
# ...

Selective Authorization

Let’s have a look at another authorization use case: we want to protect the Elasticsearch cluster with basic authentication, but still allow a HEAD request to / – called “ping” in the client libraries, e.g. for monitoring purposes.

This might sound like an easy thing to do, but in fact, it’s not a trivial thing to do in an Nginx configuration: we need to combine two conditions for a rule like that (request URL and method), and Nginx’s if statement doesn’t allow that. (Nginx even considers the if statement “evil”.)

So, what should we do? As it happens, we can creatively use two central pieces of Nginx configuration syntax: variables and custom error codes:

events {
  worker_connections  1024;
}

http {

  upstream elasticsearch {
    server 127.0.0.1:9200;
  }

  server {
    listen 8080;

    location / {
      error_page 590 = @elasticsearch;
      error_page 595 = @protected_elasticsearch;

      set $ok 0;

      if ($request_uri ~ ^/$) {
        set $ok "${ok}1";
      }

      if ($request_method = HEAD) {
        set $ok "${ok}2";
      }

      if ($ok = 012) {
        return 590;
      }

      return 595;
    }

    location @elasticsearch {
      proxy_pass http://elasticsearch;
      proxy_redirect off;
    }

    location @protected_elasticsearch {
      auth_basic           "Protected Elasticsearch";
      auth_basic_user_file passwords;

      proxy_pass http://elasticsearch;
      proxy_redirect off;
    }
  }

}

First, we define two custom status “error” codes: 590, for accessing Elasticsearch without credentials, and 595, for accessing it with basic authentication, just as we did up to this point. We use Nginx’s “named locations” feature to distinguish between these two – both point to the same cluster, but one of them requires authentication.

Then we set up a variable $ok, which has a default value of 0. When the incoming request URL matches / (ie. the path is empty), we append 1 to it. When it’s performed via the HEAD method as well, we append 2. Clearly, when both of those conditions are satisfied, the resulting value of $ok is 012.

And that’s exactly what we check in the last if. In that case, we return the 590 status code — in other words, we allow the request to go through to Elasticsearch. In any other case, we require authentication:

$ curl -i -X HEAD localhost:8080
# HTTP/1.1 200 OK
# ...

$ curl -i localhost:8080
# HTTP/1.1 401 Unauthorized
# ...

$ curl -i john:s3cr3t@localhost:8080
# HTTP/1.1 200 OK
# ...

Multiple Roles for Authorization

Up until now, we had a pretty simple authorization scheme. What if we need a wider scheme, based on roles, though? Something like:

  • unauthenticated clients can only access the “ping” URL (HEAD /),
  • client authenticated with the user role credentials can perform _search and _analyze requests,
  • client authenticated with the admin role credentials can perform any request.

We will use a different approach here — we’ll create a separate virtual server for each role:

events {
  worker_connections  1024;
}

http {

  upstream elasticsearch {
      server 127.0.0.1:9200;
  }

  # Allow HEAD / for all
  #
  server {
      listen 8080;

      location / {
        return 401;
      }

      location = / {
        if ($request_method !~ "HEAD") {
          return 403;
          break;
        }

        proxy_pass http://elasticsearch;
        proxy_redirect off;
      }
  }

  # Allow access to /_search and /_analyze for authenticated "users"
  #
  server {
      listen 8081;

      auth_basic           "Elasticsearch Users";
      auth_basic_user_file users;

      location / {
        return 403;
      }

      location ~* ^(/_search|/_analyze) {
        proxy_pass http://elasticsearch;
        proxy_redirect off;
      }
  }

  # Allow access to anything for authenticated "admins"
  #
  server {
      listen 8082;

      auth_basic           "Elasticsearch Admins";
      auth_basic_user_file admins;

      location / {
        proxy_pass http://elasticsearch;
        proxy_redirect off;
      }
  }

}

We’ll generate the credentials with the openssl command again:

$ printf "user:$(openssl passwd -crypt user)n"   > users
$ printf "admin:$(openssl passwd -crypt admin)n" > admins

Now, everybody can “ping” the cluster, but nothing else:

$ curl -i -X HEAD localhost:8080
# HTTP/1.1 200 OK
$ curl -i -X GET localhost:8080
# HTTP/1.1 403 Forbidden

Authenticated users can access the search and analyze APIs, but nothing else:

$ curl -i localhost:8081/_search
# HTTP/1.1 401 Unauthorized
# ...

$ curl -i user:user@localhost:8081/_search
# HTTP/1.1 200 OK
# ...

$ curl -i user:user@localhost:8081/_analyze?text=Test
# HTTP/1.1 200 OK
# ...

$ curl -i user:user@localhost:8081/_cluster/health
# HTTP/1.1 403 Forbidden
# ...

Authenticated admins, of course, can access any API:

$ curl -i admin:admin@localhost:8082/_search
# HTTP/1.1 200 OK
# ...

$ curl -i admin:admin@localhost:8082/_cluster/health
# HTTP/1.1 200 OK
# ...

As you’ve might have noticed, each role is accessing the proxy on a different port: that is the price we have to pay with this solution. On the other hand, it should be quite easy to configure any application for a scheme like this, for instance using different clients, connected to different URLs. (We could have used the server_name directive instead, to distinguish between different servers, and run all servers on the same port, instead.)

Access Control List with Lua

We have been able to support reasonably complex, non-”Hello World” scenarios with Nginx as a proxy for Elasticsearch so far. On the other hand, even the last example is pretty simple, and supporting a more complex, fine-grained authorization scheme would be unwieldy – imagine all those possible server blocks…

So, what if we need to support a much more complex set of rules, such as allowing not only certain endpoints for certain roles, but also only certain methods for them, and we would like to store the information in a more familiar format.

In the next configuration, we’ll use the Lua module for Nginx to be able to express the rules and code more expressively. We’ll use the package provided by the OpenResty project, which bundles not only Lua, but also a JSON parser, a Redis library, and many other useful Lua modules with Nginx: please see the installation instructions at the OpenResty site. On a Mac, you can use the Homebrew formula:

$ brew install https://raw.githubusercontent.com/Homebrew/homebrew-nginx/master/openresty.rb

The OpenResty bundle turns Nginx into a full-featured web application server, allowing to rewrite locations by Lua code, using external databases, manipulating responses on the fly, making HTTP sub-requests, and much more. In our example, we’ll use the access_by_lua_file directive in concert with the regular HTTP basic authentication to allow or deny the client.

The Nginx configuration itself is fairly simple:

error_log logs/lua.log notice;

events {
  worker_connections 1024;
}

http {
  upstream elasticsearch {
    server 127.0.0.1:9200;
  }

  server {
    listen 8080;

    location / {
      auth_basic           "Protected Elasticsearch";
      auth_basic_user_file passwords;

      access_by_lua_file '../authorize.lua';

      proxy_pass http://elasticsearch;
      proxy_redirect off;
    }

  }
}

The Lua code is of course a bit more complicated – consult the online version for full source with comments, debug logging, etc:

-- authorization rules

local restrictions = {
  all  = {
    ["^/$"]                             = { "HEAD" }
  },

  user = {
    ["^/$"]                             = { "GET" },
    ["^/?[^/]*/?[^/]*/_search"]         = { "GET", "POST" },
    ["^/?[^/]*/?[^/]*/_msearch"]        = { "GET", "POST" },
    ["^/?[^/]*/?[^/]*/_validate/query"] = { "GET", "POST" },
    ["/_aliases"]                       = { "GET" },
    ["/_cluster.*"]                     = { "GET" }
  },

  admin = {
    ["^/?[^/]*/?[^/]*/_bulk"]          = { "GET", "POST" },
    ["^/?[^/]*/?[^/]*/_refresh"]       = { "GET", "POST" },
    ["^/?[^/]*/?[^/]*/?[^/]*/_create"] = { "GET", "POST" },
    ["^/?[^/]*/?[^/]*/?[^/]*/_update"] = { "GET", "POST" },
    ["^/?[^/]*/?[^/]*/?.*"]            = { "GET", "POST", "PUT", "DELETE" },
    ["^/?[^/]*/?[^/]*$"]               = { "GET", "POST", "PUT", "DELETE" },
    ["/_aliases"]                      = { "GET", "POST" }
  }
}

-- get authenticated user as role
local role = ngx.var.remote_user

-- exit 403 when no matching role has been found
if restrictions[role] == nil then
  ngx.header.content_type = 'text/plain'
  ngx.status = 403
  ngx.say("403 Forbidden: You don't have access to this resource.")
  return ngx.exit(403)
end

-- get URL
local uri = ngx.var.uri

-- get method
local method = ngx.req.get_method()

local allowed  = false

for path, methods in pairs(restrictions[role]) do

  -- path matched rules?
  local p = string.match(uri, path)

  local m = nil

  -- method matched rules?
  for _, _method in pairs(methods) do
    m = m and m or string.match(method, _method)
  end

  if p and m then
    allowed = true
  end
end

if not allowed then
  ngx.header.content_type = 'text/plain'
  ngx.log(ngx.WARN, "Role ["..role.."] not allowed to access the resource ["..method.." "..uri.."]")
  ngx.status = 403
  ngx.say("403 Forbidden: You don't have access to this resource.")
  return ngx.exit(403)
end

As you can see, we’re storing the list of roles as a Lua table, with a nested table for each role. An incoming request method and URL are matched against regular expression patterns, and if they match, the request is allowed:

$ curl -i -X HEAD 'http://localhost:8080'
# HTTP/1.1 401 Unauthorized
# ...

$ curl -i -X HEAD 'http://all:all@localhost:8080'
# HTTP/1.1 200 OK
# ...

$ curl -i -X GET 'http://all:all@localhost:8080'
# HTTP/1.1 403 Forbidden
# ...

$ curl -i -X GET 'http://user:user@localhost:8080'
# HTTP/1.1 200 OK
# ...

$ curl -i -X POST 'http://user:user@localhost:8080/myindex/mytype/1' -d '{"title" : "Test"}'
# HTTP/1.1 403 Forbidden
# ...

$ curl -i -X DELETE 'http://user:user@localhost:8080/myindex/'
# HTTP/1.1 403 Forbidden
# ...

$ curl -i -X POST 'http://admin:admin@localhost:8080/myindex/mytype/1' -d '{"title" : "Test"}'
# HTTP/1.1 200 OK
# ...

$ curl -i -X DELETE 'http://admin:admin@localhost:8080/myindex/'
# HTTP/1.1 200 OK
# ...

Of course, the restrictions table could be much more complicated, the authentication could be provided by an Oauth token instead of HTTP basic authentication, etc., but the authorization mechanics would be the same.

Conclusion

In this article, we’ve made great use of the fact that Elasticsearch is a service exposed via HTTP – we’ve added significant amount of features to Elasticsearch, without extending or modifiying the software itself in any way.

We’ve seen how HTTP fits very well conceptually into the current paradigm of designing software architectures as independent, decoupled services, and how Nginx can be used as a high-performing, customizable proxy. Happy proxying!