webhdfsedit

This is a community-maintained plugin! It does not ship with Logstash by default, but it is easy to install by running bin/plugin install logstash-output-webhdfs.

This plugin sends Logstash events into files in HDFS via the webhdfs REST API.

Dependenciesedit

This plugin has no dependency on jars from hadoop, thus reducing configuration and compatibility problems. It uses the webhdfs gem from Kazuki Ohta and TAGOMORI Satoshi (@see: https://github.com/kzk/webhdfs). Optional dependencies are zlib and snappy gem if you use the compression functionality.

Operational Notesedit

If you get an error like:

Max write retries reached. Exception: initialize: name or service not known {:level=>:error}

make sure that the hostname of your namenode is resolvable on the host running Logstash. When creating/appending to a file, webhdfs somtime sends a 307 TEMPORARY_REDIRECT with the HOSTNAME of the machine its running on.

Usageedit

This is an example of Logstash config:

input {
  ...
}
filter {
  ...
}
output {
  webhdfs {
    server => "127.0.0.1:50070"         # (required)
    path => "/user/logstash/dt=%{+YYYY-MM-dd}/logstash-%{+HH}.log"  # (required)
    user => "hue"                       # (required)
  }
}

 

Synopsisedit

This plugin supports the following configuration options:

Required configuration options:

webhdfs {
    host => ...
    path => ...
    user => ...
}

Available configuration options:

Setting Input type Required Default value

codec

codec

No

"line"

compression

string, one of ["none", "snappy", "gzip"]

No

"none"

flush_size

number

No

500

host

string

Yes

idle_flush_time

number

No

1

message_format

string

No

open_timeout

number

No

30

path

string

Yes

port

number

No

50070

read_timeout

number

No

30

retry_interval

number

No

0.5

retry_known_errors

boolean

No

true

retry_times

number

No

5

snappy_bufsize

number

No

32768

snappy_format

string, one of ["stream", "file"]

No

"stream"

use_httpfs

boolean

No

false

user

string

Yes

workers

number

No

1

Detailsedit

 

codecedit

  • Value type is codec
  • Default value is "line"

The codec used for output data. Output codecs are a convenient method for encoding your data before it leaves the output, without needing a separate filter in your Logstash pipeline.

compressionedit

  • Value can be any of: none, snappy, gzip
  • Default value is "none"

Compress output. One of [none, snappy, gzip]

exclude_tags (DEPRECATED)edit

  • DEPRECATED WARNING: This configuration item is deprecated and may not be available in future versions.
  • Value type is array
  • Default value is []

Only handle events without any of these tags. Optional.

flush_sizeedit

  • Value type is number
  • Default value is 500

Sending data to webhdfs if event count is above, even if store_interval_in_secs is not reached.

hostedit

  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The server name for webhdfs/httpfs connections.

idle_flush_timeedit

  • Value type is number
  • Default value is 1

Sending data to webhdfs in x seconds intervals.

message_formatedit

  • Value type is string
  • There is no default value for this setting.

The format to use when writing events to the file. This value supports any string and can include %{name} and other dynamic strings.

If this setting is omitted, the full json representation of the event will be written as a single line.

open_timeoutedit

  • Value type is number
  • Default value is 30

WebHdfs open timeout, default 30s.

pathedit

  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The path to the file to write to. Event fields can be used here, as well as date fields in the joda time format, e.g.: /user/logstash/dt=%{+YYYY-MM-dd}/%{@source_host}-%{+HH}.log

portedit

  • Value type is number
  • Default value is 50070

The server port for webhdfs/httpfs connections.

read_timeoutedit

  • Value type is number
  • Default value is 30

The WebHdfs read timeout, default 30s.

retry_intervaledit

  • Value type is number
  • Default value is 0.5

How long should we wait between retries.

retry_known_errorsedit

  • Value type is boolean
  • Default value is true

Retry some known webhdfs errors. These may be caused by race conditions when appending to same file, etc.

retry_timesedit

  • Value type is number
  • Default value is 5

How many times should we retry. If retry_times is exceeded, an error will be logged and the event will be discarded.

snappy_bufsizeedit

  • Value type is number
  • Default value is 32768

Set snappy chunksize. Only neccessary for stream format. Defaults to 32k. Max is 65536 @see http://code.google.com/p/snappy/source/browse/trunk/framing_format.txt

snappy_formatedit

  • Value can be any of: stream, file
  • Default value is "stream"

Set snappy format. One of "stream", "file". Set to stream to be hive compatible.

tags (DEPRECATED)edit

  • DEPRECATED WARNING: This configuration item is deprecated and may not be available in future versions.
  • Value type is array
  • Default value is []

Only handle events with all of these tags. Optional.

type (DEPRECATED)edit

  • DEPRECATED WARNING: This configuration item is deprecated and may not be available in future versions.
  • Value type is string
  • Default value is ""

The type to act on. If a type is given, then this output will only act on messages with the same type. See any input plugin’s type attribute for more. Optional.

use_httpfsedit

  • Value type is boolean
  • Default value is false

Use httpfs mode if set to true, else webhdfs.

useredit

  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The Username for webhdfs.

workersedit

  • Value type is number
  • Default value is 1

The number of workers to use for this output. Note that this setting may not be useful for all outputs.