Web crawler API referenceedit

App Search provides API operations for the web crawler. This document provides a reference for each API operation, as well as shared concerns:

Shared concernsedit

All web crawler API operations share the following concerns.

Enterprise search base URLedit

This is the base URL for your Enterprise Search deployment.

In the API curl examples, this is represented by the shell variable ${ENTERPRISE_SEARCH_BASE_URL}.

Engineedit

Most endpoints within the crawler API are scoped to a particular App Search engine. The engine is identified by the engine name value provided in the URL of the request. If an engine could not be found for any API request, an empty HTTP 404 response will be returned.

In the API curl examples, the name of your engine is represented by the shell variable ${ENGINE_NAME}.

Accessedit

Users of this API have three choices for access:

  • Basic authentication with a user and password
  • Elasticsearch token
  • App Search key

All curl examples in this API reference assume you are using either an Elasticsearch token or an App Search key, and that token is represented by the variable ${TOKEN}. See Authentication.

Throughout this reference, these values are written as shell variables in the curl examples for easy substitution.

For example:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

Crawleredit

Responds with domain objects configured for an engine.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler

Examplesedit

A GET request to return the domain objects configured for an engine:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response:

# 200 OK
{
  "domains": [
    {
      "id": "{DOMAIN_ID}",
      "name": "{DOMAIN_NAME}",
      "document_count": 0,
      "entry_points": [
        {
          "id": "6087cec06dda9bdfb4a49e39",
          "value": "/"
        }
      ],
      "crawl_rules": [],
      "default_crawl_rule": {
        "id": "-",
        "order": 0,
        "policy": "allow",
        "rule": "regex",
        "pattern": ".*"
      },
      "sitemaps": []
    }
  ]
}

Crawl requestsedit

Each crawl performed by the Enterprise Search web crawler has an associated crawl request object. The crawl requests API allows operators to create new crawl requests and to view and control the state of existing crawl requests.

Get current active crawl requestedit

Returns a crawl request object for an active crawl or returns an HTTP 404 response if there is no active crawl for an engine:

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_requests/active

Examplesedit

A GET request to return current active crawl information for an engine:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests/active" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response:

# 200 OK
{
  "id": "601b21adbeae67679b3b760a",
  "status": "running",
  "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
  "begun_at": "Wed, 03 Feb 2021 22:20:31 +0000",
  "completed_at": null
}

For cases when there is no active crawl for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "error": "There are no active crawl requests for this engine"
}

Cancel an active crawledit

Cancels an active crawl for an engine or returns an HTTP 404 response if there is no active crawl for an engine.

POST <enterprise_search_base_url>/api/as/v1/engines<engine_name>crawler/crawl_requests/active/cancel

It may take some time for the crawler to detect the cancellation request and gracefully stop the crawl. During the time, the status of the crawl request will remain canceling.

Examplesedit

A POST request to cancel an active crawl for an engine:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests/active/cancel" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A succesful response returns a single crawl request object with a canceling state:

# 200 OK
{
  "id": "601b21adbeae67679b3b760a",
  "status": "canceling",
  "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
  "begun_at": "Wed, 03 Feb 2021 22:20:31 +0000",
  "completed_at": null
}

For cases when there is no active crawl for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "error": "There are no active crawl requests for this engine"
}

List crawl requestsedit

Returns a list of crawl requests for a given engine.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_requests

page[current] (optional)
Current page number (default: 1).
page[size] (optional)
Page size (default: 25). The maximum is 100, and be will truncated if a larger size is requested.

Examplesedit

A GET request to return a list of crawl requests for an engine:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response:

# 200 OK
{
  "meta": {
    "page": {
      "current": 1,
      "total_pages": 1,
      "total_results": 3,
      "size": 25
    }
  },
  "results": [
    {
      "id": "601b21adbeae67679b3b760a",
      "status": "running",
      "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
      "begun_at": "Wed, 03 Feb 2021 22:20:31 +0000",
      "completed_at": null
    },
    {
      "id": "60147e93beae67bf7ef72e86",
      "status": "success",
      "created_at": "Fri, 29 Jan 2021 21:30:59 +0000",
      "begun_at": "Fri, 29 Jan 2021 21:31:00 +0000",
      "completed_at": "Fri, 29 Jan 2021 21:35:20 +0000"
    },
    {
      "id": "60146c07beae67f397300128",
      "status": "canceled",
      "created_at": "Fri, 29 Jan 2021 20:11:51 +0000",
      "begun_at": "Fri, 29 Jan 2021 20:11:52 +0000",
      "completed_at": "Fri, 29 Jan 2021 20:12:51 +0000"
    }
  ]
}

Create a new crawl requestedit

Requests a new crawl for an engine: If there is already an active crawl, the request returns an HTTP 400 response with an error message.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_requests

Examplesedit

A POST request to start a new crawl for an engine:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

If the request is successful, the response contains a single crawl request object with a pending state:

# 200 OK
{
  "id": "601b21adbeae67679b3b760a",
  "status": "pending",
  "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
  "begun_at": null,
  "completed_at": null
}

If there is already an active crawl, the API returns an HTTP 400 response:

# 400 Bad Request
{
  "error": "There is an active crawl for the engine \"your-engine\", please wait for it to finish or abort it before requesting another one"
}

Create a new partial crawl requestedit

Certain properties of a crawl can be overridden at request time, allowing the user to request a crawl against a specified subset of their corpus. Such a crawl is considered a partial crawl.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_requests

overrides (optional)

Allows the user to override a subset of crawl configuration for the requested crawl.

max_crawl_depth (optional)
Maximum depth to follow links while discovering new content.
domain_allowlist (optional)
Array of domain names for restricting which links to follow.
seed_urls (optional)
Array of initial URLs to crawl. Defaults to the configured entrypoints for each crawler domain.
sitemap_urls (optional)
Array of sitemap URLs to be used for content discovery.

Examplesedit

A POST request to start a partial crawl for an engine, setting the maximum depth to follow links while discovering new content at 2, and specifying two initial URLs to crawl:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "overrides": {
    "max_crawl_depth": 2,
    "seed_urls": ["https://www.elastic.co/blog", "https://www.elastic.co/docs"]
  }
}"

A POST request to start a partial crawl for an engine, passing a domain_allowlist array:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "overrides": {
    "domain_allowlist": ["http://www.elastic.co", "https://www.example.com"]
  }
}"

A POST request to start a partial crawl for an engine, setting the maximum depth to follow links while discovering new content at 2, and passing an array of sitemap URLs to be used for content discovery:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "overrides": {
    "max_crawl_depth": 2,
    "sitemap_urls": ["http://www.elastic.co/sitemap.xml", "https://www.example.com/sitemap.xml"]
  }
}"

A successful response:

# 200 OK
{
  "id":"6275340b23bd23196eb41a29",
  "type":"partial",
  "status":"pending",
  "created_at":"2022-05-06T11:18:34Z",
  "begun_at":null,
  "completed_at":null
}

View details for a crawl requestedit

Returns details of a given crawl request. The crawl request is identified with a unique Crawl Request ID value.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_requests/<crawl_request_id>

Examplesedit

A GET request to return the details of a given crawl request, identified by its Crawl Request ID value.

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests/{CRAWL_REQUEST_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response:

# 200 OK
{
  "id": "60147e93beae67bf7ef72e86",
  "status": "success",
  "created_at": "Fri, 29 Jan 2021 21:30:59 +0000",
  "begun_at": "Fri, 29 Jan 2021 21:31:00 +0000",
  "completed_at": "Fri, 29 Jan 2021 21:35:20 +0000"
}

Crawl schedulesedit

Each engine using the Enterprise Search web crawler has an associated crawl schedule object. The crawl schedule API allows operators to specify a frequency at which new crawls will be started. If there is an active crawl, new crawls will be skipped.

Get current crawl scheduleedit

Returns a crawl schedule object or returns an HTTP 404 response if there is no crawl schedule object for an engine:

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_schedule

Examplesedit

A GET request to return the crawl schedule for a given crawl request.

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_requests/{CRAWL_REQUEST_ID}/crawl_schedule" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response:

# 200 OK
{
  "engine": {ENGINE_NAME},
  "frequency": 2,
  "unit": "week"
}

For cases when there is no crawl schedule for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "errors": ["No crawl schedule found"]
}

Create or update a crawl scheduleedit

Upserts a crawl schedule for an engine:

PUT <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_schedule

frequency (required)
A positive integer.
unit (required)
Should be one of: hour, day, week, month.

Examplesedit

A PUT request to create or update a crawl schedule for an engine, setting the frequency to 2 and the unit to week.

curl \
--request "PUT" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_schedule" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "frequency": 2,
  "unit": "week"
}"

A successful response contains the crawl schedule object:

# 200 OK
{
  "engine": {ENGINE_NAME},
  "frequency": 2,
  "unit": "week"
}

When the parameters are invalid, the API returns an HTTP 400 response:

# 400 Bad Request
{
  "errors": [
    "Crawl schedule frequency must be an integer",
    "Crawl schedule unit must be one of hour, day, week, month"
  ]
}

Delete a crawl scheduleedit

Deletes a crawl schedule for an engine:

DELETE <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/crawl_schedule

Examplesedit

A DELETE request to delete a crawl schedule for an engine:

curl \
--request "DELETE" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/crawl_schedule" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response contains an object showing the crawl schedule has been deleted:

# 200 OK
{
  "deleted": true
}

For cases when there is no crawl schedule for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "errors": ["No crawl schedule found"]
}

Process crawlsedit

A process crawl is an operation that re-processes the documents in your engine using the current crawl rules, without waiting for a full re-crawl. See Process crawl.

Use the following operations to manage process crawls for a given engine.

List process crawlsedit

Returns a list of process crawls for the given engine.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/process_crawls

page[current] (optional)
Current page number (default: 1).
page[size] (optional)
Page size (default: 25). The maximum is 100, and be will truncated if a larger size is requested.

Examplesedit

A GET request to return the process crawls for a given engine, specifying the page size at 25.

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/process_crawls" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data " {
  "page[size]": 25,
}"

A successful request returns a list of process crawls for the given engine:

# 200 OK
{
  "meta": {
    "page": {
      "current": 1,
      "total_pages": 1,
      "total_results": 3,
      "size": 25
    }
  },
  "results": [
    {
      "id": "61421fe6e725695876d72d2e",
      "dry_run": true,
      "total_url_count": 167,
      "denied_url_count": 92,
      "domains": [
        "https://swiftype.com"
      ],
      "process_all_domains": false,
      "created_at": "2021-09-15T16:31:34Z",
      "begun_at": "2021-09-15T16:31:35Z",
      "completed_at": "2021-09-15T16:31:52Z"
    },
    {
      "id": "61421e15e72569a22ad6f549",
      "dry_run": false,
      "total_url_count": 7793,
      "denied_url_count": 1028,
      "domains": [
        "https://swiftype.com",
        "https://www.elastic.co"
      ],
      "process_all_domains": true,
      "created_at": "2021-09-15T16:23:49Z",
      "begun_at": "2021-09-15T16:23:49Z",
      "completed_at": "2021-09-15T16:25:48Z"
    },
    {
      "id": "61421b93e725694296d664f0",
      "dry_run": false,
      "total_url_count": 8525,
      "denied_url_count": 930,
      "domains": [
        "https://www.elastic.co"
      ],
      "process_all_domains": true,
      "created_at": "2021-09-15T16:13:07Z",
      "begun_at": "2021-09-15T16:13:07Z",
      "completed_at": "2021-09-15T16:15:29Z"
    }
  ]
}

View details for a process crawledit

View the details of the process crawl with the given ID.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/process_crawls/<process_crawl_id>

Examplesedit

A GET request to view a process crawl for a given engine, identified by the process crawl ID.

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/process_crawls/{PROCESS_CRAWL_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response includes summary information of the process crawl, identifying the total number of URLs that were re-processed along with the number of URLs identified for deletion. For example:

# 200 OK
{
  "id": "61421fe6e725695876d72d2e",
  "dry_run": true,
  "total_url_count": 167,
  "denied_url_count": 92,
  "domains": [
    "https://swiftype.com"
  ],
  "process_all_domains": false,
  "created_at": "2021-09-15T16:31:34Z",
  "begun_at": "2021-09-15T16:31:35Z",
  "completed_at": "2021-09-15T16:31:52Z"
}

View denied URLs for a process crawledit

View a sample of 100 URLs identified for deletion by a given process crawl. This API complements the dry run capability of a process crawl to test configuration before committing to deletion. See Create a new process crawl.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/process_crawls/<process_crawl_id>/denied_urls

Examplesedit

A GET request to view a sample of denied URLs for a given process crawl.

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/process_crawls/{PROCESS_CRAWL_ID}/denied_urls" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response looks like this:

# 200 OK
{
  "total_url_count": 167,
  "denied_url_count": 92,
  "sample_size": 100,
  "denied_urls_sample": [
    "https://swiftype.com/documentation",
    "https://swiftype.com/documentation/site-search/site_search",
    "https://swiftype.com/documentation/site-search/guides/crawler-optimization",
    "https://swiftype.com/documentation/site-search/guides/multiple-domains",
    "https://swiftype.com/documentation/site-search/guides/engine-cloning",
    ...
  ]
}

The denied URLs API reads from event logs generated during the process crawl. If the logs for a given process crawl have been deleted or are otherwise unavailable, an empty array is returned.

{
  "total_url_count": 167,
  "denied_url_count": 92,
  "sample_size": 100,
  "denied_urls_sample": []
}

Create a new process crawledit

Request a new process crawl for the given engine.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/process_crawls

dry_run (optional)
If true, the process crawl identifies but does not delete documents that are no longer permitted by the current crawl configuration. Defaults to false.
domains (optional)
Limits the process crawl to a subset of crawler domains. By default, the process crawl will apply to all crawler domains on the engine.

Examplesedit

A POST request to create a new process crawl for the given engine,with the dry_run parameter set to true.

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/process_crawls" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data " {
  "dry_run": true
}"

A successful response includes the ID of the new process crawl.

# 200 OK

{
  "id": "61421b93e725694296d664f0",
  "dry_run": true,
  "total_url_count": 0,
  "denied_url_count": 0,
  "domains": [
    "https://www.elastic.co"
  ],
  "process_all_domains": true,
  "created_at": "2021-09-15T16:13:07Z",
  "begun_at": null,
  "completed_at": null
}

URL validation and debuggingedit

The web crawler provides several API operations to help you troubleshoot crawls. Use the following operations to validate and debug specific URLs from the perspective of the web crawler.

Validate a domainedit

Validate the given domain from the perspective of the web crawler.

Optionally, use the checks parameter to choose the specific validations to perform. The response includes an ordered set of results. Each provides the detailed results of a specific check. If any checks fail, the crawler skips remaining checks.

Use this operation to validate a domain before adding it to the crawl configuration for an engine, or troubleshoot issues crawling a specific domain. This operation visits the domain as the web crawler, without starting a full crawl. The results reflect how the web crawler sees that domain now. It does not represent the states of previous crawls. For historical information, see Trace a URL.

You can use this operation to identify issues with a domain. The domain owner can fix the issues, and you can verfiy the fixes. Repeat this process until all checks pass. This process is more convenient than requesting a new crawl to confirm the fixes.

POST <enterprise_search_base_url>/api/as/v1/crawler/validate_url

url (required)
The URL of the domain to validate.
checks (optional)

Checks to perform. Array including any of the following supported values. To run additional checks, validate a specific page within the context of an engine. See Validate a URL.

  • dns
  • robots_txt
  • tcp
  • url
  • url_content
  • url_request

Examplesedit

A POST request to validate a domain, passing the parameters url: https://www.elastic.co and checks: ["url", "tcp", "url_request"]:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/validate_url" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data " {
  "url": "https://www.elastic.co",
  "checks": ["url", "tcp", "url_request"]
}"

A successful response includes the results of the checks:

# 200 OK
{
  "url": "https://www.elastic.co",
  "normalized_url": "https://www.elastic.co/",
  "valid": true,
  "results": [
    {
      "result": "ok",
      "name": "url",
      "details": {},
      "comment": "URL structure looks valid"
    },
    {
      "result": "ok",
      "name": "tcp",
      "details": {
        "host": "www.elastic.co",
        "port": 443
      },
      "comment": "TCP connection successful"
    },
    {
      "result": "ok",
      "name": "url_request",
      "details": {
        "status_code": 200,
        "content_type": "text/html; charset=utf-8",
        "request_time_msec": 166
      },
      "comment": "Successfully fetched https://www.elastic.co: HTTP 200."
    }
  ]
}
Validate a URLedit

Validate the given URL from the perspective of the web crawler. The given URL is assumed to be a web page or other web content. To validate a domain, see Validate a domain.

Optionally, use the checks parameter to choose the specific validations to perform. The response includes an ordered set of results. Each provides the detailed results of a specific check. If any checks fail, the crawler skips remaining checks.

Use this operation to troubleshoot issues crawling web content at a specific URL. This operation visits the URL as the web crawler, without starting a full crawl. The results reflect how the web crawler sees that URL now. It does not represent the states of previous crawls. For historical information, see Trace a URL.

You can use this operation to identify issues with a specific URL. The domain owner can fix the issues, and you can verfiy the fixes. Repeat this process until all checks pass. After fixing all issues, start a new crawl to discover and extract the content at the URL. See Create a new crawl request.

To troubleshoot content extraction issues at a specific URL, see Extract content from a URL. The url_content check for this operation confirms that content was extracted, but it does not include the content within the response.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/validate_url

url (required)
The URL to validate.
checks (optional)

Checks to perform. Array including any of the following supported values. This operation offers more checks than the operation to validate a domain, since this operation runs in the context of a specific engine.

  • crawl_rules
  • domain_access
  • dns
  • robots_txt
  • tcp
  • url
  • url_content
  • url_request

Examplesedit

A POST request to validate a url, passing a url and checks: ["url", "domain_access", "crawl_rules"]:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/validate_url" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data " {
  "url": "https://www.elastic.co",
  "checks": ["url", "domain_access", "crawl_rules"]
}"

A successful response includes the results of the checks:

# 200 OK
{
  "url": "https://www.elastic.co",
  "normalized_url": "https://www.elastic.co/",
  "valid": true,
  "results": [
    {
      "result": "ok",
      "name": "url",
      "details": {},
      "comment": "URL structure looks valid"
    },
    {
      "result": "ok",
      "name": "domain_access",
      "details": {
        "domain": "https://www.elastic.co"
      },
      "comment": "The URL matches one of the domains configured for the engine"
    },
    {
      "result": "ok",
      "name": "crawl_rules",
      "details": {
        "rule": "domain=https://www.elastic.co, default allow all rule"
      },
      "comment": "The URL is allowed by one of the crawl rules"
    }
  ]
}
Extract content from a URLedit

Use the web crawler to extract content from the given URL.

Use this operation to troubleshoot issues with content extraction at a specific URL. For general URL validation, see Validate a URL.

This operation visits the URL as the web crawler, without starting a full crawl, and extracts content. The results reflect the current crawl configuration for the engine, and reveal how the web crawler sees the content at that URL right now. However, if the content is already indexed, the response includes that information as well.

This operation does not make changes to any documents in the engine. Use this operation to identify and fix context extraction issues at a specific URL. The content owner can fix the issues, and you can verfiy the fixes. Repeat this process until the desired content is extracted. After fixing all issues, start a new crawl to discover and extract the content at the URL. See Create a new crawl request.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/extract_url

url (required)
The URL to crawl to extract content.

Examplesedit

A POST request to use the web crawler to extract content from a url:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/extract_url" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data " { "url": "https://www.elastic.co"}"

A successful response:

# 200 OK
{
  "url": "https://www.example.com",
  "normalized_url": "https://www.example.com/",
  "results": {
    "download": {
      "status_code": 200
    },
    "extraction": {
      "content_hash": "fb38982491c4a9377f8cf0c57e75e067bca65daf",
      "content_hash_fields": [
        "title",
        "body_content",
        "meta_keywords",
        "meta_description",
        "links",
        "headings"
      ],
      "content_fields": {
        "title": "Example Domain",
        "body_content": "Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information...",
        "links": [
          "https://www.iana.org/domains/example"
        ],
        "headings": [
          "Example Domain"
        ]
      }
    },
    "indexing": {
      "document_id": null,
      "document_fields": null
    },
    "deduplication": {
      "urls_count": 0,
      "urls_sample": []
    }
  }
}

Details for selected response fields:

results.deduplication
The total count of URLs with this same content and a sample of those URLs. See Duplicate document handling.
results.extraction.content_fields
The content fields extracted by the crawler—​what the crawler will extract for the next crawl.
results.extraction.content_hash
The hash used to uniquely identify this content. See Duplicate document handling.
results.extraction.content_hash_fields
The document schema fields used to create the content hash. See Duplicate document handling.
results.indexing
The fields that are currently indexed—​what the crawler extracted during a previous crawl (if any).
Trace a URLedit

Trace the recent history of the given URL from the perspective of the web crawler. Determine if the web crawler saw the URL, how it discovered it, and other events specific to that URL. Use the response to see the history of a specific URL from the perspective of the web crawler, and debug any issues crawling the URL.

This operation provides a view of the web crawler events logs for a specific engine and URL. The response returns an array of recent crawl requests, where each indicates if the URL was found. This is followed by specific crawler events, grouped by type. For details on each crawler event, see Web crawler events logs reference.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/trace_url

url (required)
The URL to trace.

Examplesedit

A POST request to use the web crawler to trace a url:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/trace_url" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data " { "url": "https://www.elastic.co/blog"}"

A successful response:

# 200 OK
{
  "url": "https://www.elastic.co/blog",
  "normalized_url": "https://www.elastic.co/blog",
  "crawl_requests": [
    {
      "crawl_request": {
        "id": "61240b9d2a02b14c1df12f98",
        "status": "running",
        "created_at": "2021-08-23T20:57:01Z",
        "begun_at": "2021-08-23T20:57:02Z",
        "completed_at": null
      },
      "found": true,
      "discover": [
        {
          "timestamp": "Mon, 23 Aug 2021 20:57:02 +0000",
          "event_id": "61240b9e2a02b11895f12f9e",
          "message": null,
          "event_type": "allowed",
          "deny_reason": null,
          "crawl_depth": 1,
          "source_url": null
        }
      ],
      "seed": {
        "timestamp": "Mon, 23 Aug 2021 20:57:02 +0000",
        "event_id": "61240b9e2a02b11895f12f9f",
        "message": null,
        "url_type": "content",
        "source_type": "seed-list",
        "source_url": null
      },
      "fetch": {
        "timestamp": "Mon, 23 Aug 2021 20:57:03 +0000",
        "event_id": "61240b9f2a02b1f7caf12fab",
        "message": null,
        "event_outcome": "failure",
        "duration_msec": 152.537107,
        "http_response": {
          "status_code": "301",
          "body_bytes": null
        },
        "redirect": {
          "location": "https://www.elastic.co/blog/",
          "chain": [],
          "count": 1
        }
      },
      "output": {
        "timestamp": "Mon, 23 Aug 2021 20:57:03 +0000",
        "event_id": "61240b9f2a02b1f7caf12fae",
        "message": "Successfully processed the 301 response",
        "event_outcome": "success",
        "event_module": "app_search",
        "duration_msec": 36.45282,
        "engine": {
          "id": "61240b592a02b19de7f12f8e",
          "name": "elastic-blog"
        },
        "document": {
          "id": null
        }
      }
    }
  ]
}

If the given URL was not seen during the recent crawls, the response will look like this:

# 200 OK
{
  "url": "https://github.com",
  "normalized_url": "https://github.com/",
  "crawl_requests": [
    {
      "crawl_request": {
        "id": "61240b9d2a02b14c1df12f98",
        "status": "canceled",
        "created_at": "2021-08-23T20:57:01Z",
        "begun_at": "2021-08-23T20:57:02Z",
        "completed_at": "2021-08-23T21:04:39Z"
      },
      "found": false,
      "discover": [],
      "seed": null,
      "fetch": null,
      "output": null
    }
  ]
}

User agentedit

Responds with the User-Agent header used by the crawler.

GET <enterprise_search_base_url>/api/as/v1/crawler/user_agent

Examplesedit

A GET request to return the User-Agent header used by the crawler:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/user_agent" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response will look like this:

# 200 OK
{
  "user_agent": "Elastic Crawler (0.0.1)"
}

Domainsedit

List domainsedit

Returns a list of domains for the given engine.

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains

page[current] (optional)
Current page number (default: 1).
page[size] (optional)
Page size (default: 25). The maximum is 100, and be will truncated if a larger size is requested.

Examplesedit

A GET request to return the domains for the given engine:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response will look like this:

# 200 OK
{
  "meta": {
    "page": {
      "current": 1,
      "total_pages": 1,
      "total_results": 2,
      "size": 25
    }
  },
  "results": [
    {
      "id": "61a8eb863932dd78141fb0df",
      "name": "https://www.elastic.co",
      "document_count": 0,
      "entry_points": [
        {
          "id": "61a8a1ed3932ddd80cc4b719",
          "value": "/"
        }
      ],
      "crawl_rules": [],
      "default_crawl_rule": {
        "id": "-",
        "order": 0,
        "policy": "allow",
        "rule": "regex",
        "pattern": ".*"
      },
      "sitemaps": []
    },
    {
      "id": "61a8a1ed3932ddd80cc4b718",
      "name": "https://swiftype.com",
      "document_count": 0,
      "entry_points": [
        {
          "id": "61a8eb863932dd78141fb0e0",
          "value": "/"
        }
      ],
      "crawl_rules": [],
      "default_crawl_rule": {
        "id": "-",
        "order": 0,
        "policy": "allow",
        "rule": "regex",
        "pattern": ".*"
      },
      "sitemaps": []
    }
  ]
}

Create a new domainedit

Create a (crawler) domain for an engine:

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains

name (required)
The domain URL.
auth (optional)

The information required to crawl a domain protected by HTTP authentication.

For Basic auth, set type to basic, and provide the username and password:

{
  "auth": {
    "type": "basic",
    "username": "kimchy",
    "password": ":)"
  }
}

Alternatively, set the value of the Authorization header directly. Set type to raw, and provide the value:

{
  "auth": {
    "type": "raw",
    "value": "Bearer some-token"
  }
}

Examplesedit

A POST request to create a new crawler domain for a given engine, using Basic auth, providing a username and password:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "name": "https://www.elastic.co",
  "auth": "{
    "type": "basic",
    "username": "username",
    "password": "password"
}"

A successful response:

# 200 OK
{
  "meta": {
    "page": {
      "current": 1,
      "total_pages": 1,
      "total_results": 1,
      "size": 25
    }
  },
  "results": [
    {
      "id": "rjjk535wfjp14kbwlhzpf22d",
      "name": "https://www.elastic.co",
      "document_count": 86745,
      "deduplication_enabled": true,
      "deduplication_fields": [
        "title",
        "body_content",
        "meta_keywords",
        "meta_description",
        "links",
        "headings"
      ],
      "available_deduplication_fields": [
        "body_content",
        "headings",
        "id",
        "links",
        "meta_description",
        "meta_keywords",
        "title"
      ],
      "auth": null,
      "created_at": "2022-04-29T09:49:59Z",
      "last_visited_at": "2022-05-05T11:10:09Z",
      "entry_points": [
        {
          "id": "626bb4fe23bd29136e54f2e3",
          "value": "/guide/en/enterprise-search/current/",
          "created_at": "2022-04-29T09:50:54Z"
        }
      ],
      "crawl_rules": [],
      "default_crawl_rule": {
        "id": "-",
        "order": 0,
        "policy": "allow",
        "rule": "regex",
        "pattern": ".*",
        "created_at": "2022-05-05T12:32:40Z"
      },
      "sitemaps": []
    }
  ]
}

View details for a domainedit

Get domain object for an engine:

GET <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>

Examplesedit

A GET request to view the domain object for an engine:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \

A successful response:

# 200 OK
{
  "id": "{DOMAIN_ID}",
  "name": "{DOMAIN_NAME}",
  "document_count": 0,
  "entry_points": [
    {
      "id": "6087cec06dda9bdfb4a49e39",
      "value": "/"
    }
  ],
  "crawl_rules": [],
  "default_crawl_rule": {
    "id": "-",
    "order": 0,
    "policy": "allow",
    "rule": "regex",
    "pattern": ".*"
  },
  "sitemaps": []
}

Update a domainedit

Updates a domain for an engine:

PUT <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>

name
The domain URL.
auth (optional)

The information required to crawl a domain protected by HTTP authentication.

For Basic auth, set type to basic, and provide the username and password:

{
  "auth": {
    "type": "basic",
    "username": "kimchy",
    "password": ":)"
  }
}

Alternatively, set the value of the Authorization header directly. Set type to raw, and provide the value:

{
  "auth": {
    "type": "raw",
    "value": "Bearer some-token"
  }
}

Examplesedit

A PUT request to update a domain for an engine, using Basic auth, providing a username and password:

curl \
--request "PUT" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "name": "https://www.elastic.co",
  "auth": "{
    "type": "basic",
    "username": "username",
    "password": "password"
}"

A successful response:

# 200 OK
{
  "id": "{DOMAIN_ID}",
  "name": "{DOMAIN_NAME}",
  "document_count": 0,
  "entry_points": [
    {
      "id": "6087cec06dda9bdfb4a49e39",
      "value": "/"
    }
  ],
  "crawl_rules": [],
  "default_crawl_rule": {
    "id": "-",
    "order": 0,
    "policy": "allow",
    "rule": "regex",
    "pattern": ".*"
  },
  "sitemaps": []
}

Delete a domainedit

Deletes a domain for an engine:

DELETE <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>

Examplesedit

A DELETE request to delete a domain for an engine:

curl \
--request "DELETE" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \

A successful response contains an object showing the domain has been deleted:

# 200 OK
{
  "deleted": true
}

Entry pointsedit

Create a new entry pointedit

Create an entry point for a domain.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/entry_points

value (required)
The entry point path.

Examplesedit

A POST request to create an entry point for a domain for an engine:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/entry_points" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "name": "https://www.elastic.co/blog"
}"

A successful response:

# 200 OK
{
  "id": "{ENTRY_POINT_ID}",
  "value": "/blog"
}

Update an entry pointedit

Updates an entry point for a domain.

PUT <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/entry_points/<entry_point_id>

value
The entry point path.

Examplesedit

A PUT request to update an entry point for a domain for an engine:

curl \
--request "PUT" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/entry_points/{ENTRY_POINT_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "value": "https://www.elastic.co/docs"
}"

A successful response:

# 200 OK
{
  "id": "{ENTRY_POINT_ID}",
  "value": "/docs"
}

Delete an entry pointedit

Deletes an entry point for a domain.

DELETE <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/entry_points/<entry_point_id>

Examplesedit

A DELETE request to delete an entry point for a domain for an engine:

curl \
--request "DELETE" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/entry_points/{ENTRY_POINT_ID}"\
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \

A successful response contains an object showing the entry point has been deleted:

# 200 OK
{
  "deleted": true
}

Crawl rulesedit

Create a new crawl ruleedit

Create a crawl rule for a domain.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/crawl_rules

policy (required)
Accepted values are allow and deny.
rule (required)
Accepted values are begins, ends, contains and regex.
pattern (required)
The path pattern to match against.
order (optional)
An integer representing this crawl rule"s position within the list of crawl rules for the domain. The order of crawl rules is significant.

Examplesedit

A POST request to create a crawl rule for a domain:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/crawl_rules" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "policy": allow,
  "rule": ends,
  "pattern": "/ignore",
  "order": 0
}"

A successful response:

# 200 OK
{
  "id": "{CRAWL_RULE_ID}",
  "order": 0,
  "policy": "allow",
  "rule": "begins",
  "pattern": "/ignore"
}

Update a crawl ruleedit

Updates a crawl rule for a domain.

PUT <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/crawl_rules/<crawl_rule_id>

policy
Accepted values are allow and deny.
rule
Accepted values are begins, ends, contains and regex.
pattern
The path pattern to match against.
order
An integer representing this crawl rule"s position within the list of crawl rules for the domain. The order of crawl rules is significant.

Examplesedit

A PUT request to update a crawl rule for a domain:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/crawl_rules/{CRAWL_RULE_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "policy": deny,
  "rule": ends,
  "pattern": "/ignore",
  "order": 0
}"

A successful response:

# 200 OK
{
  "id": "{CRAWL_RULE_ID}",
  "order": 0,
  "policy": "deny",
  "rule": "begins",
  "pattern": "/ignore"
}

Delete a crawl ruleedit

Deletes a crawl rule for a domain.

DELETE <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/crawl_rules/<crawl_rule_id>

Examplesedit

A DELETE request to delete a crawl rule for a domain:

curl \
--request "DELETE" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/crawl_rules/{CRAWL_RULE_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}"

A successful response contains an object showing the crawl rule has been deleted:

# 200 OK
{
  "deleted": true
}

Sitemapsedit

Create a new sitemapedit

Create a sitemap for a domain.

POST <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/sitemaps

url (required)
The sitemap URL.

Examplesedit

A POST request to create a sitemap for a domain:

curl \
--request "POST" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/sitemaps" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "url": "https://www.elastic.co/sitemap2.xml"
}"

A successful response:

# 200 OK
{
  "id": "{SITEMAP_ID}",
  "url": "https://elastic.co/sitemap2.xml"
}

Update a sitemapedit

Updates a sitemap for a domain.

PUT <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/sitemaps/<sitemap_id>

url
The sitemap URL.

Examplesedit

A PUT request to update a sitemap for a domain:

curl \
--request "PUT" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/sitemaps/{SITEMAP_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "url": "https://www.elastic.co/sitemap3.xml"
}"

A successful response:

# 200 OK
{
  "id": "{SITEMAP_ID}",
  "url": "https://elastic.co/sitemap3.xml"
}

Delete a sitemapedit

Deletes a sitemap for a domain.

DELETE <enterprise_search_base_url>/api/as/v1/engines/<engine_name>/crawler/domains/<domain_id>/sitemaps/<sitemap_id>

Examplesedit

A DELETE request to delete a sitemap for a domain:

curl \
--request "DELETE" \
--url "${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/${ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/sitemaps/{SITEMAP_ID}" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${TOKEN}" \
--data "{
  "url": "https://www.elastic.co/sitemap3.xml"
}"

A successful response contains an object showing the sitemap has been deleted:

# 200 OK
{
  "deleted": true
}