Customize indexing rules for a content source

edit

Customize indexing rules for a content sourceedit

Each content source has indexing rules that determine what data is synchronized to Workplace Search. Many content sources allow you to customize these rules. You may be able to use custom indexing rules to speed up content source synchronization and decrease storage costs. You can also exclude content that shouldn’t be processed, such as binary files.

Indexing rules apply at the time of full source synchronization. This means that they will only take effect when the content from a source is re-synchronized. Changing indexing rules automatically triggers a synchronization process, but this process might take some time, especially when dealing with content sources that have large sets of documents.

In this guide, you will see how to change the indexing rules on a content source. Indexing rule configurations work with standard sources, excluding custom content sources. Here are some examples of sources that support custom indexing rules:

The examples in this guide use a Dropbox content source.

Customizing indexing rules currently relies on the Workplace Search API. The steps in these guides assume that you are familiar with authenticating to the API. We’ll be making use of admin user access tokens in these examples.

A more concise and technical API reference can be found at Content sources API reference.

Modify the indexing rules on a content sourceedit

These steps assume you have already connected a source. Here we will modify the indexing rules on a Dropbox source to exclude content from a top-level folder named Legal.


Step 1. First, we need to retrieve the source’s ID and definition. We’ll retrieve it from the API with the following cURL command:

curl \
--request "GET" \
--url "${ENTERPRISE_SEARCH_URL}/api/ws/v1/sources" \
--header "Authorization: Bearer ${ACCESS_TOKEN}"

Which gives us a response of:

{
  "meta": { ... },
  "results":
  [
    {
      "id": "60de02d9a1c4934b6efe24dc",
      "service_type": "dropbox",
      "name": "Dropbox",
      ...
    }
  ]
}

In the above response data, the source we’re looking for has an ID of 60de02d9a1c4934b6efe24dc. You can find name values in the source definitions to help identify which source you’re looking for.


Step 2. Next, we need to identify the filter rule we wish to use when indexing documents. If we assume our Dropbox documents exist at the following paths:

/employee-directory.pdf
/Documentation/enabling-saml-auth.pdf
/Documentation/customizing-content-sources.pdf
/Legal/clients-list.xls
/Legal/Contracts/business.docx

If we want to exclude every document contained in the top-level Legal folder, we can use a pattern like:

/Legal/**/*

This pattern is created using glob patterns.


Step 3. We can then use the source definition from Step 1 to build an indexing configuration request to filter out our Legal documents. In this request, we’re using the source’s definition from the previous response as our --data content. However, we’re only including the following attributes: name and indexing:

curl \
--request "PUT" \
--url "${ENTERPRISE_SEARCH_URL}/api/ws/v1/sources/${CONTENT_SOURCE_ID}" \
--header "Authorization: Bearer ${ACCESS_TOKEN}" \
--header "Content-Type: application/json" \
--data '
{
  "name": "Dropbox",
  "indexing":
  {
    "default_action": "include",
    "rules":
    [
      {
        "filter_type": "path_template",
        "exclude": "/Legal/**/*"
      }
    ]
  }
}
'

Step 4. Modifying a content source’s indexing configuration will trigger a full synchronization of the source’s data. When complete, we should see that no documents from the top-level Legal folder were indexed into Workplace Search.

Additional examplesedit

Using the above steps, here are some additional examples of indexing configurations for other use cases.


Revert indexing configuration to default

To revert a source’s indexing configuration back to the default state, you can use the default indexing configuration:

{
  "default_action": "include",
  "rules": []
}

Include only specific sub-directories

For file system based sources, rather than excluding content, perhaps you only want to index content within a specific set of directories. Note that rules are evaluated in order, which allows a complex mixture of include and exclude rules. This example will index all content in the Engineering and Design top-level folders, however the Templates folder within Design will also be excluded:

{
  "default_action": "exclude",
  "rules": [
    {
      "filter_type": "path_template",
      "exclude": "/Design/Templates/**/*"
    },
    {
      "filter_type": "path_template",
      "include": "/Engineering/**/*"
    },
    {
      "filter_type": "path_template",
      "include": "/Design/**/*"
    }
  ]
}

Exclude specific file extensions

This configuration will exclude some common executable file types:

{
  "default_action": "include",
  "rules": [
    {
      "filter_type": "file_extension",
      "exclude": "exe"
    },
    {
      "filter_type": "file_extension",
      "exclude": "bat"
    },
    {
      "filter_type": "file_extension",
      "exclude": "sh"
    }
  ]
}

Exclude specific object types

This configuration will exclude some specific object types. This is particularly useful for content sources that sync in a variety of document types, like Salesforce:

{
  "default_action": "include",
  "rules": [
    {
      "filter_type": "object_type",
      "exclude": "opportunity"
    },
    {
      "filter_type": "object_type",
      "exclude": "attachment"
    }
  ]
}