search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Site Search Engine Schema Design

This guide will teach you about the two types of search queries and field types, then walk through the construction an Engine schema.

This guide covers both:

  1. Crawler based Engines
  2. API based Engines

Suggest v. Search Queries

There are two types of search queries:

  • Search queries: Match complete terms.

  • Suggest queries: Match on prefixes of terms to perform autocompletion of search terms.

    • eg. If you have a document with a field value of "Autocomplete Example", a suggest query for "aut", "auto", "autoc" and so forth will match on "Autocomplete".

It is important to consider whether a field will be used for suggest queries or for search queries.

eg. The title of an article is a good candidate for suggest queries, but the body text would not be.

Field type overview

The key distinguishing feature between field types is whether they are used for searching or not.

Textual fields - string and text - can be searched. But only string fields leverage both suggest and search queries.

The other field types are used to:

  • filter results
  • change the relevance of results, such as with functional boosts
  • sort
  • provide faceted counts of results
Type Search Queries Suggest Queries Functional Boosts Filtering Sorting Facets
string Yes Yes No Yes Yes Yes
text Yes No No No No No
enum Yes No No Yes Yes Yes
integer No No Yes Yes Yes Yes
float No No Yes Yes Yes Yes
date No No No Yes Yes Yes
location No No No Yes No No

string fields

Short textual fields like title or headers use the string field type.

  • A string is used for both suggest and search queries.

  • A match in a string field for a full-text search will be returned in the highlights.

  • string fields cannot be used for functional boosts.

  • A string type field may contain up to 300 characters.

Textual fields longer than a few hundred characters should use the text type.

Structured data like a database ID or a URL should use the enum type.

text fields

Longer fields like the body text of an article use the text field type.

  • A text field will not match suggest queries but will match search queries.

  • A match in a text field will return in the highlights.

  • A text field cannot be used for filtering, sorting, functional boots, or faceting.

  • A text field may contain up to 100,000 characters.

enum fields

Cryptic bits of text like URLs and email addresses use the enum field type.

An enum field is considered a single piece of data. The values are not tokenized or analyzed.

eg. An enum value of "AppleCart" will not be lower-cased or split on case changes, as with text search.

  • An enum field is used for suggest and search queries, if the values match exactly.

    • For fuzzy matches, use a string field instead.
  • You can use enum fields to filter data and for faceting.

  • An enum field can be used to sort. But be aware that the sort is by string comparison.

    • eg. The query "apple" will sort before "bear" but "100" will sort before "99" because the first character of "100" is less than the first character of "99". If you need numerical sorting use an integer or float field instead.
  • enum fields may contain up to 2,000 characters.

  • A special enum field is the external_id which ties an Engine's document to your external website or application. All Site Search documents have an external_id. You do not have to define it in your schema.

Numeric fields: integer and float

Numbers use the integer or float field type.

  • Numeric fields - integer and float - are not used in suggest or search queries.

  • Numeric fields can be used in scoring, filtering (including by range), functional boosts, sorting, and faceting.

    • eg. The number of "Likes" on a post, or the average review score for a product.

date fields

Dates use the date field.

eg. You could store an article's publication date and search for articles published in the last 30 days using a range filter.

  • A date field is not used for suggest or search queries.

  • A date field can be used for filtering (including by range), sorting, and faceting.

  • When sent to the API, dates must be in ISO 8601 format, eg: "2013-02-27T18:09:19").

    • We recommend using UTC representations for dates.

location fields

Geographic locations use the location field.

The location field allows filtering by distance from a specified point.

eg. A store could have a location field and users could search for stores near their location.

  • The location field type can be used only for filtering by location.

  • The location field is not used in suggest or search queries.

  • A location is specified using a JSON object containing the longitude and latitude, eg: {"lat": 56.2,"lon": 44.7}.

Multi-valued fields

Multi-valued fields are used for storing fields like tags or categories, with more than one distinct value.

  • You cannot mix multiple types in the same field.

    • For example, an integer and a string cannot be stored in the same field.
  • Multi-valued fields are transparent in the search and suggest API calls. If the field type is searchable (string and text), multi-valued fields can be searched. If the field type is sortable, they can be sorted on, and so on.

  • To specify multiple values for a tag, pass a JSON array of the values, for example ["ruby", "rails", "json", "programming"].

Crawler Based Engine Schema Design

By default, a page that is crawled is turned into a document.

Crawled document belongs to a DocumentType called page.

Crawled pages are built into documents according to the following Engine schema:

Field Data Type Suggest/Autocomplete? Description
external_id enum No. For crawler based search engines, the hexadecimal MD5 digest of the normalized URL of the page.
updated_at date No. The date when the page was last indexed.
title string Yes. The title of the page taken from the <title> tag or the title meta tag.
url enum No. The URL of the page.
sections string Yes. Sections of the page determined by <h1>, <h2>, and <h3> tags or the sections meta tag.
body text No. The text of the page.
type enum No. The page type, set by the type meta tag.
image enum No. A URL for an image associated with the page (set by the image meta tag), used as a thumbnail in your search result listing if present.
published_at date No. The date the page was published. It can be set with the published_at meta tag. If not defined via a meta tag, the value will be the last time an update to the page was detected during a crawl.
popularity integer No. The popularity score for a page. Specialized crawlers for content management systems like Tumblr may use this field, or it can be set with the popularity meta tag and used to change search result rankings with functional boosts. If not specified, the default value is 1.
info string Yes. Additional information about the page returned with the results, set by the info meta tag.

The descriptions of the default fields often reference meta tags.

Meta tags are either created by you, or inferred by the crawler.

You can allow the crawler to make its assumptions, or create and assert your own meta tag values.

A meta tag with class="swiftype" is also how you add a new field to your Engine schema:

<head>
  <meta class="swiftype" name="new-field" data-type="integer" content="12" />
</head>

Broken down, within the meta tag we have:

  1. class="swiftype": Required to communicate with the crawler.
  2. name="new-field": The name of your field.
  3. data-type="integer": Any data type, all of which are laid out about.
  4. content="12": The content of the field must match the data type. Integers for integers, coordinates ("10, -10") for location, etc.

There are a few crucial things to note:

  • Adding a new tag to one or more pages will add the new field to your Engine schema after the next crawl.
  • Multiple tags with the same name but different content will add the content as an array to the field.
  • Fields cannot be deleted! Be careful naming and structuring your tags. Look out for odd characters and spelling issues.

Crawler Based Engine, Example queries

Now that we have a schema and a set of documents that conform to it, we can launch some test queries.

We can find documents about cats and boost the score by popularity:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
  -H 'Content-Type: application/json' \
  -d '{
        "q": "cats",
        "document_types": ["page"],
        "functional_boosts": {
          "page": {
            "popularity": "linear"
          }
        }
      }'

We can find documents sorted alphanumerically by title:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["page"],
        "filters": {
          "page": {
            "sort_field": {"page": "title"},
            "sort_direction": {"page": "desc"}
          }
      }'

We can find recently updated documents:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["page"],
        "filters": {
          "page": {
            "published_at": {"type": "range", "from": "2019-01-01"}
          }
        }
      }'

API Based Engine Schema Design

Let's say you were designing a schema for YouTube, and you want to search over the videos.

Videos have properties like title, caption, length, and so on.

You can view a complete list of attributes that a YouTube video has in the developer documentation.

The first step in schema design is determining which attributes you want to search, sort, and filter.

You only need to store data in Site Search that you want to search, sort, or filter. Site Search is not a database, but a search engine.

For a YouTube video, we might want to store data according to this search optimized schema:

Attribute Purpose Recommended Data Type
ID Identifies a unique video; links a record in your database to a Site Search document external_id
URL Search results link enum
thumbnail URL Display image with search results enum
channel ID Filtering enum
title Suggest and search queries string
caption Search queries text
tags Suggest and search queries string (multi-value)
* category name Suggest and search queries string
* category ID Filtering by category enum
published at date Filtering by date range date
duration (in seconds) Filtering integer
number of views Filtering, functional boosts integer
number of likes Functional boosts integer
  • Note that the schema contains both:
    • The category name as a string, for searching.
    • The category ID as an enum, for filtering.

Creating an API Based Engine Schema

Great ~ we mapped out the schema...

We will now use the API to index documents.

We can't get too far without an API based Engine.

Let's create one called youtube:

curl -X POST 'https://api.swiftype.com/api/v1/engines.json' \
-H 'Content-Type: application/json' \
-d '{
      "auth_token": "YOUR_API_KEY",
      "engine": {"name": "youtube"}
    }'

After that, we create the videos DocumentType to hold the documents:

curl -X POST 'https://api.swiftype.com/api/v1/engines/youtube/document_types.json' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "document_type": {"name": "videos"}
      }'

Next, we create a document in the videos DocumentType that matches the schema:

curl -X POST 'https://api.swiftype.com/api/v1/engines/youtube/document_types/videos/documents.json' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "document": {
          "external_id": "v1uyQZNg2vE",
          "fields": [
            {"name": "url", "value": "http://www.youtube.com/watch?v=v1uyQZNg2vE", "type":  "enum"},
            {"name": "thumbnail_url", "value": "https://i.ytimg.com/vi/v1uyQZNg2vE/mqdefault.jpg", "type": "enum"},
            {"name": "channel_id", "value": "UCK8sQmJBp8GCxrOtXWBpyEA", "type": "enum"},
            {"name": "title", "value": "How It Feels [through Glass]", "type": "string"},
            {"name": "caption", "value": "Want to see how Glass actually feels?...", "type": "text"},
            {"name": "tags", "value": ["glass", "wearable computing", "google"], "type": "string"},
            {"name": "category_name", "value": "Science & Technology", "type": "string"},
            {"name": "category_id", "value": 28, "type": "enum"},
            {"name": "published_at", "value": "2013-02-20T10:47:18", "type": "date"},
            {"name": "duration", "value": 136, "type": "integer"},
            {"name": "view_count", "value": 14599202, "type": "integer"},
            {"name": "like_count", "value": 75952, "type": "integer"}
          ]
        }
     }'

It may seem strange that we define the fields in the document instead of the DocumentType. But Site Search schemas are flexible.

Individual documents in a DocumentType do not need to share all the same fields.

You can add new fields over time simply by creating documents that contain them.

Although going forward, there are key two things of which to be mindful:

  • Index new fields as the same type as existing documents.
    • An existing string fields should not become an enum field, and so forth.
  • Fields cannot be deleted once they have been created.

API Based Engine, Example queries

Now that we have a schema and a set of documents that conform to it, we can launch some test queries.

The queries below really work -- try them on your own workstation.

We can find videos about cats and boost the score by the number of likes:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
  -H 'Content-Type: application/json' \
  -d '{
        "q": "cats",
        "document_types": ["videos"],
        "functional_boosts": {
          "videos": {
            "like_count": "linear"
          }
        }
      }'

We can find videos in the Pets & Animals category sorted by number of views:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["videos"],
        "filters": {"videos": {"category_id": "15"}},
        "sort_field": {"videos": "view_count"},
        "sort_direction": {"videos": "desc"}
      }'

We can find recent videos over a minute in length with more than 1,000,000 views:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["videos"],
        "filters": {
          "videos": {
            "published_at": {"type": "range", "from": "2013-02-01"},
            "view_count": {"type": "range", "from": 1000000},
            "duration": {"type": "range", "from": 60}
          }
        }
      }'

Try it yourself!

You should now feel well prepared to create your own Engine schema.


Stuck? Looking for help? Contact support or check out the Site Search community forum!