Site Search Engine Schema Design Guide

This guide will teach you about the two types of search queries and field types, then walk through the construction an Engine schema.

This guide covers both:

Crawler based Engines
API based Engines

Suggest v. Search Queries

There are two types of search queries:

Search queries: Match complete terms.
Suggest queries: Match on prefixes of terms to perform autocompletion of search terms.
- eg. If you have a document with a field value of "Autocomplete Example", a suggest query for "aut", "auto", "autoc" and so forth will match on "Autocomplete".

It is important to consider whether a field will be used for suggest queries or for search queries.

eg. The title of an article is a good candidate for suggest queries, but the body text would not be.

Field type overview

The key distinguishing feature between field types is whether they are used for searching or not.

Textual fields - string and text - can be searched. But only string fields leverage both suggest and search queries.

The other field types are used to:

filter results
change the relevance of results, such as with functional boosts
sort
provide faceted counts of results

Type	Search Queries	Suggest Queries	Functional Boosts	Filtering	Sorting	Facets
`string`	Yes	Yes	No	Yes	Yes	Yes
`text`	Yes	No	No	No	No	No
`enum`	Yes	No	No	Yes	Yes	Yes
`integer`	No	No	Yes	Yes	Yes	Yes
`float`	No	No	Yes	Yes	Yes	Yes
`date`	No	No	No	Yes	Yes	Yes
`location`	No	No	No	Yes	No	No

`string` fields

Short textual fields like title or headers use the string field type.

A string is used for both suggest and search queries.
A match in a string field for a full-text search will be returned in the highlights.
string fields cannot be used for functional boosts.
A string type field may contain up to 300 characters.

Textual fields longer than a few hundred characters should use the text type.

Structured data like a database ID or a URL should use the enum type.

`text` fields

Longer fields like the body text of an article use the text field type.

A text field will not match suggest queries but will match search queries.
A match in a text field will return in the highlights.
A text field cannot be used for filtering, sorting, functional boots, or faceting.
A text field may contain up to 100,000 characters.

`enum` fields

Cryptic bits of text like URLs and email addresses use the enum field type.

An enum field is considered a single piece of data. The values are not tokenized or analyzed.

eg. An enum value of "AppleCart" will not be lower-cased or split on case changes, as with text search.

An enum field is used for suggest and search queries, if the values match exactly.
- For fuzzy matches, use a string field instead.
You can use enum fields to filter data and for faceting.
An enum field can be used to sort. But be aware that the sort is by string comparison.
- eg. The query "apple" will sort before "bear" but "100" will sort before "99" because the first character of "100" is less than the first character of "99". If you need numerical sorting use an integer or float field instead.
enum fields may contain up to 2,000 characters.
A special enum field is the external_id which ties an Engine's document to your external website or application. All Site Search documents have an external_id. You do not have to define it in your schema.

Numeric fields: `integer` and `float`

Numbers use the integer or float field type.

Numeric fields - integer and float - are not used in suggest or search queries.
Numeric fields can be used in scoring, filtering (including by range), functional boosts, sorting, and faceting.
- eg. The number of "Likes" on a post, or the average review score for a product.

`date` fields

Dates use the date field.

eg. You could store an article's publication date and search for articles published in the last 30 days using a range filter.

A date field is not used for suggest or search queries.
A date field can be used for filtering (including by range), sorting, and faceting.
When sent to the API, dates must be in ISO 8601 format, eg: "2013-02-27T18:09:19").
- We recommend using UTC representations for dates.

`location` fields

Geographic locations use the location field.

The location field allows filtering by distance from a specified point.

eg. A store could have a location field and users could search for stores near their location.

The location field type can be used only for filtering by location.
The location field is not used in suggest or search queries.
A location is specified using a JSON object containing the longitude and latitude, eg: {"lat": 56.2,"lon": 44.7}.

Multi-valued fields

Multi-valued fields are used for storing fields like tags or categories, with more than one distinct value.

You cannot mix multiple types in the same field.
- For example, an integer and a string cannot be stored in the same field.
Multi-valued fields are transparent in the search and suggest API calls. If the field type is searchable (string and text), multi-valued fields can be searched. If the field type is sortable, they can be sorted on, and so on.
To specify multiple values for a tag, pass a JSON array of the values, for example ["ruby", "rails", "json", "programming"].

Crawler Based Engine Schema Design

By default, a page that is crawled is turned into a document.

Crawled document belongs to a DocumentType called page.

Crawled pages are built into documents according to the following Engine schema:

Field	Data Type	Suggest/Autocomplete?	Description
`external_id`	`enum`	No.	For crawler based search engines, the hexadecimal MD5 digest of the normalized URL of the page.
`updated_at`	`date`	No.	The date when the page was last indexed.
`title`	`string`	Yes.	The title of the page taken from the `<title>` tag or the `title` meta tag.
`url`	`enum`	No.	The URL of the page.
`sections`	`string`	Yes.	Sections of the page determined by `<h1>`, `<h2>`, and `<h3>` tags or the `sections` meta tag.
`body`	`text`	No.	The text of the page.
`type`	`enum`	No.	The page type, set by the `type` meta tag.
`image`	`enum`	No.	A URL for an image associated with the page (set by the `image` meta tag), used as a thumbnail in your search result listing if present.
`published_at`	`date`	No.	The date the page was published. It can be set with the `published_at` meta tag. If not defined via a meta tag, the value will be the last time an update to the page was detected during a crawl.
`popularity`	`integer`	No.	The popularity score for a page. Specialized crawlers for content management systems like Tumblr may use this field, or it can be set with the `popularity` meta tag and used to change search result rankings with functional boosts. If not specified, the default value is 1.
`info`	`string`	Yes.	Additional information about the page returned with the results, set by the `info` meta tag.

The descriptions of the default fields often reference meta tags.

Meta tags are either created by you, or inferred by the crawler.

You can allow the crawler to make its assumptions, or create and assert your own meta tag values.

A meta tag with class="swiftype" is also how you add a new field to your Engine schema:

<head>
  <meta class="swiftype" name="new-field" data-type="integer" content="12" />
</head>

Broken down, within the meta tag we have:

class="swiftype": Required to communicate with the crawler.
name="new-field": The name of your field.
data-type="integer": Any data type, all of which are laid out about.
content="12": The content of the field must match the data type. Integers for integers, coordinates ("10, -10") for location, etc.

There are a few crucial things to note:

Adding a new tag to one or more pages will add the new field to your Engine schema after the next crawl.
Multiple tags with the same name but different content will add the content as an array to the field.
Fields cannot be deleted! Be careful naming and structuring your tags. Look out for odd characters and spelling issues.

Crawler Based Engine, Example queries

Now that we have a schema and a set of documents that conform to it, we can launch some test queries.

We can find documents about cats and boost the score by popularity:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
  -H 'Content-Type: application/json' \
  -d '{
        "q": "cats",
        "document_types": ["page"],
        "functional_boosts": {
          "page": {
            "popularity": "linear"
          }
        }
      }'

We can find documents sorted alphanumerically by title:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["page"],
        "filters": {
          "page": {
            "sort_field": {"page": "title"},
            "sort_direction": {"page": "desc"}
          }
      }'

We can find recently updated documents:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["page"],
        "filters": {
          "page": {
            "published_at": {"type": "range", "from": "2019-01-01"}
          }
        }
      }'

API Based Engine Schema Design

Let's say you were designing a schema for YouTube, and you want to search over the videos.

Videos have properties like title, caption, length, and so on.

You can view a complete list of attributes that a YouTube video has in the developer documentation.

The first step in schema design is determining which attributes you want to search, sort, and filter.

You only need to store data in Site Search that you want to search, sort, or filter. Site Search is not a database, but a search engine.

For a YouTube video, we might want to store data according to this search optimized schema:

Attribute	Purpose	Recommended Data Type
ID	Identifies a unique video; links a record in your database to a Site Search document	`external_id`
URL	Search results link	`enum`
thumbnail URL	Display image with search results	`enum`
channel ID	Filtering	`enum`
title	Suggest and search queries	`string`
caption	Search queries	`text`
tags	Suggest and search queries	`string` (multi-value)
* category name	Suggest and search queries	`string`
* category ID	Filtering by category	`enum`
published at date	Filtering by date range	`date`
duration (in seconds)	Filtering	`integer`
number of views	Filtering, functional boosts	`integer`
number of likes	Functional boosts	`integer`

Note that the schema contains both:
- The category name as a string, for searching.
- The category ID as an enum, for filtering.

Creating an API Based Engine Schema

Great ~ we mapped out the schema...

We will now use the API to index documents.

We can't get too far without an API based Engine.

Let's create one called youtube:

curl -X POST 'https://api.swiftype.com/api/v1/engines.json' \
-H 'Content-Type: application/json' \
-d '{
      "auth_token": "YOUR_API_KEY",
      "engine": {"name": "youtube"}
    }'

After that, we create the videos DocumentType to hold the documents:

curl -X POST 'https://api.swiftype.com/api/v1/engines/youtube/document_types.json' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "document_type": {"name": "videos"}
      }'

Next, we create a document in the videos DocumentType that matches the schema:

curl -X POST 'https://api.swiftype.com/api/v1/engines/youtube/document_types/videos/documents.json' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "document": {
          "external_id": "v1uyQZNg2vE",
          "fields": [
            {"name": "url", "value": "http://www.youtube.com/watch?v=v1uyQZNg2vE", "type":  "enum"},
            {"name": "thumbnail_url", "value": "https://i.ytimg.com/vi/v1uyQZNg2vE/mqdefault.jpg", "type": "enum"},
            {"name": "channel_id", "value": "UCK8sQmJBp8GCxrOtXWBpyEA", "type": "enum"},
            {"name": "title", "value": "How It Feels [through Glass]", "type": "string"},
            {"name": "caption", "value": "Want to see how Glass actually feels?...", "type": "text"},
            {"name": "tags", "value": ["glass", "wearable computing", "google"], "type": "string"},
            {"name": "category_name", "value": "Science & Technology", "type": "string"},
            {"name": "category_id", "value": 28, "type": "enum"},
            {"name": "published_at", "value": "2013-02-20T10:47:18", "type": "date"},
            {"name": "duration", "value": 136, "type": "integer"},
            {"name": "view_count", "value": 14599202, "type": "integer"},
            {"name": "like_count", "value": 75952, "type": "integer"}
          ]
        }
     }'

It may seem strange that we define the fields in the document instead of the DocumentType. But Site Search schemas are flexible.

Individual documents in a DocumentType do not need to share all the same fields.

Create documents contain new fields to add them over time.

Although going forward, there are key two things of which to be mindful:

Index new fields as the same type as existing documents.
- An existing string fields should not become an enum field, and so forth.
Fields cannot be deleted once they have been created.

API Based Engine, Example queries

Now that we have a schema and a set of documents that conform to it, we can launch some test queries.

The queries below really work -- try them on your own workstation.

We can find videos about cats and boost the score by the number of likes:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
  -H 'Content-Type: application/json' \
  -d '{
        "q": "cats",
        "document_types": ["videos"],
        "functional_boosts": {
          "videos": {
            "like_count": "linear"
          }
        }
      }'

We can find videos in the Pets & Animals category sorted by number of views:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["videos"],
        "filters": {"videos": {"category_id": "15"}},
        "sort_field": {"videos": "view_count"},
        "sort_direction": {"videos": "desc"}
      }'

We can find recent videos over a minute in length with more than 1,000,000 views:

curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
  -H 'Content-Type: application/json' \
  -d '{
        "document_types": ["videos"],
        "filters": {
          "videos": {
            "published_at": {"type": "range", "from": "2013-02-01"},
            "view_count": {"type": "range", "from": 1000000},
            "duration": {"type": "range", "from": 60}
          }
        }
      }'

Try it yourself!

You should now feel well prepared to create your own Engine schema.

Stuck? Looking for help? Contact support or check out the Site Search community forum!

Site Search

Guides

Site Search API

API Reference

API Clients

Plugins

Resources

Site Search Engine Schema Design Guide

Suggest v. Search Queries

Field type overview

`string` fields

`text` fields

`enum` fields

Numeric fields: `integer` and `float`

`date` fields

`location` fields

Multi-valued fields

Crawler Based Engine Schema Design

Crawler Based Engine, Example queries

API Based Engine Schema Design

Creating an API Based Engine Schema

API Based Engine, Example queries

Try it yourself!

Site Search

Guides

Site Search API

API Reference

API Clients

Plugins

Resources

Site Search Engine Schema Design Guide

Suggest v. Search Queries

Field type overview

string fields

text fields

enum fields

Numeric fields: integer and float

date fields

location fields

Multi-valued fields

Crawler Based Engine Schema Design

Crawler Based Engine, Example queries

API Based Engine Schema Design

Creating an API Based Engine Schema

API Based Engine, Example queries

Try it yourself!

`string` fields

`text` fields

`enum` fields

Numeric fields: `integer` and `float`

`date` fields

`location` fields