Site Search Engine Schema Design Guide
This guide will teach you about the two types of search queries and field types, then walk through the construction an Engine schema.
This guide covers both:
Suggest v. Search Queries
There are two types of search queries:
Search queries: Match complete terms.
Suggest queries: Match on prefixes of terms to perform autocompletion of search terms.
- eg. If you have a document with a field value of "Autocomplete Example", a suggest query for "aut", "auto", "autoc" and so forth will match on "Autocomplete".
It is important to consider whether a field will be used for suggest queries or for search queries.
eg. The title of an article is a good candidate for suggest queries, but the body text would not be.
Field type overview
The key distinguishing feature between field types is whether they are used for searching or not.
Textual fields - string
and text
- can be searched. But only string
fields leverage both suggest and search queries.
The other field types are used to:
- filter results
- change the relevance of results, such as with functional boosts
- sort
- provide faceted counts of results
Type | Search Queries | Suggest Queries | Functional Boosts | Filtering | Sorting | Facets |
---|---|---|---|---|---|---|
string |
Yes | Yes | No | Yes | Yes | Yes |
text |
Yes | No | No | No | No | No |
enum |
Yes | No | No | Yes | Yes | Yes |
integer |
No | No | Yes | Yes | Yes | Yes |
float |
No | No | Yes | Yes | Yes | Yes |
date |
No | No | No | Yes | Yes | Yes |
location |
No | No | No | Yes | No | No |
string
fields
Short textual fields like title or headers use the string
field type.
A
string
is used for both suggest and search queries.A match in a
string
field for a full-text search will be returned in the highlights.string
fields cannot be used for functional boosts.A
string
type field may contain up to 300 characters.
Textual fields longer than a few hundred characters should use the text
type.
Structured data like a database ID or a URL should use the enum
type.
text
fields
Longer fields like the body text of an article use the text
field type.
A
text
field will not match suggest queries but will match search queries.A match in a
text
field will return in the highlights.A
text
field cannot be used for filtering, sorting, functional boots, or faceting.A
text
field may contain up to 100,000 characters.
enum
fields
Cryptic bits of text like URLs and email addresses use the enum
field type.
An enum
field is considered a single piece of data. The values are not tokenized or analyzed.
eg. An enum
value of "AppleCart" will not be lower-cased or split on case changes, as with text search.
An
enum
field is used for suggest and search queries, if the values match exactly.- For fuzzy matches, use a
string
field instead.
- For fuzzy matches, use a
You can use
enum
fields to filter data and for faceting.An
enum
field can be used to sort. But be aware that the sort is by string comparison.- eg. The query "apple" will sort before "bear" but "100" will sort before "99" because the first character of "100" is less than the first character of "99". If you need numerical sorting use an
integer
orfloat
field instead.
- eg. The query "apple" will sort before "bear" but "100" will sort before "99" because the first character of "100" is less than the first character of "99". If you need numerical sorting use an
enum
fields may contain up to 2,000 characters.A special
enum
field is theexternal_id
which ties an Engine's document to your external website or application. All Site Search documents have anexternal_id
. You do not have to define it in your schema.
Numeric fields: integer
and float
Numbers use the integer
or float
field type.
Numeric fields -
integer
andfloat
- are not used in suggest or search queries.Numeric fields can be used in scoring, filtering (including by range), functional boosts, sorting, and faceting.
- eg. The number of "Likes" on a post, or the average review score for a product.
date
fields
Dates use the date
field.
eg. You could store an article's publication date and search for articles published in the last 30 days using a range filter.
A
date
field is not used for suggest or search queries.A
date
field can be used for filtering (including by range), sorting, and faceting.When sent to the API, dates must be in ISO 8601 format, eg: "2013-02-27T18:09:19").
- We recommend using UTC representations for dates.
location
fields
Geographic locations use the location
field.
The location field allows filtering by distance from a specified point.
eg. A store could have a location
field and users could search for stores near their location.
The
location
field type can be used only for filtering by location.The
location
field is not used in suggest or search queries.A location is specified using a JSON object containing the longitude and latitude, eg:
{"lat": 56.2,"lon": 44.7}
.
Multi-valued fields
Multi-valued fields are used for storing fields like tags or categories, with more than one distinct value.
You cannot mix multiple types in the same field.
- For example, an
integer
and astring
cannot be stored in the same field.
- For example, an
Multi-valued fields are transparent in the search and suggest API calls. If the field type is searchable (
string
andtext
), multi-valued fields can be searched. If the field type is sortable, they can be sorted on, and so on.To specify multiple values for a tag, pass a JSON array of the values, for example
["ruby", "rails", "json", "programming"]
.
Crawler Based Engine Schema Design
By default, a page that is crawled is turned into a document.
Crawled document belongs to a DocumentType called page
.
Crawled pages are built into documents according to the following Engine schema:
Field | Data Type | Suggest/Autocomplete? | Description |
---|---|---|---|
external_id |
enum |
No. | For crawler based search engines, the hexadecimal MD5 digest of the normalized URL of the page. |
updated_at |
date |
No. | The date when the page was last indexed. |
title |
string |
Yes. |
The title of the page taken from the <title> tag or the title meta tag.
|
url |
enum |
No. | The URL of the page. |
sections |
string |
Yes. |
Sections of the page determined by <h1> , <h2> , and <h3> tags or the sections meta tag.
|
body |
text |
No. | The text of the page. |
type |
enum |
No. | The page type, set by the type meta tag. |
image |
enum |
No. |
A URL for an image associated with the page (set by the image meta tag), used as a thumbnail in your search result listing if present.
|
published_at |
date |
No. |
The date the page was published. It can be set with the published_at meta tag. If not defined via a meta tag, the value will be the last time an update to the page was detected during a crawl.
|
popularity |
integer |
No. | The popularity score for a page. Specialized crawlers for content management systems like Tumblr may use this field, or it can be set with the popularity meta tag and used to change search result rankings with functional boosts. If not specified, the default value is 1. |
info |
string |
Yes. | Additional information about the page returned with the results, set by the info meta tag. |
The descriptions of the default fields often reference meta tags.
Meta tags are either created by you, or inferred by the crawler.
You can allow the crawler to make its assumptions, or create and assert your own meta tag values.
A meta tag with class="swiftype"
is also how you add a new field to your Engine schema:
<head>
<meta class="swiftype" name="new-field" data-type="integer" content="12" />
</head>
Broken down, within the meta tag we have:
class="swiftype"
: Required to communicate with the crawler.name="new-field"
: The name of your field.data-type="integer"
: Any data type, all of which are laid out about.content="12"
: The content of the field must match the data type. Integers for integers, coordinates ("10, -10") for location, etc.
There are a few crucial things to note:
- Adding a new tag to one or more pages will add the new field to your Engine schema after the next crawl.
- Multiple tags with the same name but different content will add the content as an array to the field.
- Fields cannot be deleted! Be careful naming and structuring your tags. Look out for odd characters and spelling issues.
Crawler Based Engine, Example queries
Now that we have a schema and a set of documents that conform to it, we can launch some test queries.
We can find documents about cats and boost the score by popularity
:
curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
-H 'Content-Type: application/json' \
-d '{
"q": "cats",
"document_types": ["page"],
"functional_boosts": {
"page": {
"popularity": "linear"
}
}
}'
We can find documents sorted alphanumerically by title:
curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
-H 'Content-Type: application/json' \
-d '{
"document_types": ["page"],
"filters": {
"page": {
"sort_field": {"page": "title"},
"sort_direction": {"page": "desc"}
}
}'
We can find recently updated documents:
curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=example-key' \
-H 'Content-Type: application/json' \
-d '{
"document_types": ["page"],
"filters": {
"page": {
"published_at": {"type": "range", "from": "2019-01-01"}
}
}
}'
API Based Engine Schema Design
Let's say you were designing a schema for YouTube, and you want to search over the videos.
Videos have properties like title, caption, length, and so on.
You can view a complete list of attributes that a YouTube video has in the developer documentation.
The first step in schema design is determining which attributes you want to search, sort, and filter.
You only need to store data in Site Search that you want to search, sort, or filter. Site Search is not a database, but a search engine.
For a YouTube video, we might want to store data according to this search optimized schema:
Attribute | Purpose | Recommended Data Type |
---|---|---|
ID | Identifies a unique video; links a record in your database to a Site Search document | external_id |
URL | Search results link | enum |
thumbnail URL | Display image with search results | enum |
channel ID | Filtering | enum |
title | Suggest and search queries | string |
caption | Search queries | text |
tags | Suggest and search queries | string (multi-value) |
* category name | Suggest and search queries | string |
* category ID | Filtering by category | enum |
published at date | Filtering by date range | date |
duration (in seconds) | Filtering | integer |
number of views | Filtering, functional boosts | integer |
number of likes | Functional boosts | integer |
- Note that the schema contains both:
- The category name as a
string
, for searching. - The category ID as an
enum
, for filtering.
- The category name as a
Creating an API Based Engine Schema
Great ~ we mapped out the schema...
We will now use the API to index documents.
We can't get too far without an API based Engine.
Let's create one called youtube
:
curl -X POST 'https://api.swiftype.com/api/v1/engines.json' \
-H 'Content-Type: application/json' \
-d '{
"auth_token": "YOUR_API_KEY",
"engine": {"name": "youtube"}
}'
After that, we create the videos
DocumentType to hold the documents:
curl -X POST 'https://api.swiftype.com/api/v1/engines/youtube/document_types.json' \
-H 'Content-Type: application/json' \
-d '{
"auth_token": "YOUR_API_KEY",
"document_type": {"name": "videos"}
}'
Next, we create a document in the videos
DocumentType that matches the schema:
curl -X POST 'https://api.swiftype.com/api/v1/engines/youtube/document_types/videos/documents.json' \
-H 'Content-Type: application/json' \
-d '{
"auth_token": "YOUR_API_KEY",
"document": {
"external_id": "v1uyQZNg2vE",
"fields": [
{"name": "url", "value": "http://www.youtube.com/watch?v=v1uyQZNg2vE", "type": "enum"},
{"name": "thumbnail_url", "value": "https://i.ytimg.com/vi/v1uyQZNg2vE/mqdefault.jpg", "type": "enum"},
{"name": "channel_id", "value": "UCK8sQmJBp8GCxrOtXWBpyEA", "type": "enum"},
{"name": "title", "value": "How It Feels [through Glass]", "type": "string"},
{"name": "caption", "value": "Want to see how Glass actually feels?...", "type": "text"},
{"name": "tags", "value": ["glass", "wearable computing", "google"], "type": "string"},
{"name": "category_name", "value": "Science & Technology", "type": "string"},
{"name": "category_id", "value": 28, "type": "enum"},
{"name": "published_at", "value": "2013-02-20T10:47:18", "type": "date"},
{"name": "duration", "value": 136, "type": "integer"},
{"name": "view_count", "value": 14599202, "type": "integer"},
{"name": "like_count", "value": 75952, "type": "integer"}
]
}
}'
It may seem strange that we define the fields in the document instead of the DocumentType. But Site Search schemas are flexible.
Individual documents in a DocumentType do not need to share all the same fields.
Create documents contain new fields to add them over time.
Although going forward, there are key two things of which to be mindful:
- Index new fields as the same type as existing documents.
- An existing
string
fields should not become anenum
field, and so forth.
- An existing
- Fields cannot be deleted once they have been created.
API Based Engine, Example queries
Now that we have a schema and a set of documents that conform to it, we can launch some test queries.
The queries below really work -- try them on your own workstation.
We can find videos about cats and boost the score by the number of likes:
curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
-H 'Content-Type: application/json' \
-d '{
"q": "cats",
"document_types": ["videos"],
"functional_boosts": {
"videos": {
"like_count": "linear"
}
}
}'
We can find videos in the Pets & Animals category sorted by number of views:
curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
-H 'Content-Type: application/json' \
-d '{
"document_types": ["videos"],
"filters": {"videos": {"category_id": "15"}},
"sort_field": {"videos": "view_count"},
"sort_direction": {"videos": "desc"}
}'
We can find recent videos over a minute in length with more than 1,000,000 views:
curl -X GET 'https://api.swiftype.com/api/v1/public/engines/search.json?engine_key=swiftype-api-example' \
-H 'Content-Type: application/json' \
-d '{
"document_types": ["videos"],
"filters": {
"videos": {
"published_at": {"type": "range", "from": "2013-02-01"},
"view_count": {"type": "range", "from": 1000000},
"duration": {"type": "range", "from": 60}
}
}
}'
Try it yourself!
You should now feel well prepared to create your own Engine schema.
Stuck? Looking for help? Contact support or check out the Site Search community forum!