Site Search Crawler Overview
The easiest way to get started is to let the Site Search Crawler 'crawl' your content.
A crawler, or web crawler, is a robust application which scans the content of webpages.
Our Swiftbot is a high-performance web crawler that will quickly crawl and index your webpages.
This overview contains information on...
- What is a Crawler-based Engine?
- Crawler Configuration
- Crawling Multiple Domains
- Crawler Frequency
- Crawler API Control
- Default Page Schema
What is a Crawler-based Engine?
There are two possible Engine types within Site Search:
- API-based
- Crawler-based.
If you use the crawler, then you have a Crawler-based Engine.
If your Engine was created as an API-based Engine during your initial signup or created via the Engines API, then it is an API-based Engine - you will know if that was the case!
If you are unsure, assume you are using a Crawler-based Engine.
Crawler Configuration
By default, the crawler will crawl each available page within your domain.
In many cases, you will want to specify what will be crawled, what will be indexed, and more.
To control how your content is indexed, you might apply:
- Dashboard: Manage Crawl Rules
- Meta Tags
- Content Curation Rules.
- robots.txt Files
- sitemap.xml Files
- RSS or Atom Feeds
For the full list of tools at your finger tips, see Crawler Configuration.
Crawling Multiple Domains
The crawler will not cross domains while indexing content -- including subdomains!
To index multiple domains...
- Add them when first creating the Engine
- Add them via the Manage Domains page on the Dashboard.
- (Optional) Use the Crawler Operations API endpoint.
For more information on multilpe domains, see the multiple domains guide.
Crawler Frequency
Site Search will re-index your content periodically.
You can force a re-crawl by clicking the Recrawl button on the Domains page of the dashboard.
Or, you can use the Crawler Operations API endpoint.
Crawler API Control
Crawler-based Engines can be controlled through a variety of robust and useful API endpoints.
There is one key difference...
A Crawler-based Engine indexes documents via the Crawler. API-based Engines index via the API.
Endpoint | Supported? |
---|---|
Document Indexing | No. |
Crawler Operations | Yes! |
Engines | Engines are account level. |
Search | Yes! |
Autocomplete | Yes! |
Analytics | Yes! |
Default Page Schema
When your pages are crawled, the content is indexed according to a schema.
When a page is indexed a default DocumentType called page
is created:
Field | Data Type | Suggest/Autocomplete? | Description |
---|---|---|---|
external_id |
enum |
No. | For crawler based search engines, the hexadecimal MD5 digest of the normalized URL of the page. |
updated_at |
date |
No. | The date when the page was last indexed. |
title |
string |
Yes. |
The title of the page taken from the <title> tag or the title meta tag.
|
url |
enum |
No. | The URL of the page. |
sections |
string |
Yes. |
Sections of the page determined by <h1> , <h2> , and <h3> tags or the sections meta tags.
|
body |
text |
No. | The text of the page. |
type |
enum |
No. | The page type, set by the type meta tag. |
image |
enum |
No. |
A URL for an image associated with the page (set by the image meta tag), used as a thumbnail in your search result listing if present.
|
published_at |
date |
No. |
The date the page was published. It can be set with the published_at meta tag. If not defined via a meta tag, it will default to the last time an update to the page was detected during a crawl.
|
popularity |
integer |
No. | The popularity score for a page. Specialized crawlers for content management systems like Tumblr may use this field, or it can be set with the popularity meta tag and used to change search result rankings with functional boosts. If not specified, the default value is 1. |
info |
string |
Yes. | Additional information about the page returned with the results, set by the info meta tag. |
Read more about Field Types and DocumentTypes.
Or learn how to design your Engine schema within our Engine Schema design guide.
Stuck? Looking for help? Contact support or check out the Site Search community forum!