search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Crawler Operations

When you sign-up for Site Search, the default flow has you enter a URI to begin a crawl of your website, thus creating a Crawler-based Engine. You have the option of creating an API-based Engine instead, either via the account creation flow or from the Site Search API.

Both Engines accomplish the same thing: ingest then index documents, enable relevant and reliable search. However, the following functionality is only applicable within Crawler-based Engines! Read more about the differences within the API Overview.

These commands provide you with greater control over how the Site Search Crawler will interact with your webpages.

Crawler-based Engines

If you do not wish to create documents by hand, you can create a domain object within an Engine and the Site Search Crawler will automatically crawl the domain and create a document for each page it finds.

This section contains instructions upon how to:

Note The following crawler specific endpoints only apply to Crawler-based Engines. It is not possible to make use of the crawler specific endpoints for API-based Engines.

Creating a domain

POST /api/v1/engines/{engine_id}/domains.json
Example - Create a Domain for in the bookstore Engine. The crawler begins indexing immediately
curl -X POST '' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "domain": {"url": ""}

Get a domain

GET /api/v1/engines/{engine_id}/domains.json
GET /api/v1/engines/{engine_id}/domains/{domain_id}.json
Example - Get every Domain in an Engine
curl -X GET ''

Delete a domain

DELETE /api/v1/engines/{engine_id}/domains/{domain_id}.json
Example - Get a specific Domain with ID 4fcec5182f527673a0000006 in the bookstore Engine
curl -X GET ''

Recrawl a domain

You may trigger a re-crawl of a domain via the API. The frequency of the requests are dictated by your specific plan limitations.

PUT /api/v1/engines/{engine_id}/domains/{domain_id}/recrawl.json
Example - Recrawl a Domain with ID 4fcec5182f527673a0000006 in the bookstore Engine
curl -X PUT -H 'Content-Length: 0' ''

Crawl a single URL

PUT /api/v1/engines/{engine_id}/domains/{domain_id}/crawl_url.json

You may trigger the crawl of a single URL by using the domain's crawl_url endpoint with the URL as a parameter. If the URL belongs to the parent domain and has a corresponding result document, it will be updated. If the URL validates against the parent domain and has not yet been indexed, it will be added.

Example - Crawl a URL in the Domain with ID 4fcec5182f527673a0000006 in the bookstore Engine.
curl -X PUT '' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "url": ""

Error Cases

Crawling the URL happens asynchronously. As a result, there is limited error checking. It is helpful to be aware of scenarios that will prevent the URL from being added as a document:

  • The URL does not respond successfully (for example, if it responds HTTP 404 or HTTP 503).
  • The URL does not belong to the domain. For example, you try to crawl on the domain.
  • The URL is a duplicate of an existing document.
  • The URL is excluded by robots.txt, a robots meta tag, or a path whitelist or blacklist.
  • The URL has a canonical link that belongs to a different page or domain.

Stuck? Looking for help? Contact Support!