Crawler Operations

These commands provide you with greater control over the Site Search Crawler.

Engine Type	Supported?
Crawler-based Engine	YES
API-based Engine	NO

Read more about Crawler-based Engines and API-based Engines within the API Overview.

Crawler-based Engines

This section contains instructions upon how to:

Crawl a single URL
Recrawl a Domain
Add a Domain
GET a Domain
DELETE a Domain
DELETE a document by URL
Errors

Crawl a single URL

Trigger a crawl of a single URL.

The URL must belong to the parent domain.

The document will be indexed if the URL validates with the parent domain and is new.

The document will be updated if the URL validates and already has a matching document.

The document will be deleted if the URL does not validate and has been indexed.

PUT /api/v1/engines/{engine_id}/domains/{domain_id}/crawl_url.json

Example - Crawl a URL in the domain with ID 4fcec5182f527673a0000006 in the bookstore Engine.

curl -X PUT 'https://api.swiftype.com/api/v1/engines/bookstore/domains/4fcec5182f527673a0000006/crawl_url.json' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "url": "http://example.com/new-page"
      }'

Recrawl a domain

You may trigger a re-crawl of a domain via the API.

The frequency of the requests are dictated by your specific plan limitations.

See Plan & API Limitations for more details.

PUT /api/v1/engines/{engine_id}/domains/{domain_id}/recrawl.json

Example - Recrawl a domain with ID 4fcec5182f527673a0000006 in the bookstore Engine.

curl -X PUT -H 'Content-Length: 0' 'https://api.swiftype.com/api/v1/engines/bookstore/domains/4fcec5182f527673a0000006/recrawl.json?auth_token=YOUR_API_KEY'

Adding a domain

Add a new domain that you own to your Engine.

Reminder, there is an added monthly cost for each additional domain.

See pricing for more information.

POST /api/v1/engines/{engine_id}/domains.json

Example - Create a domain for example.com in the bookstore Engine. After the call, a crawl will begin to index the domain.

curl -X POST 'https://api.swiftype.com/api/v1/engines/bookstore/domains.json' \
  -H 'Content-Type: application/json' \
  -d '{
        "auth_token": "YOUR_API_KEY",
        "domain": {"url": "http://example.com"}
      }'

GET a domain

List all domains or a single domain by id.

All domains

List all domains.

GET /api/v1/engines/{engine_id}/domains.json

Example - List every domain in the bookstore Engine

curl -X GET 'https://api.swiftype.com/api/v1/engines/bookstore/domains.json?auth_token=YOUR_API_KEY'

Single domain

List a single domain, by id.

GET /api/v1/engines/{engine_id}/domains/{domain_id}.json

Example - List a single domain within the bookstore Engine by domain id.

curl -X GET 'https://api.swiftype.com/api/v1/engines/bookstore/domains/590cad86.json?auth_token=YOUR_API_KEY'

Delete a domain

Delete a domain from your Engine.

The domain will not be crawled.

DELETE /api/v1/engines/{engine_id}/domains/{domain_id}.json

Example - Get a specific domain with ID 4fcec5182f527673a0000006 in the bookstore Engine.

curl -X GET 'https://api.swiftype.com/api/v1/engines/bookstore/domains/4fcec5182f527673a0000006.json?auth_token=YOUR_API_KEY'

DELETE a document by URL

Delete a document from your Engine that matches the crawled and indexed url field value.

This performs a "soft" delete.

It can be undone using the Undelete This Page button within the page's entry in the Content section.

Exclude the page from crawls using the robots.txt file or Content Inclusion & Exclusion rules to delete the page permanently.

DELETE /api/v1/engines/{engine_id}/document_types/{document_type}/documents/destroy_url?url={url}

Example - Delete a document which contains url=http://example.com/section/page.html

curl -X DELETE 'https://api.swiftype.com/api/v1/engines/bookstore/document_types/page/documents/destroy_url?url=http://example.com/section/page.html&auth_token=YOUR_API_KEY'

Error Cases

Crawling the URL happens asynchronously. As a result, there is limited error checking. It is helpful to be aware of scenarios that will prevent the URL from being added as a document:

The URL does not respond successfully (for example, if it responds HTTP 404 or HTTP 503).
The URL does not belong to the domain. For example, you try to crawl http://anotherdomain.net/contact on the example.com domain.
The URL is a duplicate of an existing document.
The URL is excluded by robots.txt, a robots meta tag, or a path whitelist or blacklist.
The URL has a canonical link that belongs to a different page or domain.

Stuck? Looking for help? Contact support or check out the Site Search community forum!

Site Search

Guides

Site Search API

API Reference

API Clients

Plugins

Resources