To best ingest and index your webpages, the Site Search Crawler supports and abides by many different features of the modern web:
- Content Inclusion & Exclusion
- Content Inclusion & Exclusion, URL Paths
- Meta Tags
- Robots Meta-Tag
- Sitemap XML
- RSS and Atom
- Password Protected Content
- Constant Crawl
- URL Inspector
- Canonical Crawling
Content Inclusion & Exclusion
We have instructions on how you can include or exclude crawled content from your Search Engine:
Whitelist: Content Inclusion
Your webpages may include content that you do want ingested and content that you do not wanted ingested. In these scenarios you can use the content inclusion feature to whitelist elements of your pages for indexing.
For example, if you would like to index a single content section of every page you can set
data-swiftype-index=true on its element and the crawler will only extract text from that element.
<body> This is content that will not be indexed by the Swiftype crawler. <div data-swiftype-index='true'> <p> All content under the above div tag will be indexed. </p> <p> Content in this paragraph tag will be included from the search index! </p> </div> This content will not be indexed, since it isn't surrounded by an include tag. </body>
Blacklist: Content Exclusion
You may want to keep certain content from being indexed in your site search engine -- for example, your site header, footer, or menu bar. You can tell the Swiftype crawler to exclude these elements by adding the
data-swiftype-index=false attribute to any element, as illustrated below.
<body> This is your page content, which will be indexed by the Swiftype crawler. <p data-swiftype-index='false'> Content in this paragraph tag will be excluded from the search index! </p> This content will be indexed, since it isn't surrounded by an excluded tag. <div id='footer' data-swiftype-index='false'> This footer content will be excluded as well. </div> </body>
Whitelist and Blacklist rules, when nested, will work as you might expect. If there are multiple rules present on the page, all text will inherit behavior from the nearest parent element that contains a rule. This way, you will be able to include and exclude elements within each other.
For example, if the first rule is
data-swiftype-index=false applied to a child element, any text outside that element and any other element with an inclusion rule will be indexed to the page's document record.
<body> This is content that will not be indexed since the first rule is true. <div data-swiftype-index='true'> <p> All content under the above div tag will be indexed. </p> <p> Content in this paragraph tag will be included from the search index! </p> <p data-swiftype-index='false'> Content in this paragraph will be excluded because of the nested rule. </p> <span data-swiftype-index="false"> <p> Content in this paragraph will be excluded because the parent span is false. </p> <p data-swiftype-index="true"> Content in this paragraph will be INCLUDED because the parent container is true. </p> </span> </div> </body>
Content Inclusion & Exclusion, URL Paths
Whitelist and blacklist rules allow you to tell the Site Search Crawler to include or exclude parts of your domain when crawling. To configure these rules, visit the Manage Domain page of your Site Search dashboard.
As you type a whitelist or blacklist rule, you will see a sample of URLs that will be affected.
Whitelist - Including only certain paths
Whitelist rules allow you to specify which parts of your domain you want the Site Search Crawler to ingest. If you add rules to the whitelist, the Site Search Crawler will only include the parts of you domain that match these rules. Otherwise the crawler will include every page on your domain which is not excluded by blacklist or your robots.txt file:
|begin with||Include URLs that begin with this text.||
Setting this to
|contain||Include URLs that contain this text.||
Setting this to
|end with||Include URLs that end with this text.||
Setting this to
|match regex||Include URLs that match a regular expression. Advanced users only.||
Setting this to
Blacklist - Excluding certain paths
Blacklist rules allow you to tell the Site Search Crawler not to index parts of your domain. The rules you create in the blacklist will be applied to the paths that you have allowed within the whitelist rules. If no items are within the whitelist, then everything will be crawled. However, if it is on the blacklist, it will not be crawled.
|begin with||Exclude URLs that begin with this text.||
Setting this to
|contain||Exclude URLs that contain this text.||
Setting this to
|end with||Exclude URLs that end with this text.||
Setting this to
|match regex||Exclude URLs that match a regular expression. Advanced users only.||
Setting this to
Site Search Meta Tags
The Site Search Crawler supports a very flexible set of <meta>-tags that allow you to deliver structured information to our web crawler. When our crawler visits your webpage, by default, we extract a standard set of fields (e.g. title, body) that are then indexed for searching in your search engine.
With these <meta>-tags, you can augment - or completely alter - the set of fields our crawler extracts in order to better fit the data you wish to be indexed on your website.
We have documentation around:
Each field we extract must meet specific structure guidelines, with defined name, type, and content values. The field type, which is specified in the
data-type attribute, must be a valid, Site Search supported field type, which you may read about here.
The template for a Site Search Meta-Tag is as follows, and should be placed within the
<head> of your webpage:
<meta class="swiftype" name="[field name]" data-type="[field type]" content="[field content]" />
data-typecarefully! Once a new meta-tag has been indexed, and the custom field created, its
data-typecannot be changed.
This is a more complex example, showing the creation of multiple fields. As you can see, the tag field is repeated, and as a result our crawler extracts an array of tags for this URL. All field types can be extracted as arrays.
<head> <title>page title | website name</title> <meta class="swiftype" name="title" data-type="string" content="page title" /> <meta class="swiftype" name="body" data-type="text" content="this is the body content" /> <meta class="swiftype" name="url" data-type="enum" content="http://www.swiftype.com" /> <meta class="swiftype" name="price" data-type="float" content="3.99" /> <meta class="swiftype" name="quantity" data-type="integer" content="12" /> <meta class="swiftype" name="published_at" data-type="date" content="2013-10-31" /> <meta class="swiftype" name="store_location" data-type="location" content="20,-10" /> <meta class="swiftype" name="tags" data-type="string" content="tag1" /> <meta class="swiftype" name="tags" data-type="string" content="tag2" /> </head>
Body-embedded Data Attribute Tags
If you do not want to repeat tons of text in the <head> of your page then you can just add data attributes to existing elements:
<body> <h1 data-swiftype-name="title" data-swiftype-type="string">title here</h1> <div data-swiftype-name="body" data-swiftype-type="text"> Lots of body content goes here... Other content goes here too, and can be of any type, like a price: $<span data-swiftype-name="price" data-swiftype-type="float">3.99</span> </div> </body>
Upgrade your old Meta Tags
When Site Search originally launched, we supported a small set of Meta Tags that were intended for very specific use-cases. We will continue to support those tags if you already have them on your website, but we do not recommend using them for new projects. It would be a wise decision to upgrade.
For each original-style meta tag, simply replace the tag with an equivalent tag using the new format. In each example below, the first line shows an example original-style tag and the second line shows the correct replacement.
<!-- old, deprecated style --> <meta property='st:title' content='[title value]' /> <!-- new, correct style --> <meta class='swiftype' name='title' data-type='string' content='[title value]' />
<!-- old, deprecated style --> <meta property='st:section' content='[section value]' /> <!-- new, correct style --> <meta class='swiftype' name='sections' data-type='string' content='[sections field value]' />
<!-- old, deprecated style --> <meta property='st:image' content='[image url]' /> <!-- new, correct style --> <meta class='swiftype' name='image' data-type='enum' content='[image url]' />
<!-- old, deprecated style --> <meta property='st:type' content='[type value]' /> <!-- new, correct style --> <meta class='swiftype' name='type' data-type='enum' content='[type value]' />
<!-- old, deprecated style --> <meta property='st:info' content='[info value]' /> <!-- new, correct style --> <meta class='swiftype' name='info' data-type='string' content='[info value]' />
<!-- old, deprecated style --> <meta property='st:published_at' content='[published_at date]' /> <!-- new, correct style --> <meta class='swiftype' name='published_at' data-type='date' content='[published_at date]' />
<!-- old, deprecated style --> <meta property='st:popularity' content='[popularity value]' /> <!-- new, correct style --> <meta class='swiftype' name='popularity' data-type='integer' content='[popularity value]' />
The Site Search Crawler supports the features of the
robots.txt file standard and will respect all rules issued to our User-agent. Among other uses, the
robots.txt file is a good way to exclude certain portions of your site from your Site Search Engine.
If you would like your
robots.txt file rules to apply only to the Site Search Crawler, you should specify the Swiftbot User-agent in your file, as shown in the
Disallow example below. We will also respect rules specified under the wildcard User-agent.
robots.txtfile disallowing the Site Search Crawler from indexing any content under the /mobile path.
User-agent: Swiftbot Disallow: /mobile/
If you have a wildcard
Disallow, your site will not be crawled. If you would like to specifically
Allow only Swiftbot to crawl your site, you can
Allow it using a blank Disallow rule, as shown in the example below.
robots.txtfile allowing the Swiftbot while disallowing all other
User-agent: Swiftbot Disallow: User-agent: * Disallow: /
You can also control the rate at which we access your website with the crawler by using the
Crawl-delay directive, which expects a number representing seconds. A delay of
5 seconds is 17,280 crawls per day. A crawl is web traffic, so limiting it can reduce bandwidth. Limiting it too much, however, can limit the uptake of new documents!
robots.txtfile with a Crawl-delay of 5 seconds.
User-agent: Swiftbot Crawl-delay: 5
For fine-grained control over how your pages are indexed, you may use robots Meta Tags. Read more about that below!
Robots Meta Tag Support
We have instructions upon:
- Using the Robots Meta-Tag
- Robots Meta-Tag Content Values
- Directing Instructions at Site Search Crawler Only
- Repeating Content Values
- Casing, Spacing and Ordering
Using the "robots" meta-tag
Place the robots meta-tag in the
<head> section of your page:
<!doctype html> <html> <head> <meta name="robots" content="noindex, nofollow"> </head> <body> Page content here </body> </html>
Robots meta tag content values
Site Search supports the
NONE values for the robots tag.
INDEX are the defaults and are not necessary unless you are overriding a robots meta tag for Swiftype, see below. Other values - such as
NOARCHIVE - are ignored.
To tell the Site Search Crawler not to index a page, use
<meta name="robots" content="noindex">
Links from an unindexed page will still be followed.
To tell the Site Search Crawler not to follow links from a page, use
<meta name="robots" content="nofollow">
Content from a page that has
NOFOLLOW will still be indexed.
To not follow links and not index content from a page, use
NOINDEX, NOFOLLOW or
<meta name="robots" content="noindex, nofollow">
NONE is a synonym for the above:
<meta name="robots" content="none">
We recommend specifying the robots directives in a single tag, but multiple tags will be combined if present.
Directing instructions at the Site Search Crawler only
meta name="robots" will apply your instructions to all web crawlers, including Swiftbot, the Site Search Crawler. If you would like to direct special instructions at the Site Search Crawler, use
st:robots as the name instead of
robotsfor the Site Search Crawler
<meta name="robots" content="noindex, nofollow"> <meta name="st:robots" content="follow, index">
This example tells other crawlers not to index or follow links from the page, but allows the Site Search Crawler to index and follow links.
When any meta name of
st:robots is present on the page, all other robots meta rules will be ignored in favor of the
Repeated content values
If robots directives are repeated, the Site Search Crawler will use the most restrictive.
<meta name="robots" content="noindex"> <meta name="robots" content="index">
The above is equivalent to
Casing, spacing, and ordering
Tags, attribute names, and attribute values are all case-insensitive.
Multiple attribute values must be separated by a comma, but whitespace is ignored. Order is not important (
NOINDEX, NOFOLLOW is the same as
NOFOLLOW, NOINDEX). The following are considered the same:
<meta name="robots" content="noindex, nofollow"> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> <META name="rOBOTs" content=" noIndex , NOfollow ">
<a href="/some/page" rel="nofollow">Do not follow this link</a>
Since the Site Search Crawler only follows links on the same domain, this is useful for crawl prioritization. For example, you could put
rel="nofollow" on your sign in and create account links, and the Site Search Crawler will not try to follow them.
The Site Search Crawler supports the Sitemap XML format. Using Sitemap can provide a significant speed boost to the crawl. Instead of examining each page for new links to follow, the Site Search Crawler will use your Sitemap file(s) to download the URLs directly.
We have instructions upon:
The Sitemap Format
The Sitemap XML format specifies a list of URLs to index.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.yourdomain.com/</loc> </url> <url> <loc>http://www.yourdomain.com/faq/</loc> </url> <url> <loc>http://www.yourdomain.com/about/</loc> </url> </urlset>
A sitemap file can also link to a list of other sitemaps:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.yoursite.com/sitemap1.xml</loc> <lastmod>2012-10-01T18:23:17+00:00</lastmod> </sitemap> <sitemap> <loc>http://www.yoursite.com/sitemap2.xml</loc> <lastmod>2012-01-01</lastmod> </sitemap> </sitemapindex>
For full details, review the Sitemaps documentation
Installing Your Sitemap
/robots.txtfile with multiple Sitemap URLs
User-agent: * Sitemap: http://www.yourdomain.com/sitemap1.xml Sitemap: http://www.yourdomain.com/sitemap2.xml
If no Sitemap files are found in the
robots.txt file, the crawler will try to find one at
Site Search does not currently support pinging to notify the crawler of Sitemap existence, page priority, last modification date, or refresh frequency.
RSS and Atom Feed Support
RSS and Atom are standards for content syndication. In short, they provide a machine-readable way of describing updated content on a website. Many blogging platforms and content management systems support either RSS or Atom (or both) to help readers stay up-to-date with their content.
The Site Search Crawler support RSS and Atom feeds. If your website provides an RSS or Atom feed, our crawler will download it to find new links on your site to index first. This is particularly useful when Site Search is doing an incremental update of your website, as it gives us a good hint about where to find the most recently updated pages.
For example, if your website has an RSS feed at
http://www.yoursite.com/index.rss the auto-discovery code would look like this:
<html> <head> <title>Your Site Title</title> <link rel="alternate" type="application/rss+xml" title="RSS feed" href="http://www.yoursite.com/index.rss"> </head> <body> ... </body> </html>
If your website has an Atom feed, you would use the
application/atom+xml type for the
<link rel="alternate" type="application/atom+xml" title="Atom feed" href="http://www.yoursite.com/index.atom">
Password Protected Content
The Site Search Crawler does not automatically support indexing content protected by HTTP authentication. You can work around this limitation by configuring your site to allow access to the Site Search Crawler agent, Swiftbot. For a more secure solution, our support team can also supply you with your account specific user agent string.
For our Business and Premium customers, we offer our Constant Crawl feature, which is capable of detecting new and updated pages on your site and indexing them in near real time. This feature is most beneficial for ingesting new content that is added in-between crawls of your entire website.
If you think the Constant Crawl feature would be a good option for your site, but are not currently a Business or Premium customer, you can reach out to email@example.com to learn more.
The Site Search URL Inspector, FAQ
The URL Inspector is an invaluable tool! We have written this in FAQ format.
Review these steps to get up and running in no time.
What does the URL Inspector do?
If you are using the Site Search Crawler, you can use the URL Inspector to dive deeper into any URLs to learn more details. The Inspector will evaluate a URL and give you feedback on its findings.
Where is the URL Inspector?
There are two ways to get to the URL Inspector:
The first option is to go to the Content section of the Site Search Dashboard and paste a URL in the search bar. If the URL has no matches there will be an option to 'Inspect this URL.'
The second option is to go to the Content section of the Swiftype Dashboard and click on a content link. Once on the Content's Properties, page you will find an Options drop down at the top right. Click the drop down and select Inspect.
Both of these options take you to the URL Inspector landing page.
How do I use the URL Inspector?
When you get to the URL Inspector page, the URL previously selected will appear in the search bar and the Inspector's evaluation of the URL will appear below it.
You can use the URL Inspector landing page to search other URLs as well. Just make sure when typing the URL into the search bar you include
https://. Then click inspect and see what Site Search can find!
What will it be able to tell me about my URLs?
As mentioned earlier, the URL Inspector gives you more details about your URLs such as:
If the URL is not valid:
If the URL is blocked by domain rules:
It even gives you specific details about which white or blacklisted rules are impacting the URL and guides you with resolution steps.
If the URL doesn't have a confirmed domain for the engine:
If the URL can't be reached, but the Site Search Crawler has crawled it before:
If the page can't be reached:
If the URL has not been visited by the current Crawl yet:
If the URL can't index a page due to a duplicate page:
If the URL can't index a page due to mismatched field types:
If indexing is in progress:
Or if everything looks great for that URL:
You will get a confirmation that the latest version of the page is available for search as well as details on when it was last updated.
Once you've completed these steps, you should be all set! The URL Inspector promises to take your Site Search indexing to the next level.
Webpages may contain duplicate content. Search engines want to know the truth and reward pages that represent the original content in a clear manner. To ensure good standing within search engines results, developers can include canonical link elements within the
<head> of duplicate pages:
<link rel="canonical" href="https://example.com/tea/peppermint" />
These tags tell search engine crawlers that: "This is a duplicate. The original is found at this link:
https://example.com/tea/peppermint." Thus, they preserve quality rankings.
The Site Search Crawler will obey these tags, too. However, an incorrect implementation will prevent your pages from being indexed...
Non-specific canonical link elements
When canonical link elements are attached to pages, they should include the precise URI:
https://example.com/tea/peppermint or no canonical link element at all.
For example, the homepage URI -
https://example.com - is used as the canonical link element on every page. Each page should have its own URI. The crawler will assume there is only one page -
https://example.com - and the other pages will not be indexed.
Developers may choose to put a canonical link element within the original page, too, denoting that it is the source of truth. Consider a case where the element does not share an exact match with the actual page URI.
https://example.com/tea/peppermint/ is included within the canonical link element - note the trailing slash! When the crawler visits:
https://example.com/tea/peppermint, where the content is located, it will be directed to:
https://example.com/tea/peppermint/ by the canonical link element... But hitting that will takes the session to:
https://example.com/tea/peppermint -- argh ~ infinite loop! This can also occur with other misconfigured
The crawler will give up after awhile, exhausted. The page will not be indexed.
Stuck? Looking for help? Contact Support!