To best crawl your pages, the Site Search Crawler supports and abides by many different features of the modern web:
- Content Inclusion & Exclusion
- Content Inclusion & Exclusion, URL Paths
- Robots Meta-Tag
- Sitemap XML
- RSS and Atom
- Password Protected Content
- Constant Crawl
- URL Inspector
Content Inclusion & Exclusion
We have instructions on how you can include or exclude crawled content from your Search Engine:
Whitelist: Content Inclusion
In some cases you may have many items you want excluded from the page but only a few you would like included. In these scenarios you can use the content inclusion feature to whitelist elements of your pages for indexing. For example, if you have a single content section of every page you can very easily set
data-swiftype-index=true on this element and our crawler will only extract text from there.
<body> This is content that will not be indexed by the Swiftype crawler. <div data-swiftype-index='true'> <p> All content under the above div tag will be indexed. </p> <p> Content in this paragraph tag will be included from the search index! </p> </div> This content will not be indexed, since it isn't surrounded by an include tag. </body>
Blacklist: Content Exclusion
You may want to keep certain content from being indexed in your site search engine -- for example, your site header, footer, or menu bar. You can tell the Swiftype crawler to exclude these elements by adding the
data-swiftype-index=false attribute to any element, as illustrated below.
<body> This is your page content, which will be indexed by the Swiftype crawler. <p data-swiftype-index='false'> Content in this paragraph tag will be excluded from the search index! </p> This content will be indexed, since it isn't surrounded by an excluded tag. <div id='footer' data-swiftype-index='false'> This footer content will be excluded as well. </div> </body>
Nesting the above rules will work as you might expect. If there are multiple rules present on the page, all text will inherit its behavior from the nearest parent element with a rule. This way, you will be able to include and exclude elements within each other. If there is text outside of any container elements with applied rules on the page, it will be treated opposite of the first-appearing rule on said page. For example, if the first rule is
data-swiftype-index=false applied to a child element, any text outside that element and any other element with an inclusion rule will be indexed to the page's document record.
<body> This is content that will not be indexed since the first rule is true. <div data-swiftype-index='true'> <p> All content under the above div tag will be indexed. </p> <p> Content in this paragraph tag will be included from the search index! </p> <p data-swiftype-index='false'> Content in this paragraph will be excluded because of the nested rule. </p> <span data-swiftype-index="false"> <p> Content in this paragraph will be excluded because the parent span is false. </p> <p data-swiftype-index="true"> Content in this paragraph will be INCLUDED because the parent container is true. </p> </span> </div> </body>
Content Inclusion & Exclusion, URL Paths
Whitelist and blacklist rules allow you to tell the Swiftype Crawler to include or exclude parts of your domain. To configure these rules, visit the Manage Domain page of your Swiftype dashboard.
As you type a whitelist or blacklist rule, you will see a sample of URLs that will be affected.
Whitelist - Including only certain paths
Whitelist rules allow you to specify which parts of your domain you want the Swiftype Crawler to index. If you add rules to the whitelist, the Swiftype Crawler will only include the parts of you domain that match these rules. Otherwise Swiftype will include every page on your domain which is not excluded by blacklist or your robots.txt file:
|begin with||Include URLs that begin with this text.||
Setting this to
|contain||Include URLs that contain this text.||
Setting this to
|end with||Include URLs that end with this text.||
Setting this to
|match regex||Include URLs that match a regular expression. Advanced users only.||
Setting this to
Blacklist - Excluding certain paths
Blacklist rules allow you to tell the Site Search Crawler not to index parts of your domain. The rules you create in the blacklist will be applied to everything allowed by the whitelist rules. If there is no whitelist, everything on your domain is assumed to be allowed.
|begin with||Exclude URLs that begin with this text.||
Setting this to
|contain||Exclude URLs that contain this text.||
Setting this to
|end with||Exclude URLs that end with this text.||
Setting this to
|match regex||Exclude URLs that match a regular expression. Advanced users only.||
Setting this to
Swiftype-specific Meta Tags
The Swiftype web crawler supports a very flexible set of <meta> tags that allow you to deliver structured information to our web crawler. When our crawler visits your webpage, by default, we extract a standard set of fields (e.g. title, body) that are then indexed for searching in your search engine. With these <meta> tags you can augment -- or completely alter -- the set of fields our crawler extracts in order to better fit the data you wish to be indexed on your website.
We have documentation around:
Each field we extract must meet specific structure guidelines, with defined name, type, and content values. The field type, which is specified in the
data-type attribute, must be a valid, Swiftype-supported field type, which you may read about here.
The template for a Swiftype-specific Meta Tag is as follows, and should be placed within the
<head> of your webpage:
<meta class="swiftype" name="[field name]" data-type="[field type]" content="[field content]" />
data-typecarefully! Once a new meta tag has been indexed, and the custom field created, its
data-typecannot be changed.
This is a more complex example, showing the creation of multiple fields. As you can see, the tag field is repeated, and as a result our crawler extracts an array of tags for this URL. All field types can be extracted as arrays.
<head> <title>page title | website name</title> <meta class="swiftype" name="title" data-type="string" content="page title" /> <meta class="swiftype" name="body" data-type="text" content="this is the body content" /> <meta class="swiftype" name="url" data-type="enum" content="http://www.swiftype.com" /> <meta class="swiftype" name="price" data-type="float" content="3.99" /> <meta class="swiftype" name="quantity" data-type="integer" content="12" /> <meta class="swiftype" name="published_at" data-type="date" content="2013-10-31" /> <meta class="swiftype" name="store_location" data-type="location" content="20,-10" /> <meta class="swiftype" name="tags" data-type="string" content="tag1" /> <meta class="swiftype" name="tags" data-type="string" content="tag2" /> </head>
Body-embedded Data Attribute Tags
If you don't want to repeat tons of text in the <head> of your page then you can just add data attributes to existing elements.
<body> <h1 data-swiftype-name="title" data-swiftype-type="string">title here</h1> <div data-swiftype-name="body" data-swiftype-type="text"> Lots of body content goes here... Other content goes here too, and can be of any type, like a price: $<span data-swiftype-name="price" data-swiftype-type="float">3.99</span> </div> </body>
Upgrade your old Meta Tags
When Site Search originally launched we supported a small set of meta tags that were intended for very specific use-cases. We will continue to support those tags if you already have them on your website, but we do not recommend using them for new projects, and would be glad to see you upgrade (instructions below) them to the new style described above. You can read the original documentation here.
For each original-style meta tag, simply replace the tag with an equivalent tag using the new format. In each example below the first line shows an example original-style tag, and the second line shows the correct replacement.
<!-- old, deprecated style --> <meta property='st:title' content='[title value]' /> <!-- new, correct style --> <meta class='swiftype' name='title' data-type='string' content='[title value]' />
<!-- old, deprecated style --> <meta property='st:section' content='[section value]' /> <!-- new, correct style --> <meta class='swiftype' name='sections' data-type='string' content='[sections field value]' />
<!-- old, deprecated style --> <meta property='st:image' content='[image url]' /> <!-- new, correct style --> <meta class='swiftype' name='image' data-type='enum' content='[image url]' />
<!-- old, deprecated style --> <meta property='st:type' content='[type value]' /> <!-- new, correct style --> <meta class='swiftype' name='type' data-type='enum' content='[type value]' />
<!-- old, deprecated style --> <meta property='st:info' content='[info value]' /> <!-- new, correct style --> <meta class='swiftype' name='info' data-type='string' content='[info value]' />
<!-- old, deprecated style --> <meta property='st:published_at' content='[published_at date]' /> <!-- new, correct style --> <meta class='swiftype' name='published_at' data-type='date' content='[published_at date]' />
<!-- old, deprecated style --> <meta property='st:popularity' content='[popularity value]' /> <!-- new, correct style --> <meta class='swiftype' name='popularity' data-type='integer' content='[popularity value]' />
The Site Search Crawler supports the standard features of the Robots.txt file standard, and will respect all rules issued to our User-agent. Among other uses, the Robots.txt file is a good way to exclude certain portions of your site from your Swiftype site-search engine.
If you would like your Robots.txt file rules to apply only to our Crawler you should specify the Swiftbot User-agent in your file, as shown in the Disallow example blow. We will also respect rules specified under the wildcard User-agent.
User-agent: Swiftbot Disallow: /mobile/
If you have a wildcard Disallow, we simply will not touch your site at all. If you would like to specifically Allow only the Swiftype bot to index your site, you can Allow it (using a blank Disallow rule) as shown in the example below.
User-agent: Swiftbot Disallow: User-agent: * Disallow: /
You can also control the rate at which we access your website with the crawler by using the
User-agent: Swiftbot Crawl-delay: 5
For fine-grained control over how your pages are indexed, you may use robots meta tags. Read more about that below!
Robots Meta Tag Support
The Swiftype web crawler supports the robots meta tag standard. This allows you to control how Swiftype indexes pages on your site.
We have instructions upon:
- Using the Robots Meta Tag
- Robots Meta Tag Content Values
- Directing Instructions at Site Search Crawler Only
- Repeating Content Values
- Casing, Spacing and Ordering
Using the "robots" meta-tag
Place the robots meta tag in the
<head> section of your page:
<!doctype html> <html> <head> <meta name="robots" content="noindex, nofollow"> </head> <body> Page content here </body> </html>
Robots meta tag content values
Swiftype supports the
NONE values for the robots tag.
INDEX are the defaults and are not necessary (unless you are overriding a robots meta tag for Swiftype, see below. Other values (such as
NOARCHIVE) are ignored.
To tell Swiftype not to index a page, use
<meta name="robots" content="noindex">
Links from an unindexed page will still be followed.
To tell Swiftype not to follow links from a page, use
<meta name="robots" content="nofollow">
Content from a page that has
NOFOLLOW will still be indexed.
To not follow links and not index content from a page, use
NOINDEX, NOFOLLOW or
<meta name="robots" content="noindex, nofollow">
NONE is a synonym for the above:
<meta name="robots" content="none">
We recommend specifying the robots directives in a single tag, but multiple tags will be combined if present.
Directing instructions at Swiftype only
meta name="robots" will apply your instructions to all web crawlers, including Swiftype's. If you would like to direct special instructions at Swiftype's crawler, use
st:robots as the name instead of
robotsfor the Swiftype crawler
<meta name="robots" content="noindex, nofollow"> <meta name="st:robots" content="follow, index">
This example tells other crawlers not to index or follow links from the page, but allows Swiftype's crawler to index and follow links.
When any meta name of
st:robots is present on the page, all other robots meta rules will be ignored in favor of the
Repeated content values
If robots directives are repeated, Swiftype will use the most restrictive.
<meta name="robots" content="noindex"> <meta name="robots" content="index">
The above is equivalent to
Casing, spacing, and ordering
Tags, attribute names, and attribute values are all case-insensitive.
Multiple attribute values must be separated by a comma, but whitespace is ignored. Order is not important (
NOINDEX, NOFOLLOW is the same as
NOFOLLOW, NOINDEX). The following are considered the same:
<meta name="robots" content="noindex, nofollow"> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> <META name="rOBOTs" content=" noIndex , NOfollow ">
The Swiftype crawler supports the rel="nofollow" standard.
<a href="/some/page" rel="nofollow">Do not follow this link</a>
Since Swiftype only follows links on the same domain, this is mostly useful for crawl prioritization. For example, you could put
rel="nofollow" on your sign in and create account links, and Swiftype will not waste time trying to follow them.
Our crawler supports the Sitemap XML format. Using Sitemap can speed up crawling your website significantly. Instead of examining each page for new links to follow, the Site Search Crawler will use your Sitemap file(s) to download the URLs directly.
We have instructions upon:
The Sitemap Format
The Sitemap XML format specifies a list of URLs to index.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.yourdomain.com/</loc> </url> <url> <loc>http://www.yourdomain.com/faq/</loc> </url> <url> <loc>http://www.yourdomain.com/about/</loc> </url> </urlset>
A sitemap file can also link to a list of other sitemaps:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.yoursite.com/sitemap1.xml</loc> <lastmod>2012-10-01T18:23:17+00:00</lastmod> </sitemap> <sitemap> <loc>http://www.yoursite.com/sitemap2.xml</loc> <lastmod>2012-01-01</lastmod> </sitemap> </sitemapindex>
For full details, review the Sitemaps documentation
Installing Your Sitemap
The Site Search Crawler supports specifying Sitemap files in your
/robots.txtfile with multiple Sitemap URLs
User-agent: * Sitemap: http://www.yourdomain.com/sitemap1.xml Sitemap: http://www.yourdomain.com/sitemap2.xml
If no Sitemap files are found in the robots.txt file, the crawler will try to find one at
Site Search does not currently support pinging to notify the crawler of Sitemap existence, page priority, last modification date, or refresh frequency.
RSS and Atom Feed Support
RSS and Atom are standards for content syndication. In short, they provide a machine-readable way of describing updated content on a website. Many blogging platforms and content management systems support either RSS or Atom (or both) to help readers state up-to-date with their content.
The Swiftype crawler support RSS and Atom feeds. If your website provides an RSS or Atom feed, our crawler will download it to find new links on your site to index first. This is particularly useful when Swiftype is doing an incremental update of your website, as it gives us a good hint about where to find the most recently updated pages.
To tell the Swiftype crawler about your RSS or Atom feed, use auto-discovery in the
<head> section of your template.
For example, if your website has an RSS feed at
http://www.yoursite.com/index.rss the auto-discovery code would look like this:
<html> <head> <title>Your Site Title</title> <link rel="alternate" type="application/rss+xml" title="RSS feed" href="http://www.yoursite.com/index.rss"> </head> <body> ... </body> </html>
If your website has an Atom feed, you would use the
application/atom+xml type for the
<link rel="alternate" type="application/atom+xml" title="Atom feed" href="http://www.yoursite.com/index.atom">
Password Protected Content
Our crawler does not automatically support indexing content protected by HTTP authentication. You can work around this limitation by configuring your site to allow access to our crawler agent, Swiftbot. For a more secure solution, our support team can also supply you with your account specific user agent string.
For our Business and Premium customers, we offer our Constant Crawl feature, which is capable of detecting new and updated pages on your site and indexing them in near real time. This feature is most beneficial for indexing new content that is added in-between crawls of your entire website.
If you think the Constant Crawl feature would be a good option for your site, but are not currently a Business or Premium customer, you can reach out to firstname.lastname@example.org to learn more.
The Swiftype URL Inspector Step-by-Step Guide
The Swiftype Customer Care team wants to ensure you're able to use the new URL Inspector seamlessly and immediately. We have written this in FAQ format. Review these steps to get up and running in no time!
What does the URL Inspector do?
If you are using the Swiftype Crawler, you can use the URL Inspector to dive deeper into any URLs to learn more details. The Inspector will evaluate a URL and give you feedback on its findings.
Where is the URL Inspector?
There are two ways to get to the URL Inspector:
The first option is to go to the Content section of the Swiftype Dashboard and paste a URL in the search bar. If the URL has no matches there will be an option to 'Inspect this URL.'
The second option is to go to the Content section of the Swiftype Dashboard and click on a content link. Once on the Content's Properties page you will find an Options drop down at the top right—click the drop down and select Inspect.
Both of these options take you to the URL Inspector landing page.
How do I use the URL Inspector?
When you get to the URL Inspector page the URL previously selected will appear in the search bar and the Inspector's evaluation of the URL will appear below it.
You can use the URL Inspector landing page to search other URLs as well. Just make sure when typing the URL into the search bar you include http:// or https://. Then click inspect and see what Swiftype finds!
What will it be able to tell me about my URLs?
As mentioned earlier, the URL Inspector gives you more details about your URLs such as:
If the URL is not valid:
If the URL is blocked by domain rules:
It even gives you specific details about which white or blacklisted rules are impacting the URL and guides you with resolution steps.
If the URL doesn't have a confirmed domain for the engine:
If the URL can't be reached, but Swiftype has crawled it before:
If the page can't be reached:
If the URL has not been visited by the current Crawl yet:
If the URL can't index a page due to a duplicate page:
If the URL can't index a page due to mismatched field types:
If indexing is in progress:
Or if everything looks great for that URL:
You will get a confirmation that the latest version of the page is available for search as well as details on when it was last updated.
Once you've completed these steps, you should be all set! The URL Inspector promises to take your site search indexing to the next level.
If you have any additional questions or feedback, please reach out to the Swiftype Customer Care team and we'll gladly assist. Happy searching!
Stuck? Looking for help? Contact Support!