search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Crawler Troubleshooting

The Site Search Crawler will crawl and index pages, if it can access and read those pages. There are a few common cases that can lead to your site not being crawled successfully. In some cases, there are things you can tune within the Site Search dashboard. In others, you might need to make alterations to your website code.

Dashboard

If your documents are not being indexed, you may have restrictive path rules.

You can confirm whether or not this is true by clicking on the Domains tab to bring up your list of crawled websites:

Troubleshooting - A list of websites within the Domains section of the Site Search dashboard.
A list of domains that have been added to an Engine. There are four Swiftype domains, for the documentation, main site, YouTube, and community.

If you notice that you are not indexing pages from your main website, or a specific website, click on the Manage dropdown next to the website, and then select Manage Crawl Rules:

Troubleshooting - Finding the Manage Crawl Rules menu from the Manage dropdown.
The manage dropdown opens up to reveal Manage Crawl Rules, among other options. Manage Crawl Rules is highlighted.

There are two lists that contain rules: a whitelist and a blacklist. If you have paths within your blacklist, those pages will not be indexed:

Troubleshooting - We have found some items within a blacklist.
A list of domains that have been added to an Engine. There are four Swiftype domains, for the documentation, main site, YouTube, and community.

When one adds a rule, they can choose between how to match on their path...

Troubleshooting - There are different ways to ignore paths -- are they configured right?
Adding a new rule brings up an input window, which is pictured. The window accepts a string as a path, but then has some important, oft-err'd parameters, which will be defined below.

You can include paths that will trigger blacklisting in accordance with a matching pattern: Begins with, Ends with, Contains, or a custom regular expression.

A common case is an accidental misconfiguration of the path rules...

For example, instead of preventing indexing for URLs that Begin with: /hidden, one might say URLs that Contain: /hidden. In this way, results like /tourism/hidden-cove will not be indexed, even though they are not in the /hidden directory.

Another frequent pattern is seen within over-zealous regular expressions. If you were to blacklist: / or *, none of your pages will be indexed!

Double check that your path rules are not responsible for any pages that have not been indexed.

Common Website Issues

Your website code may be written in a way that obscures or prevents crawling.

Ensure that the following are in good order:

Robots.txt

It is common for a website to include a robots.txt file.

It can be accessed by entering robots.txt after your website url, like so: https://example.com/robots.txt.

The Site Search Crawler, like all crawlers, will obey the instructions within a robots.txt file. If your website is not being crawled, consider whether or not your have Disallow set. And if you do have something set to disallow, that it is not being too aggressive.

A common case is a mis-placed slash (/).

The following example will make a case sensitive match on the pattern that you have provided.

User-agent: *
Disallow: /sushi

Any file or directory that matches sushi and its casing, like /sushi.html, /sushicats, /sushicats/wallawalla.html, and /sushi.html?id=123 will be disallowed and not crawled.

In contrast, a trailing slash will ensure that matches exist only within the defined folder:

User-agent: *
Disallow: /sushi/

This will match on: /sushi/wallwalla.html, /sushi/milkbuns/fresh.html, and /sushi.html?id=123.

Another common case is uncapped wildcard characters:

User-agent: *
Disallow: /*.html

This will match on every file that ends in .html, no matter how deep it is buried in multiple directories, or whether or not it is at the beginning or end of an HTML path: /sushi/milkbuns/fresh.html and /sushi/milkbuns/fresh.html?id=132 will both match.

If this is not the intended behaviour, one can cap a wildcard character by putting a $ at the end of the pattern:

User-agent: *
Disallow: /*.html$

Now, only files that end in .html will be disallowed, like /sushi/milkbuns.html and /sushi/milkbuns/fresh.html.

Robots meta tags

The Site Search Crawler can understand Robots directives from your meta tags. Meta tags are placed within the <head></head> tags of your website:

<html>
  <head>
    <meta name="robots" content="noindex, nofollow">
  </head>
  <body>
    ...
  </body>
</html>

A common issue is when folks mix up the two restrictive directives: noindex and nofollow. When the crawler crawls a page, it follows and indexes each link within that page. To prevent this, one can add nofollow. That way, only that page will be indexed, nothing to which it links. Using noindex will prevent indexing. However, the page will still be followed and its links indexed. Using both will ensure the page is not indexed and its links are not followed.

Canonical URLs

Canonical URLs can be useful for SEO purposes. But when misconfigured, they can cause troubles for the Site Search Crawler.

There are two common cases:

When canonical link elements are created, they should include a precise URL.

The following tag is acceptable if the page is actually https://example.com.

<link rel="canonical" href="https://example.com">

The URL seen when you browse the page must match the one within the canonical URL.

If the page you are visiting is https://example.com/sushi, the canonical link element should look as such:

<link rel="canonical" href="https://example.com/sushi">

If the current page and the canonical element are off by just one character, there can be significant issues.

Redirect loops

Sometimes redirect settings can put the crawler within an infinite loop. The crawler will follow the canonical URL. If that canonical URL redirects back to the original page, then the loop begins.

For example, if your pages had:

<link rel="canonical" href="https://example.com/tea/peppermint">

And the URL had a redirect so that https://example.com/tea/peppermint redirected to https://example.com/tea/peppermint/, then the crawler will go back and forth, back and forth, back and forth... and eventually fail.

Server-side redirects

Often times, a server-side redirect will be set-up. This is fine in most cases as the crawler will follow the redirect to the new location. However, if there is a redirect to a location that requires authentication, the crawler will be unable to proceed. Read more about password protected crawling. In addition, if the redirect goes to another website, those pages will not be indexed. Only redirects within the domain you have added that are not password protected will be crawled.


Stuck? Looking for help? Contact support or check out the Site Search forum!