The Site Search Crawler will crawl and index pages, if it can access and read those pages. There are a few common cases that can lead to your site not being crawled successfully. In some cases, there are things you can tune within the Site Search dashboard. In others, you might need to make alterations to your website code.
If your documents are not being indexed, you may have restrictive path rules.
You can confirm whether or not this is true by clicking on the Domains tab to bring up your list of crawled websites:
If you notice that you are not indexing pages from your main website, or a specific website, click on the Manage dropdown next to the website, and then select Manage Crawl Rules:
There are two lists that contain rules: a whitelist and a blacklist. If you have paths within your blacklist, those pages will not be indexed:
When one adds a rule, they can choose between how to match on their path...
You can include paths that will trigger blacklisting in accordance with a matching pattern: Begins with, Ends with, Contains, or a custom regular expression.
A common case is an accidental misconfiguration of the path rules...
For example, instead of preventing indexing for URLs that Begin with:
/hidden, one might say URLs that Contain:
/hidden. In this way, results like
/tourism/hidden-cove will not be indexed, even though they are not in the
Another frequent pattern is seen within over-zealous regular expressions. If you were to blacklist:
*, none of your pages will be indexed!
Double check that your path rules are not responsible for any pages that have not been indexed.
Common Website Issues
Your website code may be written in a way that obscures or prevents crawling.
Ensure that the following are in good order:
It is common for a website to include a
It can be accessed by entering
robots.txt after your website url, like so:
The Site Search Crawler, like all crawlers, will obey the instructions within a
robots.txt file. If your website is not being crawled, consider whether or not your have Disallow set. And if you do have something set to disallow, that it is not being too aggressive.
A common case is a mis-placed slash (/).
The following example will make a case sensitive match on the pattern that you have provided.
User-agent: * Disallow: /sushi
Any file or directory that matches sushi and its casing, like
/sushi.html?id=123 will be disallowed and not crawled.
In contrast, a trailing slash will ensure that matches exist only within the defined folder:
User-agent: * Disallow: /sushi/
This will match on:
Another common case is uncapped wildcard characters:
User-agent: * Disallow: /*.html
This will match on every file that ends in
.html, no matter how deep it is buried in multiple directories, or whether or not it is at the beginning or end of an HTML path:
/sushi/milkbuns/fresh.html?id=132 will both match.
If this is not the intended behaviour, one can cap a wildcard character by putting a
$ at the end of the pattern:
User-agent: * Disallow: /*.html$
Now, only files that end in
.html will be disallowed, like
Robots meta tags
The Site Search Crawler can understand Robots directives from your meta tags. Meta tags are placed within the
<head></head> tags of your website:
<html> <head> <meta name="robots" content="noindex, nofollow"> </head> <body> ... </body> </html>
A common issue is when folks mix up the two restrictive directives:
nofollow. When the crawler crawls a page, it follows and indexes each link within that page. To prevent this, one can add
nofollow. That way, only that page will be indexed, nothing to which it links. Using
noindex will prevent indexing. However, the page will still be followed and its links indexed. Using both will ensure the page is not indexed and its links are not followed.
Canonical URLs can be useful for SEO purposes. But when misconfigured, they can cause troubles for the Site Search Crawler.
There are two common cases:
Incorrect canonical URLs
When canonical link elements are created, they should include a precise URL.
The following tag is acceptable if the page is actually
<link rel="canonical" href="https://example.com">
The URL seen when you browse the page must match the one within the canonical URL.
If the page you are visiting is https://example.com/sushi, the canonical link element should look as such:
<link rel="canonical" href="https://example.com/sushi">
If the current page and the canonical element are off by just one character, there can be significant issues.
Sometimes redirect settings can put the crawler within an infinite loop. The crawler will follow the canonical URL. If that canonical URL redirects back to the original page, then the loop begins.
For example, if your pages had:
<link rel="canonical" href="https://example.com/tea/peppermint">
And the URL had a redirect so that
https://example.com/tea/peppermint redirected to
https://example.com/tea/peppermint/, then the crawler will go back and forth, back and forth, back and forth... and eventually fail.
Often times, a server-side redirect will be set-up. This is fine in most cases as the crawler will follow the redirect to the new location. However, if there is a redirect to a location that requires authentication, the crawler will be unable to proceed. Read more about password protected crawling. In addition, if the redirect goes to another website, those pages will not be indexed. Only redirects within the domain you have added that are not password protected will be crawled.