Removing documents from your crawler based search engine
We often receive questions on how to best remove documents from an active search engine. Below, we’ll outline a few of the most common scenarios and provide you with some suggestions.
Note: The following document is geared towards our Crawler based engines. If you created your engine with our developer API, you can find information on removing documents here.
How do I include only a certain part of my site in my search engine?
Sometimes it may not be important for your entire site to be indexed. Some examples would be e-commerce shops that only want to include products in their search, or a blog/news portal site that only wants to include published articles.
You can use path rules to create specific instructions for our crawler that indicate which sections of your site to include in the engine.
Why do I have duplicate results in my search engine?
The Crawler is able to detect duplicate page content and merge it into a single result. By default, Swiftbot will look at the title and body content it was able to extract from each page. If the extracted information is an exact match for content that is already indexed in your engine, we will select the page with the shortest URL.
If you are still seeing duplicate page results in your engine, here are some considerations:
Out of all the duplicates is there one you prefer to be indexed, such as a parent product or page? If so, you could add canonical tags to all the variant page results, so that they point to the parent page (page you would like to be indexed by Swiftype). Our crawler respects canonical tags, and we will not index the variants that have the canonical tag on their page.
Are all the duplicates under a different URL path? You can add a blacklist rule that excludes all the variants within that URL path.
When I remove pages from my sitemap, why are they are still in my engine?
Have you removed pages from your sitemap, but are still seeing them in your search results? Using a sitemap is a great way to help ensure our crawler finds all of your site content, but it unfortunately does not work as a means of telling the crawler what content to delete from your engine.
To remove (or exclude) specific pages your search engine you can:
Use robots meta tags to tell the Swiftbot crawler to not index those specific pages of your site.
Use path rules to define specific sections of your site for the crawler to allow or ignore.
How do I make sure products that are “Out of Stock” do not show up in my search results?
Here’s another instance where robots meta tags would come in handy. If you apply ‘noindex’ tags to all “out of stock” pages, Swiftbot will remove the page document from your engine when recrawled. If you have any questions about keeping your engine updated with “in-stock” products please contact Support.
General Strategies for Removing Content
Robots Meta Tags
The Swiftype web crawler supports the robots meta tag standard. This allows you definitive control over whether or not Swiftype (or any other search agents) indexes specific pages on your site. You can also tell Swiftype specifically not to index a page by using both the name st:robots, and a value of
<meta name="st:robots" content="noindex">
If you’d like to read more about how Swiftype and robots meta tags work together, you can refer to our documentation here.
Note: Your engine will not reflect the changes made to pages where a robots meta tag is added or removed until that page is recrawled.
Canonical meta tags can help you control how duplicate content is processed during a web crawl. If you have identical content that can be found on several unique URLs in your site, you will likely want to avoid indexing this material more than once. Canonical tags allow web crawlers to recognize a site has duplicate content, and will point a crawler to one definitive URL.
To learn more about canonical meta tags and how to implement them on your site you can look at the following resource from Moz.
Note: Your engine will not reflect the changes made by adding canonical tags until that engine is recrawled.
URL Path Rules
Swiftype’s path rules allow you to tell the Swiftype Crawler to include or exclude specific parts of your domain. To configure these rules, visit the Manage > Domains page of your Swiftype dashboard.
Whitelist rules allow you to include only certain paths. Only paths that match your whitelist rules will be indexed in your engine.
Blacklist rules allow you to exclude parts of your domain. Any path that matches a blacklist rule will not be added to your engine.
In our documentation we have additional tips on Including and Excluding Content by URL.
Note: Your engine will not reflect the changes made to your whitelist and blacklist rules until you recrawl your engine.
Several times throughout this guide we’ve made note that your engine needs to be recawled for content updates to be recognized. We run both partial and full recrawls on your engine automatically, but you also have the availability to manually request a full recrawl of your engine from your Swiftype Dashboard through the Manage > Domains section. Your plan level determines how often you are able to click to request a full recrawl of your site.
If you are unable to trigger a recrawl through your dashboard, want to learn about the different plans Swiftype offers, or have some lingering questions about the topics covered in this article, please reach out to us via firstname.lastname@example.org.