How to crawl an entire website without using programs
- Go to the 'Crawlings' page from the main page of your resource.
- In the top right corner, click on 'Create Crawling'.
- On the opened page, depending on your subscription plan, you will see a notification about the available number of URLs for crawling.
- In the 'Name' field, enter the task name. It will help you easily navigate through the reports. You can also leave your own comment as notes for the future. They will be visible only to you.
- In the "Concurrency" field, specify the number of concurrent threads your website is ready to support for parsing. Be cautious to avoid excessive load on your resource.
- The value in the "Delay between requests" column determines the interval between requests to your resource. For example, if you set "Concurrency" = 1 and "Delay between requests" = 200, it means that after each response from your website, the crawler will wait for 0.2 seconds.
- Spider. Leave this value as default if you want the crawler to navigate your site like a typical search engine bot. It will start from the main page of your website.
List. Specify the list of URLs in the field that you want to crawl or upload a file.
Notice! Crawler will only visit the URLs specified in the list.
Provide the link to the sitemap file located on your site or upload the file. Notice! The crawler will visit only those URLs on your site that are specified in the sitemap file.
- Leave the "Specific Robots.txt content" field empty if you want the crawler to use the Robots.txt rules from your website.
Specify specific values for Robots.txt in this field to override the Robots.txt rules from your site.
Notice! In this case, the Robots.txt rules from your site will be completely ignored, not supplemented.
- If you enable "Ignore robots.txt rules," all Robots.txt rules from your site will be completely ignored.
- "Accept rel nofollow links" will allow crawling of links with rel="nofollow" if they are not prohibited in Robots.txt.
Subdomains as part of the website
- By default, the crawler is limited to the host of your resource and does not access subdomains. If you enable "Subdomains as part of the website," the crawler will treat them as a single site and continue scanning the subdomains.
In the "Response headers" field, you can specify additional HTTP headers that the crawler will use when making requests to your site. These HTTP headers will be combined with the headers set in the resource settings.
Notice! Specifying the User-Agent header, for example, User-Agent: A random useragent that you like, will not affect the interpretation of robots.txt settings.
- Specify in the "HTML body parseable mimetypes|contentTypes" field the permissible content types for content recognition.
- In the "Ignore URLs ending with" field, enter the URL endings that should not be scanned, such as strings representing image file formats. These are often not of interest for analysis and will help you avoid exceeding your subscription plan limits.
- You can specify URLs containing certain strings to be ignored in the "Ignore URLs containing" field.
- You can prevent entire sections of the website from being scanned by specifying them in the "Ignore path" field.
Notice! After you click "Save," the crawling process will not start automatically. You can initiate the crawling process from the main page of the crawling section on your resource.
Notice! After starting the crawling process, you will only be able to edit the name, comment, concurrency, and delay between requests.