Memex Explorer Crawler Guide¶
Both crawlers have their own unique designs, and both use the data they collect in unique ways.
There is some commonality between the two, however. They both require a list of URLs to crawl, called a seeds list, and they both share similar interactivity with the Crawler Control Buttons.
This section will go over the common elements of the two crawlers.
Creating a Seeds List¶
The common point between the two crawlers is that they both use the same kind of seeds list for their crawling. The seeds list is comprised of a list of urls separated by line breaks. Both Nutch and Ache use them in different ways, and the result you get directly from the crawlers is different for each of them. Here is a sample seeds list:http://www.reddit.com/r/aww http://gizmodo.com/of-course-japan-has-an-island-where-cats-outnumber-peop-1695365964 http://en.wikipedia.org/wiki/Cat http://www.catchannel.com/ http://mashable.com/category/cats/ http://www.huffingtonpost.com/news/cats/ http://www.lolcats.com/
Simply put, the seeds list should contain pages that are relevant to the topics you are searching. Both Nutch and Ache provide insight into the relevance of your seeds list, but in different ways.
For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler.
Seeds lists are created on the seeds page, and seeds lists can be created from the add crawl page.
The crawl settings page allows you to delete the crawl, as well as change the name or description of the crawl. It is accessed by clicking the “pencil” icon next to the name of the crawl.
Here you can change the name or description of the crawl. You can also delete the crawl.
Nutch is developed by Apache, and has an interface with Elasticsearch. All Nutch crawls create Elasticsearch indices by default.
With Nutch, you can define how long you want to crawl by setting the number of rounds to crawl. You can keep track of the overall crawl time and the sites currently being crawled by looking at the Nutch crawl visualizations.
The number of pages left to crawl in a Nutch round increases significantly after each round. You might pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running.
Memex Explorer currently uses the Nutch REST API for running all crawls.
Memex explorer recently added features for monitoring the status of Nutch crawls. You can now get real-time information about which pages Nutch is currently crawling, and information about the duration of the crawl.
Nutch will tell you how many pages have been crawled after the current round has finished.
Memex Explorer uses Bokeh for its plots. There are two plots available for analyzing Ache crawls, Domain Relevance and Harvest Rate.
The Domain Relevance plot sorts domains by the number of pages crawled, and adds information for relevancy of that domain to your crawl model. This plot helps you understand how well your model fits.
The Harvest Rate plot shows the overall performance of the crawl in terms how many pages were relevant out of the total pages crawled.
Like Nutch, Ache also collects statistics for its crawls, and allows you to see the head of the seeds list.
Harvest rate reflects the relevance to the model of the pages crawled. In this case, 58% of the pages crawled were relevant according to the model.