Contents
In this tutorial, we’ll show you how to build a search engine and web crawler using Node.js, MySQL, Redis, and Elasticsearch. We’ll start by creating a web crawler that will endlessly crawl the web and add new URLs to a MySQL database and a Redis cache. We’ll then set up Elasticsearch to index the pages we’ve crawled and create a search engine that can be used to search through the indexed pages.
Prerequisites
To follow along with this tutorial, you’ll need to have the following installed on your computer:
- Node.js
- MySQL
- Redis
- Elasticsearch
Section 1: Setting Up the Database and Redis Cache
The first step is to create a MySQL database and a Redis cache to store the URLs we crawl. Follow these steps:
- Create a new MySQL database called
crawler
with the following schema:
CREATE TABLE urls (
id INT AUTO_INCREMENT PRIMARY KEY,
url VARCHAR(255) UNIQUE KEY
);
This creates a table called urls
with two columns, id
and url
. The url
column is unique, which means that we won’t be able to store duplicate URLs in the table.
- Create a new Redis cache by running the following command:
redis-server
his will start the Redis server on the default port (6379).
Section 2: Creating the Web Crawler
The next step is to create the web crawler that will endlessly crawl the web and add new URLs to the MySQL database and Redis cache. Follow these steps:
- Create a new file called
crawler.js
and add the following code:
const request = require('request');
const cheerio = require('cheerio');
const mysql = require('mysql');
const redis = require('redis');
const mysqlConnection = mysql.createConnection({
host: 'localhost',
user: 'root',
password: '',
database: 'crawler'
});
const redisClient = redis.createClient();
redisClient.on('connect', () => {
console.log('Connected to Redis');
});
redisClient.on('error', (err) => {
console.log('Error:', err);
});
const crawl = (url) => {
request(url, (error, response, body) => {
if (error) throw error;
const $ = cheerio.load(body);
const urls = $('a')
.map((i, el) => $(el).attr('href'))
.get()
.filter((url) => url.startsWith('http'));
urls.forEach((url) => {
mysqlConnection.query('INSERT IGNORE INTO urls (url) VALUES (?)', [url], (error) => {
if (error) throw error;
redisClient.lpush('urls', url);
});
});
const nextUrl = redisClient.rpop('urls', (error, result) => {
if (error) throw error;
if (result) {
crawl(result);
} else {
console.log('No more URLs to crawl');
}
});
});
};
crawl('http://www.example.com');
This code sets up the MySQL and Redis connections and defines a crawl
function that takes a URL and crawls it. The function uses the request
module to fetch the HTML of the URL, and then uses cheerio
to extract all the links from the page. It filters out any links that don’t start with http
and then adds the remaining links to the MySQL database and Redis cache.
The crawl
function then retrieves the next URL from the Redis cache and calls itself with the new URL, creating an endless loop that will keep crawling the web.
- Run the crawler by running the following command:
node crawler.js
This will start the crawler and begin crawling the web starting with the URL specified in the crawl
function.
Section 3: Indexing Pages with Elasticsearch
The next step is to set up Elasticsearch to index the pages we’ve crawled. Follow these steps:
- Install the Elasticsearch module by running the following command:
npm install elasticsearch
- Create a new file called
elasticsearch.js
and add the following code:
const elasticsearch = require('elasticsearch');
const mysql = require('mysql');
const mysqlConnection = mysql.createConnection({
host: 'localhost',
user: 'root',
password: '',
database: 'crawler'
});
const esClient = new elasticsearch.Client({
host: 'localhost:9200',
log: 'error'
});
mysqlConnection.query('SELECT url FROM urls', (error, results) => {
if (error) throw error;
const bulkIndex = (index, type, data) => {
const bulkBody = [];
data.forEach((item) => {
bulkBody.push({
index: {
_index: index,
_type: type
}
});
bulkBody.push(item);
});
esClient.bulk({body: bulkBody})
.then((response) => {
const errorCount = response.items.reduce((acc, item) => {
if (item.index && item.index.error) {
console.log(item.index.error);
return acc + 1;
}
return acc;
}, 0);
console.log(`Successfully indexed ${data.length - errorCount} out of ${data.length} items`);
})
.catch(console.err);
};
const searchData = (index, body) => {
return esClient.search({index: index, body: body});
};
const index = 'pages';
const type = 'page';
const documents = results.map((result) => {
return {
url: result.url
};
});
bulkIndex(index, type, documents);
});
This code sets up the Elasticsearch client and defines a bulkIndex
function that takes an index, a type, and an array of data to index. The function creates a bulk request to Elasticsearch that indexes all the data.
The code also defines a searchData
function that takes an index and a body, and returns the search results.
The code then queries the MySQL database for all the URLs we’ve crawled and creates an array of documents that will be indexed by Elasticsearch. It then calls the bulkIndex
function to index the documents.
- Run the Elasticsearch indexer by running the following command:
node elasticsearch.js
This will index all the pages we’ve crawled with Elasticsearch.
Section 4: Building the Search Engine
The final step is to build the search engine. Follow these steps:
- Create a new file called
server.js
and add the following code:
const express = require('express');
const mysql = require('mysql');
const elasticsearch = require('elasticsearch');
const app = express();
const mysqlConnection = mysql.createConnection({
host: 'localhost',
user: 'root',
password: '',
database: 'crawler'
});
const esClient = new elasticsearch.Client({
host: 'localhost:9200',
log: 'error'
});
app.get('/search', (req, res) => {
const query = req.query.q;
if (!query) {
return res.json([]);
}
const body = {
query: {
match: {
text: query
}
}
};
esClient.search({
index: 'pages',
body: body
})
.then((results) => {
const hits = results.hits.hits.map((hit) => {
return {
url: hit._source.url
};
});
res.json(hits);
})
.catch((error) => {
console.error(error);
res.json([]);
});
});
app.listen(3000, () => {
console.log('Server listening on port 3000');
});
This code sets up an Express web server that listens on port 3000. It also sets up connections to the MySQL database and Elasticsearch.
The code defines a route for searching, which accepts a query parameter q
. If the parameter is not provided, the function returns an empty array. If the parameter is provided, the function creates an Elasticsearch query that matches the query against the text
field of the indexed documents.
The function then calls the search
method of the Elasticsearch client, passing in the index and query body. It then maps the search results to an array of URLs and returns it as a JSON response.
- Start the web server by running the following command:
node server.js
This will start the web server, and you can now search the pages you’ve crawled by visiting http://localhost:3000/search?q=search-term
.
3. Open up your web browser and visit http://localhost:3000/search?q=search-term
, where search-term
is the term you want to search for. This will return an array of URLs that match the search term.
You now have a fully functioning search engine that can crawl the web and index pages with Elasticsearch. The web UI allows users to search for pages that match a query.
Note that this is just a basic example and you can customize the crawler and search engine to suit your needs. You can modify the crawler to crawl specific domains or websites, and you can customize the search engine to use more advanced search techniques.
Enjoy building your own search engine with Node.js, MySQL, and Elasticsearch! If you wish to take this further then consider reading our next tutorial which explains how to add pagerank functionality to rank pages higher based on the number of links to the domain.
Average Rating