Build Your Own Search Engine and Web Crawler in 5 Minutes with Node.js, MySQL, and Elasticsearch

0 0
Read Time:6 Minute, 19 Second

In this tutorial, we’ll show you how to build a search engine and web crawler using Node.js, MySQL, Redis, and Elasticsearch. We’ll start by creating a web crawler that will endlessly crawl the web and add new URLs to a MySQL database and a Redis cache. We’ll then set up Elasticsearch to index the pages we’ve crawled and create a search engine that can be used to search through the indexed pages.

Prerequisites

To follow along with this tutorial, you’ll need to have the following installed on your computer:

  • Node.js
  • MySQL
  • Redis
  • Elasticsearch

Section 1: Setting Up the Database and Redis Cache

The first step is to create a MySQL database and a Redis cache to store the URLs we crawl. Follow these steps:

  1. Create a new MySQL database called crawler with the following schema:
CREATE TABLE urls (
  id INT AUTO_INCREMENT PRIMARY KEY,
  url VARCHAR(255) UNIQUE KEY
);

This creates a table called urls with two columns, id and url. The url column is unique, which means that we won’t be able to store duplicate URLs in the table.

  1. Create a new Redis cache by running the following command:
redis-server

his will start the Redis server on the default port (6379).

Section 2: Creating the Web Crawler

The next step is to create the web crawler that will endlessly crawl the web and add new URLs to the MySQL database and Redis cache. Follow these steps:

  1. Create a new file called crawler.js and add the following code:
const request = require('request');
const cheerio = require('cheerio');
const mysql = require('mysql');
const redis = require('redis');

const mysqlConnection = mysql.createConnection({
  host: 'localhost',
  user: 'root',
  password: '',
  database: 'crawler'
});

const redisClient = redis.createClient();

redisClient.on('connect', () => {
  console.log('Connected to Redis');
});

redisClient.on('error', (err) => {
  console.log('Error:', err);
});

const crawl = (url) => {
  request(url, (error, response, body) => {
    if (error) throw error;

    const $ = cheerio.load(body);

    const urls = $('a')
      .map((i, el) => $(el).attr('href'))
      .get()
      .filter((url) => url.startsWith('http'));

    urls.forEach((url) => {
      mysqlConnection.query('INSERT IGNORE INTO urls (url) VALUES (?)', [url], (error) => {
        if (error) throw error;

        redisClient.lpush('urls', url);
      });
    });

    const nextUrl = redisClient.rpop('urls', (error, result) => {
      if (error) throw error;

      if (result) {
        crawl(result);
      } else {
        console.log('No more URLs to crawl');
      }
    });
  });
};

crawl('http://www.example.com');

This code sets up the MySQL and Redis connections and defines a crawl function that takes a URL and crawls it. The function uses the request module to fetch the HTML of the URL, and then uses cheerio to extract all the links from the page. It filters out any links that don’t start with http and then adds the remaining links to the MySQL database and Redis cache.

The crawl function then retrieves the next URL from the Redis cache and calls itself with the new URL, creating an endless loop that will keep crawling the web.

  1. Run the crawler by running the following command:
node crawler.js

This will start the crawler and begin crawling the web starting with the URL specified in the crawl function.

Section 3: Indexing Pages with Elasticsearch

The next step is to set up Elasticsearch to index the pages we’ve crawled. Follow these steps:

  1. Install the Elasticsearch module by running the following command:
npm install elasticsearch
  1. Create a new file called elasticsearch.js and add the following code:
const elasticsearch = require('elasticsearch');
const mysql = require('mysql');

const mysqlConnection = mysql.createConnection({
  host: 'localhost',
  user: 'root',
  password: '',
  database: 'crawler'
});

const esClient = new elasticsearch.Client({
  host: 'localhost:9200',
  log: 'error'
});

mysqlConnection.query('SELECT url FROM urls', (error, results) => {
  if (error) throw error;

  const bulkIndex = (index, type, data) => {
    const bulkBody = [];

    data.forEach((item) => {
      bulkBody.push({
        index: {
          _index: index,
          _type: type
        }
      });

      bulkBody.push(item);
    });

    esClient.bulk({body: bulkBody})
      .then((response) => {
        const errorCount = response.items.reduce((acc, item) => {
          if (item.index && item.index.error) {
            console.log(item.index.error);
            return acc + 1;
          }

          return acc;
        }, 0);

        console.log(`Successfully indexed ${data.length - errorCount} out of ${data.length} items`);
      })
      .catch(console.err);
  };

  const searchData = (index, body) => {
    return esClient.search({index: index, body: body});
  };

  const index = 'pages';
  const type = 'page';

  const documents = results.map((result) => {
    return {
      url: result.url
    };
  });

  bulkIndex(index, type, documents);
});

This code sets up the Elasticsearch client and defines a bulkIndex function that takes an index, a type, and an array of data to index. The function creates a bulk request to Elasticsearch that indexes all the data.

The code also defines a searchData function that takes an index and a body, and returns the search results.

The code then queries the MySQL database for all the URLs we’ve crawled and creates an array of documents that will be indexed by Elasticsearch. It then calls the bulkIndex function to index the documents.

  1. Run the Elasticsearch indexer by running the following command:
node elasticsearch.js

This will index all the pages we’ve crawled with Elasticsearch.

Section 4: Building the Search Engine

The final step is to build the search engine. Follow these steps:

  1. Create a new file called server.js and add the following code:
const express = require('express');
const mysql = require('mysql');
const elasticsearch = require('elasticsearch');

const app = express();

const mysqlConnection = mysql.createConnection({
  host: 'localhost',
  user: 'root',
  password: '',
  database: 'crawler'
});

const esClient = new elasticsearch.Client({
  host: 'localhost:9200',
  log: 'error'
});

app.get('/search', (req, res) => {
  const query = req.query.q;

  if (!query) {
    return res.json([]);
  }

  const body = {
    query: {
      match: {
        text: query
      }
    }
  };

  esClient.search({
    index: 'pages',
    body: body
  })
    .then((results) => {
      const hits = results.hits.hits.map((hit) => {
        return {
          url: hit._source.url
        };
      });

      res.json(hits);
    })
    .catch((error) => {
      console.error(error);
      res.json([]);
    });
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

This code sets up an Express web server that listens on port 3000. It also sets up connections to the MySQL database and Elasticsearch.

The code defines a route for searching, which accepts a query parameter q. If the parameter is not provided, the function returns an empty array. If the parameter is provided, the function creates an Elasticsearch query that matches the query against the text field of the indexed documents.

The function then calls the search method of the Elasticsearch client, passing in the index and query body. It then maps the search results to an array of URLs and returns it as a JSON response.

  1. Start the web server by running the following command:
node server.js

This will start the web server, and you can now search the pages you’ve crawled by visiting http://localhost:3000/search?q=search-term.

3. Open up your web browser and visit http://localhost:3000/search?q=search-term, where search-term is the term you want to search for. This will return an array of URLs that match the search term.

You now have a fully functioning search engine that can crawl the web and index pages with Elasticsearch. The web UI allows users to search for pages that match a query.

Note that this is just a basic example and you can customize the crawler and search engine to suit your needs. You can modify the crawler to crawl specific domains or websites, and you can customize the search engine to use more advanced search techniques.

Enjoy building your own search engine with Node.js, MySQL, and Elasticsearch! If you wish to take this further then consider reading our next tutorial which explains how to add pagerank functionality to rank pages higher based on the number of links to the domain.

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a comment