Mastering Web Crawling and Indexing: Advanced Techniques for Building Your Own Search Engine with Node.js, MySQL, and Elasticsearch

0 0
Read Time:6 Minute, 53 Second

In this tutorial, we will take the basic search engine and web crawler we built in our previous tutorial and make it more advanced by customizing the crawler, using more advanced indexing techniques with Elasticsearch, and implementing a PageRank algorithm to rank pages higher based on the number of links between domains. If you haven’t read our previous article “Build Your Own Search Engine and Web Crawler in 5 Minutes with Node.js, MySQL, and Elasticsearch” then we suggest reading that article first.

Section 1: Customizing the Crawler

In the previous tutorial, we created a basic web crawler that could crawl a list of seed URLs and store the results in a MySQL database. In this section, we will customize the crawler by adding the ability to crawl specific domains, limiting the crawl depth, and filtering unwanted pages.

To crawl specific domains, we can create an array with the URLs we want to crawl and modify the start() function to use that list instead of the seeds array. This is done by replacing the seeds array with our new array in the start() function.

const domains = ['https://example.com', 'https://example.net'];
const shouldVisit = (url) => {
  return domains.some((domain) => url.includes(domain));
};

const crawler = new Crawler({
  shouldVisit: shouldVisit,
  // ...
});

To limit the crawl depth, we can modify the depth parameter in the queue.push() function. This will limit the depth of the crawl to the specified number of levels.

queue.push({ url: seedUrl, depth: 2 });

To filter out unwanted pages, we can modify the shouldVisit() function. This function can be used to filter out pages with specific file extensions or to exclude certain URLs.

const shouldVisit = (url) => {
  const extensions = ['.jpg', '.png', '.pdf'];
  const exclude = ['https://example.com/login'];

  return !extensions.some((ext) => url.endsWith(ext)) &&
         !exclude.includes(url);
};

const crawler = new Crawler({
  shouldVisit: shouldVisit,
  // ...
});

Section 2: Advanced Indexing with Elasticsearch

In the previous tutorial, we used MySQL to store the search index. In this section, we will use Elasticsearch to store the search index and show you how to use more advanced indexing techniques such as multi-index search and faceted search.

To use Elasticsearch, we first need to install and configure it. Once installed, we can use the official Elasticsearch module for Node.js to interact with it.

const elasticsearch = require('elasticsearch');

const client = new elasticsearch.Client({
  hosts: ['http://localhost:9200']
});

To create an index, we use the create() function.

client.indices.create({
  index: 'pages'
});

To index a document, we use the index() function.

client.index({
  index: 'pages',
  body: {
    url: 'https://example.com',
    title: 'Example',
    content: 'This is an example page.'
  }
});

To search for documents, we use the search() function.

client.search({
  index: 'pages',
  q: 'example'
});

To use faceted search, we use the aggs parameter and specify the field to group by.

client.search({
  index: 'pages',
  body: {
    query: {
      match: { content: 'example' }
    },
    aggs: {
      domains: {
        terms: {
          field: 'domain.keyword'
        }
      }
    }
  }
});

Section 3: Implementing a PageRank Algorithm

In this section, we will implement a PageRank algorithm to rank pages higher based on the number of links between domains. The PageRank algorithm was first developed by Google and is used to determine the importance of a web page based on the number and quality of other pages that link to it.

To implement the PageRank algorithm, we will first create a new MySQL table to store the links between domains.

CREATE TABLE links (
  id INT NOT NULL AUTO_INCREMENT,
  source VARCHAR(255) NOT NULL,
  target VARCHAR(255) NOT NULL,
  PRIMARY KEY (id)
);

We will then modify the index() function to store the links between domains in the links table.

const domains = ['example.com', 'example.net'];

const index = (url, title, content) => {
  // Index the page
  client.index({
    index: 'pages',
    body: {
      url: url,
      title: title,
      content: content
    }
  });

  // Store the links between domains
  const domain = getDomain(url);
  const links = extractLinks(content).filter((link) => {
    const linkDomain = getDomain(link);
    return linkDomain && domains.includes(linkDomain) && domain !== linkDomain;
  });

  links.forEach((link) => {
    db.query('INSERT INTO links (source, target) VALUES (?, ?)', [domain, getDomain(link)]);
  });
};

To calculate the PageRank, we will use the power iteration method. We start by initializing all pages with a rank of 1, then we repeatedly update each page’s rank based on the ranks of the pages that link to it. The process is repeated until the ranks converge.

const calculatePageRank = () => {
  db.query('SELECT DISTINCT source, target FROM links', (err, results) => {
    const nodes = {};

    // Create a map of nodes and their neighbors
    results.forEach((result) => {
      const source = result.source;
      const target = result.target;

      if (!nodes[source]) {
        nodes[source] = { neighbors: [], rank: 1 };
      }

      if (!nodes[target]) {
        nodes[target] = { neighbors: [], rank: 1 };
      }

      nodes[source].neighbors.push(target);
    });

    // Calculate the PageRank using the power iteration method
    const dampingFactor = 0.85;
    let maxError = 0;

    do {
      maxError = 0;

      Object.keys(nodes).forEach((node) => {
        const neighbors = nodes[node].neighbors;
        let rank = (1 - dampingFactor);

        neighbors.forEach((neighbor) => {
          const neighborRank = nodes[neighbor].rank;
          const neighborDegree = nodes[neighbor].neighbors.length;
          rank += dampingFactor * (neighborRank / neighborDegree);
        });

        const error = Math.abs(rank - nodes[node].rank);
        maxError = Math.max(maxError, error);
        nodes[node].rank = rank;
      });
    } while (maxError > 0.0001);

    // Update the page ranks in the database
    Object.keys(nodes).forEach((node) => {
      db.query('UPDATE pages SET rank = ? WHERE url = ?', [nodes[node].rank, node]);
    });
  });
};

With these changes, our search engine now ranks pages based on the number of links between domains and the quality of the pages that link to them.

Section 4: Adding a Web Interface

In this section, we will add a web interface to our search engine, so that users can search for pages from their web browser. We will use Express.js to handle the HTTP requests and display the search results.

First, let’s install the express and body-parser modules:

npm install express body-parser

We will then create a new file named app.js to handle the HTTP requests.

const express = require('express');
const bodyParser = require('body-parser');
const db = require('./db');

const app = express();
app.use(bodyParser.urlencoded({ extended: true }));

app.get('/', (req, res) => {
  res.send(`
    <form method="POST" action="/search">
      <input type="text" name="query">
      <button type="submit">Search</button>
    </form>
  `);
});

app.post('/search', (req, res) => {
  const query = req.body.query;

  db.query(
    'SELECT * FROM pages WHERE MATCH (title,content) AGAINST (? IN NATURAL LANGUAGE MODE) ORDER BY rank DESC',
    [query],
    (err, results) => {
      if (err) throw err;

      const html = results.map((result) => {
        return `
          <div>
            <h2><a href="${result.url}">${result.title}</a></h2>
            <p>${result.content}</p>
          </div>
        `;
      }).join('');

      res.send(html);
    }
  );
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

In the app.js file, we define two routes: one for the home page and one for the search results. The home page displays a search form, while the search results page queries the database for pages that match the search query and displays the results.

We will also need to modify the search() function to return the page rank along with the search results.

const search = (query, callback) => {
  client.search({
    index: 'pages',
    body: {
      query: {
        match: {
          title: query
        }
      }
    }
  }, (err, results) => {
    if (err) throw err;

    const pages = results.hits.hits.map((hit) => {
      return {
        url: hit._source.url,
        title: hit._source.title,
        content: hit._source.content,
        rank: hit._source.rank
      };
    });

    callback(pages);
  });
};

With these changes, our search engine now has a web interface that users can use to search for pages.

Section 5: Conclusion

In this tutorial, we have built a search engine using Node.js, MySQL, and Redis. We started by crawling web pages and indexing them in a MySQL database, then implemented a PageRank algorithm to rank pages based on the number of links between domains. Finally, we added a web interface using Express.js to allow users to search for pages.

With some additional tweaks and optimizations, this search engine can be made more powerful and efficient. We hope this tutorial has given you an idea of how to build your own search engine and inspired you to explore more of the possibilities of Node.js and web crawling.

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a comment