Build Web-Scrapper using node.js

Build Web-Scrapper using node.js

Β·

4 min read

Hello, awesome 😎 reader today in this article I am gonna teach you to how to build a simple and amazing web scraper using Node.js. So let gets going.

So, First set up your node development environment if not. If you already have Congratulations πŸŽ‰ you are ready to go to the next step...

Just get Into the technicalπŸ‘¨β€πŸ’» stuff...

First, we require some packages.

After reading these packages' name this question is definitely hit you that what is these packages use for so I will give you that simple and exact definitions.

Axios - Promise based HTTP client for the browser and node.js

Cheerio - Fast, flexible & lean implementation of core jQuery designed specifically for the server.

So let start with the real deal...

Open your any preferred editor/IDE I use Visual Studio Code because it is easy to use that's why...

In editor open a terminal and enter the command-

npm init

After that, you will see a package.json file will be created that will hold your all configuration about that packages and its version I am not going deeper in this package file too deeper that's for another article.

Open the Package.json file there you see a script object with curly braces Something like this

"scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  }

In the Script object change the object to start object like this

"scripts": {
    "start": "nodemon index.js"
  }

After reading this you will ask why I add this and what will happen?

So after adding this you can run your index.js file

If there is nodemon error you find try installing it first and then try to run

npm i nodemon

After that create javascript file index.js and let's start the coding stuff.

Install the packages first. Axios and cheerio using the following command:-

npm i axios

npm i cheerio

After installing the packages import the packages first using const

const axios = require('axios')
const cheerio = require('cheerio')

and after that create a variable name url which contains the website link which you want to scrape I am using The Guardians news for this tutorial.

Using Axios we fetch that data from the URL.

const url ='https://www.theguardian.com/uk'
axios(url)
    .then(response =>{
        const html = response.data
        const ele =cheerio.load(html)
        const article =[]

and using cheerio we manipulate the data type we want and we have created ele variable which contains the data and an array which is containing a list of data we fetch from the URL.

We use ele variable that we have created that is containing the data

 ele('.fc-item__title', html).each(function(){
           const title = ele(this).text()
           const url = ele(this).find('a').attr('href')
            article.push({
                title,
                url
            })
        })

       console.log(article)

First, visit the website because you should know what type of data you want so there will you see lists of articles and select any article and inspect the element and find that div name from the developer tool.

Though I have finally found out the div name. that you can already see in the code which was is .fc-item__title and we use each function to iterate the collection of data.

There you see the two variable which contains the title and URL after that line you see them there is a push function we use in the article from before code you remember that we have created an article array that contains that list of array. So we are pushing the title and URL to the article which is containing that fetched data. and at the bottom, you see the console.log() to output the data.

Full code if you stuck ❀

const axios = require('axios')
const cheerio = require('cheerio')

const url ='https://www.theguardian.com/uk'
axios(url)
    .then(response =>{
        const html = response.data
        const ele =cheerio.load(html)
        const article =[]


        ele('.fc-item__title', html).each(function(){
           const title = ele(this).text()
           const url = ele(this).find('a').attr('href')
            article.push({
                title,
                url
            })
        })

       console.log(article)

    }).catch(err=> console.log(err) )

Hope you find this article helpful and if this helps you please feel free to give your opinion in the comment box also follow and like πŸ‘

Did you find this article valuable?

Support Abhishek Shukla by becoming a sponsor. Any amount is appreciated!

Β