Lets Build a Web Scraper in PHP and Python

Lets Build a Web Scraper in PHP and Python
Photo by Growtika / Unsplash

How many total websites do you think there is on the internet? According to recent estimates, there is around 1.10 billion websites. This is also with new websites being added and old websites being removed everyday all the time. I doubt any of us can consider how much data that is floating around the internet. There is more data online than any one person could ever digest in a single lifetime, let alone 100 lifetimes. To be able to use all of that data, you need more than just getting access to that data, but also need a way actually collect and organize the data to analyze for later. This is why web scraping comes in handy.

Web scraping – also known to people as data mining, web harvesting, or web data extraction – is a technique to extract a load of data from websites, then save all of that data to a local files in a local database, or a spreadsheet. This also depends on the format that works best for you to analyze the data which you are scraping across the internet. Web scraping saves countless hours because it automates the process of copying and pasting the selected information on a webpage or an entire website that you want to keep for later.

The skills to master data scraping can open up a new world of amazing possibilities for content analyzation. Informative content and news on websites is critical for increasing website traffic – take it as a fact

for people us who run CoderOasis growing our brand – so monitoring news and other popular publications on a daily basis using a web scraping tool can be very helpful.

Using Python and BeautifulSoup

The next question is how are we going to get content from all of the different websites which we want to use our web scraper on? A lot of people develop their own software application to do their data scraping due to the way they want to organize their data. The following section of the article is where I am going to build a web scraper using Python and BeautifulSoup library.

The first step of building this application is importing the libraries we are going to use which is requests and beautifulsoup.

# Import libraries
import requests
from bs4 import BeautifulSoup

Next, I am going to specify the variable for the URL using the request.get method and access the HTML content from this page.

import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
print(r.content)

Then, the next step is to parse a webpage of our choice. So, I need to create a BeautifulSoup object to do this for us.

import requests 
from bs4 import BeautifulSoup
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

 # Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Now comes the fun part! I am now able to move to extracting some useful data from the HTML content. For an example, I am taking a webpage that consists of some quotes from famous celebrities in order to create a program to save those quotes we want.

First, we want to look through the HTML content of the page that was created from the soup.pretify() output. This will give us a method to identify a pattern for how we want to navigate the quotes. In the following example, all the quotes are inside a div container with ID "container". Then we can find this div element using the find() method.

table = soup.find('div', attrs = {'id':'container'})

Each of the quotes are inside a div container that belongs to the class "quote". I will have to repeat the process with each div container that belongs to the class "quote". For me to be able to do that, I will use the findAll() method and iterate the process with each quote using a variable row.

Then, I will have to create a dictionary where we will save all data about the quote in a list called  "quotes".

quotes=[]  # a list to store quotes
 table = soup.find('div', attrs = {'id':'container'})
 for row in table.findAll('div', attrs = {'class':'quote'}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.h6.text
    quote['author'] = row.p.text
    quotes.append(quote)

Now, for the final step of the Python code is writing the data to a .csv file, which is a very common format that is used for databases and spreadsheets.

filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

This is a very simple example of how to perform web scraping with Python using the BeautifulSoup library. This is absolutely great for small-scale web scraping. If you want to scrape data at a large scale, you should consider using other and better alternatives.

Web Scraping with PHP and cURL

A lot of website content is just not words on a page. This means it can also include graphs, pictures, videos, and other formats of content. A good option to also get this data along with the text content is to use PHP with the cURL library. This allows connections to a lot more servers and use different protocols. The cURL functions can transfer files using a pretty extensive list of protocols – this just doesn't mean http but also includes ftp which can be useful for creating a web spider to download virtually anything off of the web to a server automatically.

<?php

function curl_download($Url){

    if (!function_exists('curl_init')){
        die('cURL is not installed. Install and try again.');
    }

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);

    return $output;

print curl_download('https://coderoasis.org/hosting/plans');

Do you like what you're reading from the CoderOasis Technology Blog? We recommend reading our Implementing RSA in Python from Scratch series next.
Implementing RSA in Python from Scratch
Please note that it is essential for me to emphasize that the code and techniques presented here are intended solely for educational purposes and should never be employed in real-world applications without careful consideration and expert guidance. At the same time, understanding the principles of RSA cryptography and exploring various

The CoderOasis Community

Did you know we have a Community Forums and Discord Server? which we invite everyone to join us? Want to discuss this article with other members of our community? Want to join a laid back place to chill and discuss topics like programming, cybersecurity, web development, and Linux? Consider joining us today!