Python web scraping tutorial — Getting a website’s title.

Utibeabasi Umanah
7 min readSep 19, 2021
Photo by Emile Perron on Unsplash

Hello guys, been a while🙃. In this article, we will be creating a python web app to scrape the title of a website. I know it’s not very exciting but this tutorial is just meant to get you familiar with the basics of Flask and an intro to web scraping. Let’s get to it!🚀

Note: This tutorial assumes you have at least an intermediate knowledge of python

Requirements

  1. Python — Obviously
  2. Python packages: Flask, Requests, Beautiful soup
  3. An urge to learn

So above are the requirements to follow this tutorial. You need python installed. If you don’t have it installed, here’s a link to download it. Click here to download python. Next, you need to install some required modules which are: flask(for our webserver), requests(for web scraping), and beautiful soup(for parsing HTML). With those installed, let’s get into the tutorial.

Server setup

Flask is a Python micro web framework that allows you to build web apps fast and without the unnecessary baggage that comes with larger frameworks like Django.

So the first thing we want to do is get our web server set up. Here’s how to do that in flask.

First, we import Flask from flask

Next, we create a server variable and instantiate it to an instance of Flask which we imported above.

Note we pass in the “__name__” variable which is a special variable that refers to the name of the current package. We pass this in to tell flask where to find resources such as templates, static files, etc. Here’s a link to tell you more about the “__name__” variable.

Next, we create a function called home which just returns hello world for now.

On top of the home function, there is a decorator which tells flask to call the home function whenever a get request is made to the path “/“. The syntax for this is “@<server_name>.route(“<url_path>”, methods=<list_of_allowed_http_methods>)”. In our case, we called our Flask instance, which is the server, “server” so our server name is server. Set it to whatever you called your Flask instance. The methods list is a list of HTTP methods that are allowed on this path. Here we are only accepting GET and POST requests. Other options here are “Put”, and “Delete”. Here’s a link to read more on HTTP methods Link. The URL path is the path/route the user enters in the browser. Here it is the default or home path which is “/“.

Next, we check if the __name__ variable is equal to “__main__” which means that we are executing this program and not importing it as a module, and then call the server.run function which starts our web server. Note we set debug to True. This gives us some sweet debug mode only functions like reloading the server whenever we make a change to the code.

You can see the server starts up and we can view it in a browser by going to localhost:5000

Web scraping with requests

Web scraping is the use of bots or scripts to extract content and data from a website. Usually, it is used to get contents like HTML data, etc and In our case, we are using it to get a website’s title.

Requests is a python library that allows you to send HTTP requests. Think of it as curl in Linux. We can get a website's HTML data by sending a get request to the website’s URL. However, this returns the entire HTML content of the website so we use a module called Beautiful soup to parse this data and extract meaningful content.

Ok, our server is all set up. Let’s create a function to scrape a website and get its title.

Here we create a function that makes a GET request to the supplied URL and returns its HTML content. Now let’s use beautiful soup to parse this data.

First, we import beautiful soup and create a new soup object and pass in the raw HTML we got from requests as well as the parser we want to use to parse the HTML, which in our case is the default parser.

Then we use the soup.find method and pass in the name of the HTML tag we want to find which in this case is title.

Note: since there is always only a single title tag in a webpage, we use the find method. For tags that appear multiple times like p, div, etc, we use the soup.find_all method

Then, we return the text content of the result which is the title of the website. We use a try and except block, to prevent errors in case the server we are trying to scrape isn’t online or we don’t have a network connection.

Ok let’s go back to our home method and make some modifications

Here, first of all, I’m checking if a POST request is being sent to the server. If it is, I get the URL from the requests form object. The request object is given to us by flask and contains info on the current request being made to the server. The form method returns a python dictionary of key-value pairs sent to the server from the frontend. Then I call the scrape function and pass in the URL. I then render the index.html file and pass in the scraped result. This is possible because flask makes use of Jinja's templating language. More on that here. If it isn’t a post request, I just return the index.html page.

Here's a look at our HTML page.

As you can see, it's just a form with an input tag of type text. Note that I set the name property to “url” which is the key we are querying for in the backend home method.

I’ve added some basic css styling too and some javascript. Note the use of the url_for template string, again made possible due to Jinja.

Here’s a look at our CSS

and our Javascript

All I’m doing here is making sure that the user can’t submit the form without typing in a valid URL. This is an alternative to handling it on the backend.

Ok, that was a lot to take in. Here’s our complete backend code for clarity. Also, i’ll be dropping the link to the GitHub repo down below just in case.

And finally here’s the working app

Ok, so I guess that will be all for now, but watch this space for more DevOps-related posts. You can reach out to me on Github or send me an email at utibeabasiumanah6@gmail.com. Thanks for reading✌️

Please leave a comment if you have any thoughts about the topic — I am open to learning and knowledge explorations.

Additional links

Github repo

--

--