Chapter 4: Query a Web Resource#
Now that we know the basics about functions and modules, we can use them as our ticket into the worldwide community of software developers. While we might begin by building our own reusable components (i.e., functions) and sharing them with the Python community, in this chapter we’ll learn how to write a script using functions others have built and shared. In particular, we’ll use functions to access information available on the Internet. It’s time to use what we know to interact with the world!
Learning Outcomes
In this chapter, you will learn how to query networked resources through their application programming interface and the use libraries built by others. The scripts we’ll write will give us access to the collective knowledge of the world. By the chapter’s end, you will be able to:
Find and use powerful Python packages and libraries [design];
Describe the purpose and importance of an application programming interface (API) and know where to go to find published APIs [CS concepts and programming skills];
Write a script that requests information from an online resource like Wikipedia [programming skills];
Split apart a URL into its main components and explain purpose of each component [CS concepts];
Sketch the major components of the client-server programming model and interprocess communication [CS concepts];
Understand a basic protocol between two communicating processes, that network messages often consist of a header and a body, and what you can find in each [CS concepts];
Discuss the different ways a network request might fail and what HTTP status codes you’d receive [CS concepts and programming skills];
Create and use Python dictionaries [programming skills];
Use Python’s multiple-assignment, the enumerate function, and the for-else statement to make your scripts easier to understand [design and programming skills];
Manipulate messages and files that use the JSON data-interchange standard [programming skills];
Explain the difference between blocking and non-blocking function calls [CS concepts].
Packages and libraries. To access information out on the Internet, we will use the requests
package. A package is simply a set of modules that a Python programmer distributes all together. The requests
package is one of many available through the Python Package Index (https://pypi.org/
), which is an online repository of software for the Python programming language.
Another name you’ll often hear programmers use for a collection of modules is library. Programming languages like Python are typically distributed with what is called a standard library, which you can think of as the modules and packages available by default wherever the programming language is installed. The Python standard library[1] includes built-in constants (e.g., True
), data types (e.g., str
), and functions (e.g., the file I/O facilities) that we have repeatedly used in our scripts.
APIs. How do we use the functions and data structures in published packages and libraries to accelerate the development of our own scripts? We use their application programming interface (API).
An API begins with the functions and data structures defined in the package or library, but it doesn’t stop there. As we discussed in the previous chapter, the public interface for a function is often not enough to fully understand how to use it, and so an API isn’t complete without well-written documentation, like the information we discussed putting in the docstring of the functions we write. To emphasize, an API refers to both the function interfaces and data structure definitions we can import
into our scripts as well as the documentation for these interface and definitions that tell us how we can properly use them. You have already been using an API when you used Python’s built-in types (like strings), their methods (like string-find), and Python’s documentation (like the documentation on Python’s sequence types).
As you’ve experienced, a module and its API simplifies the work we need to do in building solutions to the problems that interest us because we can use a module’s functions and data structures without having to write this functionality ourselves or even understand exactly how the functions and data structures operate. Because Python has a treasure trove of existing libraries, our scripts are now no longer limited by what’s built in the Python language or what we know how to build ourselves. This is incredibly powerful for it expands the limits of our dramatic stage and permits us to reach beyond our own laptops to take advantage of everything on the Internet. Let’s start doing that now.
A new problem. To date, we’ve affirmatively answered some simple questions: Can I write a program that converts a book into a theatrical script? Can I write a function that replaces a subsequence of a string with a different subsequence? Here’s another seemingly simple question: Can I write a program to check if a book exists in my local library? Maybe I want to know if there’s a copy of Dr. Seuss’s The Cat in the Hat somewhere in Harvard Library’s miles of bookshelves.
We all know how to do this work ourselves. We’d simply look up the book’s title in the library’s card catalog. At Harvard, what used to be a physically imposing room full of cabinets with drawers of index cards[2] is now an online resource called HOLLIS[3]. So, my question really is: Can I write a script that sends a query from my computer to some computer running HOLLIS? And I had better know how to craft a query that the computer running HOLLIS will understand, and my program had better be able to interpret whatever response it gets back.
Notice that we have started to decompose the question into smaller problems that we need to accomplish. Here’s some pseudocode that enumerates the steps we’ve discussed so far:
1# Craft a request HOLLIS will understand about _The Cat in the Hat_
2# Send that request from our computer to the one running HOLLIS
3# Read the response from the HOLLIS computer
4# Print the answer to our question
This looks simple enough, as long as we can find libraries with APIs that make, for example, sending a request from one computer to another as simple as a function call. Think about this. There is almost certainly a boatload of work that needs to be done to send a message from our computer to some other computer, and we’d love it if someone else had figured out how that exactly works and has hidden that work behind the interface of a simple function call. Luckily, there is such an API, and we’ll use it to do something pretty amazing.[4]
Searching Wikipedia. HOLLIS is programmed to provide its users with lots of information in response to their queries. You’d learn a book’s catalogue number, publisher, and publication date. You’d find a short description of its subject matter and any notes that a librarian attached to the entry. You’d also know exactly where you could find the book on Harvard’s shelves.
Since we’ve learned that it’s best to start problem solving with a simple instance, let’s write most of what we need in our script while targeting a simple web resource. In particular, let’s ask whether Wikipedia has an entry for The Cat in the Hat. The pseudocode for this problem is the same as above, except we replace HOLLIS with wikipedia.org.
Here is one translation of that pseudocode into Python code:
1### chap04/qweb1.py
2import requests
3
4def main():
5 print('Searching wikipedia for "The Cat in the Hat"')
6
7 # Craft a request wikipedia will understand about The Cat in the Hat
8 s = 'The Cat in the Hat'.replace(' ', '+')
9 url = f"https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch={s}&srlimit=1&format=json"
10
11 # Send that request from our computer to the one running wikipedia
12 # Read the response from the wikipedia computer
13 response = requests.get(url)
14
15 # Print the answer to our question ... sorta
16 if response.status_code == 200:
17 print(f"Our request to Wikipedia succeeded!")
18 else:
19 print(f"Hmmm, something might have gone wrong.")
20
21if __name__ == '__main__':
22 main()
This script uses Python’s requests
library, which is described as “an elegant and simple HTTP library for Python, built for human beings.”[5] This library is not part of Python’s standard library, and you may have to follow a few steps explained in the library’s documentation to get it installed for your use. Once installed, the API provided by this library makes it very simple to complete the middle two steps of our pseudocode, as illustrated by the single requests.get
call on line 13.
This call and its returned result hide all the hard and confusing parts of communicating across whatever network is connected to your computer. Lines 8-9 do a small amount of string-processing work to set up the parameter to this call.[6] Lines 16-19 inspect the object returned by requests.get
to determine whether Wikipedia understood our question.
You Try It
Run the qweb1.py
and it will respond that it successfully talked to a Wikipedia server. You may need to install the requests
package.
The client-server programming model. Congratulations! You have successfully executed and generally understood your first network program. We all use network programs every day, from the web browsers running on our laptops to the bank apps running on our cell phones. Nearly every one of these network applications follows the same, straightforward programming model illustrated in qweb1.py
. Before we dive deeper into its code, let’s take a moment and understand the basics of this client-server programming model.
We have written a script that acts as a client in this client-server model. To complete its task, it makes a request of some other computational entity called a server. Typically, this server runs on some other computer, and the machines on which the client and server run are connected by a computer network. It is, however, possible that the client and the server are in fact running on the same machine, and we will take advantage of this ability in some of our upcoming examples.
Figure 8 illustrates both of these cases. It is important to keep in mind that we’re talking about a script that becomes a running program when we talk about clients and (in next chapter) servers. While you might hear someone point to a computer and say that it is their server, they probably mean that they have reserved this computer to run one or more server programs that respond to the requests of one or more client programs.
The computer science term for a running program (or more precisely a running instance of a script) is a process, and when the client and server processes exchange information, it is called interprocess communication. Two processes sending data across a computer network is one form of interprocess communication.
Whether the client and server are running on two separate, networked computers or both on the same computer is a detail hidden by the abstraction of the client-server programming model. And when networks are involved, the model hides the details of whether it is a wired network, a wireless network, or some combination of networks using lots of different transport mechanisms and data exchange protocols!
Resources, transactions, and protocols. Getting back to the structure of the client-server programming model, a server program is responsible for managing some resource that client programs find useful. A web server, like wikipedia.org, manages access to the files that comprise the articles in that online encyclopedia. An email server manages the transfer of email messages between email clients. A gaming server manages the shared state of the game and keeps each clients’ view of the game globally consistent.
The signature operation that takes place repeatedly in the client-server programming model is a transaction. The Python requests
library hides the four fundamental steps of a transaction between a client and a server behind the requests.request
function interface, of which we have used a specific type of request in our network program. As illustrated in Figure 9, the five steps are: (1) the client sends a request to the server; (2) the server receives the request and interprets it; (3) the server manipulates the resource it controls in the requested manner (assuming it is a valid request); (4) the server sends a response to the client; and (5) the client receives the response and interprets it.
When we make a requests.get
function call, our script (the client) sends a request (part of the value of url
) to the server (another part of the value of url
). Steps (2-4) are hidden from our client in the execution of the requests.get
function. Step (5) is the value that’s returned by the requests.get
function.
It’s now time to own up to something I mentioned when I introduced the Python requests
library but have ignored throughout our discussion of the client-server programming model. The requests
API does not allow us to do any type of client-server communication, but client-server communication using one specific protocol (i.e., the HTTP protocol).
What is a protocol? It is a fancy way of saying that two communicating processes can’t communicate unless they use the same language. Sound familiar? Encoding standards and communication protocols are related concepts. Clients and servers cannot communicate unless they agree to use the same method and meanings for the packing and unpacking of the data that they exchange.
In the Worldwide Web, browsers (i.e., clients) and web servers interact using a text-based protocol known as the hypertext transfer protocol (HTTP). We see pieces of this protocol in the url
string. In fact, HTTP defines a handful of methods that are the only ways that clients can ask servers to perform actions on their resources. One of these methods is GET, which simply asks the server to return a representation of the specified resource. The other popular HTTP method is POST, which essentially asks the server to accept the data accompanying the request as a new piece of the specified resource. Our script invokes the simpler GET method. That’s why the function name we use from the requests
library is get
!
The URL. When we “get” content from a web server, the data we receive comes in part from some file that the server manages, and as you might expect, each file on the server has a unique name. When we type a URL (a universal resource locator) in the address bar of a browser, this URL includes a path in the server’s file system to the file (i.e., resource) that interests us. The script qweb2.py
rewrites the setup work we did in qweb1.py
so that we can see the four main components of the URLs used in the HTTP protocol.
1### chap04/qweb2.py
2import requests
3
4def main():
5 print('Searching wikipedia for "The Cat in the Hat"')
6
7 # Highlight the 4 components of a URL for HTTP
8 protocol = 'https'
9 hostname = 'en.wikipedia.org'
10 path = '/w/api.php'
11 query = '?action=query&list=search&srsearch=The+Cat+in+the+Hat&srlimit=1&format=json'
12
13 # Build the URL and launch a `get` request
14 url = f"{protocol}://{hostname}{path}{query}"
15 response = requests.get(url)
16
17 if response.status_code == 200:
18 print(f"Our request to Wikipedia succeeded!")
19 else:
20 print(f"Hmmm, something might have gone wrong.")
21
22if __name__ == '__main__':
23 main()
The first piece of a URL indicates the protocol we are using, which in our case is HTTPS. HTTPS is an extension of the HTTP protocol that encrypts the request and response data that’s sent back and forth across the network.
Tip
Whenever possible, you should use HTTPS rather than HTTP to thwart someone snooping on your network traffic.
The next major piece of a URL is the hostname specified as an Internet Domain Name. If we want to communicate with a process on another machine, we have to be able to name that machine, and Internet Domain Names are a standardized way of doing just that. There’s a whole interesting question of, given a valid hostname, how do you actually route messages through computer networks to the computer with that hostname, especially when that computer might be a mobile device that connects to different parts of a computer network at different times. We don’t have to concern ourselves with this problem because it is hidden from us below the abstraction barrier, but you should realize that this is more complicated than putting a home address on an envelope, since most houses aren’t mobile homes.
The third piece of a URL for the HTTP protocol is the path to the specific resource on the host computer that we’d like. This path is specified in a form very similar to the filesystem paths we’ll discuss in detail in Chapter 13. This is not surprising since we are, in fact, specifying a file in the host’s filesystem.
Finally, HTTP allows us to specify an optional query at the end of our URL. The question mark (?
) separates the path from the query. The query itself is a string comprised of name and value pairs, with each pair separated by an ampersand (&
) character.
The Programmable Web. You might wonder, how did I know to create this specific URL? If we point our browser to en.wikipedia.org and type “The Cat in the Hat” in the search box there, the URL to which we’re directed is https://en.wikipedia.org/wiki/The_Cat_in_the_Hat
and not the one in qweb2.py
. That’s because we want to use the Web as a programmable platform, and to do that, we need to access the parts of the Web with published APIs.
To learn what Internet-based APIs exist, I visited the ProgrammableWeb, which maintained a directory of these things until it shut down in early 2023.[7] That site pointed me to MediaWiki’s API, on which Wikipedia’s API is built.[8] There I learned that the API endpoint for Wikipedia in English is https://en.wikipedia.org/w/api.php
.
To converse with the server at this endpoint, we have to craft a query string. In this string, the MediaWiki site explained that we should set the parameter named action
to the value query
when we want to fetch data. If then we set the list
parameter to search
, we can perform a full text search. Since we’re doing a search, we set the parameter srsearch
with the value of our search string and the parameter srlimt
to the number of page matches that we’d like returned. Finally, we set the parameter format
to json
so that the Wikipedia API knew that we want our response in JSON format, a text-based format that we’ll discuss in a moment.
Python dictionaries. It’s ugly and error prone to write out these cryptic query strings ourselves. You’ve undoubtedly noticed that we can’t use spaces in a query string and must replace each space character with a plus sign (+
), as we did on line 11 in qweb2.py
. But what if you wanted a search for a string containing a +
character? We probably don’t want to become an expert in URL encoding to simply ask our question to Wikipedia. Luckily, we don’t have to because the author of the requests
library realized this too!
1### chap04/qweb3.py
2import requests
3
4def main():
5 print('Searching wikipedia for "The Cat in the Hat"')
6
7 # Concatenate the first 3 components of a URL for HTTP
8 protocol = 'https'
9 hostname = 'en.wikipedia.org'
10 path = '/w/api.php'
11 url = protocol + '://' + hostname + path
12
13 # Describe the query string as a Python dictionary
14 query = {'action': 'query',
15 'list': 'search',
16 'srsearch': 'The Cat in the Hat',
17 'srlimit': 1,
18 'format': 'json'
19 }
20
21 response = requests.get(url, params=query)
22
23 if response.status_code == 200:
24 print(f"Our request to Wikipedia succeeded!")
25 else:
26 print(f"Hmmm, something might have gone wrong.")
27
28if __name__ == '__main__':
29 main()
This third version of our script employs a simpler way to specify the query. It pulls the query
string out of url
(line 11 of qweb3.py
) and passes it in another form to the params
parameter of the requests.get
function (see line 21 of qweb3.py
). This parameter expects a specific kind of data structure that we haven’t yet encountered. It expects a Python dictionary.[9]
Don’t confuse Python dictionaries with Python’s sequence data types. Both use the index operator (i.e., []
) to access a particular entry, but the dictionary abstraction doesn’t enforce an ordering on its elements as the sequence abstraction does.[10] Since we cannot specify which entry we want in a dictionary by specifying its position in the data structure, dictionaries associate each entry with a key, which Python requires to be of an immutable type. Recall that strings are immutable, and so we can use them as keys in our dictionary. We can also use numbers as keys, since they too are immutable objects in Python, which means you could make a dictionary object look like a list object, but let’s get back to our example.
Python dictionaries create a mapping from keys to values, and we initialize a dictionary with a set of key:value
pairs such as 'action':'query'
and 'srlimit':1
.
13# Describe the query string as a Python dictionary
14query = {'action': 'query',
15 'list': 'search',
16 'srsearch': 'The Cat in the Hat',
17 'srlimit': 1,
18 'format': 'json'
19}
20query # Not in qweb3.py
Run the code block above in the interactive interpreter. Once we’ve initialized query
, we can ask for the value associated with the key action
by typing query['action']
, which would return 'query'
. Make sure you recognize the difference between query
, which is the name of our dictionary object, and 'query'
, which is a string literal!
query['action']
As another example, query['srlimit']
would return 1
.
query['srlimit']
In initializing a dictionary, we enclose all the key:value
pairs in a pair of curly braces ({}
) and separate the key:value
pairs with commas. A pair of curly braces without any key:value
pairs creates an empty dictionary.
an_empty_dictionary = {}
Many of Python’s built-in functions work as well on dictionaries as they did with Python’s other data structures. For example, len(query)
would return 5
for our example, since there are 5 key-value pairs in query
.
len(query), len(an_empty_dictionary)
Typing query['srsearch'] = "Oh, the Places You'll Go!"
replaces one Dr. Seuss book title with another, and this functionality means dictionaries are mutable objects in Python.
print(query)
query['srsearch'] = "Oh, the Places You'll Go!"
print(query)
And 'rlimit' in query
returns False
because our dictionary doesn’t contain that misspelled key.
'rlimit' in query
If we want to remove a key and its value from a Python dictionary, we type del d[key]
where d
is our dictionary and key
is the key to be removed. If you specify a key that isn’t in the specified dictionary when indexing or using del
, the Python interpreter will raise a KeyError
exception.
You Try It
Go ahead and create a code block in which you can experiment with these Python dictionary commands for yourself. Feel free to look at the Python tutorial on dictionaries (https://docs.python.org/3/tutorial/datastructures.html#dictionaries) for more examples.
Another thing you might try is to print the actual URL used in our requests.get
call. You can see this string if you print response.url
, which is the URL that produced the response we received.
An HTTP response. So far, we have spoken to a Web API, but we haven’t done much with the HTTP response. In fact, our current script simply looks at something called the status code on the response and checks to see if this code is equal to 200
. While surfing around the Internet, you may have seen the phrase 200 OK
or 404 Not Found
. In general, the first is something good and the second isn’t. Unfortunately, the first, while good in some ways, may not mean we were completely successful in accomplishing what we hoped to do.
Instead of acting upon the value of the status_code
in our received response, let’s change our script for a moment and have it simply print the value of status_code
. We will then manipulate the parameters to requests.get
and see which affect the status code. We’ll modify qweb4.py
in our experiments.
1### chap04/qweb4.py
2import requests
3
4def main():
5 print('Searching wikipedia: changed "<nothing>"')
6 # print('Searching wikipedia: changed "path"')
7 # print('Searching wikipedia: changed "hostname"')
8 # print('Searching wikipedia: changed "srsearch"')
9
10 # Concatenate the first 3 components of a URL for HTTP
11 protocol = 'https'
12 hostname = 'en.wikipedia.org'
13 # hostname = 'en.wikipediaasdfghjasdfghj.org'
14 path = '/w/api.php'
15 # path = '/asdfghjasdfghj/api.php'
16 url = protocol + '://' + hostname + path
17
18 # Describe the query string as a Python dictionary
19 query = {'action': 'query',
20 'list': 'search',
21 'srsearch': 'The Cat in the Hat',
22 # 'srsearch': 'asdfghjasdfghj',
23 'srlimit': 1,
24 'format': 'json'
25 }
26
27 response = requests.get(url, params=query)
28
29 print("response.status_code =", response.status_code)
30
31if __name__ == '__main__':
32 main()
If you run the script as is, it reports a status code of 200. How can we make it generate a 404 status code?
If we search the Internet for “HTTP status 404 Not Found” we learn that it is a shorthand for “the server cannot find the requested resource.”[11] Recall that “requested resource” in this context refers to the path portion of our URL, which was supposed to take us to a resource controlled by this server. Ok, let’s test that. What happens if we put garbage in the path? In particular, let’s change this value from '/w/api.php'
to '/asdfghjasdfghj/api.php'
and rerun qweb4.py
.
You Try It
Switch which assignment to path
is commented out in qweb4.py
and run it.
Bingo! We got a 404 status code. This means that we were able to talk to a server with the hostname we specified and the server knew how to interpret an HTTP GET request, but this server didn’t have any resource at the path '/asdfghjasdfghj/api.php'
.
What if we mistype the hostname of the server?[12]
You Try It
Restore the good assignment to path
. Now change the assignment to hostname
from 'en.wikipedia.org'
to 'en.wikipediaasdfghjasdfghj.org'
. Run this version of qweb4.py
.
Our modified script dies with a large number of error messages. In this mess, you’ll find “Failed to establish a new connection” meaning that the requests
library tried several times to connect to the server but failed each time. Ideally, we would write our script so that it caught this error condition and handled it gracefully. Obviously, we cannot check the status code of a response if we never make a connection in the first place!
As a last experiment, what do you think the status code will be if we use a good hostname, a valid path, but ask to search for a term that doesn’t exist in Wikipedia?
You Try It
Restore the good assignment to hostname
. Now change the value for key 'srsearch'
in the definition for query
from 'The Cat in the Hat'
to 'asdfghjasdfghj'
. Run this final version of qweb4.py
.
It’s 200
. The same status code we got when we searched for 'The Cat in the Hat'
. Why is that? You can verify for yourself that Wikipedia will tell you that it doesn’t contain a page with this particular string.
The response header. To figure this out, we need to look at more than the response’s status code. Fortunately, HTTP request and response messages were designed to be human readable. Printing them can be instructional.
Roughly speaking, these messages consist of a header and a body. We can think of the header as containing the information we might see on the outside of an envelope (e.g., who the letter is meant for and the date it was posted at the post office). The body is the contents of the letter sealed in the envelope. Or if you prefer email, the header contains information like the “from” and “subject” fields, while the body is what is written below the subject line in most email clients.
The status code that we’ve been printing is just one part of the response header. We can inspect the other parts by printing response.headers
, which consists of what looks like a dictionary of key-value pairs. The qweb5.py
script uses a for-loop to print each key-value pair in response.headers
. When we write this for-loop statement, we specify two iteration variables (i.e., key
and value
) separated by a comma so that the interpreter knows that we want each pair pulled apart. This is called multiple assignment.[13]
1### chap04/qweb5.py
2import requests
3
4def main():
5 print('Searching wikipedia for "The Cat in the Hat"')
6
7 # Concatenate the first 3 components of a URL for HTTP
8 protocol = 'https'
9 hostname = 'en.wikipedia.org'
10 path = '/w/api.php'
11 url = protocol + '://' + hostname + path
12
13 # Describe the query string as a Python dictionary
14 query = {'action': 'query',
15 'list': 'search',
16 'srsearch': 'The Cat in the Hat',
17 'srlimit': 1,
18 'format': 'json'
19 }
20
21 response = requests.get(url, params=query)
22
23 # Print the response headers with a line per key
24 print("response.headers =")
25 for key, value in response.headers.items():
26 print(f" {key}: {value}")
27
28 # This is how we can access one of the keys
29 print("Content-Type:", response.headers['Content-Type'])
30
31if __name__ == '__main__':
32 main()
We’re not too concerned with most of the header, but we are interested in the value of the key 'Content-Type'
for this indicates how the server has encoded the body of the response. The value of this key is 'application/json; charset=utf-8'
. As we learned in the last chapter, the Python interpreter we’re using defaults to utf-8
for character encodings and having the Wikipedia server return a response in this encoding format makes our lives easy. Beyond the encoding of the individual characters in the response’s body, this value tells us that the body is organized using the JSON data-interchange standard, which is what we asked it to do when we specified 'format': 'json'
as one of our query
parameters.
JSON and the response body. JSON stands for JavaScript Object Notation, and it is meant to be a standard for exchanging structured data that is easy for machines to generate and parse, while also being relatively easy for humans to read and write.[14]
As we saw when we looked at the body of our response
, JSON-structured data resembles a dictionary of key-value pairs. Some of the values might be singleton values like a string or an integer, and other values might themselves be lists (i.e., enclosed in square brackets) or dictionaries (i.e., enclosed in curly braces).
By importing the json
module, we can take advantage of someone else’s work to make it straightforward for us to read and write message bodies in JSON format.
1### chap04/qweb6.py
2import requests
3import json
4
5def main():
6 print('Searching wikipedia for "The Cat in the Hat"')
7
8 # Concatenate the first 3 components of a URL for HTTP
9 protocol = 'https'
10 hostname = 'en.wikipedia.org'
11 path = '/w/api.php'
12 url = protocol + '://' + hostname + path
13
14 # Describe the query string as a Python dictionary
15 query = {'action': 'query',
16 'list': 'search',
17 'srsearch': 'The Cat in the Hat',
18 'srlimit': 1,
19 'format': 'json'
20 }
21
22 response = requests.get(url, params=query)
23
24 # Print the response body directly
25 print("response.text =", response.text)
26
27 print()
28
29 # Read the response body in JSON format and print it
30 j = response.json()
31 print("response.json() =", json.dumps(j, indent=4))
32
33 print()
34
35 # Print just the title from the JSON-structured response
36 print("Title =", j['query']['search'][0]['title'])
37
38if __name__ == '__main__':
39 main()
The json
module defines a function called dumps
that we use to convert JSON data into a string. Printing this string, we can see the structure of the data. We see that the value for the key 'query'
in our JSON-structured data is itself a dictionary. This dictionary tells us things about the search (i.e., the total number of hits that the server got when performing the search), and a list of 'search'
results. Since we limited the server to returning only the top hit (i.e., 'srlimit': 1
in our query
parameters), this list contains only a single element, which itself is a dictionary describing some characteristics of the article it found. Reading the JSON, we can see that this is the book we were hoping to find in Wikipedia!
How would we access the title of the article returned in the JSON body? Well, following our description in the previous paragraph, we’d simply type:
response.json()['query']['search'][0]['title']
And how would we change this to access the fifth character of the title string? We would add a [4]
to the end of what we just typed, because the result of the previous expression is a string and a string is a sequence!
response.json()['query']['search'][0]['title'][4]
While we will continue to play with JSON-structured response bodies, you may see (or desire) other types of responses. For instance, the default response format for the Harvard Library API is something that looks like JSON, but is actually another standard called XML. As we’ll see in our next script (qweb7.py
), we can indicate that we’d like a particular response encoding by changing the Accept
value in the request header.
Enumerating answers from HOLLIS. The qweb7.py
script below communicates not with Wikipedia, but (finally!) Harvard Library.[15] You should recognize and understand every statement except for the details of the for-loop that processes the returned response.
1### chap04/qweb7.py
2import requests
3import json
4
5def main():
6 print('Searching HOLLIS for "The Cat in the Hat"')
7
8 # Concatenate the first 3 components of a URL for HTTP
9 protocol = 'https'
10 hostname = 'api.lib.harvard.edu'
11 path = '/v2/items.json'
12 url = protocol + '://' + hostname + path
13
14 # Describe the query string as a Python dictionary
15 query = {'q': 'The Cat in the Hat',
16 'limit': 2
17 }
18
19 # Add a field to the request header saying what we accept
20 accept = {'Accept': 'application/json'}
21
22 response = requests.get(url, params=query, headers=accept)
23
24 # Read the response body in JSON format and print it
25 j = response.json()
26 print("response.json() =", json.dumps(j, indent=4))
27
28 print()
29
30 if j['pagination']['numFound'] == 0:
31 print('Zero results')
32 else:
33 # Process each returned response
34 for i, item in enumerate(j['items']['mods']):
35 # Print title info
36 ti = item['titleInfo']
37 if type(ti) == list:
38 # Lots of title info; just print the first
39 ti = ti[0]
40 print(f"Title #{i}: ", end='')
41 if 'nonSort' in ti:
42 print(ti['nonSort'], end='')
43 print(ti['title'])
44
45if __name__ == '__main__':
46 main()
In this script, we have asked HOLLIS to return not just the top search result, but the top two results. The script’s final lines dig down into the JSON-structured data to grab the Python list of results, and they pass that list to the Python built-in function enumerate
. This is a very useful little function when what you need is not just each value in a list, but each value paired with its list index. We pull this returned tuple[16] apart (using multiple assignment) in the for-loop, placing the index in i
and the dictionary that is the next value from the list in item
. From this dictionary, we grab the information associated with the key 'titleInfo'
and print it.
Beyond printing. Now that we understand the basics of grabbing data off the Internet, we can start having some real fun. Printing pieces of the data we’ve collected has helped us see what we’ve been getting back from these Web APIs, but why should we recreate what the software engineers at Wikipedia and Harvard Library have already spent years thinking about how to display? Imagine that we wanted to develop a script that searched multiple different libraries for a particular book. When it finds the book we’re interested in, we instruct this script to open the result in a browser window. We can see how this might be done using the HOLLIS URL in the response we just received.
1### chap04/qweb8.py
2import requests
3import webbrowser
4
5def h_lib(book):
6 # Concatenate the first 3 components of a URL for HTTP
7 protocol = 'http'
8 hostname = 'api.lib.harvard.edu'
9 path = '/v2/items.json'
10 url = protocol + '://' + hostname + path
11
12 # Describe the query string as a Python dictionary
13 query = {'q': book, 'limit': 5}
14
15 # Add a field to the request header saying what we accept
16 accept = {'Accept': 'application/json'}
17
18 response = requests.get(url, params=query, headers=accept)
19
20 # Return a list of matching items from the received response
21 if response.json()['pagination']['numFound'] == 0:
22 return []
23 else:
24 return response.json()['items']['mods']
25
26def match(item, desired_book):
27 if type(item['titleInfo']) == list:
28 ti = item['titleInfo'][0]
29 else:
30 ti = item['titleInfo']
31 return ti['title'].lower() == desired_book.lower()
32
33def get_url(items):
34 if type(items) == list:
35 for item in items:
36 if '@otherType' not in item or item['@otherType'] != 'HOLLIS record':
37 continue
38 return item['location']['url']
39 else:
40 return items['location']['url']
41
42def main():
43 desired_book = input("What's the title of your desired book? ")
44
45 print(f'Searching HOLLIS for "{desired_book}"')
46 items = h_lib(desired_book)
47
48 # Launch a browser window if we find the desired book
49 for item in items:
50 if match(item, desired_book):
51 # Grab this book's HOLLIS url
52 hollis_url = get_url(item['relatedItem'])
53
54 # Open a browser window to the appropriate HOLLIS page
55 browser = webbrowser.get('safari')
56 browser.open(hollis_url)
57 break
58 else:
59 print("Your desired book isn't in HOLLIS.")
60
61if __name__ == '__main__':
62 main()
You Try It
To run qweb8.py
, you’ll need an IDE that is running locally on your machine (i.e., not in the cloud) so that you can launch a window running a web browser. To run qweb8.py
, you should replace safari
on line 55 with the appropriate name for the web browser on your machine. Also, be aware that the script only checks the title you input against what HOLLIS considers the interesting part of the title. To match “The Cat in the Hat” you input “cat in the hat”. You could extend qweb8.py
with the title-printing logic in qweb7.py
.
The script qweb8.py
creates a function h_lib
to hide the details of making a search query on HOLLIS. This should all look familiar except for the very end of the function, where we check for a response that contains no matches.
The more interesting new code is in main
, which uses the for-else-statement introduced in Chapter 3’s active-learning exercises. In this script, we receive back a list from h_lib
in which we hope to find a matching item. If we iterate through this entire list and never find our desired item, we’d like to note this for the script’s user. This is the purpose the else-clause indented at the same level as the for-statement.[17]
Think about how we’d do that using just a for-statement. We’d probably set a boolean flag (e.g., found = False
) before we began the loop. In the loop, we’d set that flag to True
if we found our desired item. And then check the flag at the completion of the loop to see if we checked the entire list and didn’t find our desired item. If you found that description confusing, you’re not alone. If you write out that code, you’ll see it isn’t any easier to understand.
The for-else-statement does away with this flag in this exact use case. The if-statement in the for-loop contains the condition we’d like to find True
, and when the if-statement’s condition is True
, we break out of the for-loop. This is exactly what we’d like to think about this loop doing. But what if this loop iterates all the way through because we never found our desired item. Here’s where the else-clause of the for-statement executes, alerting us that we failed in our search. Now that, my friends, how you create a programming language feature for a common construct.
And what did the script do when it found our desired book? It grabbed the HOLLIS URL for the book out of the matching record and launched a web browser to display the resource at that URL. Did I have to know anything specific about the operation of web browsers? No! A wonderfully helpful individual created the webbrowser
library; I simply imported and used this library after spending only a few minutes with its API description.
Blocking and non-blocking function calls. Let’s pause for a moment and look at two of the powerful function calls we made in qweb8.py
. The requests.get
call on line 18 and the browser.open
call on line 56 are different in a fundamentally interesting way. I don’t mean that one returns a Python list and the other returns the special Python object None
. No, I want you to think about when lines 21 and 57 execute.
Line 21 can’t execute until we have completed the requests.get
call and have a response
object. This call blocks the further execution of our script until it completes. In particular, our script has to wait for some server process to do some work before the process running our script can continue. On the other hand, our script doesn’t care how long a process that runs the web browser takes to display the URL we asked it to render. Our script can merrily move on to the work in line 57 and beyond. The call to webbrowser.open
is an example of a non-blocking call.[18]
This is your first taste of concurrency (also called parallelism), and it is a fascinating and tricky subject. For now, be aware whether the library call you’re making will block and think about the implications of that action on the rest of what you write in your script.