Chapter 10#

ALE 10.1: Are these data still good? Whenever we send a message across a network or leave our data on a storage device, there’s a non-zero chance that a real-world event (e.g., an electrical spike from a lightning strike) will cause one or more of data bits to flip their value (i.e., from a 1 to a 0 or a 0 to a 1). A common method for noticing when such data has been corrupted is to pair it with a hash of all the bits in a block of data. Computer scientists often refer to such a hash as a checksum, and in this ALE, you’ll learn about and implement a historically popular one that requires just one extra bit per data block: CRC-1.

Step 1. CRC stands for Cyclic Redundancy Check, and it is a non-cryptographic hash function, which means that you can use it for error detection but not for any meaningful security purpose. CRC-1 is a fancy name for adding a single parity bit to the end of a short block of data. Because it is easy and space efficient to compute, the generation of a parity bit is often implemented directly in hardware. If a computer’s memory includes parity bits, the hardware can alert the operating system when a running program tries to read a piece of memory that has been corrupted.

CRC-1 is called a parity bit because you are just recording a fact about the number of bits in the data that are a 1. Obviously with one parity bit, we can’t record the number of 1-bits in a data block of length greater than 1, and so the parity bit simply records whether the number of 1-bits is an even or odd number. Because we can encode this even/odd fact in one of two ways (e.g., the parity bit being equal to 0 means either that the number of 1-bits is even or that the number of 1-bits is odd), the world defines CRC-1 either as an even-parity implementation or an odd-parity one.

In both implementations, you count the number of 1-bits in both the data block and the parity bit. Even parity sets the parity bit to 1 when that makes the total number of 1-bits even, and odd parity sets the parity bit to 1 when that makes the total into an odd number.

Now let’s test your understanding: What value would you give the parity bit when your data block contains no 1-bits and you’ve been told to implement even parity?

Step 2. Let’s implement a function that takes one parameter, an integer, and computes and returns the even parity bit (i.e., the function returns 0 or 1).

The function crc1_cnt in the following code block is one possible implementation. Make sure you understand the method it uses to count the number of 1-bits in the input number.

 1### chap10/ale01.py
 2
 3def crc1_cnt(n):
 4    cnt_of_1s = 0
 5
 6    # Add 1 to the count of 1-bits when `n`
 7    # is odd. Use integer-division to shift
 8    # down a bit, continuing until `n` is 0.
 9    while n != 0:
10        if n % 2 == 1:
11            cnt_of_1s += 1
12        n = n // 2
13
14    # Return the right value for even parity
15    if cnt_of_1s % 2 == 1:
16        return 1
17    else:
18        return 0

How would you change this code to produce the parity bit for odd parity?

You can test the original implementation of crc1_cnt using the following test function. After you edit crc1_cnt to implement odd parity, change the second parameter in the call to test to 'odd' to verify your changed function.

29### chap10/ale01.py
30
31def test(fun, parity='even'):
32    if parity == 'even':
33        p, q = 0, 1
34    else:
35        p, q = 1, 0
36
37    # Test the function sent in as the parameter
38    if fun(0) != p:
39        return 'Failed for n = 0'
40    if fun(1) != q:
41        return 'Failed for n = 1'
42    if fun(2) != q:
43        return 'Failed for n = 2'
44    if fun(3) != p:
45        return 'Failed for n = 3'
46    if fun(16) != q:
47        return 'Failed for n = 16'
48    return 'PASSED our unit tests!'
49
50
51print('Testing crc1_cnt ...')
52print(test(crc1_cnt, 'even'))

Step 3. Think back for a moment to ALE 9.1 where we learned how computers implement addition using nothing but bitwise operations. There we learned how the sum and carry parts of long-hand addition work with binary numbers, and now we can see that sum is just a parity computation!

Figure out what type of parity the XOR operator computes (HINT: write out its truth table). And then complete the body of the function crc1_xor so that it implements even parity.

18### chap10/ale01.py
19
20def crc1_xor(n):
21    even_parity = 0
22    n_binstr = bin(n)[2:]  # removes the leading '0b'
23    #print('DEBUG: n_binstr', n_binstr)
24
25    # INSERT some kind of loop over n_binstr and
26    # the body of the loop can be done with one
27    # update to `even_parity` using a bitwise operator.
28
29    return even_parity

Test your implementation to verify that you’ve done your work correctly.

53### chap10/ale01.py
54
55print('Testing crc1_xor ...')
56print(test(crc1_xor, 'even'))

ALE 10.2: The BSD checksum. When we type commands in the Shell on Replit, we are interacting with a program running on a Unix operating system. For those of you running on a MacBook, MacOS is basically a Unix operating system under its hood. In fact, MacOS is a descendent of BSD Unix, which was developed at the University of California, Berkeley in the 1970s. It was an extension of the original UNIX operating system developed just a few years earlier at Bell Labs, which started in 1925 as the research arm of Western Electric and AT&T.

One of the extensions that the Berkeley researchers added to their version of Unix was the BSD checksum utility called sum, which can still be found in MacOS. It was meant as mechanism to create a fingerprint (i.e., a checksum or hash) of a file so that you could later tell if the file had been corrupted (i.e., had one of its bytes changed).

The BSD checksum employed a very simple algorithm, which we describe in pseudocode below. Instead of taking a file of bytes as input, our function will take a string of ASCII (8-bit) characters.

Notice that this routine creates a 16-bit result, which means an integer between 0 and 65535. It does a little bit of bit manipulation, which is the hardest part of creating the Python code corresponding to this pseudocode. Otherwise, it is just adding the numerical value of each character together and keeping the lowest 16 bits of the resulting sum.

Complete the implementation of bsd_checksum and verify it using the following test-code block.

 1### chap10/ale02.py
 2
 3def bsd_checksum(s):
 4    '''Given a string of 8-bit characters
 5       compute a 16-bit, BSD-style checksum.'''
 6
 7    # Starting value for our 16-bit checksum
 8    checksum = 0
 9
10    # foreach character c in the input string s
11        # Rotate the current checksum value right by 1 bit, where
12        # rotate takes the existing least significant bit and
13        # makes it the new most significant bit.
14
15        # To this rotated value, add numerical value of
16        # the current character c.
17
18        # Store only the least-significant 16 bits of the previous
19        # result as your new checksum.
20
21    #print('DEBUG:', checksum)
22    return checksum
24### chap10/ale02.py
25
26# Test code
27print('Testing bsd_checksum ...')
28assert bsd_checksum('a') == 97, "Failed for s = 'a'"
29assert bsd_checksum('aa') == 32913, "Failed for s = 'aa'"
30assert bsd_checksum('bb') == 147, "Failed for s = 'bb'"
31assert bsd_checksum('test') == 16597, "Failed for s = 'test'"
32assert bsd_checksum('This is a test\n') == 11640, \
33                    "Failed for s = 'This is a test\n'"
34print('PASSED our unit tests!')

Congratulations! You now know several kinds of hash functions, all of which convert inputs of unbounded size into an output that is a small integer, with some hashes as small as a single bit!

ALE 10.3: Find colliding words. The script rk_strmatch.py currently finds substrings in the text that match a given pattern. If the pattern fed is a word, can you change this script so that it prints not substring matches but the words that collide with the pattern word (i.e., find two words that produce the same hash value)? Let’s try!

In doing this work, you’ll probably want to:

  1. Pull out the code that computes a hash in rk_strmatch into its own function and make the hashing function constants into globals.

  2. Make sure that the substring that matches the pattern word is itself a word. You can assume that words contain only the letters a-z.

  3. Check your work by printing both the hash of the pattern word and the hash of the colliding words.

  4. Print a colliding word only once, even if it occurs many times in the input text.

Start by copying rk_strmatch.py into a file called ale03.py, and then edit that file so that you don’t destroy your good copy of the original script.

To test your script, you can try the following two commands:

  • python3 ale03.py JustDavid-chaps.txt book

  • python3 ale03.py JustDavid-chaps.txt pale

The test with “book” should print:

chap10$ python3 ale03.py JustDavid-chaps.txt book
pattern hash("book") = 3324
No collisions found

The test with “pale” should print:

chap10$ python3 ale03.py JustDavid-chaps.txt pale
pattern hash("pale") = 64517
COLLISION: hash("spot") = 64517
COLLISION: hash("tops") = 64517

[Version 20240822]