Backpropagation and the N-Bit Parity Problem

First project for CSE 5526: Introduction to Neural Networks (Autumn 2018). After the homeworks covered M-P neurons and the perceptron learning rule, this project had us implement a full multi-layer perceptron (MLP) from scratch and train it with backpropagation.

The N-Bit Parity Problem#

The N-bit parity function outputs 1 if the number of 1-bits in the input is odd, and 0 otherwise. For example, with $N = 4$ :

$0101$ (two 1-bits) $\rightarrow$ 0 (even parity)
$0111$ (three 1-bits) $\rightarrow$ 1 (odd parity)

What makes this hard for a neural network is that every input bit matters, and flipping any single bit changes the output. There’s no local structure to exploit. The network has to learn a truly global function over all inputs, which is why parity is a standard benchmark for training algorithms.

For $N = 4$ , the training set is all $2^4 = 16$ binary numbers (0 through 15).

Network Architecture#

The MLP uses a two-layer feedforward architecture with sigmoid activation:

Layer 1 (hidden): Takes the $N$ -bit input (plus a bias term) and produces $N$ hidden units.

Weight matrix: $(N+1) \times N$

Layer 2 (output): Takes the hidden layer output (plus a bias term) and produces a single output.

Weight matrix: $(N+1) \times 1$

Forward propagation:

$h = \sigma(W_0^T \cdot x) \quad \text{(hidden layer)}$ $y = \sigma(W_1^T \cdot [1; h]) \quad \text{(output layer)}$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid activation function, and $[1; h]$ denotes prepending a bias term.

Backward propagation computes the error gradients using the sigmoid derivative $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ , which simplifies nicely when expressed in terms of the output: $y' = y(1 - y)$ .

For the output layer:

$\delta_k = y_k(1 - y_k)(d_k - y_k)$

For the hidden layer:

$\delta_j = h_j(1 - h_j) \cdot W_1[1:] \cdot \delta_k$

Weight updates incorporate optional momentum ( $\alpha$ ):

$\Delta W(t) = \eta \cdot \delta \cdot x^T + \alpha \cdot \Delta W(t-1)$

Convergence criterion: All 16 outputs must be within 0.05 of their desired values.

Implementation#

The implementation is a self-contained MultiLayerPerceptron class in Python/NumPy:

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

class MultiLayerPerceptron:
    def __init__(self, parity=4, eta=0.2, alpha=0):
        self.parity = parity
        self.eta = eta
        self.alpha = alpha
        self.W = [
            np.random.uniform(-1, 1, (parity+1, parity)),
            np.random.uniform(-1, 1, (parity+1, 1))
        ]
        self.delta_W = [np.zeros_like(w) for w in self.W]
        self.X, self.D = self._get_train_data()

    def forwardProp(self, x):
        h = sigmoid(np.dot(self.W[0].T, x))
        y = sigmoid(np.dot(self.W[1].T, np.insert(h, 0, 1, axis=0)))
        return [h, y]

    def backwardProp(self, x, Y, d_k):
        y_k = Y[1]
        delta_k = y_k * (1 - y_k) * (d_k - y_k)

        self.delta_W[1] = (self.eta * delta_k.T
                           * np.insert(Y[0], 0, 1, axis=0)
                           + self.alpha * self.delta_W[1])
        delta_j = Y[0] * (1 - Y[0]) * self.W[1][1:] * delta_k

        self.delta_W[0] = (self.eta * delta_j.T * x
                           + self.alpha * self.delta_W[0])
        self.W[0] += self.delta_W[0]
        self.W[1] += self.delta_W[1]

python

I also parallelized the training using Python’s multiprocessing module, since each configuration (learning rate + momentum combination) is independent and can run on its own core.

Experimental Results#

I swept learning rates from 0.05 to 0.50 (in steps of 0.05), both without momentum ( $\alpha = 0$ ) and with momentum ( $\alpha = 0.9$ ).

Without Momentum ( $\alpha = 0$ )#

$\eta$	Epochs
0.05	35,869
0.10	16,477
0.15	12,681
0.20	9,541
0.25	8,719
0.30	6,180
0.35	5,319
0.40	5,121
0.45	6,683
0.50	120,389

Epochs vs learning rate without momentum — Epochs to convergence as a function of learning rate (no momentum). The optimal learning rate is 0.40, with a sharp blowup at 0.50.

There’s a clear sweet spot. Increasing $\eta$ from 0.05 to 0.40 steadily reduces the number of epochs needed. But past 0.40, the step size gets too large and the network overcompensates for each error, causing weight oscillations that slow convergence way down. At $\eta = 0.50$ , it takes over 120,000 epochs, which is a 23x regression from the optimal.

With Momentum ( $\alpha = 0.9$ )#

$\eta$	Epochs
0.05	3,206
0.10	1,791
0.15	1,139
0.20	975
0.25	878
0.30	623
0.35	547
0.40	514
0.45	5,197
0.50	9,878

Epochs vs learning rate with momentum — Epochs to convergence with momentum (alpha = 0.9). Same optimal learning rate, but roughly 10x fewer epochs across the board.

Momentum gives a consistent and big speedup without changing the overall shape of the curve. The optimal $\eta$ is still 0.40, but the network converges in 514 epochs instead of 5,121.

Momentum Speedup Analysis#

$\eta$	Without $\alpha$	With $\alpha = 0.9$	Speedup
0.05	35,869	3,206	11.2x
0.10	16,477	1,791	9.2x
0.15	12,681	1,139	11.1x
0.20	9,541	975	9.8x
0.25	8,719	878	9.9x
0.30	6,180	623	9.9x
0.35	5,319	547	9.7x
0.40	5,121	514	10.0x
0.45	6,683	5,197	1.3x
0.50	120,389	9,878	12.2x

Speedup from momentum — Speedup factor from momentum as a function of learning rate. Consistent ~10x improvement, with a weird dip at eta = 0.45.

The speedup is pretty consistent at around 10x across most learning rates. The weird outlier is $\eta = 0.45$ , where momentum only gives a 1.3x speedup. My guess is that this is a transitional zone: the learning rate is just large enough to cause instability on its own, and momentum (which increases the effective step size) tips it over into oscillatory behavior. What’s interesting is that at the even higher $\eta = 0.50$ , momentum gets its 12x speedup back, which suggests a different dynamical regime.

What I Took Away#

Learning rate turned out to be the most important hyperparameter by far. The difference between a good and bad choice was orders of magnitude in convergence time, and the transition from “working fine” to “taking forever” was pretty abrupt (0.40 to 0.50).

Momentum gave a solid ~10x speedup most of the time, but it’s not a free parameter to crank up. The $\eta = 0.45$ case showed that momentum and learning rate can interact badly: the combined effective step size is what actually matters, not either parameter in isolation.

The parity problem itself was a good choice for this project. Because every input bit affects the output, every neuron has to participate in every decision. You can’t get away with some neurons being “lazy.” I also appreciated having to implement backpropagation by hand, computing $\delta_k$ for the output layer, propagating back through the hidden layer, wiring in momentum. It’s not hard to use a framework for this stuff, but doing it manually once made the algorithm feel a lot more concrete.