The Algorithms - Python Python TheAlgorithms/Python

Neural Networks in TheAlgorithms/Python

Five files tracing how a network learns: from a single-neuron forward pass to backpropagation, multi-layer weight matrices, and the activation functions that make it possible.

5 stops ~12 min Verified 2026-05-05

What you will learn

How a single-neuron forward pass iterates a weight update using the sigmoid derivative as an error scaling factor
How the DenseLayer class stores weights and bias as NumPy matrices and applies them during forward propagation
How back_propagation computes gradient with respect to weights and bias and updates them in-place
How TwoHiddenLayerNeuralNetwork wires three weight matrices to represent input, two hidden, and one output layer
How ReLU clips negative activations to zero using a single NumPy maximum call
How the Swish activation multiplies the input by its own sigmoid, producing a smooth non-monotonic function

Prerequisites

Comfort with NumPy matrix operations and Python classes
Basic understanding of what a loss function and a learning rate are

1 / 5

simple_neural_network.py: one neuron, one weight, forward-propagate until convergence

neural_network/simple_neural_network.py:28

A single-neuron network updates its weight by multiplying the prediction error by the sigmoid derivative, then stepping the weight in the corrective direction.

This file reduces a neural network to its smallest readable form: one neuron, one weight, no hidden layers. The weight starts as a random float on the range (1, 199). Each iteration applies three sub-steps. First, layer_1 is the sigmoid of the current weight scaled by INITIAL_VALUE; this is the forward pass. Second, layer_1_error measures how far the output is from the target. Third, layer_1_delta multiplies that error by the sigmoid derivative, which controls how large the correction step should be near the flat tails of the sigmoid curve. The weight then updates by that delta scaled again by INITIAL_VALUE. After 450,000 iterations, the output reliably falls in the range (31, 33) when the expected value is 32. After only 1,000 iterations it does not converge, as the second doctest confirms.

Key takeaway

A single-neuron network needs three lines per iteration: forward pass through sigmoid, error computation, delta scaled by sigmoid derivative, then weight update.

def forward_propagation(expected: int, number_propagations: int) -> float:
    """Return the value found after the forward propagation training.

    >>> res = forward_propagation(32, 450_000)  # Was 10_000_000
    >>> res > 31 and res < 33
    True

    >>> res = forward_propagation(32, 1000)
    >>> res > 31 and res < 33
    False
    """

    # Random weight
    weight = float(2 * (random.randint(1, 100)) - 1)

    for _ in range(number_propagations):
        # Forward propagation
        layer_1 = sigmoid_function(INITIAL_VALUE * weight)
        # How much did we miss?
        layer_1_error = (expected / 100) - layer_1
        # Error delta
        layer_1_delta = layer_1_error * sigmoid_function(layer_1, True)
        # Update weight
        weight += INITIAL_VALUE * layer_1_delta

    return layer_1 * 100

2 / 5

DenseLayer: forward propagation applies the stored weight matrix and bias

neural_network/back_propagation_neural_network.py:69

A DenseLayer's forward pass is a matrix multiply of the weight matrix with the input, minus bias, passed through the activation function.

The back-propagation framework in this file organizes a neural network as a list of DenseLayer objects. Each layer owns a weight matrix initialized with small random values and a bias vector of the same shape. The input layer is a special case: it passes data through unchanged. Every other layer computes np.dot(self.weight, self.xdata) - self.bias, which is the linear transformation that maps the previous layer's output to this layer's pre-activation values. The result is passed to self.activation, which defaults to sigmoid if none is specified during construction. Both self.wx_plus_b and self.output are saved because the backward pass needs them to compute the gradient. The BPNN class's train method chains calls to forward_propagation across all layers before computing loss.

Key takeaway

A DenseLayer's forward pass is one line of math: np.dot(weight, input) - bias, then apply the activation function. Both intermediate values are cached for the backward pass.

    def forward_propagation(self, xdata):
        self.xdata = xdata
        if self.is_input_layer:
            # input layer
            self.wx_plus_b = xdata
            self.output = xdata
            return xdata
        else:
            self.wx_plus_b = np.dot(self.weight, self.xdata) - self.bias
            self.output = self.activation(self.wx_plus_b)
            return self.output

3 / 5

TwoHiddenLayerNeuralNetwork: three weight matrices connect four layers

neural_network/two_hidden_layers_neural_network.py:11

Three weight matrices are initialized with shapes that encode the node counts at each layer transition: input-to-hidden-1, hidden-1-to-hidden-2, hidden-2-to-output.

The weight matrix shapes in this constructor encode the entire network topology. input_layer_and_first_hidden_layer_weights is shaped (input_features, 4), meaning the first hidden layer has 4 nodes regardless of how many input features the data has. first_hidden_layer_and_second_hidden_layer_weights is (4, 3), connecting 4 nodes to 3. second_hidden_layer_and_output_layer_weights is (3, 1), collapsing 3 hidden nodes to a single output. Reading these three shapes in order tells you the full architecture without running the code. np.random.default_rng() initializes random weights in (0, 1), which avoids the symmetry-breaking problem where all neurons would learn the same feature if they all started at the same value. The predicted_output field starts as an array of zeros with the same shape as output_array.

Key takeaway

The three weight matrix shapes directly encode the network topology; reading their dimensions tells you the node counts at every layer without running the code.

class TwoHiddenLayerNeuralNetwork:
    def __init__(self, input_array: np.ndarray, output_array: np.ndarray) -> None:
        """
        This function initializes the TwoHiddenLayerNeuralNetwork class with random
        weights for every layer and initializes predicted output with zeroes.

        input_array : input values for training the neural network (i.e training data) .
        output_array : expected output values of the given inputs.
        """

        # Input values provided for training the model.
        self.input_array = input_array

        # Random initial weights are assigned where first argument is the
        # number of nodes in previous layer and second argument is the
        # number of nodes in the next layer.

        # Random initial weights are assigned.
        # self.input_array.shape[1] is used to represent number of nodes in input layer.
        # First hidden layer consists of 4 nodes.
        rng = np.random.default_rng()
        self.input_layer_and_first_hidden_layer_weights = rng.random(
            (self.input_array.shape[1], 4)
        )

        # Random initial values for the first hidden layer.
        # First hidden layer has 4 nodes.
        # Second hidden layer has 3 nodes.
        self.first_hidden_layer_and_second_hidden_layer_weights = rng.random((4, 3))

        # Random initial values for the second hidden layer.
        # Second hidden layer has 3 nodes.
        # Output layer has 1 node.
        self.second_hidden_layer_and_output_layer_weights = rng.random((3, 1))

4 / 5

ReLU: clip negative activations to zero with a single NumPy call

neural_network/activation_functions/rectified_linear_unit.py:18

ReLU maps negative inputs to zero and leaves positive inputs unchanged, introducing nonlinearity without a saturating range.

ReLU (Rectified Linear Unit) became the dominant activation function in deep networks because it solves the vanishing gradient problem that plagues sigmoid and tanh. When the sigmoid output is near 0 or near 1, its derivative is nearly zero, which means gradients multiplied through many sigmoid layers shrink to nothing. ReLU's derivative for positive inputs is exactly 1, so gradients pass through without shrinking. The tradeoff is that neurons receiving only negative inputs produce a zero derivative always, the so-called dead neuron problem. The implementation is one line: np.maximum(0, vector) applies element-wise maximum between 0 and each input, zeroing negatives and leaving positives intact. The doctest confirms: input [-1, 0, 5] returns [0, 0, 5].

Key takeaway

ReLU is np.maximum(0, vector): one line that zeroes negative inputs and leaves positives unchanged, enabling gradients to flow without saturation.

def relu(vector: list[float]):
    """
    Implements the relu function

    Parameters:
        vector (np.array,list,tuple): A  numpy array of shape (1,n)
        consisting of real values or a similar list,tuple


    Returns:
        relu_vec (np.array): The input numpy array, after applying
        relu.

    >>> vec = np.array([-1, 0, 5])
    >>> relu(vec)
    array([0, 0, 5])
    """

    # compare two arrays and then return element-wise maxima.
    return np.maximum(0, vector)

5 / 5

Swish: multiply input by its own sigmoid to get a smooth non-monotonic activation

neural_network/activation_functions/swish.py:33

Swish multiplies each input by its own sigmoid value, producing a smooth function that allows small negative outputs near zero.

ReLU is piecewise linear and non-differentiable at zero; that kink can cause optimization instabilities in some architectures. Swish, introduced in a 2017 Google Brain paper linked in the module docstring, addresses this by returning vector times sigmoid(vector). For large positive inputs, sigmoid approaches 1 so Swish is approximately linear. For large negative inputs, sigmoid approaches 0 so Swish approaches 0. Near zero, Swish dips slightly negative (around -0.28 at input -1) before returning to zero, giving the network a richer signal than ReLU's hard zero clamp. The file also defines a parameterized swish function that multiplies the input by sigmoid(beta times vector), where beta is a trainable scaling parameter. This generalization was shown to further improve performance on some tasks. Both variants are implemented and doctested in the file.

Key takeaway

Swish is vector times sigmoid(vector): one multiplication that produces a smooth, slightly negative trough near zero instead of ReLU's hard clamp.

def sigmoid_linear_unit(vector: np.ndarray) -> np.ndarray:
    """
    Implements the Sigmoid Linear Unit (SiLU) or swish function

    Parameters:
        vector (np.ndarray): A  numpy array consisting of real values

    Returns:
        swish_vec (np.ndarray): The input numpy array, after applying swish

    Examples:
    >>> sigmoid_linear_unit(np.array([-1.0, 1.0, 2.0]))
    array([-0.26894142,  0.73105858,  1.76159416])

    >>> sigmoid_linear_unit(np.array([-2]))
    array([-0.23840584])
    """
    return vector * sigmoid(vector)

Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free