The Algorithms - Python Python TheAlgorithms/Python

Machine Learning Algorithms in TheAlgorithms/Python

Seven classical ML algorithms from linear models to kernel machines, each implemented from scratch in plain NumPy.

7 stops ~14 min Verified 2026-05-05
What you will learn
  • How linear regression applies gradient descent to minimize squared error against a real dataset
  • How the sigmoid function maps any real number to a probability between 0 and 1 for logistic regression
  • How k-means alternates between assigning points to the nearest centroid and recomputing centroids
  • How kNN classifies a point by measuring Euclidean distance to its k nearest labeled neighbors
  • How a regression decision tree finds the best split by comparing mean squared error across candidate boundaries
  • How gradient descent iteratively adjusts a parameter vector by scaling each parameter by the cost derivative
  • How an SVM maximizes margin by solving a constrained quadratic program via Wolfe's dual formulation
Prerequisites
  • Familiarity with NumPy arrays and basic linear algebra (dot products, matrix transpose)
  • Understanding of what a training set and a loss function are
  • No prior knowledge of any specific ML algorithm required
1 / 7

Linear regression: gradient descent updates the weight vector each iteration

machine_learning/linear_regression.py:43

Each iteration computes the gradient of squared error with respect to the weight vector and steps the weights in the negative gradient direction.

Linear regression finds the line of best fit by minimizing the sum of squared differences between predicted and actual outputs. The algorithm that does this minimization here is gradient descent, exposed as a single function so the math stays visible. prod on line 63 computes the vector of prediction errors: the model's current predictions minus the true labels. sum_grad on line 65 is the gradient of the squared-error loss with respect to the weight vector, formed by dotting those errors back against the input features. The final assignment subtracts a fraction of that gradient, controlled by alpha / n, from the current weights. The caller run_linear_regression repeats this 100,000 times. Each call nudges the weights slightly toward lower error. The doctest confirms the mechanics with a 2-by-2 input.

Key takeaway

Gradient descent for linear regression is three lines: compute prediction error, compute the gradient by dotting error against inputs, subtract a learning-rate-scaled gradient from the weights.

def run_steep_gradient_descent(data_x, data_y, len_data, alpha, theta):
    """Run steep gradient descent and updates the Feature vector accordingly_
    :param data_x   : contains the dataset
    :param data_y   : contains the output associated with each data-entry
    :param len_data : length of the data_
    :param alpha    : Learning rate of the model
    :param theta    : Feature vector (weight's for our model)
    ;param return    : Updated Feature's, using
                       curr_features - alpha_ * gradient(w.r.t. feature)
    >>> import numpy as np
    >>> data_x = np.array([[1, 2], [3, 4]])
    >>> data_y = np.array([5, 6])
    >>> len_data = len(data_x)
    >>> alpha = 0.01
    >>> theta = np.array([0.1, 0.2])
    >>> run_steep_gradient_descent(data_x, data_y, len_data, alpha, theta)
    array([0.196, 0.343])
    """
    n = len_data

    prod = np.dot(theta, data_x.transpose())
    prod -= data_y.transpose()
    sum_grad = np.dot(prod, data_x)
    theta = theta - (alpha / n) * sum_grad
    return theta
2 / 7

Logistic regression: sigmoid maps predictions to probabilities

machine_learning/logistic_regression.py:31

The sigmoid function maps any real number to a probability in (0, 1), forming the prediction step of logistic regression.

Logistic regression solves binary classification by passing a linear score through a nonlinear gate. That gate is the sigmoid function, defined as 1 / (1 + e^-x). Large positive inputs push the output toward 1; large negative inputs push it toward 0; an input of 0 gives exactly 0.5. This shape is why logistic regression can produce calibrated probabilities instead of raw scores. The Python type hint float | np.ndarray shows the function handles both a single float and an array in one expression because NumPy broadcasts the division and exponentiation element-wise. The six doctests confirm the boundary behavior: input 4 gives a probability of 0.98, input -3 gives 0.047. The logistic_reg function later in the file calls sigmoid_function at each training step and passes the result to cost_function to compute cross-entropy loss.

Key takeaway

The sigmoid function is the one line that turns logistic regression from a linear model into a classifier: it maps any real score to a probability in (0, 1).

def sigmoid_function(z: float | np.ndarray) -> float | np.ndarray:
    """
    Also known as Logistic Function.

                1
    f(x) =   -------
              1 + e⁻ˣ

    The sigmoid function approaches a value of 1 as its input 'x' becomes
    increasing positive. Opposite for negative values.

    Reference: https://en.wikipedia.org/wiki/Sigmoid_function

    @param z:  input to the function
    @returns: returns value in the range 0 to 1

    Examples:
    >>> float(sigmoid_function(4))
    0.9820137900379085
    >>> sigmoid_function(np.array([-3, 3]))
    array([0.04742587, 0.95257413])
    >>> sigmoid_function(np.array([-3, 3, 1]))
    array([0.04742587, 0.95257413, 0.73105858])
    >>> sigmoid_function(np.array([-0.01, -2, -1.9]))
    array([0.49750002, 0.11920292, 0.13010847])
    >>> sigmoid_function(np.array([-1.3, 5.3, 12]))
    array([0.21416502, 0.9950332 , 0.99999386])
    >>> sigmoid_function(np.array([0.01, 0.02, 4.1]))
    array([0.50249998, 0.50499983, 0.9836975 ])
    >>> sigmoid_function(np.array([0.8]))
    array([0.68997448])
    """
    return 1 / (1 + np.exp(-z))
3 / 7

K-means: assign clusters, then recompute centroids

machine_learning/k_means_clust.py:148

K-means alternates between two steps: assign each point to the nearest centroid, then replace each centroid with the mean of its assigned points.

K-means is an iterative algorithm that clusters unlabeled data by alternating between two steps. The loop starting on line 161 runs up to maxiter times. Step 1 calls assign_clusters, which computes Euclidean distances from every point to every centroid via pairwise_distances and then takes the argmin across centroid columns to assign each point its nearest centroid index. Step 2 calls revise_centroids, which groups points by their cluster label and replaces each centroid with the arithmetic mean of its group. The convergence check not shown in this slice compares the current and previous cluster assignments element-wise; when no assignment changes, the loop breaks. Initial centroids are chosen randomly by get_initial_centroids, so different seeds can produce different final clusters. Heterogeneity, the sum of squared distances from each point to its centroid, is optionally recorded for convergence diagnostics.

Key takeaway

K-means converges when no point changes cluster between steps; the two-step assign-then-revise loop is the complete algorithm.

def kmeans(
    data, k, initial_centroids, maxiter=500, record_heterogeneity=None, verbose=False
):
    """Runs k-means on given data and initial set of centroids.
    maxiter: maximum number of iterations to run.(default=500)
    record_heterogeneity: (optional) a list, to store the history of heterogeneity
                          as function of iterations
                          if None, do not store the history.
    verbose: if True, print how many data points changed their cluster labels in
                          each iteration"""
    centroids = initial_centroids[:]
    prev_cluster_assignment = None

    for itr in range(maxiter):
        if verbose:
            print(itr, end="")

        # 1. Make cluster assignments using nearest centroids
        cluster_assignment = assign_clusters(data, centroids)

        # 2. Compute a new centroid for each of the k clusters, averaging all data
        #    points assigned to that cluster.
        centroids = revise_centroids(data, k, cluster_assignment)
4 / 7

kNN: Euclidean distance to k neighbors, majority vote decides the class

machine_learning/k_nearest_neighbours.py:47

kNN classification computes Euclidean distance from the query point to every training point, picks the k nearest, and returns the majority class label.

kNN requires no training phase; all the work happens at prediction time. The classify method generates a lazy sequence of (distance, label) pairs by calling _euclidean_distance against every point stored in self.data. Rather than sorting the full distance list, it calls nsmallest(k, distances) from the heapq module, which returns the k smallest in O(n log k) time. The label of each selected neighbor is extracted and collected into votes. Counter(votes).most_common(1) returns the label that appears most often. The doctest sets up seven training points in two clusters and asks for the label of a point at coordinates (1.2, 1.2); because four of its five nearest neighbors belong to class A, the result is 'A'. The _euclidean_distance static method calls np.linalg.norm(a - b).

Key takeaway

kNN's prediction cost is O(n log k) per query using nsmallest; the entire algorithm is the distance computation plus a majority vote.

    def classify(self, pred_point: np.ndarray[float], k: int = 5) -> str:
        """
        Classify a given point using the kNN algorithm
        >>> train_X = np.array(
        ...     [[0, 0], [1, 0], [0, 1], [0.5, 0.5], [3, 3], [2, 3], [3, 2]]
        ... )
        >>> train_y = np.array([0, 0, 0, 0, 1, 1, 1])
        >>> classes = ['A', 'B']
        >>> knn = KNN(train_X, train_y, classes)
        >>> point = np.array([1.2, 1.2])
        >>> knn.classify(point)
        'A'
        """
        # Distances of all points from the point to be classified
        distances = (
            (self._euclidean_distance(data_point[0], pred_point), data_point[1])
            for data_point in self.data
        )

        # Choosing k points with the shortest distances
        votes = (i[1] for i in nsmallest(k, distances))

        # Most commonly occurring class is the one into which the point is classified
        result = Counter(votes).most_common(1)[0][0]
        return self.labels[result]
5 / 7

Decision tree: find the split that minimizes mean squared error across both child regions

machine_learning/decision_tree.py:107

The tree finds the best split by evaluating MSE for every candidate boundary and choosing the one that minimizes total error across both child regions.

A regression decision tree predicts a continuous output by recursively partitioning the input space. The training loop here is the split-finding step. For each candidate split index i, the data is divided into a left subset x[:i] and a right subset x[i:]. The min_leaf_size guard on lines 108 and 110 skips splits that would produce regions too small to be meaningful. For every valid split, mean_squared_error measures how well the mean of each child region predicts the labels in that region. The combined error is their sum; the split with the smallest sum becomes the decision boundary. After the loop, if a valid split was found, two child DecisionTree nodes are created and trained recursively on the left and right subsets. When depth or minimum leaf size is reached, the node stores np.mean(y) as its leaf prediction.

Key takeaway

The decision tree's split search is a linear scan over all candidate boundaries: the one with the smallest combined child MSE becomes the node's decision boundary.

        for i in range(len(x)):
            if len(x[:i]) < self.min_leaf_size:  # noqa: SIM114
                continue
            elif len(x[i:]) < self.min_leaf_size:
                continue
            else:
                error_left = self.mean_squared_error(x[:i], np.mean(y[:i]))
                error_right = self.mean_squared_error(x[i:], np.mean(y[i:]))
                error = error_left + error_right
                if error < min_error:
                    best_split = i
                    min_error = error

        if best_split != 0:
            left_x = x[:best_split]
            left_y = y[:best_split]
            right_x = x[best_split:]
            right_y = y[best_split:]
6 / 7

Gradient descent: subtract a learning-rate-scaled cost derivative from each parameter

machine_learning/gradient_descent.py:105

The main loop updates each parameter by subtracting the cost derivative scaled by the learning rate, stopping when adjacent iterations are numerically close.

This file strips gradient descent to its skeleton: a module-level parameter_vector, a module-level LEARNING_RATE of 0.009, and a loop that updates every parameter in lock-step. For each iteration, a temporary vector is filled by subtracting LEARNING_RATE * get_cost_derivative(i - 1) from each parameter. The cost derivative is computed by summation_of_cost_derivative, which accumulates the signed prediction error across every training example. Passing i - 1 means index -1 triggers the bias update, a common convention for the intercept term. After building the full temporary vector, np.allclose checks whether the update is negligibly small. When consecutive vectors are within absolute_error_limit of each other, the algorithm has converged and breaks. The parameters are only committed after each full pass, which is batch gradient descent rather than stochastic.

Key takeaway

Convergence is checked with np.allclose comparing the full parameter vector before and after an update; the loop stops when the change is smaller than the tolerance.

def run_gradient_descent():
    global parameter_vector
    # Tune these values to set a tolerance value for predicted output
    absolute_error_limit = 0.000002
    relative_error_limit = 0
    j = 0
    while True:
        j += 1
        temp_parameter_vector = [0, 0, 0, 0]
        for i in range(len(parameter_vector)):
            cost_derivative = get_cost_derivative(i - 1)
            temp_parameter_vector[i] = (
                parameter_vector[i] - LEARNING_RATE * cost_derivative
            )
        if np.allclose(
            parameter_vector,
            temp_parameter_vector,
            atol=absolute_error_limit,
            rtol=relative_error_limit,
        ):
            break
        parameter_vector = temp_parameter_vector
    print(("Number of iterations:", j))
7 / 7

SVM: predict by summing kernel-weighted support vector contributions

machine_learning/support_vector_machines.py:170

The SVM's predict function scores a new point by summing kernel products with each support vector, weighted by the optimized dual coefficients.

A trained SVM does not store a simple weight vector; it stores the full set of training observations and their optimal dual coefficients self.optimum. At prediction time, the score for a new observation is the sum over every training point of three factors multiplied together: the dual coefficient self.optimum[n], the training label self.classes[n] (either 1 or -1), and the kernel value between the stored observation and the query. For the default linear kernel, that kernel value is a dot product. Adding self.offset shifts the decision boundary. If the total is non-negative the class is 1; otherwise it is -1. The fit method (earlier in the file) uses scipy.optimize.minimize to solve Wolfe's dual formulation and find the optimum coefficients that maximize the margin between the two classes.

Key takeaway

The SVM prediction is a weighted sum of kernel evaluations against every training point; the dual coefficients found during fit determine which training points act as support vectors.

    def predict(self, observation: ndarray) -> int:
        """
        Get the expected class of an observation

        Args:
            observation (Vector): observation

        Returns:
            int {1, -1}: expected class

        >>> xs = [
        ...     np.asarray([0, 1]), np.asarray([0, 2]),
        ...     np.asarray([1, 1]), np.asarray([1, 2])
        ... ]
        >>> y = np.asarray([1, 1, -1, -1])
        >>> s = SVC()
        >>> s.fit(xs, y)
        >>> s.predict(np.asarray([0, 1]))
        1
        >>> s.predict(np.asarray([1, 1]))
        -1
        >>> s.predict(np.asarray([2, 2]))
        -1
        """
        s = sum(
            self.optimum[n]
            * self.classes[n]
            * self.kernel(self.observations[n], observation)
            for n in range(len(self.classes))
        )
        return 1 if s + self.offset >= 0 else -1
Your codebase next

Create code tours for your project

Intraview lets AI create interactive walkthroughs of any codebase. Install the free VS Code extension and generate your first tour in minutes.

Install Intraview Free