The results of the research have been published in the proceedings of the NeurIPS 2024 conference. In today's world, numerous tasks are solved in distributed systems where multiple computers (agents) collaborate. Traditionally, such systems utilize a central server to coordinate computations, which creates bottlenecks and scalability issues.
Existing decentralized optimization algorithms suffer from a significant drawback: for them to work effectively, it is essential to know the parameters of both the optimization task itself (for instance, the "Lipschitz constant" of the gradient, which characterizes the steepness of the loss function) and the topology of the network through which the agents communicate (the degree of their interconnection). In real distributed systems, agents typically lack access to this global information, forcing them to use very conservative parameter settings, which consequently leads to slow convergence or even divergence of the algorithm. This is akin to builders trying to construct a house without knowing either the building plan or the location of the construction materials.
However, the new approach presented in the research addresses this issue by proposing a fully decentralized optimization algorithm that operates without a central server and automatically tunes itself without the need for prior parameter settings.
This algorithm is based on the "operator splitting" method and the use of a new metric variable. This enables each agent to independently determine the optimal step size during the learning process using local information. It is similar to how an experienced builder assesses the situation on-site to decide which tool to use and how.
Instead of relying on predefined parameters, the algorithm continuously adapts to the local characteristics of the loss function. Each agent performs a local search for the optimal step size without requiring information exchange with all other agents in the network. This "local" approach significantly accelerates computations and makes the algorithm more scalable.
The theoretical analysis has shown that the new algorithm ensures linear convergence—meaning that the speed of approaching the solution remains high even in the later stages of computations. The convergence rate depends on two factors: the complexity of the optimization task itself and the "connectivity" of the network, which indicates how well agents exchange information among themselves. In well-connected networks, the convergence speed approaches that of a centralized algorithm. It's like all builders working on the same site rather than scattered throughout the city.
The authors proposed two modifications of their algorithm. Both algorithms are decentralized and utilize local linear search for adaptive step size selection by each agent individually. However, the mechanism for coordinating these local step sizes differs. The first algorithm employs a global minimum search mechanism. Each agent calculates its local optimal step size, and then all agents exchange this information, choosing the minimum value among all agents as the global step size, which requires communication among all agents in the network. The second algorithm is based solely on the use of local minima. Each agent calculates its local optimal step size and then selects the minimum among its immediate neighbors, including itself, as its step size. This requires communication only with neighbors in the network.
As a result, the first algorithm provides faster convergence by utilizing global step size information but demands more communication among agents. Conversely, the second algorithm is less demanding in terms of communication, exchanging information only with immediate neighbors, but may demonstrate slightly slower convergence (although the authors show that the difference is not significant). The choice between the algorithms depends on the trade-off between convergence speed and communication volume in a specific network. The second algorithm is particularly useful for networks with limited bandwidth or high communication costs.
Numerical experiments confirmed the theoretical findings that the new algorithms significantly outperform existing decentralized algorithms in speed. This difference is especially pronounced when solving complex tasks with large datasets and when operating in weakly connected networks. The algorithm was successfully tested on the ridge regression task—a common machine learning problem.
“Our approach utilizes the operator splitting method with a new metric variable, allowing the use of local backtracking line search for adaptive step size selection without global information or extensive communication,” said Alexander Gasnikov, head of the laboratory of mathematical optimization methods at MIPT. “This leads to favorable convergence guarantees and dependencies on optimization parameters and network conditions compared to existing non-adaptive methods. Notably, the new method is the first adaptive decentralized algorithm that achieves linear convergence for strongly convex and smooth functions.”
Future research by the scientists may focus on adapting the proposed methods to stochastic problems, extending them to more complex types of network topologies and data exchange, and exploring the potential for using more sophisticated optimization methods within the proposed approach. The development and enhancement of new decentralized machine learning algorithms is a crucial step toward creating more efficient and scalable machine learning systems in distributed environments.