Journal of Machine Learning Research Papers: Volume 26の論文一覧

Journal of Machine Learning Research Papers Volume 26に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
DRM Revisited: A Complete Error Analysis

DRM再考：完全なエラー分析

It is widely known that the error analysis for deep learning involves approximation, statistical, and optimization errors. However, it is challenging to combine them together due to overparameterization. In this paper, we address this gap by providing a comprehensive error analysis of the Deep Ritz Method (DRM). Specifically, we investigate a foundational question in the theoretical analysis of DRM under the overparameterized regime: given a target precision level, how can one determine the appropriate number of training samples, the key architectural parameters of the neural networks, the step size for the projected gradient descent optimization procedure, and the requisite number of iterations, such that the output of the gradient descent process closely approximates the true solution of the underlying partial differential equation to the specified precision?

ディープラーニングのエラー分析には、近似エラー、統計エラー、最適化エラーが含まれることが広く知られています。しかし、過剰パラメータ化のために、それらを組み合わせることは困難です。本論文では、ディープ・リッツ法（DRM）の包括的な誤差分析を提供することで、このギャップを埋めます。具体的には、過剰パラメータ化環境下におけるDRMの理論分析における基本的な問い、すなわち、目標精度レベルが与えられた場合、勾配降下法の出力が、基礎となる偏微分方程式の真の解を指定された精度で近似するように、適切な学習サンプル数、ニューラルネットワークの主要なアーキテクチャパラメータ、投影勾配降下法の最適化手順のステップサイズ、そして必要な反復回数をどのように決定するか、という問いを検証します。

Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF

二階層のための原理的なペナルティベース法強化学習とRLHF

Bilevel optimization has been recently applied to many machine learning tasks. However, their applications have been restricted to the supervised learning setting, where static objective functions with benign structures are considered. But bilevel problems such as incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF) are often modeled as dynamic objective functions that go beyond the simple static objective structures, which pose significant challenges of using existing bilevel solutions. To tackle this new class of bilevel problems, we introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based (policy) gradient algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.

近年、二階層最適化は多くの機械学習タスクに適用されています。しかし、その適用は、無害な構造を持つ静的目的関数が考慮される教師あり学習の設定に限定されています。しかし、インセンティブ設計、逆強化学習（RL）、人間からのフィードバックからの強化学習（RLHF）などの二階層問題は、単純な静的目的構造を超える動的目的関数としてモデル化されることが多く、既存の二階層ソリューションの使用に大きな課題をもたらします。この新しいクラスの二階層問題に取り組むために、ペナルティ定式化の観点から二階層強化学習問題を解決するための最初の原理的なアルゴリズムフレームワークを導入します。問題のランドスケープと、ペナルティベース（ポリシー）勾配アルゴリズムの理論的研究を提供します。スタックベルグマルコフゲーム、人間からのフィードバックからの強化学習、インセンティブ設計でのシミュレーションにより、アルゴリズムの有効性を実証します。

Precise High-Dimensional Asymptotics for Quantifying Heterogeneous Transfers

異種転送を定量化するための高精度な高次元漸近解析

The problem of learning one task using samples from another task is central to transfer learning. In this paper, we focus on answering the following question: when does combining the samples from two related tasks perform better than learning with one target task alone? This question is motivated by an empirical phenomenon known as negative transfer often observed in transfer learning practice. While the transfer effect from one task to another depends on factors such as their sample sizes and the spectrum of their covariance matrices, precisely quantifying this dependence has remained a challenging problem. In order to compare a transfer learning estimator to single-task learning, one needs to compare the risks between the two estimators precisely. Further, the comparison depends on the distribution shifts between the two tasks. This paper applies recent developments of random matrix theory to tackle this challenge in a high-dimensional linear regression setting with two tasks. We provide precise high-dimensional asymptotics for the bias and variance of a classical hard parameter sharing (HPS) estimator in the proportional limit, when the sample sizes of both tasks increase proportionally with dimension at fixed ratios. The precise asymptotics apply to various types of distribution shifts, including covariate shifts, model shifts, and combinations of both. We illustrate these results in a random-effects model to mathematically prove a phase transition from positive to negative transfer as the number of source task samples increases. One insight from the analysis is that a rebalanced HPS estimator, which downsizes the source task when the model shift is high, achieves the minimax optimal rate. The finding regarding phase transition also applies to multiple tasks when feature covariates are shared across all tasks. Simulations validate the accuracy of the high-dimensional asymptotics for finite dimensions.

1つのタスクを別のタスクのサンプルを使用して学習するという問題は、転移学習の中心です。本稿では、次の質問に答えることに焦点を当てます。2つの関連タスクのサンプルを組み合わせると、1つのターゲットタスクだけで学習するよりもパフォーマンスが向上するのはいつですか？この質問は、転移学習の実践でよく見られる負の転移と呼ばれる経験的現象に起因しています。あるタスクから別のタスクへの転移効果は、サンプルサイズや共分散行列のスペクトルといった要因に依存するものの、この依存性を正確に定量化することは依然として困難な問題です。転移学習推定量を単一タスク学習と比較するには、2つの推定量間のリスクを正確に比較する必要があります。さらに、この比較は2つのタスク間の分布シフトにも依存します。本論文では、ランダム行列理論の最近の発展を適用し、2つのタスクを用いた高次元線形回帰においてこの課題に取り組む。本論文では、比例極限において、両タスクのサンプルサイズが一定比率で次元に比例して増加する場合、古典的なハードパラメータ共有（HPS）推定量のバイアスと分散について、高精度な高次元漸近解析を提供します。この高精度漸近解析は、共変量シフト、モデルシフト、そして両者の組み合わせなど、様々な種類の分布シフトに適用できます。これらの結果をランダム効果モデルで示し、ソースタスクサンプル数の増加に伴う転移の正から負への相転移を数学的に証明します。分析から得られた一つの知見は、モデルシフトが大きい場合にソースタスクを縮小するリバランスされたHPS推定値が、ミニマックス最適レートを達成するという点です。位相遷移に関する知見は、特徴共変量がすべてのタスクで共有されている場合、複数のタスクにも当てはまります。シミュレーションにより、有限次元における高次元漸近性の精度が検証されました。

Score-based Causal Representation Learning: Linear and General Transformations

スコアベースの因果表現学習：線形および一般変換

This paper addresses intervention-based causal representation learning (CRL) under a general nonparametric latent causal model and an unknown transformation that maps the latent variables to the observed variables. Linear and general transformations are investigated. The paper addresses both the identifiability and achievability aspects. Identifiability refers to determining algorithm-agnostic conditions that ensure the recovery of the true latent causal variables and the underlying latent causal graph. Achievability refers to the algorithmic aspects and addresses designing algorithms that achieve identifiability guarantees. By drawing novel connections between score functions (i.e., the gradients of the logarithm of density functions) and CRL, this paper designs a score-based class of algorithms that ensures both identifiability and achievability. First, the paper focuses on linear transformations and shows that one stochastic hard intervention per node suffices to guarantee identifiability. It also provides partial identifiability guarantees for soft interventions, including identifiability up to mixing with parents for general causal models and perfect recovery of the latent graph for sufficiently nonlinear causal models. Secondly, it focuses on general transformations and demonstrates that two stochastic hard interventions per node are sufficient for identifiability. This is achieved by defining a differentiable loss function whose global optima ensure identifiability for general CRL. Notably, one does not need to know which pair of interventional environments has the same node intervened. Finally, the theoretical results are empirically validated via experiments on structured synthetic data and image data.

この論文では、一般的なノンパラメトリック潜在因果モデルと、潜在変数を観測変数にマッピングする未知の変換に基づく介入ベースの因果表現学習(CRL)について説明します。線形変換と一般的な変換が調査されます。この論文では、識別可能性と達成可能性の両方の側面について説明します。識別可能性とは、真の潜在因果変数と基礎となる潜在因果グラフの回復を保証する、アルゴリズムに依存しない条件を決定することを指します。達成可能性とはアルゴリズムの側面を指し、識別可能性の保証を実現するアルゴリズムの設計について説明します。スコア関数(つまり、密度関数の対数の勾配)とCRLの間に新しい関係を描くことで、この論文では、識別可能性と達成可能性の両方を保証するスコアベースのアルゴリズムのクラスを設計します。まず、この論文では線形変換に焦点を当て、ノードごとに1つの確率的ハード介入で識別可能性を保証できることを示します。また、ソフト介入の部分的な識別可能性保証も提供します。これには、一般的な因果モデルに対する親との混合までの識別可能性と、十分に非線形な因果モデルに対する潜在グラフの完全な回復が含まれます。次に、一般的な変換に焦点を当て、ノードごとに2つの確率的ハード介入で識別可能性を実現できることを示します。これは、グローバル最適値が一般的なCRLの識別可能性を保証する微分可能な損失関数を定義することで実現されます。特に、どの介入環境のペアに同じノードが介入したかを知る必要はありません。最後に、理論的結果は、構造化された合成データと画像データを使用した実験により経験的に検証されます。

On the Statistical Properties of Generative Adversarial Models for Low Intrinsic Data Dimension

低次元データのための生成的敵対的モデルの統計的特性について

Despite the remarkable empirical successes of Generative Adversarial Networks (GANs), the theoretical guarantees for their statistical accuracy remain rather pessimistic. In particular, the data distributions on which GANs are applied, such as natural images, are often hypothesized to have an intrinsic low-dimensional structure in a typically high-dimensional feature space, but this is often not reflected in the derived rates in the state-of-the-art analyses. In this paper, we attempt to bridge the gap between the theory and practice of GANs and their bidirectional variant, Bi-directional GANs (BiGANs), by deriving statistical guarantees on the estimated densities in terms of the intrinsic dimension of the data and the latent space. We analytically show that if one has access to $n$ samples from the unknown target distribution and the network architectures are properly chosen, the expected Wasserstein-1 distance of the estimates from the target scales as $O\left( n^{-1/d_\mu } \right)$ for GANs and $\tilde{O}\left( n^{-1/(d_\mu+\ell)} \right)$ for BiGANs, where $d_\mu$ and $\ell$ are the upper Wasserstein-1 dimension of the data-distribution and latent-space dimension, respectively. The theoretical analyses not only suggest that these methods successfully avoid the curse of dimensionality, in the sense that the exponent of $n$ in the error rates does not depend on the data dimension but also serve to bridge the gap between the theoretical analyses of GANs and the known sharp rates from optimal transport literature. Additionally, we demonstrate that GANs can effectively achieve the minimax optimal rate even for non-smooth underlying distributions, with the use of interpolating generator networks.

敵対的生成ネットワーク(GAN)は目覚ましい実験的成功を収めていますが、その統計的正確性に対する理論的保証は依然としてかなり悲観的です。特に、自然画像など、GANが適用されるデータ分布は、典型的には高次元の特徴空間において固有の低次元構造を持つと仮定されることがよくありますが、これは最先端の分析で導出されるレートに反映されていないことがよくあります。本稿では、GANとその双方向変種である双方向GAN（BiGAN）の理論と実践の間のギャップを埋めることを試み、データの固有次元と潜在空間の観点から推定密度の統計的保証を導出します。未知のターゲット分布から$n$個のサンプルにアクセスでき、ネットワークアーキテクチャが適切に選択された場合、ターゲットからの推定値の期待Wasserstein-1距離は、GANの場合は$O\left( n^{-1/d_\mu } \right)$、BiGANの場合は$\tilde{O}\left( n^{-1/(d_\mu+\ell)} \right)$にスケールすることを解析的に示します。ここで、$d_\mu$と$\ell$は、それぞれデータ分布と潜在空間の次元の上限Wasserstein-1次元です。理論分析は、これらの手法が次元の呪い（エラー率における$n$の指数がデータ次元に依存しないという意味で）をうまく回避していることを示唆するだけでなく、GANの理論分析と最適輸送文献で知られているシャープレートとの間のギャップを埋める役割も果たしています。さらに、補間生成器ネットワークを用いることで、GANが非滑らかな基礎分布に対しても、ミニマックス最適レートを効果的に達成できることを実証しています。

Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

領域適応における条件付き不変成分の顕著な役割：理論とアルゴリズム

Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, Camelyon17, and DomainNet datasets.

ドメイン適応（DA）は、モデルの学習に使用されるソースデータの分布が、モデルの評価に使用されるターゲットデータの分布と異なる場合に生じる統計学習問題です。多くのDAアルゴリズムは実証的にかなりの成功を収めていますが、これらのアルゴリズムを盲目的に適用すると、新しいデータセットでのパフォーマンスが低下することがよくあります。これに対処するには、DAアルゴリズムが優れたターゲットパフォーマンスを発揮するための仮定を明確にすることが重要です。本研究では、予測に関連し、ソースデータとターゲットデータ全体で条件付き不変である条件付き不変成分（CIC）の存在という仮定に焦点を当てます。条件付き不変ペナルティ（CIP）によって推定できるCICは、DAにおけるターゲットリスク保証の提供において3つの重要な役割を果たすことを示します。まず、共変量シフトやラベルシフトなどの単純な設定を超えたターゲットリスク保証を持つ、CICに基づく新しいアルゴリズム、重要度重み付け条件付き不変ペナルティ（IW-CIP）を提案します。第二に、CICが他のDAアルゴリズムのソースリスクとターゲットリスク間の大きな矛盾を識別するのに役立つことを示します。最後に、CICをドメイン不変射影(DIP)アルゴリズムに組み込むことで、ラベルフリッピング特徴によって引き起こされる失敗シナリオに対処できることを示します。新しいアルゴリズムと理論的発見は、合成データ、MNIST、CelebA、Camelyon17、およびDomainNetデータセットを用いた数値実験によって裏付けられています。

Near-Optimal Nonconvex-Strongly-Convex Bilevel Optimization with Fully First-Order Oracles

完全一次オラクルを用いた準最適非凸-強凸二階層最適化

In this work, we consider bilevel optimization when the lower-level problem is strongly convex. Recent works show that with a Hessian-vector product (HVP) oracle, one can provably find an $\epsilon$-stationary point within ${O}(\epsilon^{-2})$ oracle calls. However, the HVP oracle may be inaccessible or expensive in practice. Kwon et al. (ICML 2023) addressed this issue by proposing a first-order method that can achieve the same goal at a slower rate of $\tilde{O}(\epsilon^{-3})$. In this paper, we incorporate a two-time-scale update to improve their method to achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ first-order oracle complexity. Our analysis is highly extensible. In the stochastic setting, our algorithm can achieve the stochastic first-order oracle complexity of $\tilde {O}(\epsilon^{-4})$ and $\tilde {O}(\epsilon^{-6})$ when the stochastic noises are only in the upper-level objective and in both level objectives, respectively. When the objectives have higher-order smoothness conditions, our deterministic method can escape saddle points by injecting noise, and can be accelerated to achieve a faster rate of $\tilde {O}(\epsilon^{-1.75})$ using Nesterov’s momentum.

本研究では、下位レベルの問題が強凸である場合の2レベル最適化を検討します。最近の研究では、ヘッセベクトル積(HVP)オラクルを使用すると、${O}(\epsilon^{-2})$回のオラクル呼び出しで$\epsilon$定常点を見つけることができることが示されています。ただし、HVPオラクルは実際にはアクセスできないか、高価になる可能性があります。Kwonら。(ICML 2023)はこの問題に対処するため、同じ目標をより遅い速度$\tilde{O}(\epsilon^{-3})$で達成できる一次オラクル複雑度を提案した。本論文では、2つのタイムスケール更新を組み込むことで、彼らの手法を改良し、ほぼ最適な$\tilde{O}(\epsilon^{-2})$の一次オラクル複雑度を達成した。我々の分析は非常に拡張性が高い。確率的設定において、我々のアルゴリズムは、確率的ノイズが上位レベルの目的関数にのみ存在する場合、および両レベルの目的関数に存在している場合、それぞれ$\tilde{O}(\epsilon^{-4})$および$\tilde{O}(\epsilon^{-6})$の確率的一次オラクル複雑度を達成できます。目的関数に高次の平滑性条件がある場合、私たちの決定論的手法はノイズを注入することで鞍点を回避でき、ネステロフの運動量を使用して$\tilde {O}(\epsilon^{-1.75})$というより速い速度を達成できます。

Adaptive Distributed Kernel Ridge Regression: A Feasible Distributed Learning Scheme for Data Silos

適応型分散カーネルリッジ回帰：データサイロのための実行可能な分散学習スキーム

Data silos, mainly caused by privacy and interoperability, significantly constrain collaborations among different organizations with similar data for the same purpose. Distributed learning based on divide-and-conquer provides a promising way to settle the data silos, but it suffers from several challenges, including autonomy, privacy guarantees, and the necessity of collaborations. This paper focuses on developing an adaptive distributed kernel ridge regression (AdaDKRR) by taking autonomy in parameter selection, privacy in communicating non-sensitive information, and the necessity of collaborations for performance improvement into account. We provide both solid theoretical verifications and comprehensive experiments for AdaDKRR to demonstrate its feasibility and effectiveness. Theoretically, we prove that under some mild conditions, AdaDKRR performs similarly to running the optimal learning algorithms on the whole data, verifying the necessity of collaborations and showing that no other distributed learning scheme can essentially beat AdaDKRR under the same conditions. Numerically, we test AdaDKRR on both toy simulations and two real-world applications to show that AdaDKRR is superior to other existing distributed learning schemes. All these results show that AdaDKRR is a feasible scheme to overcome data silos, which are highly desired in numerous application regions such as intelligent decision-making, pricing forecasting, and performance prediction for products.

主にプライバシーと相互運用性によって引き起こされるデータサイロは、同じ目的で類似データを持つ異なる組織間のコラボレーションを著しく制限します。分割統治に基づく分散学習は、データサイロを解決する有望な方法を提供しますが、自律性、プライバシーの保証、コラボレーションの必要性など、いくつかの課題に悩まされています。本論文では、パラメータ選択の自律性、非機密情報の通信におけるプライバシー、パフォーマンス向上のためのコラボレーションの必要性を考慮して、適応型分散カーネルリッジ回帰（AdaDKRR）の開発に焦点を当てています。AdaDKRRの実現可能性と有効性を実証するために、堅実な理論的検証と包括的な実験の両方を提供します。理論的には、いくつかの穏やかな条件下では、AdaDKRRがデータ全体に対して最適な学習アルゴリズムを実行するのと同様に機能することを証明し、コラボレーションの必要性を検証し、同じ条件下では他の分散学習スキームが本質的にAdaDKRRに勝てないことを示しています。数値的には、AdaDKRRをトイシミュレーションと2つの実際のアプリケーションの両方でテストし、AdaDKRRが他の既存の分散学習スキームよりも優れていることを示します。これらの結果はすべて、AdaDKRRがデータサイロを克服するための実行可能なスキームであることを示しています。データサイロは、インテリジェントな意思決定、価格予測、製品のパフォーマンス予測など、多くのアプリケーション領域で非常に求められています。

On Global and Local Convergence of Iterative Linear Quadratic Optimization Algorithms for Discrete Time Nonlinear Control

離散時間非線形制御のための反復線形二次最適化アルゴリズムの大域的および局所的収束について

A classical approach for solving discrete time nonlinear control on a finite horizon consists in repeatedly minimizing linear quadratic approximations of the original problem around current candidate solutions. While widely popular in many domains, such an approach has mainly been analyzed locally. We provide detailed convergence guarantees to stationary points as well as local linear convergence rates for the Iterative Linear Quadratic Regulator (ILQR) algorithm and its Differential Dynamic Programming (DDP) variant. For problems without costs on control variables, we observe that global convergence to minima can be ensured provided that the linearized discrete time dynamics are surjective, costs on the state variables are gradient dominated. We further detail quadratic local convergence when the costs are self-concordant. We show that surjectivity of the linearized dynamics hold for appropriate discretization schemes given the existence of a feedback linearization scheme. We present complexity bounds of algorithms based on linear quadratic approximations through the lens of generalized Gauss-Newton methods. Our analysis uncovers several convergence phases for regularized generalized Gauss-Newton algorithms.

離散時間非線形制御を有限時間領域で解くための古典的なアプローチは、現在の候補解の周囲で元の問題の線形二次近似を繰り返し最小化することです。このアプローチは多くの分野で広く普及しているものの、これまでは主に局所的に解析されてきた。本稿では、反復線形二次レギュレータ（ILQR）アルゴリズムとその微分動的計画法（DDP）の変種について、停留点への詳細な収束保証と局所線形収束率を示す。制御変数にコストのない問題の場合、線形化された離散時間ダイナミクスが射影的であり、状態変数のコストが勾配支配的であれば、最小値への大域収束が保証されることがわかる。さらに、コストが自己整合的である場合の二次局所収束についても詳述します。フィードバック線形化スキームが存在する場合、適切な離散化スキームにおいて線形化ダイナミクスの射影性が成立することを示す。一般化ガウス・ニュートン法の観点から、線形二次近似に基づくアルゴリズムの計算量限界を示す。私たちの分析により、正規化された一般化ガウス・ニュートンアルゴリズムのいくつかの収束段階が明らかになりました。

A Decentralized Proximal Gradient Tracking Algorithm for Composite Optimization on Riemannian Manifolds

リーマン多様体上の複合最適化のための分散型近似勾配追跡アルゴリズム

This paper focuses on minimizing a smooth function combined with a nonsmooth regularization term on a compact Riemannian submanifold embedded in the Euclidean space under a decentralized setting. Typically, there are two types of approaches at present for tackling such composite optimization problems. The first, subgradient-based approaches, rely on subgradient information of the objective function to update variables, achieving an iteration complexity of $O(\epsilon^{-4}\log^2(\epsilon^{-2}))$. The second, smoothing approaches, involve constructing a smooth approximation of the nonsmooth regularization term, resulting in an iteration complexity of $O(\epsilon^{-4})$. This paper proposes a proximal gradient type algorithm that fully exploits the composite structure. The global convergence to a stationary point is established with a significantly improved iteration complexity of $O(\epsilon^{-2})$. To validate the effectiveness and efficiency of our proposed method, we present numerical results from real-world applications, showcasing its superior performance compared to existing approaches.

本論文では、分散設定のもとでユークリッド空間に埋め込まれたコンパクトリーマン部分多様体上の滑らかな関数と非滑らかな正則化項を組み合わせたものを最小化することに焦点を当てています。通常、このような複合最適化問題に取り組むためのアプローチは現在2種類あります。1つ目は、劣勾配ベースのアプローチで、変数を更新するために目的関数の劣勾配情報に依存し、反復計算量は$O(\epsilon^{-4}\log^2(\epsilon^{-2}))$です。2つ目は、平滑化アプローチで、非滑らかな正則化項の滑らかな近似を構築する必要があり、反復計算量は$O(\epsilon^{-4})$です。本論文では、この複合構造を最大限に活用する近似勾配型アルゴリズムを提案します。停留点への大域収束は、反復計算量が$O(\epsilon^{-2})$と大幅に改善され、確立されます。提案手法の有効性と効率性を検証するため、実世界アプリケーションにおける数値結果を提示し、既存の手法と比較して優れた性能を示す。

Learning conditional distributions on continuous spaces

連続空間における条件付き分布の学習

We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on nearest neighbors. We establish upper bounds for the convergence rates of both methods and, from these bounds, deduce optimal configurations for the radius and the number of neighbors. We propose to incorporate the nearest neighbors method into neural network training, as our empirical analysis indicates it has better performance in practice. For efficiency, our training process utilizes approximate nearest neighbors search with random binary space partitioning. Additionally, we employ the Sinkhorn algorithm and a sparsity-enforced transport plan. Our empirical findings demonstrate that, with a suitably designed structure, the neural network has the ability to adapt to a suitable level of Lipschitz continuity locally.

我々は、特徴空間とターゲット空間の異なる次元を許容する多次元単位ボックス上の条件付き分布のサンプルベース学習を調査します。本手法では、特徴空間内の様々なクエリポイント付近のデータをクラスタリングし、ターゲット空間における経験的尺度を作成します。本手法では、固定半径球に基づくクラスタリング手法と最近傍点に基づくクラスタリング手法の2つの異なるクラスタリング手法を採用します。両手法の収束率の上限を設定し、これらの上限から半径と近傍点数の最適な構成を導出します。経験的分析により最近傍点法の方が実用上優れた性能を示すことから、ニューラルネットワークの学習に最近傍点法を組み込むことを提案します。効率性を高めるため、学習プロセスでは、ランダムバイナリ空間分割を用いた近似最近傍点探索を利用します。さらに、シンクホーンアルゴリズムとスパース性を考慮したトランスポートプランを採用します。本実験結果は、適切に設計された構造を用いることで、ニューラルネットワークが適切なレベルのリプシッツ連続性に局所的に適応できる能力を持つことを示しています。

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

組み合わせセミバンディット、線形バンディット、およびMDPに対する非確率的遅延フィードバックの統一的解析

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in four important settings. We derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov Decision Processes with delay (both known and unknown transition functions). Furthermore, we use our analysis to develop an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. In order to derive these results we show that FTRL remains stable across multiple rounds under mild assumptions on the regularizer.

遅延バンディットフィードバックを用いたオンライン学習におけるFollow The Regularized Leader (FTRL)の新しい解析を導出します。遅延フィードバックのコストをバンディットフィードバックのコストから分離することで、本解析は4つの重要な設定において新しい結果を得ることを可能にします。遅延付き組合せセミバンディットと遅延付き敵対的マルコフ決定過程（遷移関数は既知と未知の両方）について、初めて最適な（対数係数まで）後悔境界を導出します。さらに、本解析を用いて、遅延付き線形バンディットに対してほぼ最適な後悔境界を達成する効率的なアルゴリズムを開発します。これらの結果を導出するために、正則化子に関する弱い仮定の下で、FTRLが複数ラウンドにわたって安定であることを示す。

Error bounds for particle gradient descent, and extensions of the log-Sobolev and Talagrand inequalities

粒子勾配降下法の誤差境界、および対数ソボレフ不等式とタラグランド不等式の拡張

We derive non-asymptotic error bounds for particle gradient descent (PGD, Kuntz et al. (2023)), a recently introduced algorithm for maximum likelihood estimation of large latent variable models obtained by discretizing a gradient flow of the free energy. We begin by showing that the flow converges exponentially fast to the free energy’s minimizers for models satisfying a condition that generalizes both the log-Sobolev and the Polyak–Łojasiewicz inequalities (LSI and PŁI, respectively). We achieve this by extending a result well-known in the optimal transport literature (that the LSI implies the Talagrand inequality) and its counterpart in the optimization literature (that the PŁI implies the so-called quadratic growth condition), and applying the extension to our new setting. We also generalize the Bakry–Émery Theorem and show that the LSI/PŁI extension holds for models with strongly concave log-likelihoods. For such models, we further control PGD’s discretization error and obtain the non-asymptotic error bounds. While we are motivated by the study of PGD, we believe that the inequalities and results we extend may be of independent interest.

我々は、粒子勾配降下法（PGD、Kuntzら(2023)）の非漸近的誤差界を導出します。これは、自由エネルギーの勾配フローを離散化することで得られる大規模潜在変数モデルの最大尤度推定のための最近導入されたアルゴリズムです。まず、対数ソボレフ不等式とポリアク-ウォヤシェヴィチ不等式（それぞれLSIとPŁI）の両方を一般化する条件を満たすモデルについて、フローが自由エネルギー最小化点に指数的に速く収束することを示す。これは、最適輸送の文献でよく知られている結果（LSIはタラグランド不等式を意味する）と最適化の文献における対応する結果（PŁIはいわゆる二次成長条件を意味する）を拡張し、その拡張を我々の新しい設定に適用することによって達成します。また、バクリ-エメリー定理を一般化し、LSI/PŁI拡張が強く凹状の対数尤度を持つモデルに対して成り立つことを示す。このようなモデルに対して、PGDの離散化誤差をさらに制御し、非漸近的な誤差境界を得る。PGDの研究が動機ではあるものの、拡張した不等式と結果は独立した関心事となる可能性があると考えています。

Linear Hypothesis Testing in High-Dimensional Expected Shortfall Regression with Heavy-Tailed Errors

裾野の重い誤差を伴う高次元期待ショートフォール回帰における線形仮説検定

Expected shortfall (ES) is widely used for characterizing the tail of a distribution across various fields, particularly in financial risk management. In this paper, we explore a two-step procedure that leverages an orthogonality property to reduce sensitivity to nuisance parameters when estimating within a joint quantile and expected shortfall regression framework. For high-dimensional sparse models, we propose a robust $\ell_1$-penalized two-step approach capable of handling heavy-tailed data distributions. We establish non-asymptotic estimation error bounds and propose an appropriate growth rate for the diverging robustification parameter. To facilitate statistical inference for certain linear combinations of the ES regression coefficients, we construct debiased estimators and develop their asymptotic distributions, which form the basis for constructing valid confidence intervals. We validate the proposed method through simulation studies, demonstrating its effectiveness in high-dimensional linear models with heavy-tailed errors.

期待ショートフォール（ES）は、様々な分野、特に金融リスク管理において、分布の裾を特徴付けるために広く用いられています。本稿では、結合分位点回帰と期待ショートフォール回帰の枠組みを用いて推定を行う際に、直交性を利用して不要なパラメータに対する感度を低減する2段階の手順を検討します。高次元スパースモデルに対して、裾の重いデータ分布に対応可能なロバストな$\ell_1$ペナルティ付き2段階アプローチを提案します。非漸近的な推定誤差の境界を設定し、発散ロバスト化パラメータの適切な成長率を提案します。ES回帰係数の特定の線形結合に対する統計的推論を容易にするため、バイアス除去された推定量を構築し、その漸近分布を開発します。これは、有効な信頼区間を構築するための基礎となります。提案手法はシミュレーション研究を通じて検証し、裾の重い誤差を持つ高次元線形モデルにおける有効性を実証します。

Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling

てこ比スコアサンプリングによるカーネルヒルベルト空間の再現における効率的な数値積分

In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of $n$ i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of $m[abs][pdf][bib] [code] &copy JMLR 2025. (edit, beta)

本研究では、数値積分、すなわち、積分対象に対する点ごとの評価のみを用いて、目標確率測度に関する積分を近似する問題を考察します。目標分布が$n$個の独立同値観測値の集合を通してのみアクセス可能であり、かつ積分対象が再生核ヒルベルト空間に属するという設定に焦点を当てる。$m[abs][pdf][bib] [code] &copy JMLR 2025. (edit, beta)

Distribution Free Tests for Model Selection Based on Maximum Mean Discrepancy with Estimated Parameters

推定パラメータを用いた最大平均乖離度に基づくモデル選択のための分布フリー検定

There exist several testing procedures based on the maximum mean discrepancy (MMD) to address the challenge of model specification. However, these testing procedures ignore the presence of estimated parameters in the case of composite null hypotheses. In this paper, we first illustrate the effect of parameter estimation in model specification tests based on the MMD. Second, we propose simple model specification and model selection tests in the case of models with estimated parameters. All our tests are asymptotically standard normal under the null, even when the true underlying distribution belongs to the competing parametric families. A simulation study and a real data analysis illustrate the performance of our tests in terms of power and level.

モデル仕様記述の課題に対処するために、最大平均乖離度（MMD）に基づく検定手順がいくつか存在します。しかし、これらの検定手順は、複合帰無仮説の場合に推定パラメータの存在を無視します。本稿では、まずMMDに基づくモデル仕様記述検定におけるパラメータ推定の効果を示す。第二に、推定パラメータを持つモデルの場合の、簡略なモデル仕様記述検定およびモデル選択検定を提案します。真の分布が競合するパラメトリック族に属する場合でも、本検定はすべて帰無仮説の下で漸近的に標準正規分布となります。シミュレーション研究と実際のデータ分析により、パワーとレベルの観点からテストのパフォーマンスが示されます。

Statistical field theory for Markov decision processes under uncertainty

不確実性下におけるマルコフ決定過程の統計場理論

A statistical field theory is introduced for finite state and action Markov decision processes with unknown parameters, in a Bayesian setting. The Bellman equation, for policy evaluation and the optimal value function in finite and discounted infinite horizon problems, is studied as a disordered interacting dynamical system. The Markov decision process transition probabilities and mean-rewards are interpreted as quenched random variables and the value functions, or the iterates of the Bellman equation, are deterministic variables that evolve dynamically. The posterior over value functions is then equivalent to the quenched average of Fourier inverse of the Martin-Siggia-Rose-De Dominicis-Janssen generating function. The formalism enables the use of methods from field theory to compute posterior moments of value functions. The paper presents two such methods, corresponding to two distinct asymptotic limits. First, the classical approximation is applied, corresponding to the asymptotic data limit. This approximation recovers so-called plug-in estimators for the mean of the value functions. Second, a dynamic mean field theory is derived, showing that under certain assumptions the state-action values are statistically independent across state-action pairs in the asymptotic state space limit. The state-action value statistics can be computed from a set of self-consistent mean field equations, which we call dynamic mean field programming (DMFP). Collectively, the results provide analytic insight into the structure of model uncertainty in Markov decision processes, and pave the way toward more advanced field theoretic techniques and applications to planning and reinforcement learning problems.

ベイズ統計学的な設定において、未知のパラメータを持つ有限状態および行動マルコフ決定過程のための統計場理論を導入します。方策評価と有限および割引無限期間問題における最適価値関数のためのベルマン方程式を、無秩序な相互作用力学系として考察します。マルコフ決定過程の遷移確率と平均報酬は、クエンチされた確率変数として解釈され、価値関数、すなわちベルマン方程式の反復は、動的に変化する決定論的変数です。価値関数の事後分布は、Martin-Siggia-Rose-De Dominicis-Janssen生成関数のフーリエ逆関数のクエンチされた平均と等価です。この形式化により、場の理論の手法を用いて価値関数の事後モーメントを計算することが可能となります。本論文では、2つの異なる漸近極限に対応する2つの手法を提示します。まず、漸近データ極限に対応する古典的近似を適用します。この近似は、価値関数の平均に対するいわゆるプラグイン推定値を回復します。第二に、動的平均場理論を導出し、一定の仮定の下で、漸近状態空間極限において状態行動値が状態行動ペア間で統計的に独立であることを示す。状態行動値統計量は、自己無撞着な平均場方程式の集合から計算することができ、これを動的平均場計画法（DMFP）と呼ぶ。これらの結果は総合的に、マルコフ決定過程におけるモデルの不確実性の構造に関する解析的洞察を提供し、より高度な場の理論的手法や、計画および強化学習問題への応用への道を開く。

Bayesian Data Sketching for Varying Coefficient Regression Models

変動係数回帰モデルのためのベイズデータスケッチ

Varying coefficient models are popular for estimating nonlinear regression functions in functional data models. Their Bayesian variants have received limited attention in large data applications, primarily due to prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We introduce Bayesian data sketching for varying coefficient models to obviate computational challenges presented by large sample sizes. To address the challenges of analyzing large data, we compress the functional response vector and predictor matrix by a random linear transformation to achieve dimension reduction and conduct inference on the compressed data. Our approach distinguishes itself from several existing methods for analyzing large functional data in that it requires neither the development of new models or algorithms nor any specialized computational hardware while delivering fully model-based Bayesian inference. Well-established methods and algorithms for varying-coefficient regression models can be applied to the compressed data. We establish posterior contraction rates for estimating the varying coefficients and predicting the outcome at new locations with the randomly compressed data model. We use simulation experiments and analyze remote sensed vegetation data to empirically illustrate the inferential and computational efficiency of our approach.

変動係数モデルは、機能データモデルにおける非線形回帰関数の推定によく用いられます。ベイズ変種は、主にマルコフ連鎖モンテカルロ（MCMC）アルゴリズムを用いた事後計算が非常に遅いため、大規模データアプリケーションではあまり注目されていませんでした。大規模なサンプルサイズに伴う計算上の課題を回避するために、変動係数モデルのためのベイズデータスケッチを導入します。大規模データの分析の課題に対処するため、機能応答ベクトルと予測行列をランダム線形変換によって圧縮して次元削減を実現し、圧縮されたデータに対して推論を行います。我々のアプローチは、大規模な機能データを分析するための既存の手法とは一線を画しており、新しいモデルやアルゴリズムの開発、あるいは特殊な計算ハードウェアを必要とせず、完全なモデルベースのベイズ推論を実現します。圧縮データには、変動係数回帰モデルのための確立された手法とアルゴリズムを適用できます。我々は、変動係数を推定し、ランダムに圧縮されたデータモデルを用いて新しい場所での結果を予測するための事後収縮率を確立します。シミュレーション実験とリモートセンシングによる植生データの分析により、我々のアプローチの推論効率と計算効率を実証的に示します。

Bagged k-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets

局所レベルセットの確率を用いたモードベースクラスタリングのためのバギングk距離

In this paper, we propose an ensemble learning algorithm named bagged $k$-distance for mode-based clustering (BDMBC) by putting forward a new measure called the probability of localized level sets (PLLS), which enables us to find all clusters for varying densities with a global threshold. On the theoretical side, we show that with a properly chosen number of nearest neighbors $k_D$ in the bagged $k$-distance, the sub-sample size $s$, the bagging rounds $B$, and the number of nearest neighbors $k_L$ for the localized level sets, BDMBC can achieve optimal convergence rates for mode estimation. It turns out that with a relatively small $B$, the sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k_D$ can be reduced simultaneously. Moreover, we establish fast convergence rates for the level set estimation of the PLLS in terms of Hausdorff distance, which reveals that BDMBC can find localized level sets for varying densities and thus enjoys local adaptivity. On the practical side, we conduct numerical experiments to empirically verify the effectiveness of BDMBC for mode estimation and level set estimation, which demonstrates the promising accuracy and efficiency of our proposed algorithm.

本稿では、局所レベルセットの確率(PLLS)と呼ばれる新しい尺度を提案し、モードベースクラスタリングのためのbagged $k$-距離(BDMBC)というアンサンブル学習アルゴリズムを提案します。この尺度は、グローバルしきい値を使用してさまざまな密度のすべてのクラスターを見つけることを可能にします。理論面では、bagged $k$-距離における最近傍点の数$k_D$、サブサンプルサイズ$s$、バギングラウンド$B$、および局所レベルセットの最近傍点の数$k_L$を適切に選択することで、BDMBCがモード推定の最適な収束率を達成できることを示します。比較的小さな$B$を使用すると、サブサンプルサイズ$s$は各バギングラウンドでのトレーニングデータ数$n$よりもはるかに小さくすることができ、同時に最近傍点の数$k_D$を削減できることがわかります。さらに、ハウスドルフ距離を用いてPLLSのレベルセット推定の収束速度が速いことを明らかにしました。これは、BDMBCが変化する密度に対して局所的なレベルセットを見つけることができ、したがって局所適応性を持つことを示しています。実践面では、モード推定とレベルセット推定におけるBDMBCの有効性を経験的に検証するための数値実験を行い、提案アルゴリズムの有望な精度と効率性を実証しました。

Linear cost and exponentially convergent approximation of Gaussian Matérn processes on intervals

区間上のガウスMatérn過程の線形コストと指数収束近似

The computational cost for inference and prediction of statistical models based on Gaussian processes with Matérn covariance functions scales cubically with the number of observations, limiting their applicability to large data sets. The cost can be reduced in certain special cases, but there are no generally applicable exact methods with linear cost. Several approximate methods have been introduced to reduce the cost, but most lack theoretical guarantees for accuracy. We consider Gaussian processes on bounded intervals with Matérn covariance functions and, for the first time, develop a generally applicable method with linear cost and a covariance error that decreases exponentially fast in the order $m$ of the proposed approximation. The method is based on an optimal rational approximation of the spectral density and results in an approximation that can be represented as a sum of $m$ independent Gaussian Markov processes, facilitating usage in general software for statistical inference. Besides theoretical justifications, we demonstrate accuracy empirically through carefully designed simulation studies, which show that the method outperforms state-of-the-art alternatives in accuracy for fixed computational cost in tasks like Gaussian process regression.

マテルン共分散関数を持つガウス過程に基づく統計モデルの推論および予測の計算コストは観測数に比例して増大し、大規模データセットへの適用が制限されます。このコストは特定の特殊なケースでは削減可能であるが、線形コストで一般に適用可能な厳密な手法は存在しない。コスト削減のためにいくつかの近似手法が導入されているが、そのほとんどは理論的な精度保証を欠いています。本稿では、マテルン共分散関数を持つ有界区間上のガウス過程を考察し、提案された近似のオーダー$m$で共分散誤差が指数関数的に減少する、線形コストで一般に適用可能な手法を初めて開発します。この手法は、スペクトル密度の最適な有理近似に基づいており、$m$個の独立したガウスマルコフ過程の和として表せる近似値が得られるため、統計的推論のための一般的なソフトウェアでの利用が容易になります。理論的な根拠に加え、慎重に設計されたシミュレーション研究を通じて経験的に精度を実証し、ガウス過程回帰などのタスクにおいて、固定計算コストでの精度において、この手法が最先端の代替手法を上回ることを示しました。

Invariant Subspace Decomposition

不変部分空間分解

We consider the task of predicting a response $Y$ from a set of covariates $X$ in settings where the conditional distribution of $Y$ given $X$ changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. To additionally exploit observations further in the past, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be employed both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict $Y$ at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide finite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure.

我々は、$X$を与えられた$Y$の条件付き分布が時間とともに変化する状況において、共変量$X$の集合から応答$Y$を予測するタスクを考える。これを実現するには、条件付き分布が時間とともにどのように変化するかに関する仮定が必要となります。既存のアプローチでは、例えば、変化は時間とともに滑らかに起こると仮定しているため、直近の過去のみを用いた短期予測は可能となります。さらに過去の観測値も活用するために、我々は不変量に基づく線形条件文のための新しいフレームワーク、不変部分空間分解（ISD）を提案します。これは、条件付き分布を時間不変成分と残差時間依存成分に分割します。我々が示すように、この分解はゼロショット予測タスクと時間適応予測タスクの両方、つまり、予測したい時点において訓練データが全く存在しないか、あるいは少量しか存在しない状況の両方に適用可能です。我々は、近似結合行列対角化などのツールを使用して分解を自動的に推論する実用的な推定手順を提案します。さらに、提案推定量には有限サンプル保証が与えられ、追加の不変構造を用いない手法よりも実際に改善されることを経験的に実証します。

Posterior Concentrations of Fully-Connected Bayesian Neural Networks with General Priors on the Weights

重みに一般的な事前分布を持つ全結合ベイズニューラルネットワークの事後集中

Bayesian approaches for training deep neural networks (BNNs) have received significant interest and have been effectively utilized in a wide range of applications. Several studies have examined the properties of posterior concentrations in BNNs. However, most of these studies focus solely on BNN models with sparse or heavy-tailed priors. Surprisingly, there are currently no theoretical results for BNNs using Gaussian priors, which are the most commonly used in practice. The lack of theory arises from the absence of approximation results of Deep Neural Networks (DNNs) that are non-sparse and have bounded parameters. In this paper, we present a new approximation theory for non-sparse DNNs with bounded parameters. Additionally, based on the approximation theory, we show that BNNs with non-sparse general priors can achieve near-minimax optimal posterior concentration rates around the true model.

深層ニューラルネットワーク（BNN）の訓練におけるベイズ的アプローチは大きな関心を集めており、幅広い用途で効果的に利用されてきた。BNNにおける事後集中の特性を検討した研究はいくつかあります。しかし、これらの研究のほとんどは、スパースまたは裾の重い事前分布を持つBNNモデルのみに焦点を当てています。驚くべきことに、実用上最も一般的に使用されているガウス事前分布を用いたBNNについては、現在、理論的な結果が存在しない。この理論の欠如は、非スパースかつ有界パラメータを持つディープニューラルネットワーク（DNN）の近似結果が存在しないことに起因しています。本稿では、有界パラメータを持つ非スパースDNNに対する新たな近似理論を提示します。さらに、この近似理論に基づき、非スパースな一般事前分布を持つBNNが、真のモデル周辺で近似ミニマックス最適事後分布密度を達成できることを示す。

Outlier Robust and Sparse Estimation of Linear Regression Coefficients

線形回帰係数の外れ値ロバストかつスパースな推定

We consider outlier-robust and sparse estimation of linear regression coefficients, when the covariates and the noises are contaminated by adversarial outliers and noises are sampled from a heavy-tailed distribution. Our results present sharper error bounds under weaker assumptions than prior studies that share similar interests with this study. Our analysis relies on some sharp concentration inequalities resulting from generic chaining.

共変量とノイズが敵対的外れ値によって汚染され、ノイズが裾の重い分布からサンプリングされる場合、外れ値ロバストかつスパースな線形回帰係数の推定を検討します。我々の結果は、本研究と同様の関心を持つ先行研究よりも弱い仮定の下で、より鋭い誤差境界を示しています。我々の分析は、ジェネリック連鎖に起因するいくつかの鋭い集中不等式に依存しています。

Affine Rank Minimization via Asymptotic Log-Det Iteratively Reweighted Least Squares

漸近的Log-Det反復重み付け最小二乗法によるアフィンランク最小化

The affine rank minimization problem is a well-known approach to matrix recovery. While there are various surrogates to this NP-hard problem, we prove that the asymptotic minimization of log-det objective functions indeed always reveals the desired, lowest-rank matrices—whereas such may or may not recover a sought-after ground truth. Concerning commonly applied methods such as iteratively reweighted least squares, one thus remains with two difficult to distinguish concerns: how problematic are local minima inherent to the approach truly; and opposingly, how influential instead is the numerical realization. We first show that comparable solution statements do not hold true for Schatten-$p$ functions, including the nuclear norm, and discuss the role of divergent minimizers. Subsequently, we outline corresponding implications for general optimization approaches as well as the more specific IRLS-$0$ algorithm, emphasizing through examples that the transition of the involved smoothing parameter to zero is frequently a more substantial issue than non-convexity. Lastly, we analyze several presented aspects empirically in a series of numerical experiments. In particular, allowing for instance sufficiently many iterations, one may even observe a phase transition for generic recoverability at the absolute theoretical minimum.

アフィンランク最小化問題は、行列復元へのよく知られたアプローチです。このNP困難問題には様々な代替手法がありますが、本研究では、log-det目的関数の漸近最小化が、実際には常に目的の最低ランク行列を明らかにすることを証明します。ただし、この最小化は、求める真の値を復元できる場合もあれば、そうでない場合もあります。反復重み付け最小二乗法などの一般的に用いられる手法に関しては、区別が難しい2つの懸念が残ります。1つは、このアプローチに内在する局所最小値が真にどれほど問題となるか、もう1つは、数値的実現がどれほど影響力を持つかです。まず、核ノルムを含むシャッテン-$p$関数では、同等の解法が成立しないことを示し、発散最小化関数の役割について考察します。次に、一般的な最適化アプローチと、より具体的なIRLS-$0$アルゴリズムについて、関連する影響を概説します。特に、関係する平滑化パラメータのゼロへの遷移は、非凸性よりもしばしば重要な問題となることを、例を用いて強調します。最後に、一連の数値実験において、提示したいくつかの側面を経験的に分析します。特に、例えば十分な回数の反復を許容すると、理論上の絶対最小値において、ジェネリックな回復可能性の相転移を観察することさえ可能です。

Causal Effect of Functional Treatment

機能的処理の因果効果

We study the causal effect with a functional treatment variable, where practical applications often arise in neuroscience, biomedical sciences, etc. Previous research concerning the effect of a functional variable on an outcome is typically restricted to exploring correlation rather than causality. The generalized propensity score, which is often used to calibrate the selection bias, is not directly applicable to a functional treatment variable due to a lack of definition of probability density function for functional data. We propose three estimators for the average dose-response functional based on the functional linear model, namely, the functional stabilized weight estimator, the outcome regression estimator and the doubly robust estimator, each of which has its own merits. We study their theoretical properties, which are corroborated through extensive numerical experiments. A real data application on electroencephalography data and disease severity demonstrates the practical value of our methods.

神経科学、生物医学などで実用的な応用がしばしば見られる機能的治療変数を用いて因果効果を調査します。機能変数が結果に与える影響に関するこれまでの研究は、通常、因果関係ではなく相関関係の調査に限定されています。選択バイアスの較正によく用いられる一般化傾向スコアは、関数データに対する確率密度関数の定義が欠如しているため、関数的治療変数に直接適用することができません。本研究では、関数線形モデルに基づく平均用量反応関数の推定量として、それぞれ独自の長所を持つ3つの推定量を提案します。すなわち、関数安定化重み推定量、アウトカム回帰推定量、および二重ロバスト推定量です。これらの推定量の理論的特性を解析し、広範な数値実験によって裏付けます。脳波データと疾患重症度への実データ適用により、本手法の実用的価値が実証されます。

Uplift Model Evaluation with Ordinal Dominance Graphs

順序優位グラフを用いたアップリフトモデルの評価

Uplift modelling is a subfield of causal learning that focuses on ranking entities by individual treatment effects. Uplift models are typically evaluated using Qini curves or Qini scores. While intuitive, the theoretical grounding for Qini in the literature is limited, and the mathematical connection to the well-understood Receiver Operating Characteristic (ROC) curve is unclear. In this paper, we introduce pROCini, a novel uplift evaluation metric that improves upon Qini in two important ways. First, it explicitly incorporates more information by taking into account negative outcomes. Second, it leverages this additional information within the Ordinal Dominance Graph framework, which is the basis behind the well known ROC curve, resulting in a mathematically well-behaved metric that facilitates theoretical analysis. We derive confidence bounds for pROCini, exploiting its theoretical properties. Finally, we empirically validate the improved discriminative power of ROCini and pROCini in a simulation study as well as via experiments on real data.

アップリフトモデリングは、因果学習のサブフィールドであり、個々の処理効果によってエンティティをランク付けすることに焦点を当てています。アップリフトモデルは通常、Qini曲線またはQiniスコアを使用して評価されます。直感的ではありますが、文献におけるQiniの理論的根拠は限られており、よく理解されている受信者操作特性（ROC）曲線との数学的関連性は明確ではありません。本稿では、2つの重要な点でQiniを改良した、新たな向上評価指標であるpROCiniを紹介します。第一に、否定的な結果を考慮することで、より多くの情報を明示的に組み込んでいます。第二に、この追加情報を、よく知られているROC曲線の基礎となる序数優位グラフフレームワーク内で活用することで、理論分析を容易にする数学的に良好な指標を実現します。pROCiniの理論的特性を利用して、信頼限界を導出します。最後に、シミュレーション研究と実データを用いた実験により、ROCiniとpROCiniの改善された識別力を実証的に検証します。

High-Dimensional L2-Boosting: Rate of Convergence

高次元L2ブースティング：収束速度

Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of L2-Boosting in a high-dimensional setting under early stopping. We close a gap in the literature and provide the rate of convergence of L2-Boosting in a high-dimensional setting under approximate sparsity and without beta-min condition. We also show that the rate of convergence of the classical L2-Boosting depends on the design matrix described by a sparse eigenvalue condition. To show the latter results, we derive new, improved approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of L2-Boosting. These results might be of independent interest. Moreover, we introduce so-called “restricted” L2-Boosting. The restricted L2-Boosting algorithm sticks to the set of the previously chosen variables, exploits the information contained in these variables first and then only occasionally allows to add new variables to this set. We derive the rate of convergence for restricted L2-Boosting under early stopping which is close to the convergence rate of Lasso in an approximate sparse, high-dimensional setting without beta-min condition. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Finally, we present simulation studies to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, L2-Boosting clearly outperforms Lasso. An empirical illustration and the proofs are contained in the Appendix.

ブースティングは機械学習における最も重要な発展の一つです。本論文では、早期停止を伴う高次元設定におけるL2ブースティングの収束速度について考察します。文献におけるギャップを埋め、近似スパース性およびベータ最小条件なしの高次元設定におけるL2ブースティングの収束速度を示します。また、従来のL2ブースティングの収束速度は、スパース固有値条件によって記述される計画行列に依存することを示します。後者の結果を示すために、L2ブースティングの再訪挙動の解析に基づき、純粋貪欲アルゴリズムの近似結果を新たに改良して導きます。これらの結果は独立して興味深いものとなる可能性があります。さらに、いわゆる「制限付き」L2ブースティングを導入します。制限付きL2ブースティングアルゴリズムは、以前に選択された変数の集合に固執し、まずこれらの変数に含まれる情報を活用し、その後、この集合に新しい変数を追加することを稀にしか許可しません。早期停止下での制限付きL2ブースティングの収束率を導出します。これは、β最小条件のない近似スパース、高次元設定におけるLassoの収束率に近い値です。また、応用研究で容易に実装して使用できる、早期停止の実行可能な規則も紹介します。最後に、理論的結果の妥当性を示し、ブースティングの実際的な側面についての洞察を提供するために、シミュレーション研究を紹介します。これらのシミュレーション研究では、L2ブースティングはLassoを明らかに上回ります。実証的な図解と証明は付録に記載されています。

Feature Learning in Finite-Width Bayesian Deep Linear Networks with Multiple Outputs and Convolutional Layers

複数の出力と畳み込み層を持つ有限幅ベイズ深層線形ネットワークにおける特徴学習

Deep linear networks have been extensively studied, as they provide simplified models of deep learning. However, little is known in the case of finite-width architectures with multiple outputs and convolutional layers. In this manuscript, we provide rigorous results for the statistics of functions implemented by the aforementioned class of networks, thus moving closer to a complete characterization of feature learning in the Bayesian setting. Our results include: (i) an exact and elementary non-asymptotic integral representation for the joint prior distribution over the outputs, given in terms of a mixture of Gaussians; (ii) an analytical formula for the posterior distribution in the case of squared error loss function (Gaussian likelihood); (iii) a quantitative description of the feature learning infinite-width regime, using large deviation theory. From a physical perspective, deep architectures with multiple outputs or convolutional layers represent different manifestations of kernel shape renormalization, and our work provides a dictionary that translates this physics intuition and terminology into rigorous Bayesian statistics.

深層線形ネットワークは、深層学習の簡略化されたモデルを提供するため、広く研究されてきた。しかし、複数の出力と畳み込み層を持つ有限幅アーキテクチャの場合についてはほとんど知られていない。本稿では、前述のネットワーククラスによって実装される関数の統計について厳密な結果を提供し、ベイズ設定における特徴学習の完全な特性評価に近づく。我々の結果は以下の通りです。(i)出力上の結合事前分布の正確かつ基本的な非漸近積分表現（ガウス分布の混合で与えられる）、(ii)二乗誤差損失関数（ガウス尤度）の場合の事後分布の解析的公式、(iii)大偏差理論を用いた特徴学習無限幅領域の定量的記述。物理的な観点から見ると、複数の出力または畳み込み層を持つディープアーキテクチャは、カーネル形状の再正規化のさまざまな現れを表しており、私たちの研究は、この物理的な直感と用語を厳密なベイズ統計に変換する辞書を提供します。

How good is your Laplace approximation of the Bayesian posterior? Finite-sample computable error bounds for a variety of useful divergences

ベイズ事後分布のラプラス近似はどの程度正確ですか？様々な有用なダイバージェンスに対する有限サンプル計算可能誤差境界

The Laplace approximation is a popular method for constructing a Gaussian approximation to the Bayesian posterior and thereby approximating the posterior mean and variance. But approximation quality is a concern. One might consider using rate-of-convergence bounds from certain versions of the Bayesian Central Limit Theorem (BCLT) to provide quality guarantees. But existing bounds require assumptions that are unrealistic even for relatively simple real-life Bayesian analyses; more specifically, existing bounds either (1) require knowing the true data-generating parameter, (2) are asymptotic in the number of samples, (3) do not control the Bayesian posterior mean, or (4) require strongly log concave models to compute. In this work, we provide the first computable bounds on quality that simultaneously (1) do not require knowing the true parameter, (2) apply to finite samples, (3) control posterior means and variances, and (4) apply generally to models that satisfy the conditions of the asymptotic BCLT. Moreover, we substantially improve the dimension dependence of existing bounds; in fact, we achieve the lowest-order dimension dependence possible in the general case. We compute exact constants in our bounds for a variety of standard models, including logistic regression, and numerically demonstrate their utility. We provide a framework for analysis of more complex models.

ラプラス近似は、ベイズ事後分布のガウス近似を構築し、それによって事後平均と分散を近似するための一般的な方法です。しかし、近似の品質が懸念されます。品質保証を提供するために、ベイズ中心極限定理(BCLT)の特定のバージョンから得られる収束速度の境界値を使用することが考えられます。しかし、既存の境界値は、比較的単純な現実のベイズ分析でさえ非現実的な仮定を必要とします。より具体的には、既存の境界値は、(1)真のデータ生成パラメータを知ることを必要とする、(2)サンプル数に対して漸近的である、(3)ベイズ事後平均を制御しない、(4)計算に強い対数凹モデルを必要とする、のいずれかです。本研究では、(1)真のパラメータを知る必要がない、(2)有限サンプルに適用できる、(3)事後平均と分散を制御し、(4)漸近的BCLTの条件を満たすモデルに一般的に適用できる、初めての計算可能な品質の境界値を提供します。さらに、既存の境界値の次元依存性を大幅に改善し、実際、一般的なケースで可能な最低次の次元依存性を実現した。ロジスティック回帰を含む様々な標準モデルについて、我々の境界内の正確な定数を計算し、その有用性を数値的に実証します。より複雑なモデルを解析するための枠組みを提供します。

Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

積分確率メトリクスとニューラルネットワークの出会い：ラドン・コルモゴロフ・スミルノフ検定

Integral probability metrics (IPMs) constitute a general class of nonparametric two-sample tests that are based on maximizing the mean difference between samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of Radon bounded variation (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the IPM defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness degree $k \geq 0$. This test, which we refer to as the Radon-Kolmogorov-Smirnov (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test—the function $f$ achieving the maximum mean difference—is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. We can thus leverage the power of modern neural network optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test.

積分確率計量法(IPM)は、ある関数空間$\mathcal{F}$に存在するデータ変換$f$のすべての選択肢にわたって、ある分布$P$と別の分布$Q$からのサンプル間の平均差を最大化することに基づく、ノンパラメトリックな2サンプル検定の一般的なクラスです。Radon bounded variation (RBV)の関数とニューラルネットワークとして知られるものを関連付ける最近の研究(Parhi and Nowak, 2021, 2023)に触発され、我々は$\mathcal{F}$を特定の滑らかさの次数$k \geq 0$のRBV空間内の単位球とすることで定義されるIPMを調べます。Radon-Kolmogorov-Smirnov (RKS)検定と呼ぶこの検定は、よく知られた古典的なKolmogorov-Smirnov (KS)検定を多次元および高次の滑らかさに一般化したものと考えることができます。これはニューラルネットワークとも密接に関係しています。RKSテストの証人（平均差が最大となる関数$f$）は常に次数$k$のリッジスプライン、つまりニューラルネットワーク内の単一ニューロンであることを証明します。これにより、最新のニューラルネットワーク最適化ツールキットの力を活用して、RKSテストの基礎となる基準を（近似的に）最大化することができます。RKSテストは、分布の任意の異なるペア$P \not= Q$を漸近的に完全に区別できることを証明し、その漸近的な帰無分布を導出し、より伝統的なカーネルMMDテストと比較したRKSテストの長所と短所を明らかにする実験を行います。

On Inference for the Support Vector Machine

サポートベクターマシンの推論について

The linear support vector machine has a parametrised decision boundary. The paper considers inference for the corresponding parameters, which indicate the effects of individual variables on the decision boundary. The proposed inference is via a convolution-smoothed version of the SVM loss function, this having several inferential advantages over the original SVM, whose associated loss function is not everywhere differentiable. Notably, convolution-smoothing comes with non-asymptotic theoretical guarantees, including a distributional approximation to the parameter estimator that scales more favourably with the dimension of the feature vector. The differentiability of the loss function produces other advantages in some settings; for instance, by facilitating the inclusion of penalties or the synthesis of information from a large number of small samples. The paper closes by relating the linear SVM parameters to those of some probability models for binary outcomes.

線形サポートベクターマシンは、パラメータ化された決定境界を持つ。本論文では、個々の変数が決定境界に及ぼす影響を示す、対応するパラメータの推論について考察します。提案される推論は、畳み込み平滑化SVM損失関数によるもので、これは、関連する損失関数がどこでも微分可能ではない元のSVMに比べて、いくつかの推論上の利点を持つ。特に、畳み込み平滑化は、特徴ベクトルの次元に応じてより有利にスケーリングするパラメータ推定値の分布近似など、非漸近的な理論的保証を伴う。損失関数の微分可能性は、いくつかの設定において他の利点も生み出す。例えば、ペナルティの組み込みや、多数の小さなサンプルからの情報の統合を容易にするなどです。この論文は、線形SVMパラメータをバイナリ結果のいくつかの確率モデルのパラメータと関連付けて締めくくっています。

Random Pruning Over-parameterized Neural Networks Can Improve Generalization: A Training Dynamics Analysis

過剰パラメータ化されたニューラルネットワークのランダムプルーニングは汎化能力を向上させる：学習ダイナミクスの分析

It has been observed that applying pruning-at-initialization methods and training the sparse networks can sometimes yield slightly better test performance than training the original dense network. Such experimental observations are yet to be understood theoretically. This work makes the first attempt to study this phenomenon. Specifically, we identify a theoretical minimal setting and study a classification task with a one-hidden-layer neural network, which is randomly pruned according to different rates at the initialization. We show that as long as the pruning rate is below a certain threshold, the network provably exhibits good generalization performance after training.More surprisingly, the generalization bound gets better as the pruning rate mildly gets larger. To complement this positive result, we also show a negative result: there exists a large pruning rate such that while gradient descent is still able to drive the training loss toward zero, the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. To our knowledge, this is the first theory work studying how different pruning rates affect neural networks’ performance, suggesting that an appropriate pruning rate might improve the neural network’s generalization.

初期化時に枝刈りを行う手法を適用し、スパースネットワークを学習させると、元の稠密ネットワークを学習させるよりもテスト性能がわずかに向上することが観察されています。このような実験的観察は、理論的にはまだ解明されていない。本研究は、この現象を初めて検証するものです。具体的には、理論的な最小設定を特定し、初期化時に異なる率でランダムに枝刈りされた1層隠れ層ニューラルネットワークを用いた分類タスクを検証します。枝刈り率が特定の閾値以下である限り、ネットワークは学習後に良好な汎化性能を示すことが証明されます。さらに驚くべきことに、枝刈り率がわずかに大きくなるにつれて、汎化限界は改善されます。この肯定的な結果を補完するために、否定的な結果も示す。すなわち、勾配降下法によって学習損失をゼロに近づけることができる一方で、汎化性能がランダム推測と同等になるような大きな枝刈り率が存在します。これはさらに、枝刈りによって特徴学習プロセスが変化し、枝刈りされたニューラルネットワークの性能低下につながる可能性があることを示唆しています。我々の知る限り、本研究は、異なる枝刈り率がニューラルネットワークの性能にどのように影響するかを研究した初の理論研究であり、適切な枝刈り率がニューラルネットワークの汎化を向上させる可能性を示唆しています。

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

因果抽象化：メカニズム的解釈可能性の理論的基礎

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.

因果抽象化は、ブラックボックスAIモデルの既知だが不透明な低レベルの詳細を忠実に単純化した分かりやすいアルゴリズムを提供することを目的とする分野である、メカニズム解釈可能性の理論的基礎を提供します。私たちの貢献は、(1)因果抽象化理論をメカニズムの置き換え（すなわち、ハード介入とソフト介入）から任意のメカニズム変換（すなわち、古いメカニズムから新しいメカニズムへの関数）へと一般化すること、(2)多義的ニューロンの中核概念、線形表現仮説、モジュール特徴、段階的忠実性について、柔軟かつ正確な形式化を提供すること、(3)活性化およびパスパッチング、因果媒介分析、因果スクラビング、因果トレーシング、回路分析、概念消去、スパースオートエンコーダ、差分バイナリマスキング、分散アライメントサーチ、ステアリングといった様々なメカニズム解釈可能性手法を因果抽象化という共通言語に統合することです。

Implicit vs Unfolded Graph Neural Networks

暗黙的グラフニューラルネットワークと展開型グラフニューラルネットワーク

It has been observed that message-passing graph neural networks (GNN) sometimes struggle to maintain a healthy balance between the efficient / scalable modeling of long-range dependencies across nodes while avoiding unintended consequences such oversmoothed node representations, sensitivity to spurious edges, or inadequate model interpretability. To address these and other issues, two separate strategies have recently been proposed, namely implicit and unfolded GNNs (that we abbreviate to IGNN and UGNN respectively). The former treats node representations as the fixed points of a deep equilibrium model that can efficiently facilitate arbitrary implicit propagation across the graph with a fixed memory footprint. In contrast, the latter involves treating graph propagation as unfolded descent iterations as applied to some graph-regularized energy function. While motivated differently, in this paper we carefully quantify explicit situations where the solutions they produce are equivalent and others where their properties sharply diverge. This includes the analysis of convergence, representational capacity, and interpretability. In support of this analysis, we also provide empirical head-to-head comparisons across multiple synthetic and public real-world node classification benchmarks. These results indicate that while IGNN is substantially more memory-efficient, UGNN models support unique, integrated graph attention mechanisms and propagation rules that can achieve strong node classification accuracy across disparate regimes such as adversarially-perturbed graphs, graphs with heterophily, and graphs involving long-range dependencies.

メッセージパッシンググラフニューラルネットワーク(GNN)では、ノード間の長距離依存関係の効率的でスケーラブルなモデリングと、過度に平滑化されたノード表現、偽のエッジへの敏感さ、不適切なモデル解釈可能性などの意図しない結果を回避との間の健全なバランスを維持するのに苦労することが時々あることが観察されています。これらの問題やその他の問題に対処するため、最近、暗黙的GNNと展開GNN (それぞれIGNNとUGNNと略す)という2つの異なる戦略が提案されました。前者は、ノード表現を、固定メモリフットプリントでグラフ全体の任意の暗黙的伝播を効率的に促進できる深い平衡モデルの固定点として扱います。対照的に、後者は、グラフ伝播を、何らかのグラフ正規化エネルギー関数に適用された展開降下反復として扱います。動機は異なりますが、本稿では、生成される解が同等である明示的な状況と、特性が大きく異なる状況を注意深く定量化します。これには、収束、表現能力、および解釈可能性の分析が含まれます。この分析をサポートするために、複数の合成および公開されている現実世界のノード分類ベンチマーク間での経験的な直接比較も提供します。これらの結果は、IGNNの方がメモリ効率が大幅に優れている一方で、UGNNモデルは、敵対的に摂動されたグラフ、異好性を持つグラフ、長距離依存性を含むグラフなどの異なる領域にわたって強力なノード分類精度を達成できる、独自の統合グラフアテンションメカニズムと伝播ルールをサポートしていることを示しています。

Towards Optimal Branching of Linear and Semidefinite Relaxations for Neural Network Robustness Certification

ニューラルネットワークの堅牢性認証のための線形緩和法と半正定値緩和法の最適分岐に向けて

In this paper, we study certifying the robustness of ReLU neural networks against adversarial input perturbations. To diminish the relaxation error suffered by the popular linear programming (LP) and semidefinite programming (SDP) certification methods, we take a branch-and-bound approach to propose partitioning the input uncertainty set and solving the relaxations on each part separately. We show that this approach reduces relaxation error, and that the error is eliminated entirely upon performing an LP relaxation with a partition intelligently designed to exploit the nature of the ReLU activations. To scale this approach to large networks, we consider using a coarser partition whereby the number of parts in the partition is reduced. We prove that computing such a coarse partition that directly minimizes the LP relaxation error is NP-hard. By instead minimizing the worst-case LP relaxation error, we develop a closed-form branching scheme in the single-hidden layer case. We extend the analysis to the SDP, where the feasible set geometry is exploited to design a branching scheme that minimizes the worst-case SDP relaxation error. Experiments on MNIST, CIFAR-10, and Wisconsin breast cancer diagnosis classifiers demonstrate significant increases in the percentages of test samples certified. By independently increasing the input size and the number of layers, we empirically illustrate under which regimes the branched LP and branched SDP are best applied. Finally, we extend our LP branching method into a multi-layer branching heuristic, which attains comparable performance to prior state-of-the-art heuristics on large-scale, deep neural network certification benchmarks.

本稿では、ReLUニューラルネットワークの敵対的入力摂動に対する堅牢性の証明について検討します。一般的な線形計画法（LP）および半正定値計画法（SDP）の証明法で生じる緩和誤差を低減するため、分岐限定法を用いて入力不確実性集合を分割し、各部分について個別に緩和を解くことを提案します。このアプローチにより緩和誤差が低減し、ReLU活性化の性質を利用するようにインテリジェントに設計された分割を用いてLP緩和を実行することで誤差が完全に除去されることを示す。このアプローチを大規模ネットワークに拡張するため、分割部分の数を減らす粗い分割の使用を検討します。LP緩和誤差を直接最小化する粗い分割を計算することはNP困難であることを証明した。代わりに、最悪ケースのLP緩和誤差を最小化することで、単一隠れ層の場合における閉形式の分岐スキームを開発します。我々は分析をSDPに拡張し、実行可能セット形状を利用して、最悪のSDP緩和誤差を最小化する分岐スキームを設計します。MNIST、CIFAR-10、およびWisconsin乳がん診断分類器での実験では、認証されたテストサンプルの割合が大幅に増加することが示されています。入力サイズとレイヤー数を個別に増加することにより、分岐LPと分岐SDPがどのレジームで最も適切に適用されるかを経験的に示します。最後に、LP分岐法を多層分岐ヒューリスティックに拡張し、大規模なディープニューラルネットワーク認証ベンチマークで、従来の最先端のヒューリスティックに匹敵するパフォーマンスを実現します。

GraphNeuralNetworks.jl: Deep Learning on Graphs with Julia

GraphNeuralNetworks.jl: Juliaを用いたグラフ上の深層学習

GraphNeuralNetworks.jl is an open-source framework for deep learning on graphs, written in the Julia programming language. It supports multiple GPU backends, generic sparse or dense graph representations, and offers convenient interfaces for manipulating standard, heterogeneous, and temporal graphs with attributes at the node, edge, and graph levels. The framework allows users to define custom graph convolutional layers using gather/scatter message-passing primitives or optimized fused operations. It also includes several popular layers, enabling efficient experimentation with complex deep architectures. The package is available on GitHub: https://github.com/JuliaGraphs/GraphNeuralNetworks.jl.

GraphNeuralNetworks.jlは、Juliaプログラミング言語で記述された、グラフ上のディープラーニングのためのオープンソースフレームワークです。複数のGPUバックエンド、汎用的なスパースおよびデンスグラフ表現をサポートし、ノード、エッジ、グラフレベルの属性を持つ標準グラフ、異種グラフ、および時系列グラフを操作するための便利なインターフェースを提供します。このフレームワークでは、ギャザー/スキャッターメッセージパッシングプリミティブまたは最適化された融合演算を使用して、カスタムグラフ畳み込み層を定義できます。また、いくつかの一般的な層も含まれているため、複雑なディープラーニングアーキテクチャを効率的に実験できます。パッケージはGitHub (https://github.com/JuliaGraphs/GraphNeuralNetworks.jl)で入手できます。

Dynamic angular synchronization under smoothness constraints

滑らかさ制約下での動的角度同期

Given an undirected measurement graph $\mathcal{H} = ([n], \mathcal{E})$, the classical angular synchronization problem consists of recovering unknown angles $\theta_1^*,\dots,\theta_n^*$ from a collection of noisy pairwise measurements of the form $(\theta_i^* – \theta_j^*) \mod 2\pi$, for all $\{i,j\} \in \mathcal{E}$. This problem arises in a variety of applications, including computer vision, time synchronization of distributed networks, and ranking from pairwise comparisons. In this paper, we consider a dynamic version of this problem where the angles, and also the measurement graphs evolve over $T$ time points. Assuming a smoothness condition on the evolution of thelatent angles, we derive three algorithms for joint estimation of the angles over all time points. Moreover, for one of the algorithms, we establish non-asymptotic recovery guarantees for the mean-squared error (MSE) under different statistical models. In particular, we show that the MSE converges to zero as $T$ increases under milder conditions than in the static setting. This includes the setting where the measurement graphs are highly sparse and disconnected, and also when the measurement noise is large and can potentially increase with $T$. We complement our theoretical results with experiments on synthetic data.

無向測定グラフ$\mathcal{H} = ([n], \mathcal{E})$が与えられた場合、古典的な角度同期問題は、すべての$\{i,j\} \in \mathcal{E}$について、形式$(\theta_i^* – \theta_j^*) \mod 2\pi$のノイズを含むペアワイズ測定のコレクションから未知の角度$\theta_1^*,\dots,\theta_n^*$を回復することから構成されます。この問題は、コンピュータービジョン、分散ネットワークの時間同期、ペアワイズ比較によるランキングなど、さまざまなアプリケーションで発生します。本稿では、この問題の動的バージョン、つまり角度と測定グラフが$T$時点にわたって変化するバージョンを検討します。潜在的な角度の変化に関する平滑性条件を仮定して、すべての時点にわたって角度を共同で推定する3つのアルゴリズムを導出します。さらに、アルゴリズムの1つについて、異なる統計モデル下における平均二乗誤差（MSE）の非漸近的回復保証を確立しました。特に、静的設定よりも緩やかな条件下では、$T$の増加とともにMSEがゼロに収束することを示します。これには、測定グラフが非常にスパースで切断されている場合や、測定ノイズが大きく、$T$とともに増加する可能性がある場合も含まれます。合成データを用いた実験により、理論結果を補完します。

Derivative-Informed Neural Operator Acceleration of Geometric MCMC for Infinite-Dimensional Bayesian Inverse Problems

無限次元ベイズ逆問題に対する幾何MCMCの微分情報に基づくニューラル演算子による高速化

We propose an operator learning approach to accelerate geometric Markov chain Monte Carlo (MCMC) for solving infinite-dimensional Bayesian inverse problems (BIPs). While geometric MCMC employs high-quality proposals that adapt to posterior local geometry, it requires repeated computations of gradients and Hessians of the log-likelihood, which becomes prohibitive when the parameter-to-observable (PtO) map is defined through expensive-to-solve parametric partial differential equations (PDEs). We consider a delayed-acceptance geometric MCMC method driven by a neural operator surrogate of the PtO map, where the proposal exploits fast surrogate predictions of the log-likelihood and, simultaneously, its gradient and Hessian. To achieve a substantial speedup, the surrogate must accurately approximate the PtO map and its Jacobian, which often demands a prohibitively large number of PtO map samples via conventional operator learning methods. In this work, we present an extension of derivative-informed operator learning [O’Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] that uses joint samples of the PtO map and its Jacobian. This leads to derivative-informed neural operator (DINO) surrogates that accurately predict the observables and posterior local geometry at a significantly lower training cost than conventional methods. Cost and error analysis for reduced basis DINO surrogates are provided. Numerical studies demonstrate that DINO-driven MCMC generates effective posterior samples 3–9 times faster than geometric MCMC and 60–97 times faster than prior geometry-based MCMC. Furthermore, the training cost of DINO surrogates breaks even compared to geometric MCMC after just 10–25 effective posterior samples.

我々は、無限次元ベイズ逆問題（BIP）を解くための幾何マルコフ連鎖モンテカルロ（MCMC）を高速化する演算子学習アプローチを提案します。幾何MCMCは事後局所幾何学に適応する高品質の提案を採用するが、対数尤度の勾配とヘッシアンを繰り返し計算する必要があり、パラメータから観測可能なもの（PtO）マップが、解くのにコストのかかるパラメトリック偏微分方程式（PDE）によって定義されている場合には、計算が困難となります。我々は、PtOマップのニューラル演算子代理によって駆動される遅延受容幾何MCMC法を検討します。この提案では、対数尤度、およびその勾配とヘッシアンの高速代理予測を活用します。大幅な高速化を実現するには、代理はPtOマップとそのヤコビアンを正確に近似する必要があり、従来の演算子学習法では、多くの場合、膨大な数のPtOマップサンプルが必要となります。本研究では、PtOマップとそのヤコビアンの結合サンプルを使用する、微分情報に基づく演算子学習[O’Leary-Roseberryら, J. Comput. Phys., 496 (2024)]の拡張を提示します。これにより、従来の方法よりも大幅に低いトレーニングコストで、観測量と事後局所形状を正確に予測する微分情報に基づくニューラル演算子（DINO）サロゲートが実現します。縮減基底DINOサロゲートのコストと誤差の分析も提供します。数値研究により、DINO駆動型MCMCは、幾何学MCMCよりも3～9倍、事前幾何学ベースMCMCよりも60～97倍高速に有効事後サンプルを生成することが実証されています。さらに、DINOサロゲートのトレーニングコストは、わずか10～25の有効事後サンプルで幾何学MCMCと同等になります。

Wasserstein F-tests for Frechet regression on Bures-Wasserstein manifolds

ビュール-ワッサーシュタイン多様体上のフレシェ回帰に対するワッサーシュタインF検定

This paper addresses regression analysis for covariance matrix-valued outcomes with Euclidean covariates, motivated by applications in single-cell genomics and neuroscience where covariance matrices are observed across many samples. Our analysis leverages Fr\’echet regression on the Bures-Wasserstein manifold to estimate the conditional Fr\’echet mean given covariates $x$. We establish a non-asymptotic uniform $\sqrt{n}$-rate of convergence (up to logarithmic factors) over covariates with $\|x\| \lesssim \sqrt{\log n}$ and derive a pointwise central limit theorem to enable statistical inference. For testing covariate effects, we devise a novel test whose null distribution converges to a weighted sum of independent chi-square distributions, with power guarantees against a sequence of contiguous alternatives. Simulations validate the accuracy of the asymptotic theory. Finally, we apply our methods to a single-cell gene expression dataset, revealing age-related changes in gene co-expression networks.

本論文では、ユークリッド共変量を持つ共分散行列値の結果に対する回帰分析について取り上げます。これは、共分散行列が多くのサンプルにわたって観測される単一細胞ゲノミクスと神経科学への応用を動機としています。私たちの分析では、共変量$x$が与えられた場合の条件付きFr\’echet平均を推定するために、Bures-Wasserstein多様体上のFr\’echet回帰を利用します。$\|x\|を持つ共変量に対して、非漸近的な一様$\sqrt{n}$収束率（対数係数まで）を確立します。\lesssim \sqrt{\log n}$を仮定し、統計的推論を可能にする点ごとの中心極限定理を導出します。共変量効果の検定には、帰無分布が独立したカイ二乗分布の重み付き和に収束し、連続する対立仮説の列に対する検出力を保証するような、新たな検定法を考案します。シミュレーションにより漸近理論の精度を検証します。最後に、本手法を単一細胞遺伝子発現データセットに適用し、遺伝子共発現ネットワークにおける加齢に伴う変化を明らかにします。

Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis

分散確率的二階層最適化：複雑性と異質性分析の改善

This paper considers solving a class of nonconvex-strongly-convex distributed stochastic bilevel optimization (DSBO) problems with personalized inner-level objectives. Most existing algorithms require computational loops for hypergradient estimation, leading to computational inefficiency. Moreover, the impact of data heterogeneity on convergence in bilevel problems is not explicitly characterized yet. To address these issues, we propose LoPA, a loopless personalized distributed algorithm that leverages a tracking mechanism for iterative approximation of inner-level solutions and Hessian-inverse matrices without relying on extra computation loops. Our theoretical analysis explicitly characterizes the heterogeneity across nodes (denoted by $b$), and establishes a sublinear rate of $\mathcal{O}( {\frac{1}{{{{\left( {1 – \rho } \right)}}K}}\!+ \!\frac{{(\frac{b}{\sqrt{m}})^{\frac{2}{3}} }}{{\left( {1 – \rho } \right)^{\frac{2}{3}} K^{\frac{2}{3}} }} \!+ \!\frac{1}{\sqrt{ K }}( {\sigma _{\operatorname{p} }} + \frac{1}{\sqrt{m}}{\sigma _{\operatorname{c} }} ) } )$ without the boundedness of local hypergradients, where ${\sigma _{\operatorname{p} }}$ and ${\sigma _{\operatorname{c} }}$ represent the gradient sampling variances associated with the inner- and outer-level variables, respectively. We also integrate LoPA with a gradient tracking scheme to eliminate the impact of data heterogeneity, yielding an improved rate of ${{\mathcal{O}}}(\frac{{1}}{{ (1-\rho)^2K }} \!+\! \frac{1}{{\sqrt{K}}}( \sigma_{\rm{p}} \!+\! \frac{1}{\sqrt{m}}\sigma_{\rm{c}} ) )$. The computational complexity of LoPA is of ${{\mathcal{O}}}({\epsilon^{-2}})$ to an $\epsilon$-stationary point, matching the communication complexity due to the loopless structure, which outperforms existing counterparts for DSBO. Numerical experiments validate the effectiveness of the proposed algorithm.

本論文では、パーソナライズされた内部レベル目的関数を持つ、非凸強凸分散確率的二階層最適化（DSBO）問題の解法について考察します。既存のアルゴリズムの多くは、超勾配推定に計算ループを必要とし、計算効率の低下につながる。さらに、二階層問題におけるデータの異質性が収束性に与える影響は、まだ明確には解明されていない。これらの問題に対処するため、本論文では、ループレスなパーソナライズ分散アルゴリズムであるLoPAを提案します。LoPAは、追加の計算ループに依存せずに、内部レベル解とヘッセ逆行列の反復近似のための追跡メカニズムを活用します。我々の理論分析は、ノード間の不均一性（$b$で示される）を明示的に特徴付け、局所的超勾配の有界性なしに、$\mathcal{O}( {\frac{1}{{{{\left( {1 – \rho } \right)}}K}}\!+ \!\frac{{(\frac{b}{\sqrt{m}})^{\frac{2}{3}} }}{{\left( {1 – \rho } \right)^{\frac{2}{3}} K^{\frac{2}{3}} }} \!+ \!\frac{1}{\sqrt{ K }}( {\sigma _{\operatorname{p} }} + \frac{1}{\sqrt{m}}{\sigma _{\operatorname{c} }} ) } )$の線形以下の速度を確立します。ここで${\sigma _{\operatorname{p} }}$と${\sigma _{\operatorname{c} }}$は、それぞれ内部レベル変数と外部レベル変数に関連付けられた勾配サンプリング分散を表します。また、LoPAを勾配追跡スキームと統合することで、データの異質性の影響を排除し、${{\mathcal{O}}}(\frac{{1}}{{ (1-\rho)^2K }} \!+\! \frac{1}{{\sqrt{K}}}( \sigma_{\rm{p}} \!+\! \frac{1}{\sqrt{m}}\sigma_{\rm{c}} ) )$という改善されたレートを実現しました。LoPAの計算複雑度は、$\epsilon$定常点まで${{\mathcal{O}}}({\epsilon^{-2}})$であり、ループレス構造による通信複雑度と一致し、DSBOの既存の同等の手法よりも優れています。数値実験により、提案アルゴリズムの有効性が検証されています。

Learning causal graphs via nonlinear sufficient dimension reduction

非線形十分次元削減による因果グラフの学習

We introduce a new nonparametric methodology for estimating a directed acyclic graph (DAG) from observational data. Our method is nonparametric in nature: it does not impose any specific form on the joint distribution of the underlying DAG. Instead, it relies on a linear operator on reproducing kernel Hilbert spaces to evaluate conditional independence. However, a fully nonparametric approach would involve conditioning on a large number of random variables, subjecting it to the curse of dimensionality. To solve this problem, we apply nonlinear sufficient dimension reduction to reduce the number of variables before evaluating the conditional independence. We develop an estimator for the DAG, based on a linear operator that characterizes conditional independence, and establish the consistency and convergence rates of this estimator, as well as the uniform consistency of the estimated Markov equivalence class. We introduce a modified PC-algorithm to implement the estimating procedure efficiently such that the complexity depends on the sparseness of the underlying true DAG. We demonstrate the effectiveness of our methodology through simulations and a real data analysis.

観測データから有向非巡回グラフ（DAG）を推定するための、新たなノンパラメトリック手法を導入します。本手法は本質的にノンパラメトリックであり、基となるDAGの結合分布に特定の形式を課すものではない。代わりに、再生核ヒルベルト空間の線形演算子を用いて条件付き独立性を評価します。しかし、完全なノンパラメトリック手法は、多数のランダム変数を条件付けすることになり、次元の呪いに晒されます。この問題を解決するため、条件付き独立性を評価する前に、非線形の十分な次元削減を適用して変数の数を削減します。条件付き独立性を特徴付ける線形演算子に基づくDAGの推定量を開発し、この推定量の一貫性と収束率、および推定されたマルコフ同値類の一様一貫性を確立します。推定手順を効率的に実装するために、修正されたPCアルゴリズムを導入します。このアルゴリズムでは、複雑さは基となる真のDAGのスパース性に依存します。シミュレーションと実データ解析を通して、本手法の有効性を実証します。

On Consistent Bayesian Inference from Synthetic Data

合成データからの一貫性のあるベイズ推論について

Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic data sets, that are sampled from a posterior predictive, converges to the posterior of the downstream analysis under standard regularity conditions when the analyst’s model is compatible with the data provider’s model. We also present several examples showing how the theory works in practice, and showing how Bayesian inference can fail when the compatibility assumption is not met, or the synthetic data set is not significantly larger than the original.

差分プライバシーの有無にかかわらず、合成データを生成することは、データの容易な利用可能性とデータ主体のプライバシーとの間のジレンマに対する潜在的な解決策として大きな注目を集めています。いくつかの研究は、正確な不確実性推定を含む、合成データからの下流分析の一貫性には、合成データの生成を考慮する必要があることを示しています。そのための方法はごくわずかで、ほとんどが頻度主義分析です。本稿では、合成データから一貫性のあるベイズ推論を実行する方法について検討します。事後予測からサンプリングされた複数の大規模合成データセットから個別に得られた事後分布サンプルを混合すると、分析者のモデルがデータ提供者のモデルと互換性がある場合、標準的な正則性条件下で下流分析の事後分布に収束することを証明します。また、この理論が実際にどのように機能するかを示すいくつかの例を示し、互換性の仮定が満たされない場合、または合成データセットが元のデータセットよりも有意に大きくない場合にベイズ推論が失敗する可能性があることを示します。

Optimization Over a Probability Simplex

確率単体上の最適化

We propose a new iteration scheme, the Cauchy-Simplex, to optimize convex problems over the probability simplex $\{w\in\mathbb{R}^n\ |\ \sum_i w_i=1\ \textrm{and}\ w_i\geq0\}$.Specifically, we map the simplex to the positive quadrant of a unit sphere, envisage gradient descent in latent variables, and map the result back in a way that only depends on the simplex variable. Moreover, proving rigorous convergence results in this formulation leads inherently to tools from information theory (e.g., cross-entropy and KL divergence). Each iteration of the Cauchy-Simplex consists of simple operations, making it well-suited for high-dimensional problems. In continuous time, we prove that $f(x_T)-f(x^*) = O(1/T)$ for differentiable real-valued convex functions, where $T$ is the number of time steps and $w^*$ is the optimal solution. Numerical experiments of projection onto convex hulls show faster convergence than similar algorithms. Finally, we apply our algorithm to online learning problems and prove the convergence of the average regret for (1) Prediction with expert advice and (2) Universal Portfolios.

確率単体$\{w\in\mathbb{R}^n\ |\ \sum_i w_i=1\ \textrm{and}\ w_i\geq0\}$上の凸問題を最適化するための新しい反復スキーム、コーシー単体を提案します。具体的には、単体を単位球面の正象限に写像し、潜在変数における勾配降下法を想定し、その結果を単体変数のみに依存する方法で写像します。さらに、この定式化における厳密な収束結果の証明は、情報理論のツール（例えば、クロスエントロピーやKLダイバージェンス）に本質的につながる。コーシー単体の各反復は単純な演算で構成されているため、高次元問題に適しています。連続時間において、微分可能な実数値凸関数に対して$f(x_T)-f(x^*) = O(1/T)$であることを証明します。ここで、$T$は時間ステップ数、$w^*$は最適解です。凸包への射影の数値実験は、同様のアルゴリズムよりも高速な収束を示しました。最後に、このアルゴリズムをオンライン学習問題に適用し、(1)専門家のアドバイスによる予測と(2)ユニバーサルポートフォリオの平均後悔の収束を証明します。

Laplace Meets Moreau: Smooth Approximation to Infimal Convolutions Using Laplace’s Method

ラプラスとモローの出会い：ラプラス法を用いた最小畳み込みの滑らかな近似

We study approximations to the Moreau envelope—and infimal convolutions more broadly—based on Laplace’s method, a classical tool in analysis which ties certain integrals to suprema of their integrands. We believe the connection between Laplace’s method and infimal convolutions is generally deserving of more attention in the study of optimization and partial differential equations, since it bears numerous potentially important applications, from proximal-type algorithms to Hamilton-Jacobi equations.

本研究では、ラプラス法に基づいて、モロー包絡線の近似、そしてより広義には最小畳み込みを研究します。ラプラス法は、特定の積分をその被積分関数の最大値に結び付ける解析学における古典的な手法です。ラプラス法と最小畳み込みの関連性は、近似型アルゴリズムからハミルトン・ヤコビ方程式に至るまで、潜在的に重要な応用が数多く存在するため、最適化と偏微分方程式の研究において、より一層の注目に値すると我々は考えています。

Sampling and Estimation on Manifolds using the Langevin Diffusion

ランジュバン拡散を用いた多様体上のサンプリングと推定

Error bounds are derived for sampling and estimation using a discretization of an intrinsically defined Langevin diffusion with invariant measure $\text{d}\mu_\phi \propto e^{-\phi} \mathrm{dvol}_g $ on a compact Riemannian manifold. Two estimators of linear functionals of $\mu_\phi $ based on the discretized Markov process are considered: a time-averaging estimator based on a single trajectory and an ensemble-averaging estimator based on multiple independent trajectories. Imposing no restrictions beyond a nominal level of smoothness on $\phi$, first-order error bounds, in discretization step size, on the bias and variance/mean-square error of both estimators are derived. The order of error matches the optimal rate in Euclidean and flat spaces, and leads to a first-order bound on distance between the invariant measure $\mu_\phi$ and a stationary measure of the discretized Markov process. This order is preserved even upon using retractions when exponential maps are unavailable in closed form, thus enhancing practicality of the proposed algorithms. Generality of the proof techniques, which exploit links between two partial differential equations and the semigroup of operators corresponding to the Langevin diffusion, renders them amenable for the study of a more general class of sampling algorithms related to the Langevin diffusion. Conditions for extending analysis to the case of non-compact manifolds are discussed. Numerical illustrations with distributions, log-concave and otherwise, on the manifolds of positive and negative curvature elucidate on the derived bounds and demonstrate practical utility of the sampling algorithm.

コンパクトリーマン多様体上の不変測度$\text{d}\mu_\phi \propto e^{-\phi} \mathrm{dvol}_g $を持つ本質的に定義されたランジュバン拡散の離散化を用いて、サンプリングと推定の誤差境界を導出します。離散化マルコフ過程に基づく$\mu_\phi $の線形汎関数の2つの推定量、すなわち単一の軌跡に基づく時間平均推定量と複数の独立した軌跡に基づくアンサンブル平均推定量が検討されます。$\phi$に名目上の平滑度を超える制約を課さずに、離散化ステップサイズにおける両推定値のバイアスおよび分散/平均二乗誤差の一次誤差境界を導出します。誤差の順序はユークリッド空間および平坦空間における最適速度と一致し、不変測度$\mu_\phi$と離散化マルコフ過程の定常測度との間の距離の一次境界につながる。この順序は、指数写像が閉じた形式で利用できない場合に後退法を使用しても維持されるため、提案アルゴリズムの実用性が向上します。2つの偏微分方程式とランジュバン拡散に対応する作用素の半群との間のリンクを利用する証明手法の一般性により、ランジュバン拡散に関連するより一般的なクラスのサンプリングアルゴリズムの研究が容易になります。非コンパクト多様体の場合に解析を拡張するための条件について議論します。正曲率および負曲率の多様体上の分布、対数凹分布およびその他の分布の数値図解により、導出された境界が明らかにされ、サンプリングアルゴリズムの実用性が実証されます。

Sharp Bounds for Sequential Federated Learning on Heterogeneous Data

異種データにおける逐次連合学習のシャープな境界

There are two paradigms in Federated Learning (FL): parallel FL (PFL), where models are trained in a parallel manner across clients, and sequential FL (SFL), where models are trained in a sequential manner across clients. Specifically, in PFL, clients perform local updates independently and send the updated model parameters to a global server for aggregation; in SFL, one client starts its local updates only after receiving the model parameters from the previous client in the sequence. In contrast to that of PFL, the convergence theory of SFL on heterogeneous data is still lacking. To resolve the theoretical dilemma of SFL, we establish sharp convergence guarantees for SFL on heterogeneous data with both upper and lower bounds. Specifically, we derive the upper bounds for the strongly convex, general convex and non-convex objective functions, and construct the matching lower bounds for the strongly convex and general convex objective functions. Then, we compare the upper bounds of SFL with those of PFL, showing that SFL outperforms PFL on heterogeneous data (at least, when the level of heterogeneity is relatively high). Experimental results validate the counterintuitive theoretical finding.

Federated Learning (FL)には、モデルがクライアント間で並列にトレーニングされる並列FL (PFL)と、モデルがクライアント間で順次にトレーニングされる順次FL (SFL)の2つのパラダイムがあります。具体的には、PFLではクライアントが独立してローカル更新を実行し、更新されたモデルパラメータを集約のためにグローバルサーバーに送信します。SFLでは、1つのクライアントが、シーケンス内の前のクライアントからモデルパラメータを受け取った後にのみ、ローカル更新を開始します。PFLとは対照的に、異種データに対するSFLの収束理論はまだ欠如しています。SFLの理論的ジレンマを解決するために、我々は、上限と下限の両方を持つ異種データに対するSFLの明確な収束保証を確立します。具体的には、強凸、一般凸、非凸目的関数の上限を導出し、強凸および一般凸目的関数の対応する下限を構築します。次に、SFLの上限をPFLの上限と比較し、SFLは異質データ（少なくとも異質性のレベルが比較的高い場合）においてPFLよりも優れていることを示します。実験結果は、直感に反する理論的発見を検証します。

Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

オーバーパラメータ化におけるディープニューラルネットワークの局所線形回復保証

Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term “local linear recovery” (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios.

ディープニューラルネットワーク（DNN）モデルが過剰パラメータ化においてターゲット関数を確実に回復できるかどうかを判断することは、ディープラーニング理論において重要かつ複雑な問題です。この分野の理解を深めるため、我々は「局所線形回復」（LLR）と呼ぶ概念を導入します。これはターゲット関数回復のより弱い形で、問題を理論分析により容易にします。LLRの意味において、より狭いDNNで表現可能な関数は、モデルパラメータよりも少ないサンプル数から回復可能であることが保証されることを証明します。具体的には、与えられたDNNの空間における関数について、LLRを保証するために必要な最小サンプル数として定義される楽観的サンプル数の上限を確立します。さらに、これらの上限が2層tanhニューラルネットワークの場合に達成されることを証明します。我々の研究は、過剰パラメータ化されたシナリオにおけるDNNの回復能力に関する将来の研究のための確固たる基礎を築くものです。

Stabilizing Sharpness-Aware Minimization Through A Simple Renormalization Strategy

単純な再正規化戦略によるシャープネスを考慮した最小化の安定化

Recently, sharpness-aware minimization (SAM) has attracted much attention because of its surprising effectiveness in improving generalization performance. However, compared to stochastic gradient descent (SGD), it is more prone to getting stuck at the saddle points, which as a result may lead to performance degradation. To address this issue, we propose a simple renormalization strategy, dubbed Stable SAM (SSAM), so that the gradient norm of the descent step maintains the same as that of the ascent step. Our strategy is easy to implement and flexible enough to integrate with SAM and its variants, almost at no computational cost. With elementary tools from convex optimization and learning theory, we also conduct a theoretical analysis of sharpness-aware training, revealing that compared to SGD, the effectiveness of SAM is only assured in a limited regime of learning rate. In contrast, we show how SSAM extends this regime of learning rate and then it can consistently perform better than SAM with the minor modification. Finally, we demonstrate the improved performance of SSAM on several representative data sets and tasks.

近年、シャープネスを考慮した最小化（SAM）は、汎化性能の向上において驚くべき有効性を示すことから、大きな注目を集めています。しかし、確率的勾配降下法（SGD）と比較すると、SAMは鞍点に陥りやすく、結果として性能低下につながる可能性があります。この問題に対処するため、我々は安定SAM（SSAM）と呼ばれる単純な正規化戦略を提案します。この戦略は、下降ステップの勾配ノルムが上昇ステップの勾配ノルムと同じになるようにします。この戦略は実装が容易で、SAMおよびその派生モデルとほぼ計算コストをかけずに統合できるほど柔軟です。凸最適化と学習理論の基本的なツールを用いて、シャープネスを考慮した学習の理論分析を行い、SGDと比較して、SAMの有効性は限られた学習率の範囲でのみ保証されることを明らかにしました。これに対し、SSAMがこの学習率の範囲をどのように拡張し、わずかな変更を加えることでSAMよりも一貫して優れた性能を発揮できるかを示します。最後に、いくつかの代表的なデータセットとタスクにおいて、SSAMの性能向上を示します。

Fine-Grained Change Point Detection for Topic Modeling with Pitman-Yor Process

ピットマン・ヨー過程を用いたトピックモデリングのための細粒度変化点検出

Identifying change points in dynamic text data is crucial for understanding the evolving nature of topics across various sources, such as news articles, scientific papers, and social media posts. While topic modeling has become a widely used technique for this purpose, capturing fine-grained shifts in individual topics over time remains a significant challenge. Traditional approaches typically use a two-stage process, separating topic modeling and change point detection. However, this separation can lead to information loss and inconsistency in capturing subtle changes in topic evolution. To address this issue, we propose TOPIC-PYP, a change point detection model specifically designed for fine-grained topic-level analysis, i.e., detecting change points for each individual topic. By leveraging the Pitman-Yor process, TOPIC-PYP effectively captures the dynamic evolution of topic meanings over time. Unlike traditional methods, TOPIC-PYP integrates topic modeling and change point detection into a unified framework, facilitating a more comprehensive understanding of the relationship between topic evolution and change points. Experimental evaluations on both synthetic and real-world datasets demonstrate the effectiveness of TOPIC-PYP in accurately detecting change points and generating high-quality topics.

動的なテキストデータにおける変化点の特定は、ニュース記事、科学論文、ソーシャルメディアの投稿など、さまざまなソースにわたるトピックの進化の性質を理解するために不可欠です。トピックモデリングはこの目的で広く使用される手法となっていますが、個々のトピックの経時的な変化をきめ細かく捉えることは依然として大きな課題です。従来のアプローチでは、通常、トピックモデリングと変化点検出を分離する2段階のプロセスが使用されます。しかし、この分離は、トピックの進化における微妙な変化を捉える際に情報の損失や不整合につながる可能性があります。この問題に対処するため、我々はTOPIC-PYPを提案します。これは、きめ細かなトピックレベルの分析、つまり個々のトピックの変化点を検出するために特別に設計された変化点検出モデルです。Pitman-Yorプロセスを活用することで、TOPIC-PYPはトピックの意味の経時的な動的な変化を効果的に捉えます。TOPIC-PYPは従来の手法とは異なり、トピックモデリングと変化点検出を統一されたフレームワークに統合することで、トピックの進化と変化点の関係をより包括的に理解することを可能にします。合成データセットと実世界データセットの両方を用いた実験評価により、TOPIC-PYPが変化点を正確に検出し、高品質なトピックを生成する有効性が実証されています。

Deletion Robust Non-Monotone Submodular Maximization over Matroids

マトロイド上の削除ロバストな非単調劣モジュラ最大化

We study the deletion robust version of submodular maximization under matroid constraints. The goal is to extract a small-size summary of the data set that contains a high-value independent set even after an adversary deletes some elements. We present constant-factor approximation algorithms, whose space complexity depends on the rank $k$ of the matroid, the number $d$ of deleted elements, and the input precision $\varepsilon$. In the centralized setting we present a $(4.494+O(\varepsilon))$-approximation algorithm with summary size $O( \frac{k+d}{\varepsilon^2}\log \frac{k}{\varepsilon})$ that improves to a $(3.582+O(\varepsilon))$-approximation with $O(k + \frac{d}{\varepsilon^2}\log \frac{k}{\varepsilon})$ summary size when the objective is monotone. In the streaming setting we provide a $(9.294 + O(\varepsilon))$-approximation algorithm with summary size and memory $O(k + \frac{d}{\varepsilon^2}\log \frac{k}{\varepsilon})$; the approximation factor is then improved to $(5.582+O(\varepsilon))$ in the monotone case.

マトロイド制約下における劣モジュラ最大化の削除ロバスト版を研究します。目標は、攻撃者がいくつかの要素を削除した後でも、高い値の独立集合を含むデータセットの小規模な要約を抽出することです。マトロイドのランク$k$、削除された要素の数$d$、および入力精度$\varepsilon$に依存する定数因子近似アルゴリズムを示す。集中設定では、要約サイズが$O( \frac{k+d}{\varepsilon^2}\log \frac{k}{\varepsilon})$である$(4.494+O(\varepsilon))$近似アルゴリズムを提示します。これは、目的関数が単調な場合、要約サイズが$O(k + \frac{d}{\varepsilon^2}\log \frac{k}{\varepsilon})$である$(3.582+O(\varepsilon))$近似に改善されます。ストリーミング設定では、要約サイズとメモリが$O(k + \frac{d}{\varepsilon^2}\log \frac{k}{\varepsilon})$である$(9.294 + O(\varepsilon))$近似アルゴリズムを提供します。単調なケースでは、近似係数は$(5.582+O(\varepsilon))$に改善されます。

Instability, Computational Efficiency and Statistical Accuracy

不安定性、計算効率、および統計精度

Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on $n$ samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton’s method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps—namely, with the number of iterations being reduced from polynomial to logarithmic in sample size $n$.

多くの統計的推定量はデータ依存演算子の不動点として定義され、コスト関数の最小化に基づく推定量は重要な特殊ケースです。このような推定量の限界性能は、無限個のサンプルの理想化された極限における集団レベル演算子の特性に依存します。我々は、集団レベルにおけるアルゴリズムの決定論的収束速度と、$n$個のサンプルに基づく経験的対象に適用した場合の（不）安定性の度合いとの相互作用に基づいて、統計的精度の限界を導く一般的な枠組みを開発します。この枠組みを用いて、安定的な勾配降下法と、ニュートン法とその3次正則化変種、EMアルゴリズムなどの高階および不安定なアルゴリズムの両方を解析します。我々は、ガウス混合推定、非線形回帰モデル、情報的無回答モデルなど、いくつかの具体的なモデルクラスへの我々の一般的な結果の応用を示す。不安定なアルゴリズムが、指数関数的に少ないステップ数、すなわちサンプルサイズ$n$における反復回数が多項式から対数へと減少することで、安定なアルゴリズムと同等の統計精度を達成できる事例を示す。

Estimation of Local Geometric Structure on Manifolds from Noisy Data

ノイズを含むデータからの多様体上の局所幾何構造の推定

A common observation in data-driven applications is that high-dimensional data have a low intrinsic dimension, at least locally. In this work, we consider the problem of point estimation for manifold-valued data. Namely, given a finite set of noisy samples of $\mathcal{M}$, a $d$ dimensional submanifold of $\mathbb{R}^D$, and a point $r$ near the manifold we aim to project $r$ onto the manifold. Assuming that the data was sampled uniformly from a tubular neighborhood of a $k$-times smooth boundaryless and compact manifold, we present an algorithm that takes $r$ from this neighborhood and outputs $\hat p_n\in \mathbb{R}^D$, and $\widehat{T_{\hat p_n}\mathcal{M}}$ an element in the Grassmannian $Gr(d, D)$. We prove that as the number of samples $n\to\infty$, the point $\hat p_n$ converges to $\mathbf{p}\in \mathcal{M}$, the projection of $r$ onto $\mathcal{M}$, and $\widehat{T_{\hat p_n}\mathcal{M}}$ converges to $T_{\mathbf{p}}\mathcal{M}$ (the tangent space at that point) with high probability. Furthermore, we show that $\hat p_n$ approaches the manifold with an asymptotic rate of $n^{-\frac{k}{2k + d}}$, and that $\hat p_n, \widehat{T_{\hat p_n}\mathcal{M}}$ approach $\mathbf{p}$ and $T_{\mathbf{p}}\mathcal{M}$ correspondingly with asymptotic rates of $n^{-\frac{k-1}{2k + d}}$. %While we These rates coincide with the optimal rates for the estimation of function derivatives.

データ駆動型アプリケーションでよく見られる現象として、高次元データは、少なくとも局所的には、内在次元が低いことが挙げられます。本研究では、多様体値データの点推定の問題について考察します。すなわち、$\mathcal{M}$のノイズ付きサンプルの有限集合、$\mathbb{R}^D$の$d$次元部分多様体、および多様体の近傍の点$r$が与えられた場合、$r$を多様体へ射影することを目指す。データが$k$倍の滑らかな境界なしコンパクト多様体の管状近傍から一様にサンプリングされたと仮定し、この近傍から$r$を取り、$\hat p_n\in \mathbb{R}^D$とグラスマン多様体$Gr(d, D)$の元$\widehat{T_{\hat p_n}\mathcal{M}}$を出力するアルゴリズムを提示します。サンプル数が$n\to\infty$になるにつれて、点$\hat p_n$が$r$の$\mathcal{M}$への射影である$\mathbf{p}\in \mathcal{M}$に収束し、$\widehat{T_{\hat p_n}\mathcal{M}}$が$T_{\mathbf{p}}\mathcal{M}$（その点における接空間）に高い確率で収束することを証明します。さらに、$\hat p_n$が$n^{-\frac{k}{2k + d}}$の漸近速度で多様体に近づくこと、また、$\hat p_n、\widehat{T_{\hat p_n}\mathcal{M}}$がそれに応じて$n^{-\frac{k-1}{2k + d}}$の漸近速度で$\mathbf{p}$および$T_{\mathbf{p}}\mathcal{M}$に近づくことを示します。%これらの速度は、関数の導関数の推定に最適な速度と一致します。

Ontolearn—A Framework for Large-scale OWL Class Expression Learning in Python

Ontolearn – Pythonによる大規模OWLクラス表現学習フレームワーク

In this paper, we present Ontolearn—a framework for learning OWL class expressions over large knowledge graphs.Ontolearn contains efficient implementations of recent state-of-the-art symbolic and neuro-symbolic class expression learners including EvoLearner and DRILL.A learned OWL class expression can be used to classify instances in the knowledge graph.Furthermore, Ontolearn integrates a verbalization module based on an LLM to translate complex OWL class expressions into natural language sentences.By mapping OWL class expressions into respective SPARQL queries, Ontolearn can be easily used to operate over a remote triplestore.The source code of Ontolearn is available at https://github.com/dice-group/Ontolearn.

本稿では、大規模な知識グラフ上でOWLクラス表現を学習するためのフレームワークであるOntolearnを紹介します。Ontolearnには、EvoLearnerやDRILLなど、最新のシンボリックおよびニューロシンボリッククラス表現学習器の効率的な実装が含まれています。学習済みのOWLクラス表現は、知識グラフ内のインスタンスを分類するために使用できます。さらに、OntolearnはLLMに基づく言語化モジュールを統合し、複雑なOWLクラス表現を自然言語の文に変換します。OWLクラス表現をそれぞれのSPARQLクエリにマッピングすることにより、Ontolearnをリモートトリプルストア上で簡単に操作するために使用できます。Ontolearnのソースコードはhttps://github.com/dice-group/Ontolearnで入手できます。

Continuously evolving rewards in an open-ended environment

オープンエンド環境における連続的に進化する報酬

Unambiguous identification of the rewards driving behaviours of entities operating in complex open-ended real-world environments is difficult, in part because goals and associated behaviours emerge endogenously and are dynamically updated as environments change. Reproducing such dynamics in models would be useful in many domains, particularly where fixed reward functions limit the adaptive capabilities of agents. Simulation experiments described here assess a candidate algorithm for the dynamic updating of the reward function, RULE: Reward Updating through Learning and Expectation. The approach is tested in a simplified ecosystem-like setting where experiments challenge entities’ survival, calling for significant behavioural change. The population of entities successfully demonstrate the abandonment of an initially rewarded but ultimately detrimental behaviour, amplification of beneficial behaviour, and appropriate responses to novel items added to their environment. These adjustments happen through endogenous modification of the entities’ reward function, during continuous learning, without external intervention.

複雑でオープンエンドな現実世界環境で活動するエンティティの行動を駆動する報酬を明確に特定することは困難です。その理由の一つは、目標とそれに関連する行動が内生的に出現し、環境の変化に応じて動的に更新されるためです。このようなダイナミクスをモデルで再現することは、多くの領域、特に固定された報酬関数がエージェントの適応能力を制限するような領域で有用です。ここで説明するシミュレーション実験は、報酬関数の動的更新のための候補アルゴリズムである「RULE:学習と期待による報酬更新」を評価します。このアプローチは、エンティティの生存に挑戦し、大幅な行動変化を要求する、単純化された生態系のような設定でテストされます。エンティティの集団は、当初は報酬を与えられたものの最終的には有害な行動の放棄、有益な行動の増幅、そして環境に追加された新しいアイテムへの適切な反応をうまく示します。これらの調整は、外部からの介入なしに、継続的な学習中にエンティティの報酬関数の内生的修正を通じて行われます。

Recursive Causal Discovery

再帰的因果発見

Causal discovery from observational data, i.e., learning the causal graph from a finite set of samples from the joint distribution of the variables, is often the first step toward the identification and estimation of causal effects, a key requirement in numerous scientific domains. Causal discovery is hampered by two main challenges: limited data results in errors in statistical testing and the computational complexity of the learning task is daunting. This paper builds upon and extends four of our prior publications (Mokhtarian et al., 2021; Akbari et al., 2021; Mokhtarian et al., 2022, 2023a). These works introduced the concept of removable variables, which are the only variables that can be removed recursively for the purpose of causal discovery. Presence and identification of removable variables allow recursive approaches for causal discovery, a promising solution that helps to address the aforementioned challenges by reducing the problem size successively. This reduction not only minimizes conditioning sets in each conditional independence (CI) test, leading to fewer errors but also significantly decreases the number of required CI tests. The worst-case performances of these methods nearly match the lower bound. In this paper, we present a unified framework for the proposed algorithms, refined with additional details and enhancements for a coherent presentation. A comprehensive literature review is also included, comparing the computational complexity of our methods with existing approaches, showcasing their state-of-the-art efficiency. Another contribution of this paper is the release of RCD, a Python package that efficiently implements these algorithms. This package is designed for practitioners and researchers interested in applying these methods in practical scenarios. The package is available at github.com/ban-epfl/rcd, with comprehensive documentation provided at rcdpackage.com.

観測データからの因果発見、すなわち変数の共分布から有限のサンプル集合から因果グラフを学習することは、多くの科学分野における重要な要件である因果効果の特定と推定に向けた最初のステップとなることが多い。因果発見は、主に2つの課題によって妨げられています。1つはデータが限られているため統計的検定でエラーが発生すること、もう1つは学習タスクの計算量が膨大であることだ。本論文は、私たちの過去の4つの論文（Mokhtarianら, 2021; Akbariら, 2021; Mokhtarianら, 2022, 2023a）を基に、さらに拡張したものです。これらの研究は、因果発見のために再帰的に除去できる唯一の変数である除去可能変数の概念を導入しました。除去可能変数の存在と特定により、因果発見のための再帰的アプローチが可能になり、問題規模を段階的に縮小することで前述の課題に対処する有望な解決策となります。この縮小は、各条件付き独立性（CI）テストにおける条件集合を最小化し、エラーを減らすだけでなく、必要なCIテストの数も大幅に削減します。これらの手法の最悪ケースのパフォーマンスは、下限にほぼ一致します。本論文では、提案アルゴリズムの統一フレームワークを提示し、一貫性のあるプレゼンテーションのために詳細と機能強化を追加して洗練させます。包括的な文献レビューも含まれており、本手法の計算複雑度を既存の手法と比較し、最先端の効率性を示しています。本論文のもう一つの貢献は、これらのアルゴリズムを効率的に実装するPythonパッケージであるRCDのリリースです。このパッケージは、これらの手法を実際のシナリオに適用することに関心のある実務家や研究者向けに設計されています。パッケージはgithub.com/ban-epfl/rcdで入手でき、包括的なドキュメントはrcdpackage.comで提供されています。

Evaluation of Active Feature Acquisition Methods for Time-varying Feature Settings

時変特徴設定に対する能動的な特徴獲得手法の評価

Machine learning methods often assume that input features are available at no cost. However, in domains like healthcare, where acquiring features could be expensive or harmful, it is necessary to balance a feature’s acquisition cost against its predictive value. The task of training an AI agent to decide which features to acquire is called active feature acquisition (AFA). By deploying an AFA agent, we effectively alter the acquisition strategy and trigger a distribution shift. To safely deploy AFA agents under this distribution shift, we present the problem of active feature acquisition performance evaluation (AFAPE). We examine AFAPE under i) a no direct effect (NDE) assumption, stating that acquisitions do not affect the underlying feature values; and ii) a no unobserved confounding (NUC) assumption, stating that retrospective feature acquisition decisions were only based on observed features. We show that one can apply missing data methods under the NDE assumption and offline reinforcement learning under the NUC assumption. When NUC and NDE hold, we propose a novel semi-offline reinforcement learning framework. This framework requires a weaker positivity assumption and introduces three new estimators: A direct method (DM), an inverse probability weighting (IPW), and a double reinforcement learning (DRL) estimator.

機械学習手法では、入力特徴量はコストなしで利用できると想定されることが多い。しかし、医療のように特徴量の取得にコストがかかったり、有害となる可能性のある分野では、特徴量の取得コストと予測値のバランスを取る必要があります。AIエージェントにどの特徴量を取得するかを決定するよう訓練するタスクは、能動的特徴量取得（AFA）と呼ばれます。AFAエージェントを配備することで、獲得戦略を効果的に変更し、分布シフトを引き起こす。この分布シフト下でAFAエージェントを安全に配備するために、能動的特徴量取得性能評価（AFAPE）の問題を提示します。AFAPEは、i)獲得が基礎となる特徴量値に影響を与えないという直接効果なし（NDE）仮定、およびii)観測されない交絡なし（NUC）仮定、つまり遡及的な特徴量取得の決定は観測された特徴量のみに基づいているという仮定の下で検証されます。NDE仮定の下では欠損データ法、NUC仮定の下ではオフライン強化学習を適用できることを示す。NUCとNDEが成立する場合、新しい半オフライン強化学習フレームワークを提案します。この枠組みでは、より弱い陽性仮定を必要とし、3つの新しい推定量、すなわち直接法（DM）、逆確率重み付け（IPW）、二重強化学習（DRL）推定量を導入します。

On Adaptive Stochastic Optimization for Streaming Data: A Newton’s Method with O(dN) Operations

ストリーミングデータのための適応型確率最適化について：O(dN)演算のニュートン法

Stochastic optimization methods face new challenges in the realm of streaming data, characterized by a continuous flow of large, high-dimensional data. While first-order methods, like stochastic gradient descent, are the natural choice for such data, they often struggle with ill-conditioned problems. In contrast, second-order methods, such as Newton’s method, offer a potential solution but are computationally impractical for large-scale streaming applications. This paper introduces adaptive stochastic optimization methods that effectively address ill-conditioned problems while functioning in a streaming context. Specifically, we present adaptive inversion-free stochastic quasi-Newton methods with computational complexity matching that of first-order methods, $\mathcal{O}(dN)$, where $d$ represents the number of dimensions/features and $N$ the number of data points. Theoretical analysis establishes their asymptotic efficiency, and empirical studies demonstrate their effectiveness in scenarios with complex covariance structures and poor initializations. In particular, we demonstrate that our adaptive quasi-Newton methods can outperform or match existing first- and second-order methods.

機械学習手法では、入力特徴量はコストなしで利用できると想定されることがよくあります。しかし、医療のように特徴量の取得にコストがかかったり、有害となる可能性のある分野では、特徴量の取得コストと予測値のバランスを取る必要があります。AIエージェントにどの特徴量を取得するかを決定するよう訓練するタスクは、能動的特徴量取得（AFA）と呼ばれます。AFAエージェントを配備することで、獲得戦略を効果的に変更し、分布シフトを引き起こします。この分布シフト下でAFAエージェントを安全に配備するために、能動的特徴量取得性能評価（AFAPE）の問題を提示します。AFAPEは、i)獲得が基礎となる特徴量に影響を与えないという直接効果なし（NDE）仮定、およびii)観測されない交絡なし（NUC）仮定、つまり遡及的な特徴量取得の決定は観測された特徴量のみに基づいているという仮定の下で検証されます。NDE仮定の下では欠損データ法、NUC仮定の下ではオフライン強化学習を適用できることを示します。NUCとNDEが成立する場合、我々は新たな半オフライン強化学習フレームワークを提案します。このフレームワークは、より弱い正値仮定を必要とし、3つの新しい推定量を導入します。すなわち、直接法（DM）、逆確率重み付け（IPW）、そして二重強化学習（DRL）推定量です。

Determine the Number of States in Hidden Markov Models via Marginal Likelihood

周辺尤度を用いた隠れマルコフモデルの状態数の決定

Hidden Markov models (HMM) have been widely used by scientists to model stochastic systems: the underlying process is a discrete Markov chain, and the observations are noisy realizations of the underlying process. Determining the number of hidden states for an HMM is a model selection problem which is yet to be satisfactorily solved, especially for the popular Gaussian HMM with heterogeneous covariance. In this paper, we propose a consistent method for determining the number of hidden states of HMM based on the marginal likelihood, which is obtained by integrating out both the parameters and hidden states. Moreover, we show that the model selection problem of HMM includes the order selection problem of finite mixture models as a special case. We give rigorous proof of the consistency of the proposed marginal likelihood method and provide an efficient computation method for practical implementation. We numerically compare the proposed method with the Bayesian information criterion (BIC), demonstrating the effectiveness of the proposed marginal likelihood method.

隠れマルコフモデル（HMM）は、確率システムをモデル化するために科学者に広く利用されています。基礎となるプロセスは離散マルコフ連鎖であり、観測値は基礎となるプロセスのノイズを含む実現値です。HMMの隠れ状態の数を決定することは、特に異種共分散を持つ一般的なガウスHMMの場合、未だ十分に解決されていないモデル選択問題です。本稿では、パラメータと隠れ状態の両方を積分することで得られる周辺尤度に基づいて、HMMの隠れ状態の数を決定する一貫した方法を提案します。さらに、HMMのモデル選択問題には、有限混合モデルの順序選択問題が特別なケースとして含まれることを示す。提案する周辺尤度法の一貫性を厳密に証明し、実用化のための効率的な計算方法を提供します。提案方法をベイズ情報量基準（BIC）と数値的に比較し、提案する周辺尤度法の有効性を示す。大規模で高次元のデータが連続的に流れるストリーミングデータ領域において、確率的最適化手法は新たな課題に直面しています。このようなデータには確率的勾配降下法のような一次法が自然な選択肢ですが、悪条件問題への対応が困難な場合が多くあります。一方、ニュートン法のような二次法は潜在的な解決策となりますが、大規模ストリーミングアプリケーションでは計算量的に現実的ではありません。本稿では、ストリーミング環境下で悪条件問題に効果的に対処できる適応型確率的最適化手法を紹介します。具体的には、一次法と同等の計算量$\mathcal{O}(dN)$を持つ適応型逆行列不要確率的準ニュートン法を提示します。ここで、$d$は次元数/特徴数、$N$はデータ点数です。理論分析によって漸近的効率性が確立され、実証研究によって複雑な共分散構造と不適切な初期化条件を持つシナリオにおける有効性が実証されています。特に、適応型準ニュートン法が既存の一次および二次手法を上回る、あるいは同等の性能を発揮できることを実証します。

Variance-Aware Estimation of Kernel Mean Embedding

カーネル平均の分散を考慮した推定埋め込み

An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We further extend our results from independent data to stationary mixing sequences and illustrate our methods in the context of hypothesis testing and robust parametric estimation.

隠れマルコフモデル（HMM）は、科学者によって確率システムのモデル化に広く用いられてきました。基礎となるプロセスは離散マルコフ連鎖であり、観測値は基礎となるプロセスのノイズを含む実現値です。HMMの隠れ状態の数を決定することは、特に異種共分散を持つ一般的なガウスHMMの場合、まだ十分に解決されていないモデル選択問題です。本稿では、パラメータと隠れ状態の両方を統合して得られる周辺尤度に基づいて、HMMの隠れ状態の数を決定するための一貫した方法を提案します。さらに、HMMのモデル選択問題には、有限混合モデルの順序選択問題が特別なケースとして含まれることを示します。提案する周辺尤度法の一貫性を厳密に証明し、実際の実装のための効率的な計算方法を提供します。提案方法をベイズ情報量基準（BIC）と数値的に比較し、提案する周辺尤度法の有効性を実証します。

Scaling ResNets in the Large-depth Regime

大規模深度レジームにおけるResNetのスケーリング

Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for $\alpha_L = \frac{1}{\sqrt{L}}$—other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

Deep ResNetは、複雑な機械学習タスクにおいて最先端の結果を達成することで知られています。しかし、これらのアーキテクチャの驚異的なパフォーマンスは、特に深度$L$が増加するにつれて、勾配の消失や爆発を回避するために慎重に構築する必要があるトレーニング手順に依存しています。この問題の軽減方法については合意が得られていませんが、広く議論されている戦略の一つは、各層の出力を係数$\alpha_L$でスケーリングすることです。確率的設定において、標準的なi.i.d.初期化において、$\alpha_L = \frac{1}{\sqrt{L}}$の場合のみ非自明なダイナミクスとなることを示します。他の選択肢は爆発または恒等写像のいずれかにつながります。このスケーリング係数は、連続時間極限においてニューラル確率微分方程式に対応しており、これは深層ResNetがニューラル常微分方程式の離散化であるという広く解釈されている解釈に反しています。対照的に、後者の領域では、特定の相関初期化と$\alpha_L = \frac{1}{L}$で安定性が得られます。私たちの分析は、層インデックスの関数としての重みのスケーリングと規則性の間に強い相互作用があることを示唆しています。最後に、一連の実験において、これら2つのパラメータによって駆動される連続的な領域の範囲を示し、これらはトレーニング前後のパフォーマンスに共同で影響を与えます。カーネル平均埋め込み（KME）の重要な特徴は、経験的KMEから真の分布KMEへの収束速度が、空間の次元、分布の特性、カーネルの滑らかさの特性とは独立に、有界化できることです。本稿では、再生カーネルヒルベルト空間における分散情報を活用することで収束を高速化する方法を示します。さらに、そのような情報が事前に不明な場合でも、データから効率的に推定できることを示し、偶然の設定において加速される分布非依存の境界という要件を回復します。さらに、独立データから定常混合シーケンスへと結果を拡張し、仮説検定とロバストなパラメトリック推定の文脈で本手法を示します。

A Comparative Evaluation of Quantification Methods

定量化手法の比較評価

Quantification represents the problem of estimating the distribution of class labels on unseen data. It also represents a growing research field in supervised machine learning, for which a large variety of different algorithms has been proposed in recent years. However, a comprehensive empirical comparison of quantification methods that supports algorithm selection is not available yet. In this work, we close this research gap by conducting a thorough empirical performance comparison of 24 different quantification methods on in total more than 40 datasets, considering binary as well as multiclass quantification settings. We observe that no single algorithm generally outperforms all competitors, but identify a group of methods that perform best in the binary setting, including the threshold selection-based median sweep and TSMax methods, the DyS framework including the HDy method, Forman’s mixture model, and Friedman’s method. For the multiclass setting, we observe that a different, broad group of algorithms yields good performance, including the HDx method, the generalized probabilistic adjusted count, the readme method, the energy distance minimization method, the EM algorithm for quantification, and Friedman’s method. We also find that tuning the underlying classifiers has in most cases only a limited impact on the quantification performance. More generally, we find that the performance on multiclass quantification is inferior to the results obtained in the binary setting. Our results can guide practitioners who intend to apply quantification algorithms and help researchers identify opportunities for future research.

数量化は、未知のデータにおけるクラスラベルの分布を推定する問題です。また、教師あり機械学習の研究分野として成長しており、近年、多種多様なアルゴリズムが提案されています。しかし、アルゴリズムの選択をサポートする包括的な数量化手法の実証的比較はまだありません。本研究では、合計40以上のデータセットで24の異なる数量化手法の徹底的な実証的パフォーマンス比較を実施し、バイナリおよびマルチクラス数量化設定を考慮することで、この研究ギャップを埋めます。一般的にすべての競合よりも優れたパフォーマンスを発揮する単一のアルゴリズムはありませんが、閾値選択ベースのメディアンスイープ法とTSMax法、HDy法を含むDySフレームワーク、Formanの混合モデル、およびFriedman法など、バイナリ設定で最高のパフォーマンスを発揮する一連の手法を特定しました。多クラス設定では、HDx法、一般化確率調整カウント、readme法、エネルギー距離最小化法、定量化のためのEMアルゴリズム、フリードマン法など、様々なアルゴリズムが良好なパフォーマンスを発揮することがわかりました。また、基盤となる分類器の調整は、ほとんどの場合、定量化のパフォーマンスに限られた影響しか与えないこともわかりました。より一般的には、多クラス定量化のパフォーマンスは、2値設定で得られた結果よりも劣ることがわかりました。私たちの結果は、定量化アルゴリズムを適用しようとする実務家にとって指針となり、研究者が将来の研究の機会を特定するのに役立ちます。

Lightning UQ Box: Uncertainty Quantification for Neural Networks

Lightning UQ Box：ニューラルネットワークの不確実性定量化

Although neural networks have shown impressive results in a multitude of application domains, the “black box” nature of deep learning and lack of confidence estimates have led to scepticism, especially in domains like medicine and physics where such estimates are critical. Research on uncertainty quantification (UQ) has helped elucidate the reliability of these models, but existing implementations of these UQ methods are sparse and difficult to reuse. To this end, we introduce Lightning UQ Box, a PyTorch-based Python library for deep learning-based UQ methods powered by PyTorch Lightning. Lightning UQ Box supports classification, regression, semantic segmentation, and pixelwise regression applications, and UQ methods from a variety of theoretical motivations. With this library, we provide an entry point for practitioners new to UQ, as well as easy-to-use components and tools for scalable deep learning applications.

ニューラルネットワークは多くの応用分野で素晴らしい成果を上げていますが、深層学習の「ブラックボックス」性や信頼性のある推定値の欠如は、特に推定値が重要な医療や物理学などの分野では懐疑的な見方につながっています。不確実性定量化（UQ）の研究はこれらのモデルの信頼性の解明に役立ってきましたが、これらのUQ手法の既存の実装はまばらで、再利用が困難です。そこで、PyTorch Lightningを搭載した深層学習ベースのUQ手法のためのPyTorchベースのPythonライブラリ、Lightning UQ Boxを紹介します。Lightning UQ Boxは、分類、回帰、セマンティックセグメンテーション、ピクセル単位の回帰アプリケーション、そして様々な理論的動機に基づくUQ手法をサポートしています。このライブラリは、UQを初めて使用する実践者にとってのエントリーポイントとなるだけでなく、スケーラブルな深層学習アプリケーションのための使いやすいコンポーネントとツールも提供します。数量化とは、未知のデータにおけるクラスラベルの分布を推定する問題です。また、教師あり機械学習においても成長を続ける研究分野であり、近年、多種多様なアルゴリズムが提案されています。しかし、アルゴリズムの選択を支援する包括的な数量化手法の実証的比較はまだ行われていません。本研究では、2クラスおよび多クラスの数量化設定を考慮し、合計40以上のデータセットで24種類の異なる数量化手法の徹底的な実証的性能比較を実施することで、この研究ギャップを埋めます。単一のアルゴリズムがすべての競合手法よりも一般的に優れているわけではありませんが、閾値選択ベースのメディアンスイープ法とTSMax法、HDy法を含むDySフレームワーク、Formanの混合モデル、Friedman法など、2クラス設定で最も優れた性能を発揮する手法群を特定しました。多クラス設定では、HDx法、一般化確率調整カウント法、readme法、エネルギー距離最小化法、定量化のためのEMアルゴリズム、フリードマン法など、様々なアルゴリズムが良好なパフォーマンスを示すことが確認されました。また、基盤となる分類器の調整は、ほとんどの場合、定量化のパフォーマンスに限られた影響しか与えないこともわかりました。より一般的には、多クラス定量化のパフォーマンスは、2クラス設定で得られた結果よりも劣ることがわかりました。私たちの結果は、定量化アルゴリズムを適用しようとする実務家にとっての指針となり、研究者が将来の研究の機会を見つけるのに役立つでしょう。

Scaling Data-Constrained Language Models

データ制約付き言語モデルのスケーリング

The current trend of scaling language models involves increasing both parameter count and training data set size. Extrapolating this trend suggests that training data set size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training data set with code data or removing commonly used filters. Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

言語モデルのスケーリングにおける現在の傾向は、パラメータ数と学習データセットのサイズの両方の増加を伴います。この傾向を外挿すると、学習データセットのサイズは近い将来、インターネット上で利用可能なテキストデータの量によって制限される可能性があることが示唆されます。この制限を踏まえ、データ制約のある環境における言語モデルのスケーリングを調査します。具体的には、データの繰り返しと計算予算を最大9,000億個の学習トークンと90億個のパラメータモデルまで変化させた大規模な実験を実施します。計算予算が固定された制約データの場合、最大4エポックの繰り返しデータを用いた学習では、一意のデータを用いた場合と比較して、損失への影響は無視できることがわかりました。しかし、繰り返し回数が増えると、計算量の追加による価値は最終的にゼロに減少します。繰り返しトークンと過剰なパラメータの価値の減少を考慮した、計算量の最適化に関するスケーリング則を提案し、実験的に検証します。最後に、学習データセットにコードデータを追加することや、一般的に使用されるフィルターを削除することなど、データ不足を緩和する手法を実験します。400回の訓練実行から得られたモデルとデータセットは、https://github.com/huggingface/datablationsで無料で入手できます。

Curvature-based Clustering on Graphs

グラフ上の曲率ベースクラスタリング

Unsupervised node clustering (or community detection) is a classical graph learning task. In this paper, we study algorithms that exploit the geometry of the graph to identify densely connected substructures, which form clusters or communities. Our method implements discrete Ricci curvatures and their associated geometric flows, under which the edge weights of the graph evolve to reveal its community structure. We consider several discrete curvature notions and analyze the utility of the resulting algorithms. In contrast to prior literature, we study not only single-membership community detection, where each node belongs to exactly one community, but also mixed-membership community detection, where communities may overlap. For the latter, we argue that it is beneficial to perform community detection on the line graph, i.e., the graph’s dual. We provide both theoretical and empirical evidence for the utility of our curvature-based clustering algorithms. In addition, we give several results on the relationship between the curvature of a graph and that of its dual, which enable the efficient implementation of our proposed mixed-membership community detection approach and which may be of independent interest for curvature-based network analysis.

教師なしノードクラスタリング（またはコミュニティ検出）は、古典的なグラフ学習タスクです。本稿では、グラフの幾何学的特徴を利用して、クラスターまたはコミュニティを形成する密に接続された部分構造を識別するアルゴリズムを検討します。我々の手法は、離散リッチ曲率とそれに関連する幾何学的フローを実装し、その下でグラフのエッジの重みが進化することでコミュニティ構造を明らかにします。我々はいくつかの離散曲率の概念を検討し、結果として得られるアルゴリズムの有用性を分析します。先行研究とは対照的に、我々は各ノードが正確に1つのコミュニティに属する単一メンバーシップコミュニティ検出だけでなく、コミュニティが重複する可能性がある混合メンバーシップコミュニティ検出も検討します。後者の場合、グラフの双対である線グラフ上でコミュニティ検出を行うことが有益であると主張します。曲率に基づくクラスタリングアルゴリズムの有用性について、理論的および経験的証拠を提示します。さらに、グラフの曲率とその双対グラフの曲率の関係に関するいくつかの結果を示す。これらの結果は、提案する混合メンバーシップコミュニティ検出アプローチの効率的な実装を可能にし、曲率に基づくネットワーク分析においても独立した関心事となる可能性があります。

Composite Goodness-of-fit Tests with Kernels

カーネルを用いた複合適合度検定

We propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on kernel-based distances such as the maximum mean discrepancy. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. We also prove that the popular wild bootstrap will lead to an overly conservative test, and show that the parametric bootstrap is consistent and can lead to significantly improved performance in practice. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of a non-parametric density model, and an intractable generative model of a biological cellular network.

データが何らかのパラメトリックファミリーの分布から来ているかどうかに関心を持つ、困難な複合検定問題に対するカーネルベースの仮説検定を提案します。検定では、最大平均乖離度などのカーネルベースの距離に基づく最小距離推定量を使用します。主な結果として、正しい検定レベルを維持しながら、同じデータ（データ分割なし）でパラメータを推定し、検定を実行できることを示します。また、一般的なワイルドブートストラップは過度に保守的な検定につながることを証明し、パラメトリックブートストラップは一貫性があり、実際には大幅に向上するパフォーマンスにつながることを示します。我々のアプローチは、ノンパラメトリック密度モデルの適合度検定や、生物細胞ネットワークの扱いにくい生成モデルなど、様々な問題で実証されています。

PFLlib: A Beginner-Friendly and Comprehensive Personalized Federated Learning Library and Benchmark

PFLlib：初心者向けで包括的なパーソナライズされた連合学習ライブラリおよびベンチマーク

Amid the ongoing advancements in Federated Learning (FL), a machine learning paradigm that allows collaborative learning with data privacy protection, personalized FL (pFL) has gained significant prominence as a research direction within the FL domain. Whereas traditional FL (tFL) focuses on jointly learning a global model, pFL aims to balance each client’s global and personalized goals in FL settings. To foster the pFL research community, we started and built PFLlib, a comprehensive pFL library with an integrated benchmark platform. In PFLlib, we implemented 37 state-of-the-art FL algorithms (8 tFL algorithms and 29 pFL algorithms) and provided various evaluation environments with three statistically heterogeneous scenarios and 24 datasets. At present, PFLlib has gained more than 1600 stars and 300 forks on GitHub.

データプライバシーを保護しながら協調学習を可能にする機械学習パラダイムであるFederated Learning（FL）の継続的な進歩の中で、パーソナライズされたFL（pFL）は、FL分野における研究方向として大きな注目を集めています。従来のFL（tFL）がグローバルモデルの共同学習に焦点を当てているのに対し、pFLはFL設定において各クライアントのグローバル目標とパーソナライズされた目標のバランスをとることを目指しています。pFL研究コミュニティを育成するために、私たちは統合ベンチマークプラットフォームを備えた包括的なpFLライブラリであるPFLlibを立ち上げ、構築した。PFLlibでは、37の最先端FLアルゴリズム（8つのtFLアルゴリズムと29のpFLアルゴリズム）を実装し、3つの統計的に異質なシナリオと24のデータセットを備えた様々な評価環境を提供しました。現在、PFLlibはGitHubで1600以上のスターと300以上のフォークを獲得しています。

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

SGDバッチサイズがオートエンコーダ学習に与える影響：スパース性、シャープネス、および特徴学習

In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum that is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of “feature selection” in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full-batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest.

本研究では、線形またはReLU活性化を用いた単一ニューロンオートエンコーダを直交データで学習する際の確率的勾配降下法（SGD）のダイナミクスを調査します。この非凸問題において、一定のステップサイズでランダムに初期化されたSGDは、任意のバッチサイズ選択において大域的最小値を発見することを示す。しかし、発見される特定の大域的最小値はバッチサイズに依存します。フルバッチ設定では、解は稠密（つまり、スパースではない）であり、初期化された方向と非常によく一致していることを示し、特徴学習が比較的少ないことを示しています。一方、サンプル数より厳密に小さいバッチサイズでは、SGDはスパースであり、初期化とほぼ直交する大域的最小値を発見します。これは、確率的勾配のランダム性が、この設定において質的に異なるタイプの「特徴選択」を誘導することを示す。さらに、ヘッセ行列のトレースで最小値の鋭さを測定すると、フルバッチ勾配降下法で発見された最小値は、バッチサイズが小さいほど最小値が鋭くなることを示唆する以前の研究とは対照的に、より平坦です。一定のステップサイズでのSGDの収束を証明するために、非同次ランダムウォーク理論からの強力なツールを導入します。これは独立した関心事となる可能性があります。

Efficient and Robust Transfer Learning of Optimal Individualized Treatment Regimes with Right-Censored Survival Data

右打ち切り生存データを用いた最適個別治療レジームの効率的かつロバストな転移学習

An individualized treatment regime (ITR) is a decision rule that assigns treatments based on patients’ characteristics. The value function of an ITR is the expected outcome in a counterfactual world had this ITR been implemented. Recently, there has been increasing interest in combining heterogeneous data sources, such as leveraging the complementary features of randomized controlled trial (RCT) data and a large observational study (OS). Usually, a covariate shift exists between the source and target population, rendering the source-optimal ITR not optimal for the target population. We present an efficient and robust transfer learning framework for estimating the optimal ITR with right-censored survival data that generalizes well to the target population. The value function accommodates a broad class of functionals of survival distributions, including survival probabilities and restrictive mean survival times (RMSTs). We propose a doubly robust estimator of the value function, and the optimal ITR is learned by maximizing the value function within a pre-specified class of ITRs. We establish the cubic rate of convergence for the estimated parameter indexing the optimal ITR, and show that the proposed optimal value estimator is consistent and asymptotically normal even with flexible machine learning methods for nuisance parameter estimation. We evaluate the empirical performance of the proposed method by simulation studies and a real data application of sodium bicarbonate therapy for patients with severe metabolic acidaemia in the intensive care unit (ICU), combining a RCT and an observational study with heterogeneity.

個別化治療レジーム(ITR)は、患者の特性に基づいて治療を割り当てる決定ルールです。ITRの価値関数とは、反事実的世界でこのITRが実装されていた場合に期待される結果です。最近では、ランダム化比較試験(RCT)データと大規模観察研究(OS)の相補的特徴を活用するなど、異種のデータソースを組み合わせることに関心が高まっています。通常、ソース集団とターゲット集団の間には共変量シフトが存在するため、ソース最適ITRはターゲット集団にとって最適ではありません。本稿では、ターゲット集団に適切に一般化される右側打ち切り生存データを使用して最適ITRを推定するための、効率的で堅牢な転移学習フレームワークを紹介します。価値関数は、生存確率や制限平均生存時間(RMST)など、生存分布の広範な関数に対応します。本稿では価値関数の二重に堅牢な推定量を提案し、事前に指定されたITRのクラス内で価値関数を最大化することで最適なITRを学習します。最適なITRを指標とする推定パラメータの3次収束率を確立し、提案する最適値推定値は、柔軟な機械学習手法を用いて擬似パラメータを推定した場合でも、一貫性があり、漸近的に正規分布に従うことを示す。提案手法の実証的性能は、RCTと異質性を考慮した観察研究を組み合わせたシミュレーション研究と、集中治療室（ICU）における重症代謝性アシデミア患者に対する重炭酸ナトリウム療法の実データ適用によって評価します。

DAGs as Minimal I-maps for the Induced Models of Causal Bayesian Networks under Conditioning

条件付け下の因果ベイズネットワークの誘導モデルのための最小IマップとしてのDAG

Bayesian networks (BNs) are a powerful tool for knowledge representation and reasoning, especially for complex systems. A critical task in the applications of BNs is conditional inference or inference in the presence of selection bias. However, post-conditioning, the conditional distribution family of a BN can become complex for analysis, and the corresponding induced subgraph may not accurately encode the conditional independencies for the remaining variables. In this work, we first investigate the conditions under which a BN remains closed under conditioning, meaning that the induced subgraph is consistent with the structural information of conditional distributions. Conversely, when a BN is not closed, we aim to construct a new directed acyclic graph (DAG) as a minimal $\mathcal{I}$-map for the conditional model by incorporating directed edges into the original induced graph. We present an equivalent characterization of this minimal $\mathcal{I}$-map and develop an efficient algorithm for its identification. The proposed framework improves the efficiency of conditional inference of a BN. Additionally, the DAG minimal $\mathcal{I}$-map offers graphical criteria for the safe integration of knowledge from diverse sources (subpopulations/conditional distributions), facilitating correct parameter estimation. Both theoretical analysis and simulation studies demonstrate that using a DAG minimal $\mathcal{I}$-map for conditional inference is more effective than traditional methods based on the joint distribution of the original BN.

ベイジアンネットワーク（BN）は、特に複雑なシステムにおける知識表現と推論のための強力なツールです。BNの応用において重要なタスクは、条件付き推論、または選択バイアスが存在する場合の推論です。しかし、事後条件付けにより、BNの条件付き分布族は解析が複雑になる可能性があり、対応する誘導サブグラフは残りの変数の条件付き独立性を正確にエンコードしない可能性があります。本研究では、まず、BNが条件付け下で閉じた状態を維持する条件、つまり誘導サブグラフが条件付き分布の構造情報と整合する条件を調べる。逆に、BNが閉じていない場合、元の誘導グラフに有向エッジを組み込むことで、条件付きモデルの最小$\mathcal{I}$マップとして新しい有向非巡回グラフ(DAG)を構築することを目指します。この最小$\mathcal{I}$マップの同等の特性評価を提示し、その識別のための効率的なアルゴリズムを開発します。提案されたフレームワークは、BNの条件付き推論の効率を改善します。さらに、DAG最小$\mathcal{I}$マップは、多様な情報源(サブポピュレーション/条件付き分布)からの知識を安全に統合するためのグラフィカルな基準を提供し、正しいパラメータ推定を容易にします。理論分析とシミュレーション研究の両方により、条件付き推論にDAG最小$\mathcal{I}$マップを使用することは、元のBNの結合分布に基づく従来の方法よりも効果的であることが実証されています。

Adjusted Expected Improvement for Cumulative Regret Minimization in Noisy Bayesian Optimization

調整済み期待改善率ノイズベイズ最適化における累積リグレット最小化

The expected improvement (EI) is one of the most popular acquisition functions for Bayesian optimization (BO) and has demonstrated good empirical performances in many applications for the minimization of simple regret. However, under the evaluation metric of cumulative regret, the performance of EI may not be competitive, and its existing theoretical regret upper bound still has room for improvement. To adapt the EI for better performance under cumulative regret, we introduce a novel quantity called the evaluation cost which is compared against the acquisition function, and with this, develop the expected improvement-cost (EIC) algorithm. In each iteration of EIC, a new point with the largest acquisition function value is sampled, only if that value exceeds its evaluation cost. If none meets this criteria, the current best point is resampled. This evaluation cost quantifies the potential downside of sampling a point, which is important under the cumulative regret metric as the objective function value in every iteration affects the performance measure. We establish in theory a high-probability regret upper bound of EIC based on the maximum information gain, which is tighter than the bound of existing EI-based algorithms. It is also comparable to the regret bound of other popular BO algorithms such as Thompson sampling (GP-TS) and upper confidence bound (GP-UCB). We further perform experiments to illustrate the improvement of EIC over several popular BO algorithms.

期待改善度(EI)は、ベイズ最適化(BO)における最も一般的な獲得関数の1つであり、単純リグレットの最小化に対する多くのアプリケーションで良好な経験的パフォーマンスが実証されています。ただし、累積リグレットの評価基準では、EIのパフォーマンスは競争力がない可能性があり、既存の理論的なリグレットの上限にはまだ改善の余地があります。累積リグレットでEIのパフォーマンスを向上させるために、獲得関数と比較される評価コストと呼ばれる新しい量を導入し、これを使用して期待改善コスト(EIC)アルゴリズムを開発します。EICの各反復では、獲得関数の値が評価コストを超える場合にのみ、その値が最大になる新しいポイントがサンプリングされます。この基準を満たすポイントがない場合、現在の最良のポイントが再サンプリングされます。この評価コストは、ポイントをサンプリングすることの潜在的なマイナス面を定量化します。これは、各反復における目的関数の値がパフォーマンス指標に影響を与えるため、累積リグレット基準では重要です。我々は、最大情報利得に基づくEICの高確率リグレット上限を理論的に確立した。これは、既存のEIベースアルゴリズムの上限よりも厳密です。また、これは、トンプソンサンプリング（GP-TS）や信頼度上限（GP-UCB）といった他の一般的なBOアルゴリズムのリグレット上限にも匹敵します。さらに、いくつかの一般的なBOアルゴリズムに対するEICの改善を示す実験も行う。

Manifold Fitting under Unbounded Noise

無制限ノイズ下での多様体フィッティング

In the field of non-Euclidean statistical analysis, a trend has emerged in recent times, of attempts to recover a low dimensional structure, namely a manifold, underlying the high dimensional data. Recovering the manifold requires the noise to be of a certain concentration and prevailing methods address this requirement by constructing an approximated manifold that is based on the tangent space estimation at each sample point. Although theoretical convergence for these methods is guaranteed, the samples are either noiseless or the noise is bounded. However, if the noise is unbounded, as is commonplace, the tangent space estimation at the noisy samples will be blurred – an undesirable outcome since fitting a manifold from the blurred tangent space might be more greatly compromised in terms of its accuracy. In this paper, we introduce a new manifold-fitting method, whereby the output manifold is constructed by directly estimating the tangent spaces at the projected points on the latent manifold, rather than at the sample points, thus reducing the error caused by the noise. Assuming the noise is unbounded, our new method has a high probability of achieving theoretical convergence, in terms of the upper bound of the distance between the estimated and latent manifold. The smoothness of the estimated manifold is also evaluated by bounding the supremum of twice difference above. Numerical simulations are conducted as part of this new method to help validate our theoretical findings and demonstrate the advantages of our method over other relevant manifold fitting methods. Finally, our method is applied to real data examples.

非ユークリッド統計解析の分野では、近年、高次元データの基礎となる低次元構造、すなわち多様体を復元しようとする傾向が見られます。多様体を復元するには、ノイズが一定の濃度である必要があり、一般的な手法では、各サンプル点における接空間推定に基づく近似多様体を構築することでこの要件に対処しています。これらの手法は理論的収束が保証されていますが、サンプルはノイズレスであるか、ノイズが有界です。しかし、一般的にノイズが無界の場合、ノイズを含むサンプルにおける接空間推定は不鮮明になります。これは望ましくない結果です。なぜなら、不鮮明な接空間から多様体をフィッティングすると、精度の面でより大きく損なわれる可能性があるからです。本稿では、出力多様体の構築にあたり、サンプル点ではなく潜在多様体上の投影点における接空間を直接推定することでノイズによる誤差を低減する、新たな多様体フィッティング手法を提案します。ノイズが無制限であると仮定すると、この新手法は、推定多様体と潜在多様体間の距離の上限に関して、高い理論的収束確率を示す。推定多様体の滑らかさは、上記の二倍差の上限を定めることで評価されます。この新手法の一環として数値シミュレーションを実施し、理論的知見を検証し、他の関連する多様体フィッティング手法に対する本手法の優位性を実証します。最後に、本手法を実際のデータ例に適用します。

Learning Global Nash Equilibrium in Team Competitive Games with Generalized Fictitious Cross-Play

一般化架空クロスプレイを用いたチーム競争ゲームにおける大域ナッシュ均衡の学習

Self-play (SP) is a popular multi-agent reinforcement learning framework for competitive games. Despite the empirical success, the theoretical properties of SP are limited to two-player settings. For team competitive games where two teams of cooperative agents compete with each other, we show a counter-example where SP cannot converge to a global Nash equilibrium (NE) with high probability. Policy-Space Response Oracles (PSRO) is an alternative framework that finds NEs by iteratively learning the best response (BR) to previous policies. PSRO can be directly extended to team competitive games with unchanged convergence properties by learning team BRs, but its repeated training from scratch makes it hard to scale to complex games. In this work, we propose Generalized Fictitious Cross-Play (GFXP), a novel algorithm that inherits benefits from both frameworks. GFXP simultaneously trains an SP-based main policy and a counter population. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the BRs to the main policy’s checkpoints. We evaluate GFXP in matrix games and gridworld domains where GFXP achieves the lowest exploitabilities. We further conduct experiments in a challenging football game where GFXP defeats SOTA models with over 94% win rate.

セルフプレイ（SP）は、競争ゲームのための人気のあるマルチエージェント強化学習フレームワークです。実験的な成功にもかかわらず、SPの理論的特性は2人のプレイヤーの設定に限定されています。協力エージェントの2つのチームが互いに競争するチーム競争ゲームの場合、SPが高確率でグローバルナッシュ均衡(NE)に収束できない反例を示します。ポリシー空間応答オラクル(PSRO)は、以前のポリシーに対する最善の応答(BR)を反復的に学習することでNEを見つける代替フレームワークです。PSROは、チームBRを学習することで収束特性を変更せずにチーム競争ゲームに直接拡張できますが、最初から繰り返しトレーニングするため、複雑なゲームに拡張することが困難です。本研究では、両方のフレームワークの利点を継承した新しいアルゴリズムである一般化架空クロスプレイ(GFXP)を提案します。GFXPは、SPベースのメインポリシーとカウンターポピュレーションを同時にトレーニングします。メインポリシーは、架空のセルフプレイとカウンターポピュレーションに対するクロスプレイによってトレーニングされ、カウンターポリシーは、メインポリシーのチェックポイントに対するBRとしてトレーニングされます。GFXPを、マトリックスゲームとグリッドワールド領域で評価します。これらの領域では、GFXPが最も低いエクスプロイト可能性を達成しています。さらに、難易度の高いフットボールゲームで実験を行い、GFXPがSOTAモデルを94%以上の勝率で破りました。

Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models

一般的なスコアベース生成モデルに対するワッサーシュタイン収束保証

Score-based generative models are a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of score-based generative models in the 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our results to several concrete score-based generative models with specific choices of forward processes modeled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment with score-based generative models with different forward processes for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity.

スコアベースの生成モデルは、多くのアプリケーションで最先端のパフォーマンスを備えた最近のクラスの深層生成モデルです。本稿では、正確なスコア推定と滑らかな対数凹データ分布を仮定し、2-ワッサーシュタイン距離における一般的なクラスのスコアベースの生成モデルの収束保証を確立します。私たちは、確率微分方程式でモデル化された順方向プロセスの特定の選択を持ついくつかの具体的なスコアベースの生成モデルに結果を特化し、各モデルの反復複雑度の上限を取得します。これは、異なる順方向プロセスの選択の影響を示しています。また、データ分布がガウス分布の場合の下限も提供します。数値的には、CIFAR-10での無条件画像生成に対して、異なる順方向プロセスを持つスコアベースの生成モデルを実験します。実験結果は、反復複雑度に関する理論予測とよく一致することがわかりました。

Extremal graphical modeling with latent variables via convex optimization

凸最適化による潜在変数を用いた極値グラフィカルモデリング

Extremal graphical models encode the conditional independence structure of multivariate extremes and provide a powerful tool for quantifying the risk of rare events. Prior work on learning these graphs from data has focused on the setting where all relevant variables are observed. For the popular class of Husler-Reiss models, we propose the eglatent method, a tractable convex program for learning extremal graphical models in the presence of latent variables. Our approach decomposes the Husler-Reiss precision matrix into a sparse component encoding the graphical structure among the observed variables after conditioning on the latent variables, and a low-rank component encoding the effect of a few latent variables on the observed variables. We provide finite-sample guarantees of eglatent and show that it consistently recovers the conditional graph as well as the number of latent variables. We highlight the improved performances of our approach on synthetic and real data.

極値グラフィカルモデルは、多変量極値の条件付き独立性構造を符号化し、稀なイベントのリスクを定量化するための強力なツールを提供します。データからこれらのグラフを学習するこれまでの研究は、すべての関連変数が観測される設定に焦点を当ててきた。広く普及しているHusler-Reissモデルに対し、潜在変数が存在する極値グラフィカルモデルを学習するための扱いやすい凸プログラムであるeglatent法を提案します。本手法は、Husler-Reissの精度行列を、潜在変数を条件付けた後の観測変数間のグラフィカル構造を符号化するスパース成分と、少数の潜在変数が観測変数に与える影響を符号化する低ランク成分に分解します。eglatentの有限サンプル保証を提供し、条件付きグラフと潜在変数の数を一貫して復元できることを示す。合成データと実データにおける本手法の性能向上についても強調します。

On the Approximation of Kernel functions

カーネル関数の近似について

Various methods in statistical learning build on kernels considered in reproducing kernel Hilbert spaces. In applications, the kernel is often selected based on characteristics of the problem and the data. This kernel is then employed to infer response variables at points, where no explanatory data were observed. The data considered here are located in compact sets in higher dimensions and the paper addresses approximations of the kernel itself. The new approach considers Taylor series approximations of radial kernel functions. For the Gauss kernel on the unit cube, the paper establishes an upper bound of the associated eigenfunctions, which grows only polynomially with respect to the index. The novel approach substantiates smaller regularization parameters than considered in the literature, overall leading to better approximations. This improvement confirms low rank approximation methods such as the Nyström method.

統計学習のさまざまな手法は、再現カーネルヒルベルト空間で考慮されるカーネルに基づいています。応用分野においては、カーネルは問題とデータの特性に基づいて選択されることが多い。このカーネルは、説明データが観測されなかった点における応答変数の推定に用いられます。ここで検討するデータは高次元のコンパクトな集合に配置されており、本論文ではカーネル自体の近似について論じる。新しいアプローチでは、ラジアルカーネル関数のテイラー級数近似を考慮します。単位立方体上のガウスカーネルについては、本論文では関連する固有関数の上限を確立し、これはインデックスに関して多項式的にのみ増加します。この新しいアプローチは、文献で検討されているよりも小さい正則化パラメータを実証し、全体としてより優れた近似値をもたらす。この改善は、ニストローム法などの低ランク近似法の有効性を裏付けるものです。

Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and Response

部分的に注釈が付けられた治療と反応を用いた平均治療効果の効率的かつロバストな半教師あり推定

A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response cannot be accurately captured by readily available EHR features in many studies and require labor-intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We considered average treatment effect (ATE) estimation when 1) exact treatment and outcome variables are only observed together in a small labeled subset and 2) noisy surrogates of treatment and outcome, such as relevant prescription and diagnosis codes, along with potential confounders are observed for all subjects. We derived the efficient influence function for ATE and used it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We justified that our SMMAL ATE estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We developed the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies demonstrated the validity of our SMMAL method and its superiority over supervised and unsupervised benchmarks. We applied SMMAL to the assessment of targeted therapies for metastatic colorectal cancer in comparison to chemotherapy.

治療効果評価に電子健康記録(EHR)を活用する際の顕著な課題は、受けた治療や反応などの重要な臨床変数に関する正確な情報が不足していることです。多くの研究では、治療情報と反応の両方をすぐに利用できるEHR機能で正確に取得することができず、正確に注釈を付けるには労働集約的な手作業によるカルテレビューが必要であり、そのため、これらの主要変数に利用可能なゴールドスタンダードラベルの数が制限されます。我々は、1)正確な治療変数と結果変数が小さなラベル付きサブセットでのみ一緒に観察され、2)治療と結果のノイズの多い代替変数(関連する処方箋や診断コードなど)が潜在的な交絡因子とともにすべての被験者で観察される場合に、平均治療効果(ATE)の推定を考慮しました。ATEの効率的な影響関数を導出し、それを使用して半教師あり多重機械学習(SMMAL)推定量を構築しました。低次元の滑らかなモデルでは、Bスプライン回帰を用いたSMMAL ATE推定量が半パラメトリックに効率的であることを実証しました。高次元のロジスティック傾向スコアと結果回帰モデルでは、適応型スパース性/モデル二重ロバスト推定を開発しました。シミュレーション研究の結果は、我々のSMMAL法の妥当性と、教師ありおよび教師なしベンチマークに対する優位性を実証しました。我々は、化学療法と比較して転移性大腸がんに対する標的療法の評価にSMMALを適用しました。

Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning

非凸確率的ブレグマン近似勾配法と深層学習への応用

Stochastic gradient methods for minimizing nonconvex composite objective functions typically rely on the Lipschitz smoothness of the differentiable part, but this assumption fails in many important problem classes like quadratic inverse problems and neural network training, leading to instability of the algorithms in both theory and practice. To address this, we propose a family of stochastic Bregman proximal gradient (SBPG) methods that only require smooth adaptivity. SBPG replaces the quadratic approximation in SGD with a Bregman proximity measure, offering a better approximation model that handles non-Lipschitz gradients in nonconvex objectives. We establish the convergence properties of vanilla SBPG and show it achieves optimal sample complexity in the nonconvex setting. Experimental results on quadratic inverse problems demonstrate SBPG’s robustness in terms of stepsize selection and sensitivity to the initial point. Furthermore, we introduce a momentum-based variant, MSBPG, which enhances convergence by relaxing the mini-batch size requirement while preserving the optimal oracle complexity. We apply MSBPG to the training of deep neural networks, utilizing a polynomial kernel function to ensure smooth adaptivity of the loss function. Experimental results on benchmark datasets confirm the effectiveness and robustness of MSBPG in training neural networks. Given its negligible additional computational cost compared to SGD in large-scale optimization, MSBPG shows promise as a universal open-source optimizer for future applications.

非凸複合目的関数を最小化する確率的勾配法は、通常、微分可能部分のLipschitz滑らかさに依存しますが、この仮定は、二次逆問題やニューラルネットワークのトレーニングなど、多くの重要な問題クラスで当てはまらず、理論と実践の両方でアルゴリズムの不安定性につながります。これに対処するために、滑らかな適応性のみを必要とする確率的Bregman近似勾配(SBPG)法のファミリーを提案します。SBPGは、SGDの二次近似をBregman近接測度に置き換え、非凸目的関数の非Lipschitz勾配を処理するより優れた近似モデルを提供します。バニラSBPGの収束特性を確立し、非凸設定で最適なサンプル複雑度を実現することを示します。二次逆問題に関する実験結果により、ステップサイズの選択と初期点に対する感度に関してSBPGが堅牢であることが実証されています。さらに、我々はモメンタムベースのバリアントであるMSBPGを導入します。これは、最適なオラクルの複雑性を維持しながらミニバッチサイズ要件を緩和することで収束を強化します。我々は、損失関数の滑らかな適応性を確保するために多項式カーネル関数を利用して、ディープニューラルネットワークのトレーニングにMSBPGを適用します。ベンチマークデータセットでの実験結果は、ニューラルネットワークのトレーニングにおけるMSBPGの有効性と堅牢性を確認しました。大規模最適化においてSGDと比較して追加の計算コストが無視できることを考えると、MSBPGは将来のアプリケーションのための汎用オープンソース最適化ツールとして有望です。

Optimizing Data Collection for Machine Learning

機械学習のためのデータ収集の最適化

Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm to model the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. This formulation generalizes to tasks with multiple data sources, such as labeled and unlabeled data used in semi-supervised learning, and can be easily modified to customized analyses such as how to introduce data from new classes to an existing model. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.

現代の深層学習システムは、優れた性能を達成するために膨大なデータセットを必要としますが、どの程度の量やどのような種類のデータを収集すべきかについての指針はほとんどありません。データの収集が多すぎると不必要な現在のコストが発生し、収集が少なすぎると将来のコストが発生し、ワークフローが遅延する可能性があります。本研究では、データ収集ワークフローを形式的な最適データ収集問題としてモデル化する新しいパラダイムを提案します。これにより、設計者は性能目標、収集コスト、対象期間、目標達成に失敗した際のペナルティを指定できます。この定式化は、半教師あり学習で使用されるラベル付きデータとラベルなしデータなど、複数のデータソースを持つタスクに一般化され、新しいクラスのデータを既存のモデルに導入する方法などのカスタマイズされた分析にも簡単に変更できます。この問題を解決するために、将来の収集コストを最小化する学習・最適化・収集（LOC）手法を開発しました。最後に、ニューラルネットワークのスケーリング則から外挿してデータ要件を推定する従来のベースラインと、このフレームワークを数値的に比較しました。これにより、総収集コストを低く抑えながら、複数の分類、セグメンテーション、検出タスクにおいて、望ましいパフォーマンス目標を達成できないリスクを大幅に低減します。

Unbalanced Kantorovich-Rubinstein distance, plan, and barycenter on nite spaces: A statistical perspective

有限空間における不均衡カントロビッチ-ルビンシュタイン距離、平面、重心：統計的観点

We analyze statistical properties of plug-in estimators for unbalanced optimal transport quantities between finitely supported measures in different prototypical sampling models. Specifically, our main results provide non-asymptotic bounds on the expected error of empirical Kantorovich-Rubinstein (KR) distance, plans, and barycenters for mass penalty parameter $C>0$. The impact of the mass penalty parameter $C$ is studied in detail. Based on this analysis, we mathematically justify randomized computational schemes for KR quantities which can be used for fast approximate computations in combination with any exact solver. Using synthetic and real datasets, we empirically analyze the behavior of the expected errors in simulation studies and illustrate the validity of our theoretical bounds.

我々は、異なるプロトタイプサンプリングモデルにおける有限支持測度間の不均衡な最適輸送量のプラグイン推定値の統計的特性を分析します。具体的には、我々の主な結果は、質量ペナルティパラメータ$C>0$の場合の経験的カントロビッチ-ルビンスタイン（KR）距離、プラン、および重心の期待誤差に関する非漸近的な境界を提供します。質量ペナルティパラメータ$C$の影響を詳細に検討します。この分析に基づき、任意の厳密解法と組み合わせて高速近似計算に使用できるKR量のランダム化計算スキームを数学的に正当化します。合成データセットと実データセットを用いて、シミュレーション研究における期待誤差の挙動を経験的に分析し、理論的限界の妥当性を示します。

Copula-based Sensitivity Analysis for Multi-Treatment Causal Inference with Unobserved Confounding

観測されない交絡因子を含む多重治療因果推論のためのコピュラに基づく感度分析

Recent work has focused on the potential and pitfalls of causal identification in observational studies with multiple simultaneous treatments. Building on previous work, we show that even if the conditional distribution of unmeasured confounders given treatments were known exactly, the causal effects would not in general be identifiable, although they may be partially identified. Given these results, we propose a sensitivity analysis method for characterizing the effects of potential unmeasured confounding, tailored to the multiple treatment setting, that can be used to characterize a range of causal effects that are compatible with the observed data. Our method is based on a copula factorization of the joint distribution of outcomes, treatments, and confounders, and can be layered on top of arbitrary observed data models. We propose a practical implementation of this approach making use of the Gaussian copula, and establish conditions under which causal effects can be bounded. We also describe approaches for reasoning about effects, including calibrating sensitivity parameters, quantifying robustness of effect estimates, and selecting models that are most consistent with prior hypotheses.

最近の研究は、複数の同時治療を伴う観察研究における因果同定の可能性と落とし穴に焦点を当てています。先行研究に基づき、たとえ治療を与えられた際の測定されていない交絡因子の条件付き分布が正確に分かっていたとしても、因果効果は部分的には同定できるものの、一般的には同定できないことを示す。これらの結果に基づき、複数の治療設定に合わせて調整された、潜在的な測定されていない交絡因子の影響を特徴付けるための感度分析手法を提案します。この手法は、観測データと整合する様々な因果効果を特徴付けるために使用できます。我々の手法は、結果、治療、および交絡因子の同時分布のコピュラ分解に基づいており、任意の観測データモデルの上に重ねることができます。我々は、ガウスコピュラを用いたこのアプローチの実際的な実装を提案し、因果効果を有界化できる条件を確立します。また、感度パラメータの較正、効果推定値の堅牢性の定量化、事前仮説と最も整合するモデルの選択など、効果について推論するためのアプローチについても説明します。

Rank-one Convexification for Sparse Regression

スパース回帰におけるランク1凸化

Sparse regression models are increasingly prevalent due to their ease of interpretability and superior out-of-sample performance. However, the exact model of sparse regression with an $\ell_0$-constraint restricting the support of the estimators is a challenging (\NP-hard) non-convex optimization problem. In this paper, we derive new strong convex relaxations for sparse regression. These relaxations are based on the convex-hull formulations for rank-one quadratic terms with indicator variables. The new relaxations can be formulated as semidefinite optimization problems in an extended space and are stronger and more general than the state-of-the-art formulations, including the perspective reformulation and formulations with the reverse Huber penalty and the minimax concave penalty functions. Furthermore, the proposed rank-one strengthening can be interpreted as a non-separable, non-convex, unbiased sparsity-inducing regularizer, which dynamically adjusts its penalty according to the shape of the error function without inducing bias for the sparse solutions. In our computational experiments with benchmark datasets, the proposed conic formulations are solved within seconds and result in near-optimal solutions (with 0.4\% optimality gap on average) for non-convex $\ell_0$-problems. Moreover, the resulting estimators also outperform alternative convex approaches, such as lasso and elastic net regression, from a statistical perspective, achieving high prediction accuracy and good interpretability.

スパース回帰モデルは、その解釈の容易さと優れたアウトオブサンプル性能から、ますます普及しています。しかし、推定値のサポートを制限する$\ell_0$制約を伴うスパース回帰の正確なモデルは、困難な（\NP困難）非凸最適化問題です。本稿では、スパース回帰のための新しい強力な凸緩和を導出します。これらの緩和は、指標変数を含むランク1の2次項の凸包定式に基づいています。新しい緩和は、拡張空間における半正定値最適化問題として定式化でき、パースペクティブ再定式化、逆Huberペナルティ関数、ミニマックス凹ペナルティ関数を含む最新の定式化よりも強力で汎用性があります。さらに、提案されたランク1強化は、非分離、非凸、不偏スパース性誘導正則化として解釈でき、スパース解にバイアスを誘導することなく、誤差関数の形状に応じてペナルティを動的に調整します。ベンチマークデータセットを用いた計算実験において、提案された円錐定式化は数秒以内に解かれ、非凸$\ell_0$問題に対してほぼ最適な解（平均0.4%の最適性ギャップ）が得られます。さらに、得られた推定値は統計的観点からLasso回帰やElastic Net Regressionなどの代替凸アプローチよりも優れており、高い予測精度と良好な解釈可能性を実現しています。

gsplat: An Open-Source Library for Gaussian Splatting

gsplat：ガウシアンスポッティングのオープンソースライブラリ

gsplat is an open-source library designed for training and developing Gaussian Splatting methods. It features a front-end with Python bindings compatible with the PyTorch library and a back-end with highly optimized CUDA kernels. gsplat offers numerous features that enhance the optimization of Gaussian Splatting models, which include optimization improvements for speed, memory, and convergence times. Experimental results demonstrate that gsplat achieves up to 10% less training time and 4x less memory than the original implementation. Utilized in several research projects, gsplat is actively maintained on GitHub. Source code is available at https://github.com/nerfstudio-project/gsplat under Apache License 2.0. We welcome contributions from the open-source community.

gsplatは、ガウススプラッティング手法のトレーニングと開発のために設計されたオープンソースライブラリです。PyTorchライブラリと互換性のあるPythonバインディングを備えたフロントエンドと、高度に最適化されたCUDAカーネルを持つバックエンドを特徴としています。gsplatは、速度、メモリ、および収束時間の最適化改善を含む、ガウススプラッティングモデルの最適化を強化する多数の機能を提供します。実験結果は、gsplatが元の実装に比べて最大10％のトレーニング時間の短縮と4倍のメモリ削減を達成することを示しています。いくつかの研究プロジェクトで利用されているgsplatは、GitHubで積極的にメンテナンスされています。ソースコードは、Apache License 2.0の下でhttps://github.com/nerfstudio-project/gsplatで入手可能です。私たちはオープンソースコミュニティからの貢献を歓迎します。

Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming

スケッチされた逐次二次計画法による制約付き確率最適化の統計的推論

We consider online statistical inference of constrained stochastic nonlinear optimization problems. We apply the Stochastic Sequential Quadratic Programming (StoSQP) method to solve these problems, which can be regarded as applying second-order Newton’s method to the Karush-Kuhn-Tucker (KKT) conditions. In each iteration, the StoSQP method computes the Newton direction by solving a quadratic program, and then selects a proper adaptive stepsize $\bar{\alpha}_t$ to update the primal-dual iterate. To reduce dominant computational cost of the method, we inexactly solve the quadratic program in each iteration by employing an iterative sketching solver. Notably, the approximation error of the sketching solver need not vanish as iterations proceed, meaning that the per-iteration computational cost does not blow up. For the above StoSQP method, we show that under mild assumptions, the rescaled primal-dual sequence $1/\sqrt{\bar{\alpha}_t}\cdot (x_t -x^\star, \lambda_t – \lambda^\star)$ converges to a mean-zero Gaussian distribution with a nontrivial covariance matrix depending on the underlying sketching distribution. To perform inference in practice, we also analyze a plug-in covariance matrix estimator. We illustrate the asymptotic normality result of the method both on benchmark nonlinear problems in CUTEst test set and on linearly/nonlinearly constrained regression problems.

制約付き確率的非線形最適化問題のオンライン統計的推論について考察します。これらの問題を解決するために、Stochastic Sequential Quadratic Programming (StoSQP) 法を適用し、これは Karush-Kuhn-Tucker (KKT) 条件に 2 次ニュートン法を適用したと見なすことができます。各反復で、StoSQP法は二次計画法を解いてニュートン方向を計算し、適切な適応ステップサイズ$bar{alpha}_t$を選択して主双対反復を更新します。この手法の主要な計算コストを削減するために、反復スケッチソルバーを使用して、各反復で二次計画を不正確に解きます。特に、スケッチソルバーの近似誤差は、反復が進行しても消える必要がないため、反復ごとの計算コストが爆発的に増加することはありません。上記のStoSQP法では、穏やかな仮定の下で、再スケーリングされた主双対列$1/sqrt{bar{alpha}_t}cdot（x_t -x^star, lambda_t – lambda^star)$が、基礎となるスケッチ分布に応じて非自明な共分散行列を持つ平均ゼロのガウス分布に収束することを示します。実際に推論を実行するために、プラグイン共分散行列推定器も分析します。この手法の漸近正規性の結果を、CUTEst テストセットのベンチマーク非線形問題と線形/非線形制約付き回帰問題の両方で説明します。

Sliced-Wasserstein Distances and Flows on Cartan-Hadamard Manifolds

スライス・ワッサースタイン距離とカルタン・ハダマール多様体上のフロー

While many Machine Learning methods have been developed or transposed on Riemannian manifolds to tackle data with known non-Euclidean geometry, Optimal Transport (OT) methods on such spaces have not received much attention. The main OT tool on these spaces is the Wasserstein distance, which suffers from a heavy computational burden. On Euclidean spaces, a popular alternative is the Sliced-Wasserstein distance, which leverages a closed-form solution of the Wasserstein distance in one dimension, but which is not readily available on manifolds. In this work, we derive general constructions of Sliced-Wasserstein distances on Cartan-Hadamard manifolds, Riemannian manifolds with non-positive curvature, which include among others Hyperbolic spaces or the space of Symmetric Positive Definite matrices. Then, we propose different applications such as classification of documents with a suitably learned ground cost on a manifold, and data set comparison on a product manifold. Additionally, we derive non-parametric schemes to minimize these new distances by approximating their Wasserstein gradient flows.

多くの機械学習手法が既知の非ユークリッド幾何を持つデータに対処するためにリーマン多様体に発展または転送されてきた一方で、そのような空間における最適輸送（OT）手法はあまり注目されていません。これらの空間における主なOTツールはワッサースタイン距離であり、重い計算負担に悩まされています。ユークリッド空間では、人気のある代替案はスライスワッサースタイン距離であり、これは1次元におけるワッサースタイン距離の閉形式解を利用しますが、多様体上では容易に利用できるわけではありません。この研究では、双曲空間や対称正定値行列の空間を含む、非正曲率のリーマン多様体であるカルタン-ハダマール多様体上のスライス-ワッサースタイン距離の一般的な構成を導出します。その後、適切に学習された基準コストを多様体上で用いた文書の分類や、積多様体上でのデータセットの比較など、さまざまな応用を提案します。さらに、これらの新しい距離をワッサースタイン勾配流を近似することで最小化する非パラメトリック手法を導出します。

Accelerating optimization over the space of probability measures

確率測度の空間における最適化の加速

The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates the exploration of accelerated gradient methods in this context, too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples.

勾配ベースの最適化手法の加速は、特に機械学習の応用において、実用的および理論的に重要なテーマです。ユークリッド空間内の最適化に向けられた多くの関心がある一方で、機械学習において確率測度の空間に対して最適化する必要性が、この文脈における加速勾配法の探求を促しています。そのために、ユークリッド空間におけるモーメントベースのアプローチに類似したハミルトン流アプローチを導入します。このアプローチに基づくアルゴリズムは、連続時間設定において、任意の高次の収束率を達成できることを示します。また、数値例をもって結果を補完します。

Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data

異種グループ構造データのためのベイズ多群ガウス過程モデル

Gaussian processes are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Scientific data are often heterogeneous in their inputs and contain multiple known discrete groups of samples; thus, it is desirable to leverage the similarity among groups while accounting for heterogeneity across groups. We propose multi-group Gaussian processes (MGGPs) defined over $\mathbb{R}^p\times \mathscr{C}$, where $\mathscr{C}$ is a finite set representing the group label, by developing general classes of valid (positive definite) covariance functions on such domains. MGGPs are able to accurately recover relationships between the groups and efficiently share strength across samples from all groups during inference, while capturing distinct group-specific behaviors in the conditional posterior distributions. We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables.

ガウス過程は、機能データ分析、機械学習、空間統計において、複雑な依存関係をモデル化するために広く用いられています。科学データは、その入力がしばしば異質であり、複数の既知の離散サンプル群を含むため、群間の異質性を考慮しながら群間の類似性を活用することが望ましいです。我々は、群ラベルを表す有限集合である(mathscr{C}) の上で定義された多群ガウス過程（MGGP）を提案し、そのようなドメイン上で一般的な有効（正定値）共分散関数のクラスを開発します。MGGPは、推論中にすべてのグループからのサンプル間で関係を正確に回復し、強度を効率的に共有しながら、条件付き後分布における特定のグループの特性を捉えることができます。私たちはシミュレーション実験を通じてMGGPでの推論を示し、遺伝子発現データに対して提案したMGGP回帰フレームワークを適用して、連続変数とカテゴリカル変数を共同モデル化することにより、多グループガウス過程の挙動と強化された推論能力を示します。

Orthogonal Bases for Equivariant Graph Learning with Provable k-WL Expressive Power

証明可能なk-WL表現力を持つ対称グラフ学習のための直交基底

Graph neural network (GNN) models have been widely used for learning graph-structured data. Due to the permutation-invariant requirement of graph learning tasks, a basic element in graph neural networks is the invariant and equivariant linear layers. Previous work (Maron et al., 2019b) provided a maximal collection of invariant and equivariant linear layers and a simple deep neural network model, called k-IGN, for graph data defined on k-tuples of nodes. It is shown that the expressive power of k-IGN is at least as good as the k-Weisfeiler-Leman (WL) algorithm in graph isomorphism tests. However, the dimension of the invariant layer and equivariant layer is the k-th and 2k-th bell numbers, respectively. Such high complexity makes it computationally infeasible for k-IGNs with k >= 3. In this paper, we show that a much smaller dimension for the linear layers is sufficient to achieve the same expressive power. We provide two sets of orthogonal bases for the linear layers, each with only 3(2^k-1)-k basis elements. Based on these linear layers, we develop neural network models GNN-a and GNN-b and show that for the graph data defined on k-tuples of data, GNN-a and GNN-b achieve the expressive power of the k-WL algorithm and the (k+1)-WL algorithm in graph isomorphism tests, respectively. In molecular prediction tasks on benchmark datasets, we demonstrate that low-order neural network models consisting of the proposed linear layers achieve better performance than other neural network models. In particular, order-2 GNN-b and order-3 GNN-a both have 3-WL expressive power, but use a much smaller basis and hence much less computation time than known neural network models.

グラフニューラルネットワーク（GNN）モデルは、グラフ構造データの学習に広く使用されています。グラフ学習タスクの置換不変要件により、グラフニューラルネットワークの基本的な要素は不変かつ共変な線形層です。以前の研究（Maron et al., 2019b）では、不変かつ共変の線形層の最大コレクションと、k-タプルノード上で定義されたグラフデータ向けのシンプルな深層ニューラルネットワークモデルであるk-IGNが提供されました。k-IGNの表現力は、グラフ同型性テストにおけるk-Weisfeiler-Leman（WL）アルゴリズムと同等以上であることが示されました。しかし、不変層と共変層の次元はそれぞれk番目と2k番目のベル数です。このような高い複雑さは、k >= 3のk-IGNには計算上不可能です。本論文では、線形層においてはるかに小さい次元が同じ表現力を達成するのに十分であることを示します。我々は、線形層のために直交基底の2セットを提供し、それぞれに3(2^k-1)-kの基底要素のみを含んでいます。これらの線形層に基づいて、ニューラルネットワークモデルGNN-aとGNN-bを開発し、データのkタプル上で定義されたグラフデータに対して、GNN-aとGNN-bがグラフ同型性テストにおけるk-WLアルゴリズムと(k+1)-WLアルゴリズムの表現力をそれぞれ達成することを示します。ベンチマークデータセットにおける分子予測タスクでは、提案された線形層から成る低次のニューラルネットワークモデルが、他のニューラルネットワークモデルよりも優れた性能を達成することを示します。特に、順序2のGNN-bと順序3のGNN-aの両方が3-WLの表現力を持っていますが、知られているニューラルネットワークモデルよりもはるかに小さな基底を使用し、したがってはるかに少ない計算時間を要します。

Optimal Experiment Design for Causal Effect Identification

因果効果の特定のための最適実験デザイン

Pearl’s do calculus is a complete axiomatic approach to learn the identifiable causal effects from observational data. When such an effect is not identifiable, it is necessary to perform a collection of often costly interventions in the system to learn the causal effect. In this work, we consider the problem of designing a collection of interventions with the minimum cost to identify the desired effect. First, we prove that this problem is NP-complete and subsequently propose an algorithm that can either find the optimal solution or a logarithmic-factor approximation of it. This is done by establishing a connection between our problem and the minimum hitting set problem. Additionally, we propose several polynomial time heuristic algorithms to tackle the computational complexity of the problem. Although these algorithms could potentially stumble on sub-optimal solutions, our simulations show that they achieve small regrets on random graphs.

パールの因果計算は、観察データから因果効果を学ぶための完全な公理的アプローチです。このような効果が特定できない場合、因果効果を学ぶために、しばしば高価な介入の集合をシステムに対して行う必要があります。本研究では、望ましい効果を特定するために、最小コストで介入の集合を設計する問題を考えます。まず、この問題がNP完全であることを証明し、その後、最適解を見つけるか、対数的近似を提供するアルゴリズムを提案します。これは、私たちの問題と最小ヒット集合問題との間に関係を確立することで実現されます。さらに、問題の計算の複雑さに対処するために、いくつかの多項式時間ヒューリスティックアルゴリズムを提案します。これらのアルゴリズムは、最適でない解に出くわす可能性がありますが、シミュレーションではランダムグラフに対して小さな後悔を達成することが示されています。

Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data

平均集約器は、分散した異種データに対するラベルポイズニング攻撃において、堅牢な集約器よりもより堅牢です。

Robustness to malicious attacks is of paramount importance for distributed learning. Existing works usually consider the classical Byzantine attacks model, which assumes that some workers can send arbitrarily malicious messages to the server and disturb the aggregation steps of the distributed learning process. To defend against such worst-case Byzantine attacks, various robust aggregators have been proposed. They are proven to be effective and much superior to the often-used mean aggregator. In this paper, however, we demonstrate that the robust aggregators are too conservative for a class of weak but practical malicious attacks, known as label poisoning attacks, where the sample labels of some workers are poisoned. Surprisingly, we are able to show that the mean aggregator is more robust than the state-of-the-art robust aggregators in theory, given that the distributed data are sufficiently heterogeneous. In fact, the learning error of the mean aggregator is proven to be order-optimal in this case. Experimental results corroborate our theoretical findings, showing the superiority of the mean aggregator under label poisoning attacks.

悪意のある攻撃に対する堅牢性は、分散学習において極めて重要です。既存の研究は通常、いくつかの作業者がサーバーに恣意的な悪意のあるメッセージを送信し、分散学習プロセスの集約ステップを妨害できるという、古典的なビザンチン攻撃モデルを考慮しています。このような最悪のケースのビザンチン攻撃に対抗するために、さまざまな堅牢な集約器が提案されています。これらは効果的であり、しばしば使用される平均集約器よりも優れていることが証明されています。しかし本論文では、堅牢な集約器が、いくつかの作業者のサンプルラベルが汚染される「ラベル汚染攻撃」として知られる弱いが実用的な悪意のある攻撃のクラスに対して過度に保守的であることを示します。驚くべきことに、分散データが十分に異種である場合、平均集約器が最新のロバスト集約器よりも理論的にロバストであることを示すことができます。実際、この場合の平均集約器の学習誤差は順序最適であることが証明されています。実験結果は私たちの理論的結果を裏付けており、ラベルポイズニング攻撃の下での平均集約器の優位性を示しています。

The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond

連合Q学習における異質性の祝福：線形スピードアップとその先

In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focusing on infinite-horizon tabular Markov decision processes, we provide sample complexity guarantees for both the synchronous and asynchronous variants of federated Q-learning, which exhibit a linear speedup with respect to the number of agents and near-optimal dependencies on other salient problem parameters. In the asynchronous setting, existing analyses of federated Q-learning, which adopt an equally weighted averaging of local Q-estimates, require that every agent covers the entire state-action space. In contrast, our improved sample complexity scales inverse proportionally to the minimum entry of the average stationary state-action occupancy distribution of all agents, thus only requiring the agents to collectively cover the entire state-action space, unveiling the blessing of heterogeneity. However, its sample complexity still suffers when the local trajectories are highly heterogeneous. In response, we propose a novel federated Q-learning algorithm with importance averaging, giving larger weights to more frequently visited state-action pairs, which achieves a robust linear speedup as if all trajectories are centrally processed, regardless of the heterogeneity of local behavior policies.

本論文では、ローカルデータのみに基づいて訓練されたローカルQ推定値を定期的に集約することで最適なQ関数を学習することを目的とした連合Q学習を考察します。無限ホライズンのタブラー形式のマルコフ決定プロセスに焦点を当て、エージェントの数に対して線形のスピードアップを示し、他の重要な問題パラメータに対してほぼ最適な依存関係を持つ連合Q学習の同期および非同期のバリアントについてサンプルの複雑性保証を提供します。非同期設定では、ローカルQ推定値の等重み平均を採用した既存の連合Q学習の分析は、すべてのエージェントが完全な状態-行動空間をカバーする必要があります。対照的に、我々の改善されたサンプル複雑性は、全てのエージェントの平均定常状態-行動占有分布の最小エントリーに対して逆比例的にスケールし、したがってエージェントが全体の状態-行動空間を集団でカバーすることだけを要求し、異質性の祝福を明らかにします。しかし、ローカルの軌道が非常に異質である場合、そのサンプル複雑性は依然として影響を受けます。それに応えて、我々は重要度平均を用いた新しい連合Q学習アルゴリズムを提案し、より頻繁に訪問される状態-行動ペアにより大きな重みを与え、ローカル行動ポリシーの異質性にかかわらず、全ての軌道が中央で処理されたかのように堅牢な線形スピードアップを実現します。

depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

depyf：機械学習研究者のためのPyTorchコンパイラーの不透明なボックスを開ける

PyTorch 2.x introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, fully leveraging the PyTorch compiler can be challenging due to its operation at the Python bytecode level, making it appear as an opaque box. To address this, we introduce depyf, a tool designed to demystify the inner workings of the PyTorch compiler. depyf decompiles the bytecode generated by PyTorch back into equivalent source code and establishes connections between the code objects in the memory and their counterparts in source code format on the disk. This feature enables users to step through the source code line by line using debuggers, thus enhancing their understanding of the underlying processes. Notably, depyf is non-intrusive and user-friendly, primarily relying on two convenient context managers for its core functionality. The project is openly available at https://github.com/thuml/depyf and is recognized as a PyTorch ecosystem project at https://pytorch.org/blog/introducing-depyf.

PyTorch 2.xは、深層学習プログラムを加速するためのコンパイラを導入しました。しかし、機械学習研究者にとって、PyTorchコンパイラを完全に活用することは、Pythonバイトコードレベルでの動作のために難しいことがあります。それはあたかも不透明な箱のように見えるためです。これを解決するために、PyTorchコンパイラの内部動作を明らかにすることを目的としたツールであるdepyfを紹介します。depyfは、PyTorchによって生成されたバイトコードを同等のソースコードにデコンパイルし、メモリ内のコードオブジェクトとディスク上のソースコードフォーマットの対応するオブジェクトの間の接続を確立します。この機能は、ユーザーがデバッガを使用してソースコードを1行ずつステップ実行できるようにし、基本的なプロセスの理解を深めます。特に、depyfは非侵襲的で使いやすく、主に2つの便利なコンテキストマネージャに依存してそのコア機能を提供します。このプロジェクトは https://github.com/thuml/depyf で公開されており、 https://pytorch.org/blog/introducing-depyf ではPyTorchエコシステムプロジェクトとして認識されています。

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

マルコフノイズを伴う確率近似と強化学習のためのODE法

Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e.g., stochastic gradient descent and temporal difference learning. One fundamental challenge in analyzing a stochastic approximation algorithm is to establish its stability, i.e., to show that the stochastic vector iterates are bounded almost surely. In this paper, we extend the celebrated Borkar-Meyn theorem for stability from the Martingale difference noise setting to the Markovian noise setting, which greatly improves its applicability in reinforcement learning, especially in those off-policy reinforcement learning algorithms with linear function approximation and eligibility traces. Central to our analysis is the diminishing asymptotic rate of change of a few functions, which is implied by both a form of the strong law of large numbers and a form of the law of the iterated logarithm.

確率近似は、ベクトルを反復的かつ漸次的に確率的に更新するアルゴリズムのクラスであり、例えば、確率的勾配降下法や時間差学習を含みます。確率近似アルゴリズムを分析する際の基本的な課題は、その安定性を確立すること、すなわち確率ベクトル反復がほぼ確実に有界であることを示すことです。本論文では、マーチンゲール差ノイズ設定からマルコフノイズ設定への安定性に関する著名なボルカー・メイン定理を拡張し、特に線形関数近似と適格トレースを用いたオフポリシー強化学習アルゴリズムにおける適用可能性を大幅に向上させます。私たちの分析の中心には、強い大数の法則の一形態と反復対数の法則の一形態の両方から示されるいくつかの関数の漸近的変化率の減少があります。

Improving Graph Neural Networks on Multi-node Tasks with the Labeling Trick

ラベリングトリックを用いたマルチノードタスクにおけるグラフニューラルネットワークの改善

In this paper, we study using graph neural networks (GNNs) for multi-node representation learning, where a representation for a set of more than one node (such as a link) is to be learned. Existing GNNs are mainly designed to learn single-node representations. When used for multi-node representation learning, a common practice is to directly aggregate the single-node representations obtained by a GNN. In this paper, we show a fundamental limitation of such an approach, namely the inability to capture the dependence among multiple nodes in the node set. A straightforward solution is to distinguish target nodes from others. Formalizing this idea, we propose \text{labeling trick}, which first labels nodes in the graph according to their relationships with the target node set before applying a GNN and then aggregates node representations obtained in the labeled graph for multi-node representations. Besides node sets in graphs, we also extend labeling tricks to posets, subsets and hypergraphs. Experiments verify that the labeling trick technique can boost GNNs on various tasks, including undirected link prediction, directed link prediction, hyperedge prediction, and subgraph prediction. Our work explains the superior performance of previous node-labeling-based methods and establishes a theoretical foundation for using GNNs for multi-node representation learning.

本論文では、複数のノード（リンクなど）を対象とした表現学習のためにグラフニューラルネットワーク（GNN）を使用する研究を行います。既存のGNNは主に単一ノードの表現を学習するように設計されています。複数ノードの表現学習に使用される際の一般的な手法は、GNNによって得られた単一ノードの表現を直接集約することです。本論文では、そのようなアプローチの根本的な制限、つまりノードセット内の複数ノード間の依存関係を捉えることができないことを示します。直接的な解決策は、ターゲットノードと他のノードを区別することです。このアイデアを形式化し、

Directed Cyclic Graphs for Simultaneous Discovery of Time-Lagged and Instantaneous Causality from Longitudinal Data Using Instrumental Variables

長期データからの器具変数を使用した時間遅延および瞬時因果関係の同時発見のための有向サイクリックグラフ

We consider the problem of causal discovery from longitudinal observational data. We develop a novel framework that simultaneously discovers the time-lagged causality and the possibly cyclic instantaneous causality. Under common causal discovery assumptions, combined with additional instrumental information typically available in longitudinal data, we prove the proposed model is generally identifiable. To the best of our knowledge, this is the first causal identification theory for directed graphs with general cyclic patterns that achieves unique causal identifiability. Structural learning is carried out in a fully Bayesian fashion. Through extensive simulations and an application to the Women’s Interagency HIV Study, we demonstrate the identifiability, utility, and superiority of the proposed model against state-of-the-art alternative methods.

我々は、縦断的観察データからの因果発見の問題を考察します。時間的遅延因果関係と、場合によっては循環的瞬間因果関係を同時に発見する新しいフレームワークを開発します。一般的な因果発見の仮定に基づき、縦断的データで通常利用可能な追加の計器情報と結び付けて、提案されたモデルが一般に同定可能であることを証明します。我々の知る限り、これは一般的な循環パターンを持つ有向グラフに対する最初の因果同定理論で、ユニークな因果同定性を達成します。構造学習は完全にベイズ的な方法で行われます。広範なシミュレーションと女性間HIV研究への応用を通じて、提案されたモデルの同定可能性、実用性、そして最新の代替手法に対する優位性を示します。

Bayesian Sparse Gaussian Mixture Model for Clustering in High Dimensions

高次元でのクラスタリングのためのベイジアン・スパース・ガウス混合モデル

We study the sparse high-dimensional Gaussian mixture model when the number of clusters is allowed to grow with the sample size. A minimax lower bound for parameter estimation is established, and we show that a constrained maximum likelihood estimator achieves the minimax lower bound. However, this optimization-based estimator is computationally intractable because the objective function is highly nonconvex and the feasible set involves discrete structures. To address the computational challenge, we propose a computationally tractable Bayesian approach to estimate high-dimensional Gaussian mixtures whose cluster centers exhibit sparsity using a continuous spike-and-slab prior. We further prove that the posterior contraction rate of the proposed Bayesian method is minimax optimal. The mis- clustering rate is obtained as a by-product using tools from matrix perturbation theory. The proposed Bayesian sparse Gaussian mixture model does not require pre-specifying the number of clusters, which can be adaptively estimated. The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing data set.

サンプルサイズに応じてクラスタ数が増加することを許可したときのスパース高次元ガウス混合モデルについて研究します。パラメータ推定のためのミニマックス下限が確立され、制約付き最尤推定量がミニマックス下限を達成することを示します。しかし、最適化ベースの推定量は、目的関数が非常に非凸であり、許可される集合が離散構造を含むため、計算上の処理が困難です。この計算上の課題に対処するために、連続スパイク・スラブ事前分布を使用してスパース性を示すクラスタ中心を持つ高次元ガウス混合を推定するための計算上実行可能なベイジアンアプローチを提案します。我々はさらに、提案されたベイズ手法の事後収束率がミニマックス最適であることを証明します。ミスクラスタリング率は、行列摂動理論のツールを使用して副産物として得られます。提案されたベイズスパースガウス混合モデルは、クラスタ数を事前に指定する必要がなく、適応的に推定可能です。提案された手法の妥当性と有用性は、シミュレーション研究および実世界の単細胞RNAシーケンシングデータセットの分析を通じて示されています。

Regularizing Hard Examples Improves Adversarial Robustness

難例の正則化が敵対的ロバスト性を向上させる

Recent studies have validated that pruning hard-to-learn examples from training improves the generalization performance of neural networks (NNs). In this study, we investigate this intriguing phenomenon—the negative effect of hard examples on generalization—in adversarial training. Particularly, we theoretically demonstrate that the increase in the difficulty of hard examples in adversarial training is significantly greater than the increase in the difficulty of easy examples. Furthermore, we verify that hard examples are only fitted through memorization of the label in adversarial training. We conduct both theoretical and empirical analyses of this memorization phenomenon, showing that pruning hard examples in adversarial training can enhance the model’s robustness. However, the challenge remains in finding the optimal threshold for removing hard examples that degrade robustness performance. Based upon these observations, we propose a new approach, difficulty proportional label smoothing (DPLS), to adaptively mitigate the negative effect of hard examples, thereby improving the adversarial robustness of NNs. Notably, our experimental result indicates that our method can successfully leverage hard examples while circumventing the negative effect.

最近の研究では、学習が難しい例をトレーニングから剪定することで、ニューラルネットワーク（NN）の一般化性能が向上することが確認されています。本研究では、この興味深い現象、すなわち逆襲トレーニングにおける一般化への難しい例の負の影響を調査します。特に、逆襲トレーニングにおける難しい例の難易度の増加は、簡単な例の難易度の増加よりも著しく大きいことを理論的に示します。さらに、逆襲トレーニングにおいて、難しい例はラベルの記憶を通じてのみ適合することを確認します。この記憶現象に関する理論的および実証的分析を行い、逆襲トレーニングで難しい例を剪定することがモデルの堅牢性を向上させることを示します。しかし、ロバスト性のパフォーマンスを低下させるハードな例を除去するための最適なしきい値を見つけることは依然として課題です。これらの観察に基づいて、我々はハードな例の悪影響を自動的に緩和し、ニューラルネットワークの対敵ロバスト性を向上させるための新しいアプローチ、難易度比例ラベルスムージング（DPLS）を提案します。特に、我々の実験結果は、我々の方法がハードな例を効果的に活用しながら、その悪影響を回避できることを示しています。

Random ReLU Neural Networks as Non-Gaussian Processes

非ガウス過程としてのランダムReLUニューラルネットワーク

We consider a large class of shallow neural networks with randomly initialized parameters and rectified linear unit activation functions. We prove that these random neural networks are well-defined non-Gaussian processes. As a by-product, we demonstrate that these networks are solutions to stochastic differential equations driven by impulsive white noise (combinations of random Dirac measures). These processes are parameterized by the law of the weights and biases as well as the density of activation thresholds in each bounded region of the input domain. We prove that these processes are isotropic and wide-sense self-similar with Hurst exponent 3/2. We also derive a remarkably simple closed-form expression for their autocovariance function. Our results are fundamentally different from prior work in that we consider a non-asymptotic viewpoint: The number of neurons in each bounded region of the input domain (i.e., the width) is itself a random variable with a Poisson law with mean proportional to the density parameter. Finally, we show that, under suitable hypotheses, as the expected width tends to infinity, these processes can converge in law not only to Gaussian processes, but also to non-Gaussian processes depending on the law of the weights. Our asymptotic results provide a new take on several classical results (wide networks converge to Gaussian processes) as well as some new ones (wide networks can converge to non-Gaussian processes).

我々は、ランダムに初期化されたパラメータと整流線形ユニット活性化関数を持つ大規模な浅いニューラルネットワークのクラスを考察します。これらのランダムニューラルネットワークが良く定義された非ガウス過程であることを証明します。その副産物として、これらのネットワークが衝撃的なホワイトノイズ（ランダムなディラック測度の組み合わせ）によって駆動される確率微分方程式の解であることを示します。これらの過程は、重みとバイアスの法則、および入力領域の各有界領域における活性化閾値の密度によってパラメータ化されます。これらの過程が各方向に等方的であり、広義の自己相似性を持ち、ハースト指数が3/2であることを証明します。また、自己共分散関数の驚くほど単純な閉形式の表現を導出します。私たちの結果は、非漸近的な視点を考慮するという点で、以前の研究とは根本的に異なります。入力領域の各有界領域におけるニューロンの数（すなわち幅）は、密度パラメータに比例する平均を持つポアソン法則に従うランダム変数です。最後に、適切な仮定の下で、期待される幅が無限大に近づくとき、これらのプロセスは法則においてガウス過程だけでなく、重みの法則に依存する非ガウス過程にも収束することを示します。私たちの漸近的な結果は、いくつかの古典的な結果（広いネットワークはガウス過程に収束する）に新たな視点を提供するだけでなく、いくつかの新しい結果（広いネットワークは非ガウス過程に収束する可能性がある）も示しています。

Riemannian Bilevel Optimization

リーマン二層最適化

In this work, we consider the bilevel optimization problem on Riemannian manifolds. We inspect the calculation of the hypergradient of such problems on general manifolds and thus enable the utilization of gradient-based algorithms to solve such problems. The calculation of the hypergradient requires utilizing the notion of Riemannian cross-derivative and we inspect the properties and the numerical calculations of Riemannian cross-derivatives. Algorithms in both deterministic and stochastic settings, named respectively RieBO and RieSBO, are proposed that include the existing Euclidean bilevel optimization algorithms as special cases. Numerical experiments on robust optimization on Riemannian manifolds are presented to show the applicability and efficiency of the proposed methods.

本研究では、リーマン多様体上の二層最適化問題を考察します。このような問題の一般的な多様体におけるハイパー勾配の計算を検討し、勾配ベースのアルゴリズムを用いてこれらの問題を解決できるようにします。ハイパー勾配の計算にはリーマン交差微分の概念を利用する必要があり、リーマン交差微分の特性と数値計算を検討します。決定論的および確率的な設定において、それぞれRieBOおよびRieSBOと名付けられたアルゴリズムが提案され、既存のユークリッド二層最適化アルゴリズムが特別なケースとして含まれています。リーマン多様体上のロバスト最適化に関する数値実験が提示され、提案された手法の適用性と効率性を示します。

Supervised Learning with Evolving Tasks and Performance Guarantees

進化するタスクとパフォーマンス保証を伴う教師あり学習

Multiple supervised learning scenarios are composed by a sequence of classification tasks. For instance, multi-task learning and continual learning aim to learn a sequence of tasks that is either fixed or grows over time. Existing techniques for learning tasks that are in a sequence are tailored to specific scenarios, lacking adaptability to others. In addition, most of existing techniques consider situations in which the order of the tasks in the sequence is not relevant. However, it is common that tasks in a sequence are evolving in the sense that consecutive tasks often have a higher similarity. This paper presents a learning methodology that is applicable to multiple supervised learning scenarios and adapts to evolving tasks. Differently from existing techniques, we provide computable tight performance guarantees and analytically characterize the increase in the effective sample size. Experiments on benchmark datasets show the performance improvement of the proposed methodology in multiple scenarios and the reliability of the presented performance guarantees.

複数の教師あり学習シナリオは、一連の分類タスクで構成されています。例えば、マルチタスク学習や継続学習は、固定されたタスクのシーケンスまたは時間とともに成長するタスクのシーケンスを学習することを目的としています。シーケンス内のタスクを学習するための既存の技術は特定のシナリオに特化しており、他のシナリオへの適応性に欠けています。さらに、既存の技術のほとんどは、シーケンス内のタスクの順序が重要でない状況を考慮しています。しかし、シーケンス内のタスクは進化していることが一般的であり、連続するタスクはしばしば高い類似性を持っています。本論文では、複数の教師あり学習シナリオに適用可能で、進化するタスクに適応する学習方法論を提示します。既存の技術とは異なり、計算可能な厳密な性能保証を提供し、有効サンプルサイズの増加を分析的に特徴付けます。ベンチマークデータセットに関する実験は、提案された方法論の複数のシナリオにおける性能向上と、提示された性能保証の信頼性を示しています。

Error estimation and adaptive tuning for unregularized robust M-estimator

非正則化ロバストM推定量の誤差推定と適応調整

We consider unregularized robust M-estimators for linear models under Gaussian design and heavy-tailed noise, in the proportional asymptotics regime where the sample size n and the number of features p are both increasing such that $p/n \to \gamma\in (0,1)$. An estimator of the out-of-sample error of a robust M-estimator is analyzed and proved to be consistent for a large family of loss functions that includes the Huber loss. As an application of this result, we propose an adaptive tuning procedure of the scale parameter $\lambda>0$ of a given loss function $\rho$: choosing $\hat \lambda$ in a given interval $I$ that minimizes the out-of-sample error estimate of the M-estimator constructed with loss $\rho_\lambda(\cdot) = \lambda^2 \rho(\cdot/\lambda)$ leads to the optimal out-of-sample error over $I$. The proof relies on a smoothing argument: the unregularized M-estimation objective function is perturbed, or smoothed, with a Ridge penalty that vanishes as $n\to+\infty$, and shows that the unregularized M-estimator of interest inherits properties of its smoothed version.

ガウス設計とヘビーテールノイズの下で、サンプルサイズnと特徴pの数の両方が$p/n to gammain (0,1)$のように増加する比例漸近領域で、線形モデルの非正則化されたロバストなM推定量を考慮します。ロバストな M 推定量のサンプル外誤差の推定量が分析され、Huber 損失を含む損失関数の大規模なファミリに対して一貫していることが証明されます。この結果の応用として、与えられた損失関数ρのスケールパラメータλ>0の適応的調整手順を提案します。損失ρλ(・)=λ2ρ(・/λ)を用いて構築されたM推定量の外部サンプル誤差推定を最小化するように、与えられた区間I内でλを選択することは、Iにおける最適な外部サンプル誤差につながります。証明はスムージングの議論に依存しています：正則化されていないM推定の目的関数は、n→+∞のときに消失するリッジペナルティで摂動またはスムージングされ、関心のある正則化されていないM推定量がそのスムージングされたバージョンの特性を引き継ぐことを示します。

From Sparse to Dense Functional Data in High Dimensions: Revisiting Phase Transitions from a Non-Asymptotic Perspective

高次元におけるスパースからデンス機能データへの移行：非漸近的視点からの位相遷移の再考

Nonparametric estimation of the mean and covariance functions is ubiquitous in functional data analysis and local linear smoothing techniques are most frequently used. Zhang and Wang (2016) explored different types of asymptotic properties of the estimation, which reveal interesting phase transition phenomena based on the relative order of the average sampling frequency per subject $T$ to the number of subjects $n$, partitioning the data into three categories: “sparse”, “semi-dense”, and “ultra-dense”. In an increasingly available high-dimensional scenario, where the number of functional variables $p$ is large in relation to $n$, we revisit this open problem from a non-asymptotic perspective by deriving comprehensive concentration inequalities for the local linear smoothers. Besides being of interest by themselves, our non-asymptotic results lead to elementwise maximum rates of $L_2$ convergence and uniform convergence serving as a fundamentally important tool for further convergence analysis when $p$ grows exponentially with $n$ and possibly $T$. With the presence of extra $\log p$ terms to account for the high-dimensional effect, we then investigate the scaled phase transitions and the corresponding elementwise maximum rates from sparse to semi-dense to ultra-dense functional data in high dimensions. We also discuss a couple of applications of our theoretical results. Finally, numerical studies are carried out to confirm the established theoretical properties.

平均と共分散関数の非パラメトリック推定は、機能データ分析において広く行われており、局所線形スムージング技術が最も頻繁に使用されます。ZhangとWang（2016）は、推定のさまざまなタイプの漸近特性を探求し、被験者ごとの平均サンプリング頻度$T$と被験者数$n$の相対的な順序に基づいて興味深い位相転移現象を明らかにし、データを「スパース」、「セミデンス」、「ウルトラデンス」の3つのカテゴリに分けました。機能変数の数$p$が$n$に対して大きい高次元のシナリオにおいて、局所線形スムーザーのための包括的な集中不等式を導出することによって、この未解決の問題を非漸近的な視点から再考します。それ自体が興味深いだけでなく、非漸近的な結果は、$L_2$の収束と一様収束の要素ごとの最大レートにつながり、$p$が$n$、場合によっては$T$と指数関数的に成長するときに、さらなる収束解析のための基本的に重要なツールとして機能します。高次元効果を説明するために追加の $log p$ 項が存在するため、スケーリングされた相転移と、高次元でのスパースから半高密度、超高密度の関数データに対応する要素ごとの最大レートを調査します。また、理論的な結果のいくつかの応用についても説明します。最後に、確立された理論的特性を確認するために数値研究が行われます。

Locally Private Causal Inference for Randomized Experiments

ランダム化実験のためのローカルプライベート因果推論

Local differential privacy is a differential privacy paradigm in which individuals first apply a privacy mechanism to their data (often by adding noise) before transmitting the result to a curator. The noise for privacy results in additional bias and variance in their analyses. Thus it is of great importance for analysts to incorporate the privacy noise into valid inference. In this article, we develop methodologies to infer causal effects from locally privatized data under randomized experiments. First, we present frequentist estimators under various privacy scenarios with their variance estimators and plug-in confidence intervals. We show a na\”ive debiased estimator results in inferior mean-squared error (MSE) compared to minimax lower bounds. In contrast, we show that using a customized privacy mechanism, we can match the lower bound, giving minimax optimal inference. We also develop a Bayesian nonparametric methodology along with a blocked Gibbs sampling algorithm, which can be applied to any of our proposed privacy mechanisms, and which performs especially well in terms of MSE for tight privacy budgets. Finally, we present simulation studies to evaluate the performance of our proposed frequentist and Bayesian methodologies for various privacy budgets, resulting in useful suggestions for performing causal inference for privatized data.

ローカル差分プライバシーは、個人がデータにプライバシー機構を適用した後（通常はノイズを追加することによって）、その結果をキュレーターに送信する差分プライバシーのパラダイムです。プライバシーのためのノイズは、分析において追加のバイアスと分散をもたらします。したがって、分析者がプライバシーノイズを有効な推論に組み込むことが非常に重要です。本記事では、ランダム化実験の下でローカルにプライバシー化されたデータから因果効果を推測する方法論を開発します。まず、さまざまなプライバシーシナリオにおける頻度主義推定量とその分散推定量、プラグイン信頼区間を提示します。私たちは、ナイーブなデバイアス推定量がミニマックス下限と比較して劣る平均二乗誤差（MSE）をもたらすことを示します。対照的に、カスタマイズされたプライバシー機構を使用することで、下限に一致させることができ、ミニマックス最適推論を実現できることを示します。また、提案したプライバシー機構のいずれにも適用できるブロックギブスサンプリングアルゴリズムとともに、ベイズ非パラメトリック手法を開発し、特に厳しいプライバシーバジェットに対してMSEの観点で優れた性能を発揮します。最後に、提案した頻度主義およびベイズ手法の性能をさまざまなプライバシーバジェットで評価するためのシミュレーション研究を提示し、プライバシー化されたデータに対する因果推論を行うための有用な提案を得ました。

Estimating Network-Mediated Causal Effects via Principal Components Network Regression

主成分ネットワーク回帰を通じたネットワーク媒介因果効果の推定

We develop a method to decompose causal effects on a social network into an indirect effect mediated by the network, and a direct effect independent of the social network. To handle the complexity of network structures, we assume that latent social groups act as causal mediators. We develop principal components network regression models to differentiate the social effect from the non-social effect. Fitting the regression models is as simple as principal components analysis followed by ordinary least squares estimation. We prove asymptotic theory for regression coefficients from this procedure and show that it is widely applicable, allowing for a variety of distributions on the regression errors and network edges. We carefully characterize the counterfactual assumptions necessary to use the regression models for causal inference, and show that current approaches to causal network regression may result in over-control bias. The method is very general, so that it is applicable to many types of structured data beyond social networks, such as text, areal data, psychometrics, images and omics.

私たちは、社会ネットワークにおける因果効果を、ネットワークによって媒介される間接効果と、社会ネットワークに依存しない直接効果に分解する方法を開発します。ネットワーク構造の複雑さに対処するために、潜在的な社会グループが因果媒介者として機能することを仮定します。私たちは、社会的効果と非社会的効果を区別するために主成分ネットワーク回帰モデルを開発します。回帰モデルの適合は、主成分分析に続く通常の最小二乗推定と同じくらい簡単です。この手法から得られる回帰係数の漸近理論を証明し、回帰誤差やネットワークエッジにさまざまな分布を許容する広範な適用性を示します。私たちは、因果推論のために回帰モデルを使用するために必要な反事実的仮定を慎重に特定し、因果ネットワーク回帰に対する現在のアプローチが過剰制御バイアスを引き起こす可能性があることを示します。この方法は非常に一般的であり、ソーシャルネットワークを超えた多くの種類の構造化データ、例えばテキスト、地域データ、心理測定、画像、オミクスに適用可能です。

Selective Inference with Distributed Data

分散データを用いた選択的推論

When data are distributed across multiple sites or machines rather than centralized in one location, researchers face the challenge of extracting meaningful information without directly sharing individual data points. While there are many distributed methods for point estimation using sparse regression, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. In this paper, we introduce a procedure for performing selective inference with distributed data. We consider a scenario where each local machine solves a lasso problem and communicates the selected predictors to a central machine. The central machine then aggregates these selected predictors to form a generalized linear model (GLM). Our goal is to provide valid inference for the selected GLM while reusing data that have been used in the model selection process. Our proposed procedure only requires low-dimensional summary statistics from local machines, thus keeping communication costs low and preserving the privacy of individual data sets. Furthermore, this procedure can be applied in scenarios where model selection is repeatedly conducted on randomly subsampled data sets, addressing the p-value lottery problem linked with model selection. We demonstrate the effectiveness of our approach through simulations and an analysis of a medical data set on ICU admissions.

データが一箇所に集中するのではなく、複数のサイトやマシンに分散されている場合、研究者は個々のデータポイントを直接共有することなく、有意義な情報を抽出するという課題に直面します。スパース回帰を用いた点推定のための多くの分散手法が存在する一方で、推定されたスパース性に基づいて不確実性を推定したり仮説検定を行ったりするための選択肢はほとんどありません。本論文では、分散データを用いた選択的推論を行う手順を紹介します。各ローカルマシンがラッソ問題を解決し、選択された予測因子を中央マシンに伝達するシナリオを考えます。中央マシンは、これらの選択された予測因子を集約して一般化線形モデル（GLM）を形成します。私たちの目標は、モデル選択プロセスで使用されたデータを再利用しながら、選択されたGLMに対して有効な推論を提供することです。提案された手順は、ローカルマシンからの低次元の要約統計のみを必要とし、通信コストを低く抑え、個々のデータセットのプライバシーを保護します。さらに、この手順は、ランダムにサブサンプリングされたデータセットでモデル選択が繰り返し行われるシナリオにも適用でき、モデル選択に関連するp値のロッタリー問題に対処します。シミュレーションとICU入院に関する医療データセットの分析を通じて、私たちのアプローチの有効性を示します。

Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization

非凸ミニマックス最適化のための二重時間スケール勾配降下上昇アルゴリズム

We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_x \max_{y \in Y} f(x, y)$, where the objective function $f(x, y)$ is nonconvex in $x$ and concave in $y$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex and bounded. In the convex-concave setting, the single-timescale gradient descent ascent (GDA) algorithm is widely used in applications and has been shown to have strong convergence guarantees. In more general settings, however, it can fail to converge. Our contribution is to design TTGDA algorithms that are effective beyond the convex-concave setting, efficiently finding a stationary point of the function $\Phi(\cdot) := \max_{y \in Y} f(\cdot, y)$. We also establish theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To the best of our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in other real-world application problems.

構造化された非凸ミニマックス最適化問題を $min_x max_{y in Y} f(x, y)$ の形で解くための 2 時間スケール勾配降下上昇 (TTGDA) の統一解析を提供します。ここで、目的関数 $f(x, y)$ は $x$ で非凸型、$y$ では凹型であり、subseteq mathbb{R}^n$ $Y設定された制約は凸型で有界です。凸凹型設定では、シングルタイムスケールの勾配降下上昇(GDA)アルゴリズムがアプリケーションで広く使用されており、強力な収束保証があることが示されています。ただし、より一般的な設定では、収束に失敗することがあります。私たちの貢献は、凸凹の設定を超えて効果的なTTGDAアルゴリズムを設計し、関数$Phi(cdot) := max_{y in Y} f(cdot, y)$の静止点を効率的に見つけることです。我々は、滑らかな非滑らか非凸凹最小最大最適化問題の解決における複雑さの理論的限界も確立します。我々の知る限り、これは非凸最小最大最適化のためのTTGDAの初の体系的分析であり、生成的敵対ネットワーク（GAN）のトレーニングや他の実世界の応用問題におけるその優れた性能に光を当てています。

An Axiomatic Definition of Hierarchical Clustering

階層的クラスタリングの公理的定義

In this paper, we take an axiomatic approach to defining a population hierarchical clustering for piecewise constant densities, and in a similar manner to Lebesgue integration, extend this definition to more general densities. When the density satisfies some mild conditions, e.g., when it has connected support, is continuous, and vanishes only at infinity, or when the connected components of the density satisfy these conditions, our axiomatic definition results in Hartigan’s definition of cluster tree.

本論文では、区分定数密度のための母集団階層クラスタリングを定義するために公理的アプローチを採用し、ルベーグ積分と同様の方法でこの定義をより一般的な密度に拡張します。密度がいくつかの緩やかな条件、例えば、連結サポートを持ち、連続であり、無限大でのみ消失する場合、または密度の連結成分がこれらの条件を満たす場合、私たちの公理的定義はハーティガンのクラスターツリーの定義に至ります。

Test-Time Training on Video Streams

ビデオストリームにおけるテスト時トレーニング

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances – video frames in our case – arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off.

以前の研究により、テスト時トレーニング（TTT）がトレーニングされたモデルをテスト時にさらに改善するための一般的なフレームワークとして確立されました。各テストインスタンスに対して予測を行う前に、モデルは再構成などの自己教師ありタスクを使用して同じインスタンスで最初にトレーニングされます。私たちはTTTをストリーミング設定に拡張します。ここでは、複数のテストインスタンス（私たちの場合はビデオフレーム）が時間的順序で到着します。私たちの拡張はオンラインTTTです：現在のモデルは前のモデルから初期化され、次に現在のフレームとその直前の小さなフレームウィンドウでトレーニングされます。オンラインTTTは、3つの実世界データセットにおいて4つのタスクで固定モデルのベースラインを大幅に上回ります。改善は、インスタンスセグメンテーションで2.2倍、パンオプティックセグメンテーションで1.5倍以上です。驚くべきことに、オンラインTTTは、時間的順序に関係なく、テストビデオ全体のすべてのフレームでトレーニングを行うオフラインバリアントよりも優れています。この発見は、合成ビデオを使用した以前の研究に挑戦します。私たちは、オンラインTTTがオフラインTTTに対して持つ利点としての局所性の概念を形式化し、アブレーションとバイアス-バリアンストレードオフに基づく理論を用いてその役割を分析します。

Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback

バンディットフィードバックを用いたオンライン学習によるフェデレーテッドラーニングにおける適応的クライアントサンプリング

Due to the high cost of communication, federated learning (FL) systems need to sample a subset of clients that are involved in each round of training. As a result, client sampling plays an important role in FL systems as it affects the convergence rate of optimization algorithms used to train machine learning models. Despite its importance, there is limited work on how to sample clients effectively. In this paper, we cast client sampling as an online learning task with bandit feedback, which we solve with an online stochastic mirror descent (OSMD) algorithm designed to minimize the sampling variance. We then theoretically show how our sampling method can improve the convergence speed of federated optimization algorithms over the widely used uniform sampling. Through both simulated and real data experiments, we empirically illustrate the advantages of the proposed client sampling algorithm over uniform sampling and existing online learning-based sampling strategies. The proposed adaptive sampling procedure is applicable beyond the FL problem studied here and can be used to improve the performance of stochastic optimization procedures such as stochastic gradient descent and stochastic coordinate descent.

通信コストが高いため、連合学習（FL）システムは、各トレーニングラウンドに関与するクライアントのサブセットをサンプリングする必要があります。その結果、クライアントのサンプリングはFLシステムにおいて重要な役割を果たし、機械学習モデルをトレーニングするために使用される最適化アルゴリズムの収束速度に影響を与えます。その重要性にもかかわらず、クライアントを効果的にサンプリングする方法に関する研究は限られています。本論文では、クライアントのサンプリングをバンディットフィードバックを伴うオンライン学習タスクとして定式化し、サンプリングの分散を最小化するように設計されたオンライン確率的ミラー降下（OSMD）アルゴリズムを用いて解決します。次に、理論的に我々のサンプリング手法が広く使用されている均一サンプリングに対して連合最適化アルゴリズムの収束速度を改善できることを示します。シミュレーションデータと実データの実験を通じて、提案されたクライアントサンプリングアルゴリズムが均一サンプリングおよび既存のオンライン学習ベースのサンプリング戦略に対して持つ利点を実証的に示します。提案された適応型サンプリング手法は、ここで研究されたFL問題を超えて適用可能であり、確率的勾配降下法や確率的座標降下法などの確率的最適化手法の性能を向上させるために使用できます。

A Random Matrix Approach to Low-Multilinear-Rank Tensor Approximation

低多重線形ランクテンソル近似へのランダム行列アプローチ

This work presents a comprehensive understanding of the estimation of a planted low-rank signal from a general spiked tensor model near the computational threshold. Relying on standard tools from the theory of large random matrices, we characterize the large-dimensional spectral behavior of the unfoldings of the data tensor and exhibit relevant signal-to-noise ratios governing the detectability of the principal directions of the signal. These results allow to accurately predict the reconstruction performance of truncated multilinear SVD (MLSVD) in the non-trivial regime. This is particularly important since it serves as an initialization of the higher-order orthogonal iteration (HOOI) scheme, whose convergence to the best low-multilinear-rank approximation depends entirely on its initialization. We give a sufficient condition for the convergence of HOOI and show that the number of iterations before convergence tends to $1$ in the large-dimensional limit.

この研究は、計算の閾値近くにある一般的なスパイクテンソルモデルから植えられた低ランク信号の推定に関する包括的な理解を提供します。大規模ランダム行列の理論からの標準的なツールに依存して、データテンソルの展開の大次元スペクトル挙動を特徴付け、信号の主方向の検出可能性を支配する関連する信号対雑音比を示します。これらの結果は、非自明な領域における切断多線形SVD（MLSVD）の再構成性能を正確に予測することを可能にします。これは、最良の低多線形ランク近似への収束がその初期化に完全に依存する高次直交反復（HOOI）スキームの初期化として重要です。HOOIの収束のための十分条件を示し、収束前の反復回数が大次元の限界で$1$に近づくことを示します。

Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents

メモリジム：エージェントのメモリ能力をベンチマークするための無限のタスクに向けて

Memory Gym presents a suite of 2D partially observable environments, namely Mortar Mayhem, Mystery Path, and Searing Spotlights, designed to benchmark memory capabilities in decision-making agents. These environments, originally with finite tasks, are expanded into innovative, endless formats, mirroring the escalating challenges of cumulative memory games such as “I packed my bag”. This progression in task design shifts the focus from merely assessing sample efficiency to also probing the levels of memory effectiveness in dynamic, prolonged scenarios. To address the gap in available memory-based Deep Reinforcement Learning baselines, we introduce an implementation within the open-source CleanRL library that integrates Transformer-XL (TrXL) with Proximal Policy Optimization. This approach utilizes TrXL as a form of episodic memory, employing a sliding window technique. Our comparative study between the Gated Recurrent Unit (GRU) and TrXL reveals varied performances across our finite and endless tasks. TrXL, on the finite environments, demonstrates superior effectiveness over GRU, but only when utilizing an auxiliary loss to reconstruct observations. Notably, GRU makes a remarkable resurgence in all endless tasks, consistently outperforming TrXL by significant margins. Website and Source Code: https://marcometer.github.io/jmlr_2024.github.io/

Memory Gymは、意思決定エージェントの記憶能力をベンチマークするために設計された、モルターメイヘム、ミステリーパス、シアリングスポットライトという2D部分観測環境のスイートを提供します。これらの環境は、元々有限のタスクを持っていましたが、累積記憶ゲーム「私はバッグを詰めました」のような挑戦の高まりを反映した革新的で無限の形式に拡張されています。このタスクデザインの進展は、サンプル効率の評価から、動的で長期的なシナリオにおける記憶の効果のレベルを探ることにも焦点を移します。利用可能な記憶ベースの深層強化学習のベースラインのギャップに対処するために、Transformer-XL（TrXL）を近接ポリシー最適化と統合した実装をオープンソースのCleanRLライブラリ内に導入します。このアプローチは、スライディングウィンドウ技術を用いて、エピソード記憶の一形態としてTrXLを利用します。Gated Recurrent Unit（GRU）とTrXLの比較研究は、有限タスクと無限タスクの間で異なるパフォーマンスを示しています。TrXLは有限環境において、観察を再構築するために補助損失を利用する場合に限り、GRUよりも優れた効果を示します。しかし、GRUはすべての無限タスクで顕著な復活を遂げ、TrXLを大きな差で一貫して上回っています。ウェブサイトとソースコード: https://marcometer.github.io/jmlr_2024.github.io/

Enhancing Graph Representation Learning with Localized Topological Features

局所的トポロジー特徴を用いたグラフ表現学習の強化

Representation learning on graphs is a fundamental problem that can be crucial in various tasks. Graph neural networks, the dominant approach for graph representation learning, are limited in their representation power. Therefore, it can be beneficial to explicitly extract and incorporate high-order topological and geometric information into these models. In this paper, we propose a principled approach to extract the rich connectivity information of graphs based on the theory of persistent homology. Our method utilizes the topological features to enhance the representation learning of graph neural networks and achieve state-of-the-art performance on various node classification and link prediction benchmarks. We also explore the option of end-to-end learning of the topological features, i.e., treating topological computation as a differentiable operator during learning. Our theoretical analysis and empirical study provide insights and potential guidelines for employing topological features in graph learning tasks.

グラフ上の表現学習は、さまざまなタスクにおいて重要な問題です。グラフ表現学習の主要なアプローチであるグラフニューラルネットワークは、その表現力に限界があります。したがって、これらのモデルに高次のトポロジーおよび幾何学的情報を明示的に抽出して組み込むことが有益です。本論文では、持続的ホモロジーの理論に基づいてグラフの豊富な接続情報を抽出するための原則的アプローチを提案します。我々の方法は、トポロジー的特徴を利用してグラフニューラルネットワークの表現学習を強化し、さまざまなノード分類およびリンク予測ベンチマークで最先端の性能を達成します。また、トポロジー的計算を学習中の微分可能な演算子として扱う、トポロジー的特徴のエンドツーエンド学習のオプションも探ります。我々の理論的分析と実証研究は、グラフ学習タスクにおけるトポロジー的特徴の利用に関する洞察と潜在的なガイドラインを提供します。

Deep Out-of-Distribution Uncertainty Quantification via Weight Entropy Maximization

重みエントロピー最大化による分布外不確実性の深層定量化

This paper deals with uncertainty quantification and out-of-distribution detection in deep learning using Bayesian and ensemble methods. It proposes a practical solution to the lack of prediction diversity observed recently for standard approaches when used out-of-distribution (Ovadia et al., 2019; Liu et al., 2021). Considering that this issue is mainly related to a lack of weight diversity, we claim that standard methods sample in “over-restricted” regions of the weight space due to the use of “over-regularization” processes, such as weight decay and zero-mean centered Gaussian priors. We propose to solve the problem by adopting the maximum entropy principle for the weight distribution, with the underlying idea to maximize the weight diversity. Under this paradigm, the epistemic uncertainty is described by the weight distribution of maximal entropy that produces neural networks “consistent” with the training observations. Considering stochastic neural networks, a practical optimization is derived to build such a distribution, defined as a trade-off between the average empirical risk and the weight distribution entropy. We provide both theoretical and numerical results to assess the efficiency of the approach. In particular, the proposed algorithm appears in the top three best methods in all configurations of an extensive out-of-distribution detection benchmark including more than thirty competitors.

この論文は、ベイズ法とアンサンブル法を用いた深層学習における不確実性定量化と分布外検出について扱っています。最近、分布外で使用された標準的アプローチにおいて観察された予測の多様性の欠如に対する実用的な解決策を提案します（Ovadia et al., 2019; Liu et al., 2021）。この問題は主に重みの多様性の欠如に関連していると考えられるため、標準的な手法は「過度に制限された」重み空間の領域でサンプリングを行っていると主張します。これは、重み減衰やゼロ平均中心のガウス事前分布などの「過度の正則化」プロセスの使用によるものです。重みの多様性を最大化するという基本的なアイデアを持って、重み分布に対して最大エントロピー原理を採用することでこの問題を解決することを提案します。このパラダイムの下で、認識的不確実性は、トレーニング観察と「一貫性」のあるニューラルネットワークを生成する最大エントロピーの重み分布によって説明されます。確率的ニューラルネットワークを考慮すると、平均経験リスクと重み分布エントロピーのトレードオフとして定義されるそのような分布を構築するための実用的な最適化が導出されます。アプローチの効率を評価するために、理論的および数値的な結果を提供します。特に、提案されたアルゴリズムは、30以上の競合を含む広範な分布外検出ベンチマークのすべての構成において、トップ3の最良の方法に現れます。

DisC2o-HD: Distributed causal inference with covariates shift for analyzing real-world high-dimensional data

DisC2o-HD：実世界の高次元データを分析するための共変量シフトを伴う分散因果推論

High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisC2o-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we conduct an empirical study to present the readiness of implementation and validity.

電子健康記録（EHR）データや請求データなどの高次元医療データは、多数の変数と複数の臨床サイトからデータを統合する必要性により、主に二つの課題を提示します。第三の重要な課題は、共変量シフトに関する異質性の潜在的な存在です。本論文では、高次元データの平均治療効果（ATE）を推定するために、共変量シフトを考慮した分散学習アルゴリズムDisC2o-HDを提案します。代理尤度法を活用し、我々の方法は、複数の臨床サイトにおける共変量シフトを考慮しながら、傾向スコアと結果モデルの推定値を調整し、望ましい共変量バランス特性を概ね達成します。私たちは、分散共変量バランシング傾向スコア推定量が、複数のサイトからデータをプールして得られるプール推定量を近似できることを示します。提案された推定量は、傾向スコアモデルまたは結果回帰モデルのいずれかが正しく指定されている場合に一貫性を保ちます。傾向スコアと結果モデルの両方が正しく指定されている場合、半パラメトリック効率境界が達成されます。提案されたアルゴリズムの性能を示すためにシミュレーション研究を行い、さらに実装の準備状況と妥当性を示すために実証研究を行います。

Bayes Meets Bernstein at the Meta Level: an Analysis of Fast Rates in Meta-Learning with PAC-Bayes

メタレベルでのベイズとバーンスタインの出会い：PACベイズを用いたメタ学習における高速率の分析

Bernstein’s condition is a key assumption that guarantees fast rates in machine learning. For example, under this condition, the Gibbs posterior with prior $\pi$ has an excess risk in $O(d_{\pi}/n)$, as opposed to $O(\sqrt{d_{\pi}/n})$ in the general case, where $n$ denotes the number of observations and $d_{\pi}$ is a complexity parameter which depends on the prior $\pi$. In this paper, we examine the Gibbs posterior in the context of meta-learning, i.e., when learning the prior $\pi$ from $T$ previous tasks. Our main result is that Bernstein’s condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior $\pi$, which will reduce the term $d_\pi$ across tasks, is in $O(1/T)$, instead of the expected $O(1/\sqrt{T})$. We further illustrate how this result improves on the standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussian priors.

バーンスタインの状態は、機械学習の高速速度を保証する重要な仮定です。例えば、この条件下では、事前の $pi$ を持つギブス事後波は $O(d_{pi}/n)$ に過剰なリスクを持ちますが、一般的なケースでは $O(sqrt{d_{pi}/n})$ では、$n$ は観測数を示し、$d_{pi}$ は事前の $pi$ に依存する複雑さのパラメータです。この論文では、メタ学習の文脈で、つまり、$T$個の前のタスクから前の$pi$を学習するときに、ギブス事後分布を調べます。私たちの主な結果は、バーンスタインの状態は、観測レベルでの有効性に関係なく、常にメタレベルで保持されるということです。これは、ギブスの事前 $pi$ を学習するための追加コスト (タスク間での項 $d_pi$ を減らす) が、予想される $O(1/sqrt{T})$ ではなく $O(1/T)$ であることを意味します。さらに、この結果が3つの異なる設定(離散事前確率、ガウス事前分布、およびガウス事前確率の混合)で標準レートをどのように改善するかを示します。

Efficiently Escaping Saddle Points in Bilevel Optimization

バイレベル最適化における鞍点からの効率的な脱出

Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on finding the first-order stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Specifically, we show that the perturbed approximate implicit differentiation (AID) with a warm start strategy finds an $\epsilon$-approximate local minimum of bilevel optimization in $\tilde{O}(\epsilon^{-2})$ iterations with high probability. Moreover, we propose an inexact NEgative-curvature-Originated-from-Noise Algorithm (iNEON), an algorithm that can escape saddle point and find local minimum of stochastic bilevel optimization. As a by-product, we provide the first nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems.

バイレベル最適化は、機械学習と最適化における基本的な問題の1つです。バイレベル最適化の最近の理論的発展は、非凸-強-凸の場合の 1 次定常点を見つけることに焦点を当てています。この論文では、非凸-強-凸二水準最適化で鞍点をエスケープできるアルゴリズムを解析します。具体的には、ウォームスタート戦略を用いた摂動近似陰的微分(AID)が、$tilde{O}(epsilon^{-2})$反復でバイレベル最適化の$epsilon$-approximate局所最小値を高い確率で見つけることを示しています。さらに、サドルポイントをエスケープして確率的二値最適化の局所的な最小値を見つけることができるアルゴリズムである、不正確なNEgative-curvature-Originated-from-Noise Algorithm(iNEON)を提案します。副産物として、ミニマックス問題の局所的なミニマックスポイントに収束する摂動多段階勾配降下上昇(GDmax)アルゴリズムの最初の非漸近解析を提供します。

関連記事