Journal of Machine Learning Research Papers: Volume 27の論文一覧

Journal of Machine Learning Research Papers Volume 27に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。

1 論文
2 参考文献
3 関連情報

論文

The surrogate Gibbs-posterior of a corrected stochastic MALA: Towards uncertainty quantification for neural networks

The surrogate Gibbs-posterior of a corrected stochastic MALA: Towards uncertainty quantification for neural networks / 補正された確率的MALAの代理ギブス事後分布：ニューラルネットワークの不確実性定量化に向けて

MALA is a popular gradient-based Markov chain Monte Carlo method to access the Gibbs-posterior distribution. Stochastic MALA (sMALA) scales to large data sets, but changes the target distribution from the Gibbs-posterior to a surrogate posterior which only exploits a reduced sample size. We introduce a corrected stochastic MALA (csMALA) with a simple correction term for which distance between the resulting surrogate posterior and the original Gibbs-posterior decreases in the full sample size while retaining scalability. In a nonparametric regression model, we prove a PAC-Bayes oracle inequality for the surrogate posterior. Uncertainties can be quantified by sampling from the surrogate posterior. Focusing on Bayesian neural networks, we analyze the diameter and coverage of credible balls for shallow neural networks and we show optimal contraction rates for deep neural networks. Our credibility result is independent of the correction and can also be applied to the standard Gibbs-posterior. A simulation study in a high-dimensional parameter space demonstrates that an estimator drawn from csMALA based on its surrogate Gibbs-posterior indeed exhibits these advantages in practice.

MALAは、ギブス事後分布にアクセスするための一般的な勾配ベースのマルコフ連鎖モンテカルロ法です。確率的MALA（sMALA）は大規模なデータセットに拡張できますが、ターゲット分布をギブス事後分布から、縮小されたサンプルサイズのみを利用する代理事後分布に変更します。私たちは、スケーラビリティを維持しながら、結果として得られる代理事後分布と元のギブス事後分布の間の距離が完全なサンプルサイズで減少する単純な補正項を持つ補正確率的MALA（csMALA）を紹介します。ノンパラメトリック回帰モデルでは、代理事後分布のPAC-ベイズオラクル不等式を証明します。不確実性は、代理事後分布からサンプリングすることで定量化できます。ベイジアンニューラルネットワークに焦点を当て、浅いニューラルネットワークの信頼できるボールの直径と範囲を分析し、深いニューラルネットワークの最適な収縮率を示します。我々の信頼性結果は補正とは独立しており、標準的なギブス事後分布にも適用できます。高次元パラメータ空間におけるシミュレーション研究は、csMALAからその代理ギブス事後分布に基づいて導出された推定値が、実際にこれらの利点を示すことを実証しています。

Online Detection of Changes in Moment–Based Projections: When to Retrain Deep Learners or Update Portfolios?

Online Detection of Changes in Moment–Based Projections: When to Retrain Deep Learners or Update Portfolios? / モーメントベース射影における変化のオンライン検出：ディープラーナーの再学習とポートフォリオの更新はいつ行うべきか？

Training deep learning neural networks often requires massive amounts of computational ressources. We proposeto sequentially monitor network predictions to trigger retraining only if the predictions are no longer valid. This can reduce drastically computational costs and opens a door to green deep learning. Our approach is based on the relationship to projected second moments monitoring, a problem also arising in other areas such as computational finance. Various open-end as well as closed-end monitoring rules are studied under mild assumptions on the training sample and the observations of the monitoring period. The results allow for high-dimensional non-stationary time series data and thus, especially, non-i.i.d. training data. Asymptotics is based on Gaussian approximations of projected partial sums allowing for an estimated projection vector. Estimation of projection vectors is studied both for classical non-$\ell_0$-sparsity as well as under sparsity. For the case that the optimal projection depends on the unknown covariance matrix, hard- and soft-thresholded estimators are studied. The method is analyzed by simulations and supported by synthetic data experiments.

深層学習ニューラルネットワークの学習には、しばしば膨大な計算資源が必要です。我々は、ネットワーク予測を逐次監視し、予測がもはや有効でなくなった場合にのみ再学習をトリガーすることを提案します。これにより計算コストを大幅に削減し、グリーンな深層学習への道が開かれます。我々のアプローチは、計算金融などの他の分野でも発生する問題である、投影された秒モーメント監視との関係に基づいています。学習サンプルと監視期間の観測値に関する緩やかな仮定の下で、様々なオープンエンドおよびクローズドエンドの監視規則が研究されます。その結果、高次元の非定常時系列データ、特に非独立同値学習データが可能になります。漸近解析は、推定射影ベクトルを許容する射影部分和のガウス近似に基づいています。射影ベクトルの推定は、古典的な非$\ell_0$-スパース性とスパース性の両方について研究されています。最適な射影が未知の共分散行列に依存する場合については、ハード閾値推定値とソフト閾値推定値が研究されています。この手法はシミュレーションによって解析され、合成データ実験によって裏付けられています。

Efficient frequent directions algorithms for approximate decomposition of matrices and higher-order tensors

Efficient frequent directions algorithms for approximate decomposition of matrices and higher-order tensors / 行列と高階テンソルの近似分解のための効率的な頻繁方向アルゴリズム

In the framework of the FD (frequent directions) algorithm, we first develop two efficient algorithms for low-rank matrix approximations under the embedding matrices composed of the product of any SpEmb (sparse embedding) matrix and any standard Gaussian matrix, or any SpEmb matrix and any SRHT (subsampled randomized Hadamard transform) matrix. The theoretical results are also achieved based on the bounds of singular values of standard Gaussian matrices and the theoretical results for SpEmb and SRHT matrices. With a given Tucker-rank, we then obtain several efficient FD-based randomized variants of T-HOSVD (the truncated high-order singular value decomposition) and ST-HOSVD (sequentially T-HOSVD), which are two common algorithms for computing the approximate Tucker decomposition of any tensor with a given Tucker-rank. We also consider efficient FD-based randomized algorithms for computing the approximate TT (tensor-train) decomposition of any tensor with a given TT-rank. Finally, we illustrate the efficiency and accuracy of these algorithms using synthetic and real-world matrix (and tensor) data.

FD（頻繁方向）アルゴリズムの枠組みにおいて、まず、任意のSpEmb（スパース埋め込み）行列と任意の標準ガウス行列、または任意のSpEmb行列と任意のSRHT（部分標本化ランダム化アダマール変換）行列の積で構成される埋め込み行列の下での低ランク行列近似のための2つの効率的なアルゴリズムを開発します。また、標準ガウス行列の特異値の境界と、SpEmb行列およびSRHT行列の理論的結果に基づいて理論的結果も得られます。与えられたタッカーランクを用いて、T-HOSVD（切り捨て高階特異値分解）とST-HOSVD（逐次T-HOSVD）の効率的なFDベースのランダム化変種をいくつか得ることができます。これらは、与えられたタッカーランクを持つ任意のテンソルの近似タッカー分解を計算するための一般的な2つのアルゴリズムです。また、与えられたTTランクを持つ任意のテンソルの近似TT（テンソル列）分解を計算するための効率的なFDベースのランダム化アルゴリズムについても検討します。最後に、合成データと実世界の行列（およびテンソル）データを用いて、これらのアルゴリズムの効率と精度を示します。

Identifying Weight-Variant Latent Causal Models

Identifying Weight-Variant Latent Causal Models / 重み変数潜在因果モデルの識別

The task of causal representation learning aims to uncover latent higher-level causal variables that affect lower-level observations. Identifying the true latent causal variables from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start with the analysis of three intrinsic indeterminacies in identifying latent variables from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal variables. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under certain assumptions, including the existence of a reference condition under which latent causal influences vanish, we can show that the latent causal variables can be identified up to trivial permutation and scaling, and that partial identifiability results can still be obtained when this reference condition is violated for a subset of latent variables. Furthermore, based on these theoretical results, we propose a novel method, termed Structural caUsAl Variational autoEncoder (SuaVE), which directly learns causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of SuaVE in learning causal representations.

因果表現学習の課題は、低レベルの観測に影響を与える潜在的な高レベルの因果変数を発見することを目指しています。しかしながら、潜在変数間の瞬間的な因果関係を許容しながら、観測データから真の潜在的因果変数を特定することは依然として課題です。この目的のために、我々は観測から潜在変数を特定する際の3つの本質的な不確定性、すなわち推移性、順列不確定性、およびスケーリング不確定性の分析から始めます。その結果、推移性が潜在的因果変数の識別可能性を阻害する上で重要な役割を果たしていることが分かりました。推移性による識別不能な問題に対処するため、我々は、基礎となる潜在的因果モデルが線形ガウスモデルを満たすという新たな識別可能性条件を導入します。この線形ガウスモデルでは、因果係数とガウスノイズの分布が追加の観測変数によって変調されます。潜在的因果的影響が消失する参照条件の存在を含む特定の仮定の下で、潜在的因果変数は自明な順列とスケーリングまで識別可能であり、潜在変数のサブセットでこの参照条件が破られた場合でも部分識別可能性の結果が得られることが示せます。さらに、これらの理論的結果に基づいて、構造的因果変分オートエンコーダ（SuaVE）と呼ばれる新しい手法を提案します。この手法は、因果表現とそれらの間の因果関係、および潜在的因果変数から観測された因果変数へのマッピングを直接学習します。合成データと実データを用いた実験結果は、識別可能性と一貫性の結果、そして因果表現の学習におけるSuaVEの有効性を実証しています。

Classification Under Local Differential Privacy with Model Reversal and Model Averaging

Classification Under Local Differential Privacy with Model Reversal and Model Averaging / モデル反転とモデル平均化を用いた局所差分プライバシー下での分類

Local differential privacy has become a central topic in data privacy research, offering strong privacy guarantees by perturbing user data at the source and removing the need for a trusted curator. However, the noise introduced by local differential privacy often significantly reduces data utility. To address this issue, we reinterpret private learning under local differential privacy as a transfer learning problem, where the noisy data serve as the source domain and the unobserved clean data as the target. We propose novel techniques specifically designed for local differential privacy to improve classification performance without compromising privacy: (1) a noised binary feedback-based evaluation mechanism for estimating dataset utility; (2) model reversal, which salvages underperforming classifiers by inverting their decision boundaries; and (3) model averaging, which assigns weights to multiple reversed classifiers based on their estimated utility. We provide theoretical excess risk bounds under local differential privacy and demonstrate how our methods reduce this risk. Empirical results on both simulated and real-world datasets show substantial improvements in classification accuracy.

ローカル差分プライバシーは、データプライバシー研究の中心的なトピックとなっており、ユーザーデータをソースで撹乱し、信頼できるキュレーターの必要性を排除することで強力なプライバシー保証を提供します。しかし、ローカル差分プライバシーによって導入されるノイズは、多くの場合、データの有用性を大幅に低下させます。この問題に対処するため、我々はローカル差分プライバシー下でのプライバシー学習を転移学習問題として再解釈します。転移学習問題では、ノイズの多いデータがソースドメイン、観測されていないクリーンなデータがターゲットとなります。我々は、プライバシーを損なうことなく分類性能を向上させるため、ローカル差分プライバシー向けに特別に設計された以下の新しい手法を提案します。(1)データセットの有用性を推定するためのノイズ付きバイナリフィードバックベースの評価メカニズム、(2)決定境界を反転することで性能の低い分類器を救済するモデル反転、(3)推定された有用性に基づいて複数の反転分類器に重みを割り当てるモデル平均化。我々は、ローカル差分プライバシー下での理論的な過剰リスク境界を示し、我々の手法がどのようにこのリスクを軽減するかを示す。シミュレートされたデータセットと実際のデータセットの両方で実験した結果、分類精度が大幅に向上した。

Stochastic Gradient Methods: Bias, Stability and Generalization

Stochastic Gradient Methods: Bias, Stability and Generalization / 確率的勾配法：バイアス、安定性、汎化

Recent developments of stochastic optimization often suggest biased gradient estimators to improve either the robustness, communication efficiency or computational speed. Representative biased stochastic gradient methods (BSGMs) include Zeroth-order stochastic gradient descent (SGD), Clipped-SGD and SGD with delayed gradients. The practical success of BSGMs motivates a lot of convergence analysis to explain their impressive training behaviour. As a comparison, there is far less work on their generalization analysis, which is a central topic in modern machine learning. In this paper, we present the first framework to study the stability and generalization of BSGMs for convex and smooth problems. We introduce a generalized Lipschitz-type condition on gradient estimators and bias, under which we develop a rather general stability bound to show how the bias and the gradient estimators affect the stability. We apply our general result to develop the first stability bound for Zeroth-order SGD with reasonable step size sequences, and the first stability bound for Clipped-SGD. While our stability analysis is developed for general BSGMs, the resulting stability bounds for both Zeroth-order SGD and Clipped-SGD match those of SGD under appropriate smoothing/clipping parameters. We combine the stability and convergence analysis together, and derive excess risk bounds of order $O(1/\sqrt{n})$ for both Zeroth-order SGD and Clipped-SGD, where $n$ is the sample size.

確率的最適化の最近の発展は、堅牢性、通信効率、または計算速度のいずれかを向上させるために、バイアス勾配推定器を提案することが多い。代表的なバイアス付き確率的勾配法(BSGM)には、ゼロ次確率的勾配降下法(SGD)、Clipped-SGD、遅延勾配付きSGDなどがあります。BSGMの実用的な成功により、その優れたトレーニング動作を説明する収束解析が盛んに行われています。一方、現代の機械学習の中心的なトピックである一般化解析に関する研究ははるかに少ないです。本稿では、凸問題と滑らかな問題に対するBSGMの安定性と一般化を研究する初のフレームワークを紹介します。勾配推定値とバイアスに関する一般化Lipschitz型条件を導入し、その下で、バイアスと勾配推定値が安定性にどのように影響するかを示すためのかなり一般的な安定性境界を展開します。この一般的な結果を適用して、妥当なステップサイズシーケンスによるゼロ次SGDの最初の安定性境界と、Clipped-SGDの最初の安定性境界を展開します。我々の安定性解析は一般的なBSGMを対象に開発されているが、結果として得られるゼロ次SGDとClipped-SGDの安定性境界は、適切な平滑化/クリッピングパラメータを適用したSGDの安定性境界と一致しています。安定性解析と収束解析を組み合わせ、ゼロ次SGDとClipped-SGDの両方について、$O(1/\sqrt{n})$のオーダーの過剰リスク境界を導出します。ここで、$n$はサンプルサイズです。

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation / エントロピー正則化による平均場変分推論の拡張：理論と計算

Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as $\Xi$-variational inference ($\Xi$-VI). $\Xi$-VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient Sinkhorn algorithm. We show that $\Xi$-variational posteriors effectively recover the true posterior dependency, where the likelihood function is downweighted by a regularization parameter. We analyze the role of dimensionality of the parameter space on the accuracy of $\Xi$-variational approximation and the computational complexity of computing the approximate distribution, providing a rough characterization of the statistical-computational trade-off in $\Xi$-VI, where higher statistical accuracy requires greater computational effort. We also investigate the frequentist properties of $\Xi$-VI and establish results on consistency, asymptotic normality, high-dimensional asymptotics, and algorithmic stability. We provide sufficient criteria for our algorithm to achieve polynomial-time convergence. Finally, we show the inferential benefits of using $\Xi$-VI over mean-field VI and other competing methods, such as normalizing flow, on simulated and real datasets.

変分推論（VI）は、高次元ベイズモデルの近似推論手法として広く普及しています。本稿では、エントロピー正則化を介してナイーブ平均場を拡張する新しいVI手法、$\Xi$変分推論（$\Xi$-VI）を提案します。$\Xi$-VIはエントロピー最適輸送問題と密接な関係があり、計算効率の高いシンクホーンアルゴリズムの恩恵を受ける。$\Xi$変分事後分布は、尤度関数が正則化パラメータによって重み付けを下げられることで、真の事後依存性を効果的に回復することを示す。パラメータ空間の次元が$\Xi$変分近似の精度と近似分布の計算量に及ぼす影響を分析し、統計精度の向上にはより大きな計算量が必要となる$\Xi$-VIにおける統計的・計算的トレードオフの大まかな特徴づけを行う。また、$\Xi$-VIの頻度論的特性を調査し、一貫性、漸近正規性、高次元漸近性、アルゴリズムの安定性に関する結果を確立します。本アルゴリズムが多項式時間収束を達成するための十分な基準を提供します。最後に、シミュレーションおよび実データセットにおいて、平均場VIや正規化フローなどの他の競合手法よりも$\Xi$-VIを使用することの推論上の利点を示す。

skwdro: a library for Wasserstein distributionally robust machine learning

skwdro: a library for Wasserstein distributionally robust machine learning / skwdro：分布的にロバストなWasserstein機械学習用ライブラリ

We present skwdro, a Python library for training robust machine learning models.The library is based on distributionally robust optimization using Wasserstein distances, popular in optimal transport and machine learnings. The goal of the library is to make the training of robust models easier for a wide audience by proposing a wrapper for PyTorch modules, enabling model loss’ robustification with minimal code changes. It comes along with scikit-learn compatible estimators for some popular objectives. The core of the implementation relies on an entropic smoothing of the original robust objective, in order to ensure maximal model flexibility. The library is available at https://github.com/iutzeler/skwdro and the documentation at https://skwdro.readthedocs.io

堅牢な機械学習モデルのトレーニングのためのPythonライブラリであるskwdroを紹介します。このライブラリは、最適輸送および機械学習で人気のあるワッサースタイン距離を用いた分布的に堅牢な最適化に基づいています。このライブラリの目標は、PyTorchモジュールのラッパーを提案し、最小限のコード変更でモデル損失の堅牢化を可能にすることで、幅広いユーザーが堅牢なモデルのトレーニングを容易にすることです。いくつかの一般的な目的関数に対するscikit-learn互換の推定量が付属しています。実装の中核は、モデルの柔軟性を最大限に高めるために、元の堅牢な目的関数のエントロピー平滑化に依存しています。ライブラリはhttps://github.com/iutzeler/skwdroで、ドキュメントはhttps://skwdro.readthedocs.ioで入手できます。

Guaranteed Nonconvex Low-Rank Tensor Estimation via Scaled Gradient Descent

Guaranteed Nonconvex Low-Rank Tensor Estimation via Scaled Gradient Descent / スケール勾配降下法による保証付き非凸低ランクテンソル推定

Tensors, which give a faithful and effective representation to deliver the intrinsic structure of multi-dimensional data, play a crucial role in an increasing number of signal processing and machine learning problems. However, tensor data are often accompanied by arbitrary signal corruptions, including missing entries and sparse noise. A fundamental challenge is to reliably extract the meaningful information from corrupted tensor data in a statistically and computationally efficient manner. This paper develops a scaled gradient descent (ScaledGD) algorithm to directly estimate the tensor factors with tailored spectral initializations under the tensor-tensor product (t-product) and tensor singular value decomposition (t-SVD) framework. With tailored variants for tensor robust principal component analysis, (robust) tensor completion and tensor regression, we theoretically show that ScaledGD achieves linear convergence at a constant rate that is independent of the condition number of the ground truth low-rank tensor, while maintaining the low per-iteration cost of gradient descent. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties for low-rank tensor estimation with the t-SVD. Finally, numerical examples are provided to demonstrate the efficacy of ScaledGD in accelerating the convergence rate of ill-conditioned low-rank tensor estimation in a number of applications.

テンソルは、多次元データの本質的な構造を忠実かつ効果的に表現するものであり、信号処理や機械学習における様々な問題において重要な役割を果たしています。しかし、テンソルデータには、欠落エントリやスパースノイズといった、様々な信号破損が伴うことがよくあります。根本的な課題は、破損したテンソルデータから統計的かつ計算効率の高い方法で、意味のある情報を確実に抽出することです。本論文では、テンソル-テンソル積（t-product）およびテンソル特異値分解（t-SVD）フレームワークに基づき、調整されたスペクトル初期化を用いてテンソル因子を直接推定する、スケールド勾配降下法（ScaledGD）アルゴリズムを開発します。テンソルロバスト主成分分析、（ロバスト）テンソル補完、テンソル回帰といった様々な手法を用いて、ScaledGDが勾配降下法の反復ごとのコストを低く抑えつつ、真の低ランクテンソルの条件数に依存しない一定速度で線形収束を達成することを理論的に示す。我々の知る限り、ScaledGDはt-SVDを用いた低ランクテンソル推定においてこのような特性を証明できる最初のアルゴリズムです。最後に、様々なアプリケーションにおいて、悪条件の低ランクテンソル推定の収束速度を加速するScaledGDの有効性を示す数値例を提示します。

A Data-Augmented Contrastive Learning Approach to Nonparametric Density Estimation

A Data-Augmented Contrastive Learning Approach to Nonparametric Density Estimation / ノンパラメトリック密度推定へのデータ拡張対照学習アプローチ

In this paper, we introduce a data-augmented nonparametric noise contrastive estimation method to density estimation using deep neural networks. By leveraging the idea of contrastive learning, our density estimator exhibits efficiency with a one-step and simulation-free evaluation process, imposes no constraints on the neural network, and is shown to be consistent and asymptotically automatically normalized. A novel data augmentation procedure allows us to mitigate the influence of the choice of reference distribution on our method. Non-asymptotic upper bounds for the expected $L_{2}$-risk and the expected total variation distance have been established, which achieve minimax optimal rates. Moreover, our new method exhibits inherent adaptivity to low dimensional structures of data with a faster convergence rate under a compositional structure assumption. Numerical experiments show the competitiveness of our new method compared with the state-of-the-art nonparametric density estimation methods.

本稿では、ディープニューラルネットワークを用いた密度推定に、データ拡張型ノンパラメトリックノイズ対照推定法を導入します。対照学習の考え方を活用することで、本密度推定法は、ワンステップかつシミュレーション不要の評価プロセスで効率性を示し、ニューラルネットワークに制約を課さず、一貫性があり漸近的に自動的に正規化されることが示されます。新たなデータ拡張手順により、参照分布の選択が本手法に与える影響を軽減することができます。期待$L_{2}$-リスクと期待総変動距離の非漸近的上限が確立されており、これによりミニマックス最適速度が達成されます。さらに、本新手法は、構成構造仮定の下でより高速な収束速度を有し、データの低次元構造に固有の適応性を示す。数値実験は、最先端のノンパラメトリック密度推定法と比較して、本新手法の競争力を示しています。

Nonlocal Techniques for the Analysis of Deep ReLU Neural Network Approximations

Nonlocal Techniques for the Analysis of Deep ReLU Neural Network Approximations / Deep ReLUニューラルネットワーク近似の解析のための非局所的手法

In recent work concerned with the approximation and expressive powers of deep neural networks, Daubechies, DeVore, Foucart, Hanin, and Petrova introduced a system of piecewise linear functions, which can be easily reproduced by artificial neural networks with the ReLU activation function, and showed that it forms a Riesz basis of $L_2([0, 1])$. Their work was subsequently generalized to the multivariate setting by Schneider and Vybíral. In the work at hand, we show that this system serves as a Riesz basis also for Sobolev spaces $W^s([0,1]^d)$ and Barron classes ${\mathbb B}^s([0,1]^d)$ with smoothness $0\lt s\lt 1$. We apply this fact to re-prove some recent results on the approximation of functions from these classes by deep neural networks. Our proof method avoids using local approximations and also allows us to track the implicit constants as well as to show that we can avoid the curse of dimension. Moreover, we also study how well one can approximate Sobolev and Barron functions by neural networks if only function values are known.

深層ニューラルネットワークの近似と表現力に関する最近の研究において、Daubechies、DeVore、Foucart、Hanin、およびPetrovaは、ReLU活性化関数を用いた人工ニューラルネットワークによって容易に再現可能な区分線形関数のシステムを導入し、それが$L_2([0, 1])$のリース基底を形成することを示した。彼らの研究はその後、SchneiderとVybíralによって多変数設定に一般化されました。本研究では、このシステムがソボレフ空間$W^s([0,1]^d)$と滑らかさ$0\lt s\lt 1$を持つバロン類${\mathbb B}^s([0,1]^d)$に対してもリース基底として機能することを示す。この事実を適用して、深層ニューラルネットワークによるこれらのクラスの関数の近似に関する最近の結果をいくつか再証明します。我々の証明法は局所近似の使用を避け、暗黙の定数を追跡するとともに次元の呪いを回避できることを示す。さらに、関数値のみが既知である場合に、ニューラルネットワークによってソボレフ関数とバロン関数をどの程度正確に近似できるかについても検証します。

Nonlinear function-on-function regression by RKHS

Nonlinear function-on-function regression by RKHS / RKHSによる非線形関数対関数回帰

We propose a nonlinear function-on-function regression model where both the covariate and the response are random functions. The nonlinear regression is carried out in two steps: we first construct Hilbert spaces to accommodate the functional covariate and the functional response, and then build a second-layer Hilbert space for the covariate to capture nonlinearity. The second-layer space is assumed to be a reproducing kernel Hilbert space, which is generated by a positive definite kernel determined by the inner product of the first-layer Hilbert space for $X$–this structure is known as the nested Hilbert spaces. We develop estimation procedures to implement the proposed method, which allows the functional data to be observed at different time points for different subjects. Furthermore, we establish the convergence rate of our estimator as well as theweak convergence of the predicted response in the Hilbert space. Numerical studies including both simulations and a data application are conducted to investigate the performance of our estimator in finite sample.

我々は、共変量と応答がともにランダム関数である非線形関数対関数回帰モデルを提案します。この非線形回帰は2段階で実行されます。まず、関数共変量と関数応答を収容するヒルベルト空間を構築し、次に非線形性を捉えるために共変量のための第2層ヒルベルト空間を構築します。第2層空間は、$X$の第1層ヒルベルト空間の内積によって決定される正定値カーネルによって生成される再生カーネルヒルベルト空間であると仮定します。この構造はネストヒルベルト空間として知られています。提案手法を実装するための推定手順を開発し、これにより、異なる被験者について異なる時点で関数データを観測することができます。さらに、ヒルベルト空間における推定値の収束率と予測応答の弱収束性を確立します。有限サンプルにおける推定値の性能を調査するため、シミュレーションとデータ適用の両方を含む数値研究を実施します。

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

UQLM: A Python Package for Uncertainty Quantification in Large Language Models / UQLM: 大規模言語モデルにおける不確実性定量化のためのPythonパッケージ

Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

幻覚は、大規模言語モデル（LLM）が虚偽または誤解を招くコンテンツを生成するインスタンスとして定義され、下流のアプリケーションの安全性と信頼性に影響を与える重大な課題となります。最先端の不確実性定量化（UQ）技術を用いたLLM幻覚検出用のPythonパッケージであるUQLMを紹介します。このツールキットは、0から1の範囲の応答レベルの信頼度スコアを計算するUQベースのスコアラースイートを提供します。このライブラリは、UQベースの幻覚検出のための既製のソリューションを提供し、簡単に統合してLLM出力の信頼性を高めることができます。

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design / 多段階セカンドプライスオークション設計における強化学習アプローチ

We study reserve price optimization in multi-phase second price auctions, where the seller’s prior actions affect the bidders’ later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges.First, from the seller’s perspective, we need to efficiently explore the environment in the presence of potentially untruthful bidders who aim to manipulate the seller’s policy.Second, we want to minimize the seller’s revenue regret when the market noise distribution is unknown. Third, the seller’s per-step revenue is an unknown, nonlinear random variable, and cannot even be directly observed from the environment but realized values.We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named “buffer periods” and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders’ surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction’s underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the \underline{C}ontextual-\underline{L}SVI-\underline{U}CB-\underline{B}uffer (CLUB) algorithm which achieves $\tilde{\mathcal{O}}(H^{5/2}\sqrt{K})$ revenue regret, where $K$ is the number of episodes and $H$ is the length of each episode, when the market noise is known and $\tilde{\mathcal{O}}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders’ truthfulness.

売り手の以前の行動がマルコフ決定過程（MDP）を通じて入札者のその後の評価に影響を与える、多段階のセカンドプライスオークションにおける最低落札価格の最適化を研究します。既存研究のバンディット設定と比較して、私たちの設定には3つの課題があります。まず、売り手の視点から、売り手のポリシーを操作しようとする潜在的に不正な入札者が存在する環境で、環境を効率的に探索する必要があります。次に、市場ノイズの分布が不明な場合に、売り手の収益後悔を最小限に抑えたいと考えています。3番目に、売り手のステップごとの収益は未知の非線形ランダム変数であり、環境から直接観察することさえできず、実現値となります。私たちは、3つの課題すべてに対処するメカニズムを提案します。最初の課題に対処するために、「バッファ期間」と呼ばれる新しい手法と、スイッチングコストの低い強化学習(RL)からのインスピレーションを組み合わせて、不正な入札による入札者の余剰を制限し、それによってほぼ正直な入札を奨励します。2番目の課題については、市場ノイズの分布が不明な場合に純粋な探索の必要性を排除する新しいアルゴリズムで対処します。3つ目の課題は、LSVI-UCBの拡張によって解決されます。この拡張では、オークションの基礎構造を使用して収益関数の不確実性を制御します。3つの手法は、\underline{C}ontextual-\underline{L}SVI-\underline{U}CB-\underline{B}uffer (CLUB)アルゴリズムに集約されます。このアルゴリズムは、市場ノイズが既知の場合は$\tilde{\mathcal{O}}(H^{5/2}\sqrt{K})$の収益後悔を実現します。ここで、$K$はエピソード数、$H$は各エピソードの長さです。また、入札者の誠実さに関する仮定がなく、ノイズが未知である場合は、$\tilde{\mathcal{O}}(H^{3}\sqrt{K})$の収益後悔を実現します。

Error Analysis for Deep ReLU Feedforward Density-Ratio Estimation with Bregman Divergence

Error Analysis for Deep ReLU Feedforward Density-Ratio Estimation with Bregman Divergence / ブレグマンダイバージェンスを用いたDeep ReLUフィードフォワード密度比推定の誤差分析

We consider the problem of density-ratio estimation using Bregman Divergence with Deep ReLU feedforward neural networks (BDD). We establish non-asymptotic error bounds for BDD density-ratio estimators, which are minimax optimal up to a logarithmic factor when the data distribution has finite support. As an application of our theoretical findings, we propose an estimator for the KL-divergence that is asymptotically normal, leveraging our convergence results for the deep density-ratio estimator and a data-splitting method. We also extend our results to cases with unbounded support and unbounded density ratios. Furthermore, we show that the BDD density-ratio estimator can mitigate the curse of dimensionality when data distributions are supported on an approximately low-dimensional manifold. Our results are applied to investigate the convergence properties of the telescoping density-ratio estimator proposed by Rhodes (2020). We provide sufficient conditions under which it achieves a lower error bound than a single-ratio estimator. Moreover, we conduct simulation studies to validate our main theoretical results and assess the performance of the BDD density-ratio estimator.

Deep ReLUフィードフォワードニューラルネットワーク（BDD）を使用したBregman Divergenceを用いた密度比推定の問題を検討します。データ分布が有限のサポートを持つ場合、対数係数までミニマックス最適となるBDD密度比推定量の非漸近的誤差境界を確立します。理論的発見の応用として、深層密度比推定量とデータ分割法の収束結果を利用して、漸近的に正規なKL情報量の推定量を提案します。また、結果を無制限のサポートと無制限の密度比を持つケースに拡張します。さらに、データ分布が近似的に低次元の多様体でサポートされている場合、BDD密度比推定量は次元の呪いを軽減できることを示す。私たちの結果は、Rhodes (2020)によって提案されたテレスコーピング密度比推定量の収束特性を調査するために適用されます。私たちは、単一比率推定量よりも低い誤差境界を達成するための十分な条件を提供します。さらに、主要な理論的結果を検証し、BDD密度比推定器のパフォーマンスを評価するためのシミュレーション研究を実施します。

Flexible Functional Treatment Effect Estimation

Flexible Functional Treatment Effect Estimation / 柔軟な機能的治療効果推定

We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weights are constructed by directly minimizing the uniform balancing error resulting from a decomposition of the WMKRR estimator, instead of being estimated under a particular treatment selection model. Despite the complex structure of the uniform balancing error derived under WMKRR, finite-dimensional convex algorithms can be applied to efficiently solve for the proposed weights thanks to a representer theorem. The optimal convergence rate is shown to be attainable by the proposed WMKRR estimator without any smoothness assumption on the true weight function. Corresponding empirical performance is demonstrated by a simulation study and a real data application.

我々は、目標が実数の関数である連続的な治療効果推定とは対照的に、平均潜在的結果関数が関数の関数である関数治療による治療効果推定を研究します。柔軟なスカラー・オン・ファンクション周辺構造モデルを考慮することにより、重み修正カーネルリッジ回帰（WMKRR）が推定に採用されます。重みは、特定の治療選択モデルの下で推定されるのではなく、WMKRR推定量の分解から生じる均一バランシング誤差を直接最小化することによって構築されます。WMKRRの下で導出される均一バランシング誤差の複雑な構造にもかかわらず、代表者定理のおかげで、有限次元凸アルゴリズムを適用して提案された重みを効率的に解くことができます。最適な収束率は、真の重み関数にいかなる滑らかさの仮定もなしに、提案されたWMKRR推定量によって達成可能であることが示されています。対応する経験的性能は、シミュレーション研究と実際のデータ適用によって実証されています。

Neural Network Parameter-optimization of Gaussian Pre-marginalized Directed Acyclic Graphs

Neural Network Parameter-optimization of Gaussian Pre-marginalized Directed Acyclic Graphs / ガウス事前周辺化有向非巡回グラフのニューラルネットワークパラメータ最適化

Finding the parameters of a latent variable causal model is central to causal inference and causal identification. In this article, we show that existing graphical structures that are used in causal inference are not stable under marginalization of Gaussian Bayesian networks, and present a graphical structure that faithfully represents margins of Gaussian Bayesian networks. We present the first duality between parameter optimization of a latent variable model and training a feed-forward neural network in the parameter space of the assumed family of distributions. Based on this observation, we develop an algorithm for parameter optimization of these graphical structures using the observational distribution. Then, we provide conditions for causal effect identifiability in the Gaussian setting. We propose a meta-algorithm that checks whether a causal effect is identifiable or not. Moreover, we lay a grounding for generalizing the duality between a neural network and a causal model from the Gaussian to other distributions.

潜在変数因果モデルのパラメータを見つけることは、因果推論と因果同定において中心的な役割を果たします。本稿では、因果推論で使用される既存のグラフィカル構造が、ガウスベイジアンネットワークの周辺化の下では安定しないことを示し、ガウスベイジアンネットワークの周辺を忠実に表現するグラフィカル構造を提示します。潜在変数モデルのパラメータ最適化と、想定される分布族のパラメータ空間におけるフィードフォワードニューラルネットワークのトレーニングとの間に、最初の双対性を示します。この観察に基づいて、観測分布を用いてこれらのグラフィカル構造のパラメータ最適化アルゴリズムを開発します。次に、ガウス分布の設定において因果効果を識別可能であるための条件を示します。因果効果が識別可能かどうかを確認するメタアルゴリズムを提案します。さらに、ニューラルネットワークと因果モデル間の双対性をガウス分布から他の分布に一般化するための基礎を築きます。

Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification

Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification / 不均衡テキスト分類のための外挿マルコフ連鎖オーバーサンプリング法

Text classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes – known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feature space) is likely to grow as well. We introduce a novel Markov chain based text oversampling method. The transition probabilities are estimated from the minority class but also partly from the majority class, thus allowing the minority feature space to expand in oversampling. We evaluate our approach against prominent oversampling methods and show that our approach is able to produce highly competitive results against the other methods in several real data examples, especially when the imbalance is severe.

テキスト分類とは、事前定義されたカテゴリセットからテキスト文書に正しいラベルを自動的に割り当てるタスクです。実際の（テキスト）分類タスクでは、観測値と誤分類コストがクラス間で不均等に分散していることが多く、これは不均衡データ問題として知られています。合成オーバーサンプリングは、不均衡分類に対する一般的なアプローチです。その考え方は、少数クラスで合成観測値を生成し、トレーニングセット内のクラス間のバランスをとることです。多くの汎用オーバーサンプリング手法をテキストデータに適用できますが、不均衡なテキストデータは、他の領域と比較したテキストの特異な性質に起因する多くの特有の問題を引き起こします。そのような要因の1つは、テキストのサンプルサイズが増加すると、サンプル語彙（すなわち、特徴空間）も増加する可能性があることです。本稿では、マルコフ連鎖に基づく新しいテキストオーバーサンプリング手法を紹介します。遷移確率は少数クラスから推定されますが、一部は多数クラスからも推定されるため、オーバーサンプリングにおいて少数クラスの特徴空間を拡張することができます。我々は、提案手法を主要なオーバーサンプリング手法と比較して評価し、特に不均衡が深刻な場合、いくつかの実データ例において他の手法に対して非常に競争力のある結果を生み出すことができることを示す。

An Anytime Algorithm for Good Arm Identification

An Anytime Algorithm for Good Arm Identification / 良好なアーム識別のためのいつでもアルゴリズム

In good arm identification (GAI), the goal is to identify one arm whose average performance exceeds a given threshold, referred to as a good arm, if it exists. Few works have studied GAI in the fixed-budget setting when the sampling budget is fixed beforehand, or in the anytime setting, when a recommendation can be asked at any time. We propose APGAI, an anytime and parameter-free sampling rule for GAI in stochastic bandits. APGAI can be straightforwardly used in fixed-confidence and fixed-budget settings. First, we derive upper bounds on its probability of error at any time. They show that adaptive strategies can be more efficient in detecting the absence of good arms than uniform sampling in several diverse instances. Second, when APGAI is combined with a stopping rule, we prove upper bounds on the expected sampling complexity, holding at any confidence level. Finally, we show the good empirical performance of APGAI on synthetic and real-world data. Our work offers an extensive overview of the GAI problem in all settings.

良好アーム識別（GAI）における目標は、平均パフォーマンスが所定の閾値を超える1つのアーム（良好アームと呼ばれる）を特定することです。サンプリング予算が事前に固定されている固定予算設定、またはいつでも推奨を尋ねられるいつでも設定でのGAIを研究した研究はほとんどない。我々は、確率的バンディットにおけるGAIのための、いつでもパラメータフリーのサンプリング規則であるAPGAIを提案します。APGAIは、固定信頼度および固定予算設定で直接使用できます。まず、任意の時点における誤り確率の上限を導出します。その結果、適応戦略は、複数の多様なインスタンスにおいて均一サンプリングよりも良好アームの不在を検出するのに効率的であることを示す。次に、APGAIを停止規則と組み合わせた場合、あらゆる信頼度レベルで成立する、予想されるサンプリング複雑度の上限を証明します。最後に、合成データおよび実世界データにおけるAPGAIの良好な経験的性能を示す。本研究は、あらゆる設定におけるGAI問題の広範な概要を提供します。

Simulation-based Calibration of Uncertainty Intervals under Approximate Bayesian Estimation

Simulation-based Calibration of Uncertainty Intervals under Approximate Bayesian Estimation / 近似ベイズ推定における不確実性区間のシミュレーションベースのキャリブレーション

The mean field variational Bayes (VB) algorithm implemented in Stan is relatively fast and efficient, making it feasible to produce model-estimated official statistics on a rapid timeline. Yet, while consistent point estimates of parameters are achieved for continuous data models, the mean field approximation often produces inaccurate uncertainty quantification to the extent that parameters are correlated a posteriori. In this paper, we propose a simulation procedure that calibrates uncertainty intervals for model parameters estimated under approximate algorithms to achieve nominal coverages. Our procedure detects and corrects biased estimation of both first and second moments of approximate marginal posterior distributions induced by any estimation algorithm that produces consistent first moments under specification of the correct model. The method generates replicate data sets using parameters estimated in an initial model run. The model is subsequently re-estimated on each replicate data set, and we use the empirical distribution over the re-samples to formulate calibrated confidence intervals of parameter estimates of the initial model run that are guaranteed to asymptotically achieve nominal coverage. We demonstrate the performance of our procedure in Monte Carlo simulation study and apply it to real data from the Current Employment Statistics survey.

Stanに実装された平均場変分ベイズ（VB）アルゴリズムは比較的高速かつ効率的であり、モデル推定に基づく公式統計量を迅速に生成することが可能になります。しかしながら、連続データモデルではパラメータの一貫した点推定値が得られる一方で、平均場近似は、パラメータが事後的に相関する程度まで不正確な不確実性定量化をもたらすことが多い。本論文では、近似アルゴリズムを用いて推定されたモデルパラメータの不確実性区間を較正し、名義カバレッジを達成するシミュレーション手順を提案します。本手順は、正しいモデルの仕様に基づいて一貫した第一モーメントを生成する推定アルゴリズムによって引き起こされる、近似周辺事後分布の第一モーメントと第二モーメントの両方の偏った推定値を検出し、修正します。この手法は、初期モデル実行で推定されたパラメータを用いて複製データセットを生成します。その後、各複製データセットでモデルを再推定し、再標本における経験分布を用いて、初期モデル実行時のパラメータ推定値の較正済み信頼区間を定式化します。この信頼区間は、漸近的に名目カバレッジを達成することが保証されています。本手法の性能をモンテカルロシミュレーション研究で実証し、これを「雇用統計調査」の実データに適用します。

Learning Bayesian Network Classifiers to Minimize Class Variable Parameters

Learning Bayesian Network Classifiers to Minimize Class Variable Parameters / クラス変数パラメータを最小化するベイジアンネットワーク分類器の学習

This study proposes and evaluates a novel Bayesian network classifier which can asymptotically estimate the true probability distribution of the class variable with the fewest class variable parameters among all structures for which the class variable has no parent. Moreover, to search for an optimal structure of the proposed classifier, we propose (1) a depth-first search based method and (2) an integer programming based method. The proposed methods are guaranteed to obtain the true probability distribution asymptotically while minimizing the number of class variable parameters. Comparative experiments using benchmark datasets demonstrate the effectiveness of the proposed method.

本研究では、クラス変数が親を持たないすべての構造の中で、クラス変数パラメータが最も少ないクラス変数の真の確率分布を漸近的に推定できる、新しいベイジアンネットワーク分類器を提案し、評価します。さらに、提案する分類器の最適な構造を探索するために、(1)深さ優先探索に基づく手法と(2)整数計画法に基づく手法を提案します。提案手法は、クラス変数パラメータの数を最小化しながら、真の確率分布を漸近的に得ることを保証します。ベンチマークデータセットを用いた比較実験は、提案手法の有効性を実証します。

Nonparametric Estimation of a Factorizable Density using Diffusion Models

Nonparametric Estimation of a Factorizable Density using Diffusion Models / 拡散モデルを用いた因数分解可能な密度のノンパラメトリック推定

In recent years, diffusion models, and more generally score-based deep generative models, have achieved remarkable success in various applications, including image and audio generation. In this paper, we view diffusion models as an implicit approach to nonparametric density estimation and study them within a statistical framework to analyze their surprising performance. A key challenge in high-dimensional statistical inference is leveraging low-dimensional structures inherent in the data to mitigate the curse of dimensionality. We assume that the underlying density exhibits a low-dimensional structure by factorizing into low-dimensional components, a property common in examples such as Bayesian networks and Markov random fields. Under suitable assumptions, we demonstrate that an implicit density estimator constructed from diffusion models adapts to the factorization structure and achieves the minimax optimal rate with respect to the total variation distance. In constructing the estimator, we design a sparse weight-sharing neural network architecture, where sparsity and weight-sharing are key features of practical architectures such as convolutional neural networks and recurrent neural networks.

近年、拡散モデル、そしてより一般的にはスコアベースの深層生成モデルは、画像や音声の生成を含む様々なアプリケーションで目覚ましい成功を収めています。本稿では、拡散モデルを非パラメトリック密度推定への暗黙的なアプローチと捉え、統計的枠組みの中でその驚くべき性能を分析します。高次元統計的推論における重要な課題は、データに内在する低次元構造を活用して次元の呪いを軽減することです。我々は、ベイジアンネットワークやマルコフ確率場などの例でよく見られる特性である、低次元成分への因数分解によって基礎密度が低次元構造を示すと仮定します。適切な仮定の下で、拡散モデルから構築された暗黙的な密度推定量が因数分解構造に適応し、総変動距離に関してミニマックス最適速度を達成することを示す。推定量の構築にあたり、我々はスパースな重み共有ニューラルネットワークアーキテクチャを設計します。ここで、スパース性と重み共有は、畳み込みニューラルネットワークやリカレントニューラルネットワークなどの実用的なアーキテクチャの重要な特徴です。

The Distribution of Ridgeless Least Squares Interpolators

The Distribution of Ridgeless Least Squares Interpolators / リッジレス最小二乗補間器の分布

The Ridgeless minimum $\ell_2$-norm interpolator in overparametrized linear regression has attracted considerable attention in recent years in both machine learning and statistics communities. While it seems to defy conventional wisdom that overfitting leads to poor prediction, recent theoretical research on its $\ell_2$-type risks reveals that its norm minimizing property induces an `implicit regularization’ that helps prediction in spite of interpolation.This paper takes a further step that aims at understanding its precise stochastic behavior as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which provides a precise quantification of the prescribed implicit regularization in the most general distributional sense. Our distributional characterizations hold for general non-Gaussian random designs and extend uniformly to positively regularized Ridge estimators.As a direct application, we obtain a complete characterization for a general class of weighted $\ell_q$ risks of the Ridge(less) estimators that are previously only known for $q=2$ by random matrix methods. These weighted $\ell_q$ risks not only include the standard prediction and estimation errors, but also include the non-standard covariate shift settings. Our uniform characterizations further reveal a surprising feature of the commonly used generalized and $k$-fold cross-validation schemes: tuning the estimated $\ell_2$ prediction risk by these methods alone lead to simultaneous optimal $\ell_2$ in-sample, prediction and estimation risks, as well as the optimal length of debiased confidence intervals.

過パラメータ化線形回帰におけるリッジレス最小$\ell_2$ノルム補間器は、近年、機械学習と統計学の両方のコミュニティで大きな注目を集めています。過剰適合は予測精度の低下につながるという通説に反するように思われるが、$\ell_2$型リスクに関する最近の理論的研究では、そのノルム最小化特性が「暗黙の正則化」を誘導し、補間にもかかわらず予測精度を向上させることが明らかにされています。本論文では、統計的推定量としてのその正確な確率的挙動を理解することを目指し、さらに一歩踏み込む。具体的には、高次元におけるリッジレス補間量の分布を、正の正則化を伴うガウス系列モデルにおけるリッジ推定量を用いて特徴付ける。このリッジ推定量は、最も一般的な分布的意味で、規定された暗黙の正則化を正確に定量化します。我々の分布的特徴付けは、一般的な非ガウスランダム設計に当てはまり、正の正則化されたリッジ推定量に一様に拡張されます。直接的な応用として、これまでランダム行列法によって$q=2$の場合のみ知られていた、リッジ(レス)推定量の重み付き$\ell_q$リスクの一般的なクラスに対する完全な特徴付けを得る。これらの重み付き$\ell_q$リスクには、標準的な予測誤差と推定誤差だけでなく、非標準的な共変量シフト設定も含まれます。我々の均一な特性評価は、一般的に用いられる一般化クロスバリデーションと$k$分割クロスバリデーションの驚くべき特徴をさらに明らかにします。これらの手法のみで推定された$\ell_2$予測リスクを調整すると、標本内リスク、予測リスク、推定リスク、そしてバイアス除去信頼区間の最適な長さが同時に得られます。

LazyDINO: Fast, Scalable, and Efficiently Amortized Bayesian Inversion via Structure-Exploiting and Surrogate-Driven Measure Transport

LazyDINO: Fast, Scalable, and Efficiently Amortized Bayesian Inversion via Structure-Exploiting and Surrogate-Driven Measure Transport / LazyDINO: 構造利用とサロゲート駆動の測度輸送による高速、スケーラブル、かつ効率的な償却ベイジアン逆変換

We present LazyDINO, a transport map variational inference method for fast, scalable, and efficiently amortized solutions of high-dimensional nonlinear Bayesian inverse problems with expensive parameter-to-observable (PtO) maps. Our method consists of an offline phase, in which we construct a derivative-informed neural surrogate of the PtO map using joint samples of the PtO map and its Jacobian as training data. During the online phase, when given observational data, we rapidly approximate the posterior using surrogate-driven training of a lazy map, i.e., a structure-exploiting transport map with low-dimensional nonlinearity. Our surrogate construction is optimized for amortized Bayesian inversion using lazy map variational inference. We show that (i) the derivative-based reduced basis architecture minimizes an upper bound on the expected error in surrogate posterior approximation, and (ii) the derivative-informed surrogate training minimizes the expected error due to surrogate-driven variational inference. Our numerical results demonstrate that LazyDINO is highly efficient in cost amortization for Bayesian inversion. We observe a reduction of one to two orders of magnitude in offline cost for accurate online posterior approximation, compared to amortized simulation-based inference via conditional transport and to conventional surrogate-driven transport. In particular, LazyDINO consistently outperforms Laplace approximation using fewer than 1000 offline PtO map evaluations, while competing methods struggle and sometimes fail at 16,000 evaluations.

我々は、高次元非線形ベイズ逆問題（パラメータ観測値（PtO）マップを扱う）に対し、高速かつスケーラブルで効率的な償却解を求める輸送マップ変分推論手法LazyDINOを提案します。本手法は、PtOマップとそのヤコビアンを結合したサンプルを学習データとして用い、PtOマップの微分情報に基づくニューラルサロゲートを構築するオフラインフェーズから構成されます。オンラインフェーズでは、観測データが与えられた際に、サロゲート駆動型の遅延マップ（低次元非線形性を持つ構造利用輸送マップ）の学習を用いて事後分布を迅速に近似します。このサロゲート構築は、遅延マップ変分推論を用いた償却ベイズ逆問題に最適化されています。(i)微分ベースの縮小基底アーキテクチャが代理事後近似における期待誤差の上限を最小化し、(ii)微分情報に基づく代理学習が代理駆動変分推論による期待誤差を最小化することを示します。数値結果は、LazyDINOがベイズ逆変換のコスト償却において非常に効率的であることを示しています。条件付きトランスポートを介した償却シミュレーションベースの推論や従来の代理駆動トランスポートと比較して、正確なオンライン事後近似のオフラインコストが1～2桁削減されることが確認できました。特に、LazyDINOは1,000回未満のオフラインPtOマップ評価でラプラス近似よりも一貫して優れた性能を発揮しますが、競合手法は16,000回の評価で苦戦し、失敗することもあります。

A Common Interface for Automatic Differentiation

A Common Interface for Automatic Differentiation / 自動微分のための共通インターフェース

For scientific machine learning tasks with a lot of custom code, picking the right Automatic Differentiation (AD) system matters. Our Julia package DifferentiationInterface.jl provides a common frontend to a dozen AD backends, unlocking easy comparison and modular development. In particular, its built-in preparation mechanism leverages the strengths of each backend by amortizing one-time computations. This is key to enabling sophisticated features like sparsity handling without putting additional burdens on the user.

カスタムコードを多く含む科学的機械学習タスクでは、適切な自動微分(AD)システムを選択することが重要です。JuliaパッケージDifferentiationInterface.jlは、12種類のADバックエンドに共通のフロントエンドを提供し、容易な比較とモジュール開発を可能にします。特に、組み込みの準備メカニズムは、1回限りの計算を償却することで各バックエンドの長所を活用します。これは、ユーザーに追加の負担をかけることなく、スパース性処理などの高度な機能を実現するための鍵となります。

Refined Risk Bounds for Unbounded Losses via Transductive Priors

Refined Risk Bounds for Unbounded Losses via Transductive Priors / トランスダクティブ事前分布を用いた無制限損失のリスク境界の精緻化

We revisit the sequential variants of linear regression with the squared loss, classification problems with hinge loss, and logistic regression, all characterized by unbounded losses in the setup where no assumptions are made on the magnitude of design vectors and the norm of the optimal vector of parameters. The key distinction from existing results lies in our assumption that the set of design vectors is known in advance (though their order is not), a setup sometimes referred to as transductive online learning. While this assumption might seem similar to fixed design regression or denoising, we demonstrate that the sequential nature of our algorithms allows us to convert our bounds into statistical ones with random design without making any additional assumptions about the distribution of the design vectors-an impossibility for standard denoising results. Our key tools are based on the exponential weights algorithm with carefully chosen transductive (design-dependent) priors, which exploit the full horizon of the design vectors, as well as additional aggregation tools that address the possibly unbounded norm of the vector of the optimal solution.Our classification regret bounds have a feature that is only attributed to bounded losses in the literature: they depend solely on the dimension of the parameter space and on the number of rounds, independent of the design vectors or the norm of the optimal solution. For linear regression with squared loss, we further extend our analysis to the sparse case, providing sparsity regret bounds that depend additionally only on the magnitude of the response variables. We argue that these improved bounds are specific to the transductive setting and unattainable in the worst-case sequential setup.Our algorithms, in several cases, have polynomial-time approximations and reduce to sampling with respect to log-concave measures instead of aggregating over hard-to-construct epsilon-covers of classes.

二乗損失を伴う線形回帰のシーケンシャルバリアント、ヒンジ損失を伴う分類問題、ロジスティック回帰を再検討します。これらはすべて、設計ベクトルの大きさと最適なパラメータベクトルのノルムについて仮定を設けない設定において、無制限の損失を特徴とします。既存の結果との主な違いは、設計ベクトルのセットが事前にわかっている（ただし、順序はわからない）という仮定にあります。この設定は、トランスダクティブオンライン学習と呼ばれることもあります。この仮定は固定設計回帰やノイズ除去に似ているように見えるかもしれませんが、本アルゴリズムの逐次的な性質により、設計ベクトルの分布に関する追加の仮定を立てることなく、境界をランダム設計による統計的な境界に変換できることを実証します。これは標準的なノイズ除去結果では不可能なことです。主要ツールは、設計ベクトルの全範囲を活用する、慎重に選択されたトランスダクティブ（設計依存）事前分布を用いた指数重みアルゴリズムと、最適解のベクトルのノルムが非有界である可能性に対処する追加の集約ツールに基づいています。本分類リグレット境界には、文献では有界損失にのみ関連付けられている特徴があります。それは、パラメータ空間の次元とラウンド数のみに依存し、設計ベクトルや最適解のノルムとは独立しているということです。損失を2乗した線形回帰については、スパースケースに分析をさらに拡張し、応答変数の大きさのみにも依存するスパースリグレット境界を提供します。私たちは、これらの改善された境界はトランスダクティブ設定に特有のものであり、最悪の場合の順次設定では達成できないと主張します。私たちのアルゴリズムは、いくつかのケースで多項式時間の近似を持ち、構築が難しいクラスのイプシロンカバーにわたって集約するのではなく、対数凹測度に関するサンプリングに簡略化されます。

Decorrelated Local Linear Estimator: Inference for Non-linear Effects in High-dimensional Additive Models

Decorrelated Local Linear Estimator: Inference for Non-linear Effects in High-dimensional Additive Models / 無相関局所線形推定量: 高次元加法モデルにおける非線形効果の推論

Additive models play an essential role in studying non-linear relationships. Despite many recent advances in estimation, there is a lack of methods and theories for inference in high-dimensional additive models, including confidence interval construction and hypothesis testing. Motivated by inference for non-linear treatment effects, we consider the high-dimensional additive model and make inferences for the function derivative. We propose a novel decorrelated local linear estimator and establish its asymptotic normality. The main novelty is the construction of the decorrelation weights, which is instrumental in reducing the error inherited from estimating the nuisance functions in the high-dimensional additive model. We construct the confidence interval for the function derivative and conduct the related hypothesis testing. We demonstrate our proposed method over large-scale simulation studies and apply it to identify non-linear effects in the motif regression problem. Our proposed method is implemented in the R package DLL available from CRAN.

加法モデルは、非線形関係の研究において重要な役割を果たします。近年、推定法は大きく進歩していますが、高次元加法モデルにおける推論、特に信頼区間の構築や仮説検定といった手法や理論は未だ確立されていません。非線形治療効果の推論を目的として、本研究では高次元加法モデルを考察し、関数微分について推論を行います。本研究では、新たな非相関局所線形推定量を提案し、その漸近正規性を確立します。主な新規性は、高次元加法モデルにおけるニューサンス関数の推定から生じる誤差を低減する上で有効な、非相関重みの構築にあります。本研究では、関数微分の信頼区間を構築し、関連する仮説検定を実施します。提案手法を大規模シミュレーション研究で実証し、モチーフ回帰問題における非線形効果の特定に適用します。提案手法は、CRANから入手可能なRパッケージDLLに実装されています。

Communication-efficient Distributed Statistical Inference for Massive Data with Heterogeneous Auxiliary Information

Communication-efficient Distributed Statistical Inference for Massive Data with Heterogeneous Auxiliary Information / 異質な補助情報を持つ大規模データに対する通信効率の高い分散統計推論

Heterogeneous auxiliary information commonly arises in big data due to diverse study settings and privacy constraints. Excluding such indirect evidence often results in a substantial loss of statistical inference efficiency. This article proposes a novel framework for integrating a mixture of individual-level data and multiple external heterogeneous summary statistics by multiplying likelihood functions and confidence densities. Theoretically, we show that the proposed method possesses desirable properties and can achieve statistical efficiency comparable to that of the individual participant data (IPD) estimator, which uses all available individual-level data. Furthermore, we develop a communication-efficient distributed inference procedure for massive datasets with heterogeneous auxiliary information. We demonstrate that the proposed iterative algorithm achieves linear convergence under general conditions or generalized linear models. Finally, extensive simulations and real data applications are conducted to illustrate the performance of the proposed methods.

ビッグデータでは、多様な研究設定やプライバシー制約により、異質な補助情報が一般的に発生します。このような間接的な証拠を除外すると、統計的推論の効率が大幅に低下することがよくあります。本稿では、尤度関数と信頼密度を乗算することにより、個人レベルのデータと複数の外部の異種要約統計量の混合物を統合するための新しいフレームワークを提案します。理論的には、提案手法が望ましい特性を備え、利用可能なすべての個人レベルのデータを使用する個人参加者データ（IPD）推定量に匹敵する統計効率を達成できることを示します。さらに、異種の補助情報を含む大規模データセットに対して、通信効率の高い分散推論手順を開発します。提案する反復アルゴリズムは、一般的な条件または一般化線形モデルの下で線形収束を達成することを実証します。最後に、提案手法の性能を示すために、広範なシミュレーションと実際のデータへの適用を実施します。

Generative Bayesian Inference with GANs

Generative Bayesian Inference with GANs / GANを用いた生成ベイジアン推論

In the absence of explicit or tractable likelihoods, Bayesians often resort to approximate Bayesian computation (ABC) for inference. Our work bridges ABC with deep neural implicit samplers based on generative adversarial networks (GANs) and adversarial variational Bayes. Both ABC and GANs compare aspects of observed and fake data to simulate from posteriors and likelihoods, respectively. We develop a Bayesian GAN (B-GAN) sampler that directly targets the posterior by solving an adversarial optimization problem. B-GAN is driven by a deterministic mapping learned on the ABC reference by conditional GANs.Once the mapping has been trained, iid posterior samples are obtained by filtering noise at a negligible additional cost. We propose two post-processing local refinements using (1) data-driven proposals with importance reweighting, and (2) variational Bayes. We support our findings with frequentist-Bayesian results, showing that the typical total variation distance between the true and approximate posteriors converges to zero for certain neural network generators and discriminators. Our findings on simulated data show highly competitive performance relative to some of the most recent likelihood-free posterior simulators.

明示的または扱いやすい尤度がない場合、ベイズ主義者は推論のために近似ベイズ計算（ABC）に頼ることがよくあります。本研究では、ABCと、生成的敵対ネットワーク（GAN）および敵対的変分ベイズに基づくディープニューラル暗黙的サンプラーを橋渡しします。ABCとGANはどちらも、観測データと偽データの側面を比較し、それぞれ事後分布と尤度からシミュレートします。我々は、敵対的最適化問題を解決することで事後分布を直接ターゲットとするベイジアンGAN（B-GAN）サンプラーを開発しました。B-GANは、条件付きGANによってABC参照で学習された決定論的マッピングによって駆動されます。マッピングがトレーニングされると、わずかな追加コストでノイズをフィルタリングすることで、iid事後分布サンプルが得られます。我々は、（1）重要度の再重み付けを伴うデータ駆動型提案、および（2）変分ベイズを使用した2つの後処理ローカル改良を提案します。我々は、頻度主義ベイジアンの結果によって調査結果を裏付け、特定のニューラルネットワークジェネレーターと識別器について、真の事後分布と近似事後分布間の典型的な総変動距離がゼロに収束することを示しました。シミュレートされたデータに関する調査結果は、最新の尤度フリー事後分布シミュレータのいくつかと比較して、非常に競争力のあるパフォーマンスを示しています。

Exploring Novel Uncertainty Quantification through Forward Intensity Function Modeling

Exploring Novel Uncertainty Quantification through Forward Intensity Function Modeling / 順方向強度関数モデリングによる新たな不確実性定量化の探求

Predicting future time-to-event outcomes is a foundational task in statistical learning. While various methods exist for generating point predictions, quantifying the associated uncertainties poses a more substantial challenge. In this study, we introduce an innovative approach specifically designed to address this challenge, accommodating dynamic predictors that may manifest as stochastic processes. Our investigation harnesses the forward intensity function in a novel way, providing a fresh perspective on this intricate problem. The framework we propose demonstrates remarkable computational efficiency, enabling efficient analyses of large-scale investigations. We validate its soundness with theoretical guarantees, and our in-depth analysis establishes the weak convergence of function-valued parameter estimations. We illustrate the effectiveness of our framework with two comprehensive real examples and extensive simulation studies.

将来の事象発生までの時間を予測することは、統計学習における基礎的な課題です。点予測を生成するための様々な手法が存在しますが、関連する不確実性を定量化することは、より大きな課題となります。本研究では、この課題に対処するために特別に設計された革新的なアプローチを導入し、確率過程として現れる可能性のある動的な予測変数に対応します。本研究では、前向き強度関数を新たな方法で利用することで、この複雑な問題に新たな視点を提供します。提案するフレームワークは、優れた計算効率を示し、大規模な調査の効率的な分析を可能にします。その健全性は理論的な保証によって検証され、詳細な分析によって関数値パラメータ推定の弱収束が確立されます。2つの包括的な実例と広範なシミュレーション研究を用いて、このフレームワークの有効性を示します。

Persistence Diagrams Estimation of Multivariate Piecewise Hölder-continuous Signals

Persistence Diagrams Estimation of Multivariate Piecewise Hölder-continuous Signals / 多変量区分ヘルダー連続信号のパーシスタンス図推定

To our knowledge, the analysis of convergence rates for persistence diagrams estimation from noisy signals has predominantly relied on lifting signal estimation results through sup-norm (or other functional norm) stability theorems. We believe that moving forward from this approach can lead to considerable gains. We illustrate it in the setting of nonparametric regression. From a minimax perspective, we examine the inference of persistence diagrams (for the sublevel sets filtration). We show that for piecewise Hölder-continuous functions, with control over the reach of the set of discontinuities, taking the persistence diagram coming from a simple histogram estimator of the signal permits achieving the minimax rates known for Hölder-continuous functions. The key novelty lies in our use of algebraic stability instead of sup-norm stability, directly targeting the bottleneck distance through the underlying interleaving. This allows us to incorporate deformation retractions of sublevel sets to accommodate boundary discontinuities that cannot be handled by sup-norm based stability analyses.

私たちの知る限り、ノイズの多い信号からのパーシスタンス図推定の収束率の分析は、主に、ノルム超（またはその他の関数ノルム）安定性定理を用いて信号推定結果をリフティングすることに頼ってきました。このアプローチから前進することで、大きな成果が得られると考えています。これをノンパラメトリック回帰の設定で説明します。ミニマックスの観点から、パーシスタンス図（サブレベルセットのフィルタリング用）の推論を検討します。区分ヘルダー連続関数において、不連続点の到達範囲を制御しながら、信号の単純なヒストグラム推定値からパーシスタンス図を得ることで、ヘルダー連続関数で知られているミニマックス率を達成できることを示す。重要な新規性は、超ノルム安定性ではなく代数的安定性を用い、基礎となるインターリーブを通してボトルネック距離を直接ターゲットにしている点にあります。これにより、超ノルムベースの安定性解析では扱えない境界不連続点に対応するために、サブレベルセットの変形後退を組み込むことができます。

CHANI: Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration

CHANI: Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration / CHANI: 生物に着想を得たニューロンの相関に基づくHawkes集約

The present work aims at proving mathematically that a neural network inspired by biology can learn a classification task thanks to local transformations only. In this purpose, we propose a spiking neural network named CHANI (Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration), whose neurons activity is modeled by Hawkes processes. Synaptic weights are updated thanks to an expert aggregation algorithm, providing a local and simple learning rule. We were able to prove that our network can learn on average and asymptotically. Moreover, we demonstrated that it automatically produces neuronal assemblies in the sense that the network can encode several classes and that a same neuron in the intermediate layers might be activated by more than one class, and we provided numerical simulations on synthetic datasets. This theoretical approach contrasts with the traditional empirical validation of biologically inspired networks and paves the way for understanding how local learning rules enable neurons to form assemblies able to represent complex concepts.

本研究の目的は、生物学に着想を得たニューラルネットワークが、局所的な変換のみによって分類タスクを学習できることを数学的に証明することです。この目的のため、我々はCHANI（Correlation-based Hawkes Aggregation of Neurons with bio-Inspiration）という名のスパイキングニューラルネットワークを提案します。このネットワークのニューロン活動はHawkes過程によってモデル化されます。シナプス荷重は、局所的かつシンプルな学習規則を提供するエキスパート集約アルゴリズムによって更新されます。我々は、このネットワークが平均的かつ漸近的に学習できることを証明できた。さらに、ネットワークが複数のクラスをエンコードでき、中間層の同じニューロンが複数のクラスによって活性化される可能性があるという意味で、ニューロンアセンブリを自動的に生成することを実証し、合成データセットを用いた数値シミュレーションを提供した。この理論的アプローチは、生物学的に着想を得たネットワークの従来の経験的検証とは対照的であり、局所的な学習規則によってニューロンが複雑な概念を表現できるアセンブリを形成できるようにする仕組みを理解するための道を開くものです。

Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection

Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection / ガウス過程の混合体としての有限ニューラルネットワーク：証明可能な誤差限界から事前選択へ

Infinitely wide or deep neural networks (NNs) with independent and identically distributed (i.i.d.) parameters have been shown to be equivalent to Gaussian processes. Because of the favorable properties of Gaussian processes, this equivalence is commonly employed to analyze neural networks and has led to various breakthroughs over the years. However, neural networks and Gaussian processes are equivalent only in the limit; in the finite case there are currently no methods available to approximate a trained neural network with a Gaussian model with bounds on the approximation error. In this work, we present an algorithmic framework to approximate a neural network of finite width and depth, and with not necessarily i.i.d. parameters, with a mixture of Gaussian processes with bounds on the approximation error. In particular, we consider the Wasserstein distance to quantify the closeness between probabilistic models and, by relying on tools from optimal transport and Gaussian processes, we iteratively approximate the output distribution of each layer of the neural network as a mixture of Gaussian processes. Crucially, for any NN and $\epsilon >0$ our approach is able to return a mixture of Gaussian processes that is $\epsilon$-close to the NN at a finite set of input points. Furthermore, we rely on the differentiability of the resulting error bound to show how our approach can be employed to tune the parameters of a NN to mimic the functional behavior of a given Gaussian process, e.g., for prior selection in the context of Bayesian inference. We empirically investigate the effectiveness of our results on both regression and classification problems with various neural network architectures. Our experiments highlight how our results can represent an important step towards understanding neural network predictions and formally quantifying their uncertainty.

独立かつ同一に分布する（i.i.d.）パラメータを持つ、無限に広く深いニューラルネットワーク（NN）は、ガウス過程と等価であることが示されています。ガウス過程の好ましい特性のため、この等価性はニューラルネットワークの解析に広く用いられ、長年にわたり様々なブレークスルーをもたらしてきました。しかし、ニューラルネットワークとガウス過程は極限においてのみ等価であり、有限の場合、近似誤差に境界を持つガウスモデルで訓練済みニューラルネットワークを近似する方法は現在のところ存在しません。本研究では、有限の幅と深さを持ち、必ずしもi.i.d.パラメータを持たないニューラルネットワークを、近似誤差に境界を持つガウス過程の混合で近似するアルゴリズムフレームワークを提示します。特に、確率モデル間の近似度を定量化するためにワッサースタイン距離を考慮し、最適輸送過程とガウス過程のツールを用いて、ニューラルネットワークの各層の出力分布をガウス過程の混合として反復的に近似します。重要な点は、任意のニューラルネットワークと$\epsilon >0$に対して、本手法は有限の入力点集合においてニューラルネットワークに$\epsilon$近似するガウス過程の混合を返すことができることです。さらに、得られた誤差境界の微分可能性を用いて、本手法を用いてニューラルネットワークのパラメータを調整し、例えばベイズ推論における事前選択など、特定のガウス過程の機能的挙動を模倣する方法を示す。様々なニューラルネットワークアーキテクチャを用いて、回帰問題と分類問題の両方において、本研究結果の有効性を実証的に検証します。本実験は、本研究結果がニューラルネットワーク予測の理解とその不確実性の形式的定量化に向けた重要な一歩となり得ることを示しています。

Optimization and Generalization of Gradient Descent for Shallow ReLU Networks with Minimal Width

Optimization and Generalization of Gradient Descent for Shallow ReLU Networks with Minimal Width / 最小幅の浅いReLUネットワークに対する勾配降下法の最適化と一般化

Understanding the generalization and optimization of neural networks is a longstanding problem in modern learning theory. The prior analysis often leads to risk bounds of order $1/\sqrt{n}$ for ReLU networks, where $n$ is the sample size. In this paper, we present a general optimization and generalization analysis for gradient descent applied to shallow ReLU networks. We develop convergence rates of the order $1/T$ for gradient descent with $T$ iterations, and show that the gradient descent iterates fall inside local balls around either an initialization point or a reference point. Then we develop improved Rademacher complexity estimates by using the activation pattern of the ReLU function in these local balls. We apply our general result to NTK-separable data with a margin $\gamma$, and develop an almost optimal risk bound of the order $1/(n\gamma^2)$ for the ReLU network with a polylogarithmic width.

ニューラルネットワークの一般化と最適化を理解することは、現代学習理論における長年の課題です。事前分析では、ReLUネットワークのリスク境界はしばしば$1/\sqrt{n}$オーダーに導かれる（ここで$n$はサンプルサイズ）。本論文では、浅いReLUネットワークに適用される勾配降下法の一般的な最適化および一般化分析を提示します。$T$回の反復を伴う勾配降下法の収束速度が$1/T$オーダーであることを明らかにし、勾配降下法の反復が初期化点または参照点の周囲の局所球内に収まることを示す。次に、これらの局所球におけるReLU関数の活性化パターンを用いることで、改良されたRademacher複雑度推定値を開発します。この一般的な結果を、マージン$\gamma$を持つNTK分離可能なデータに適用し、多重対数幅を持つReLUネットワークのほぼ最適なリスク境界を$1/(n\gamma^2)$オーダーで開発します。

Adaptive Forward Stepwise: A Method for High Sparsity Regression

Adaptive Forward Stepwise: A Method for High Sparsity Regression / 適応的前向きステップワイズ法：高スパース回帰のための手法

This paper proposes a sparse regression method that continuously interpolates between Forward Stepwise selection (FS) and the LASSO. When tuned appropriately, our solutions are much sparser than typical LASSO fits but, unlike FS fits, benefit from the stabilizing effect of shrinkage. Our method, Adaptive Forward Stepwise Regression (AFS) addresses the need for sparser models with shrinkage. We show its connection with boosting via a soft-thresholding viewpoint and demonstrate the ease of adapting the method to classification tasks. In both simulations and real data, our method has lower mean squared error and fewer selected features across multiple settings compared to popular sparse modeling procedures.

本論文では、前向きステップワイズ選択法（FS）とLASSOを連続的に補間するスパース回帰法を提案します。適切に調整することで、本手法は典型的なLASSOフィッティングよりもはるかにスパース性が高くなるが、FSフィッティングとは異なり、シュリンクによる安定化効果の恩恵を受ける。本手法である適応型前向きステップワイズ回帰法（AFS）は、シュリンクを伴うよりスパースなモデルの必要性に応える。ソフト閾値の観点からブースティングとの関連性を示し、本手法が分類タスクに容易に適応できることを実証します。シミュレーションと実データの両方において、本手法は一般的なスパースモデリング手法と比較して、平均二乗誤差が低く、複数の設定において選択される特徴量が少ないことが示されました。

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection / ミラー降下法によるアテンションの最適化：一般化最大マージントークン選択

Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

注意メカニズムは、モデルが入力データの関連部分に選択的に焦点を絞ることを可能にすることで、自然言語処理やコンピュータビジョンといった人工知能のいくつかの領域に革命をもたらしました。近年の研究では、注意ベースモデルにおける勾配降下法（GD）の最適化ダイナミクスとその推奨解の構造的特性が明らかにされていますが、ミラー降下法（MD）のようなより一般的な最適化アルゴリズムについてはあまり知られていません。本論文では、ポテンシャル関数として$\ell_p$ノルムの$p$乗を選択した、ソフトマックス注意メカニズム向けに調整されたMDアルゴリズム群の収束特性と暗黙的なバイアスを調査します。具体的には、これらのアルゴリズムをソフトマックス注意モデルを用いた分類問題に適用した場合、$\ell_p$ノルムを目的関数とする一般化ハードマージンSVMへと収束することを示す。特に、本問題が高度に非線形かつ非凸であるにもかかわらず、我々の理論的結果は、より単純なモデルにおいて、収束率が従来のGDと同等であることを示しています。さらに、キークエリ行列とデコーダーの同時最適化ダイナミクスを詳細に検討し、この複雑な同時最適化がそれぞれのハードマージンSVM解に収束する条件を確立します。最後に、実データを用いた数値実験により、MDアルゴリズムは標準的なGDよりも汎化性能が向上し、最適なトークン選択に優れていることが実証されました。

Hierarchical Causal Models

Hierarchical Causal Models / 階層的因果モデル

Causal questions often arise in settings where data are hierarchical: subunits are nested within units. Consider students in schools, cells in patients, or cities in states. In these settings, unit-level variables (e.g., a school’s budget) may affect subunit-level outcomes (e.g., student test scores), and subunit-level characteristics may aggregate to influence unit-level outcomes. In this paper, we show how to analyze hierarchical data for causal inference. We introduce hierarchical causal models, which extend structural causal models and graphical models by incorporating inner plates to represent nested data structures. We develop a graphical identification technique for these models that generalizes do-calculus. We show that hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data–for example, when only unit-level summaries are available. We develop estimation strategies, including using hierarchical Bayesian models. We illustrate our results in simulation and through a reanalysis of the classic “eight schools” study.

因果関係に関する疑問は、データが階層的である設定でしばしば生じます。つまり、サブユニットはユニット内にネストされています。学校の生徒、患者の細胞、州の都市などを考えてみましょう。このような設定では、ユニットレベルの変数（例：学校の予算）がサブユニットレベルの結果（例：生徒のテストの点数）に影響を与える可能性があり、サブユニットレベルの特性が集約されてユニットレベルの結果に影響を与える可能性があります。本稿では、階層データを分析して因果推論を行う方法を示します。階層型因果モデルを紹介します。これは、ネストされたデータ構造を表現するために内部プレートを組み込むことで、構造因果モデルとグラフィカルモデルを拡張したものです。我々は、これらのモデルに対し、do計算を一般化するグラフィカル識別手法を開発します。階層型データを用いることで、非階層型データでは不可能な場合（例えば、単位レベルの要約しか利用できない場合）でも因果関係の識別が可能となることを示す。階層型ベイズモデルを用いた推定戦略も開発します。シミュレーションと、古典的な「8つの流派」研究の再分析によって、我々の結果を示す。

Reparameterized Complex-valued Neurons Can Efficiently Learn More than Real-valued Neurons via Gradient Descent

Reparameterized Complex-valued Neurons Can Efficiently Learn More than Real-valued Neurons via Gradient Descent / 再パラメータ化された複素数値ニューロンは、勾配降下法によって実数値ニューロンよりも効率的に学習できる

Complex-valued neural networks potentially possess better representations and performance than real-valued counterparts when dealing with some complicated tasks such as acoustic analysis, radar image classification, etc. Despite empirical successes, it remains unknown theoretically when and to what extent complex-valued neural networks outperform real-valued ones. We take one step in this direction by comparing the learnability of real-valued neurons and complex-valued neurons via gradient descent. We theoretically show that a complex-valued neuron can learn functions expressed by any one real-valued neuron and any one complex-valued neuron with convergence rates $O(t^{-3})$ and $O(t^{-1})$ where $t$ is the iteration index of gradient descent, respectively, whereas a two-layer real-valued neural network with finite width cannot learn a single non-degenerate complex-valued neuron. We prove that a complex-valued neuron learns a real-valued neuron with rate $\Omega (t^{-3})$, exponentially slower than the linear convergence rate of learning one real-valued neuron using a real-valued neuron. We then reparameterize the phase parameter of the complex-valued neuron and prove that a reparameterized complex-valued neuron can efficiently learn a real-valued neuron with a linear convergence rate. We further verify and extend these results via simulation experiments in more general settings.

複素値ニューラルネットワークは、音響解析、レーダー画像分類といった複雑なタスクを扱う際に、実数値ニューラルネットワークよりも優れた表現力と性能を持つ可能性があります。実験的な成功例があるにもかかわらず、複素値ニューラルネットワークが実数値ニューラルネットワークをどの程度、そしていつ、どのように凌駕するのかは理論的には未解明です。我々は、勾配降下法を用いて実数値ニューロンと複素値ニューロンの学習可能性を比較することで、この方向への一歩を踏み出す。我々は、複素ニューロンが、任意の1つの実数ニューロンと任意の1つの複素ニューロンで表される関数を、それぞれ収束速度$O(t^{-3})$と$O(t^{-1})$で学習できることを理論的に示す。ここで、$t$は勾配降下法の反復インデックスです。一方、有限幅の2層実数ニューラルネットワークは、単一の非退化複素ニューロンを学習することはできない。我々は、複素ニューロンが実数ニューロンを、実数ニューロンを用いて1つの実数ニューロンを学習する場合の線形収束速度よりも指数的に遅い速度$\Omega (t^{-3})$で学習することを証明した。次に、複素ニューロンの位相パラメータを再パラメータ化し、再パラメータ化された複素ニューロンが線形収束速度で実数ニューロンを効率的に学習できることを証明します。我々は、より一般的な設定でのシミュレーション実験により、これらの結果をさらに検証し、拡張します。

Unsupervised Feature Selection via Nonnegative Orthogonal Constrained Regularized Minimization

Unsupervised Feature Selection via Nonnegative Orthogonal Constrained Regularized Minimization / 非負直交制約付き正則化最小化による教師なし特徴選択

Unsupervised feature selection has drawn wide attention in the era of big data, since it serves as a fundamental technique for dimensionality reduction. However, many existing unsupervised feature selection models and solution methods are primarily designed for practical applications, and often lack rigorous theoretical support, such as convergence guarantees. In this paper, we first establish a novel unsupervised feature selection model based on regularized minimization with nonnegative orthogonality constraints, which has advantages of embedding feature selection into the nonnegative spectral clustering and preventing overfitting. To solve the proposed model, we develop an effective inexact augmented Lagrangian multiplier method, in which the subproblems are addressed using a proximal alternating minimization approach. We rigorously prove the algorithm’s sequence converges to a stationary point of the model. Extensive numerical experiments on popular datasets demonstrate the stability and robustness of our method. Moreover, comparative results show that our method outperforms some existing state-of-the-art methods in terms of clustering evaluation metrics. The code is available at https://github.com/liyan-amss/NOCRM_code.

教師なし特徴選択は、次元削減の基本的な手法として、ビッグデータ時代に広く注目を集めています。しかし、既存の多くの教師なし特徴選択モデルと解法は、主に実用アプリケーション向けに設計されており、収束保証などの厳密な理論的裏付けが不足していることが多いです。本稿では、まず、非負直交性制約を伴う正則化最小化に基づく新しい教師なし特徴選択モデルを確立します。このモデルは、特徴選択を非負スペクトルクラスタリングに組み込み、過学習を防ぐという利点があります。提案モデルを解くために、効果的な不正確拡張ラグランジュ乗数法を開発し、この方法では、部分問題に近似交代最小化アプローチを用いて対処します。アルゴリズムのシーケンスがモデルの定常点に収束することを厳密に証明します。一般的なデータセットを用いた広範な数値実験により、この方法の安定性と堅牢性が実証されています。さらに、比較結果から、本手法はクラスタリング評価指標の点で既存の最先端手法のいくつかよりも優れていることが示されています。コードはhttps://github.com/liyan-amss/NOCRM_codeで入手できます。

A causal fused lasso for interpretable heterogeneous treatment effects estimation

A causal fused lasso for interpretable heterogeneous treatment effects estimation / 解釈可能な異種治療効果推定のための因果融合Lasso

We propose a novel method for estimating heterogeneous treatment effects based on the fused lasso. By first ordering samples based on the propensity or prognostic score, we match units from the treatment and control groups. We then run the fused lasso to obtain piecewise constant treatment effects with respect to the ordering defined by the score. Similar to the existing methods based on discretizing the score, our methods yield interpretable subgroup effects. However, existing methods fixed the subgroup a priori, but our causal fused lasso forms data-adaptive subgroups. We show that the estimator consistently estimates the treatment effects conditional on the score under very general conditions on the covariates and treatment. We demonstrate the performance of our procedure using extensive experiments that show that it can be interpretable and competitive with state-of-the-art methods.

本手法では、融合Lassoに基づいて異質な治療効果を推定する新しい手法を提案します。まず、傾向スコアまたは予後スコアに基づいてサンプルを順序付けすることで、治療群と対照群のユニットをマッチングさせます。次に、融合Lassoを実行して、スコアによって定義された順序付けに関して区分的に一定の治療効果を取得します。スコアの離散化に基づく既存の手法と同様に、本手法は解釈可能なサブグループ効果をもたらします。ただし、既存の手法はサブグループを事前に固定していましたが、本手法の因果的融合Lassoはデータ適応型サブグループを形成します。本手法は、共変量と治療に関する非常に一般的な条件下で、スコアを条件として治療効果を一貫して推定できることを示します。我々は、広範な実験を用いて本手法の性能を実証し、それが解釈可能であり、最先端の手法と競合可能であることを示す。

Bayesian Inference of Contextual Bandit Policies via Empirical Likelihood

Bayesian Inference of Contextual Bandit Policies via Empirical Likelihood / ベイズ推論経験的尤度による文脈的バンディットポリシー

Policy inference plays an essential role in the contextual bandit problem. In this paper, we use empirical likelihood to develop a Bayesian inference method for the joint analysis of multiple contextual bandit policies in finite sample regimes. The proposed inference method is robust to small sample sizes and is able to provide accurate uncertainty measurements for policy value evaluation. In addition, it allows for flexible inferences on policy comparison with full uncertainty quantification. We demonstrate the effectiveness of the proposed inference method using Monte Carlo simulations and its application to an adolescent body mass index data set.

政策推論は、文脈バンディット問題において重要な役割を果たす。本稿では、経験尤度を用いて、有限サンプルレジームにおける複数の文脈バンディット政策の共同分析のためのベイズ推論手法を開発します。提案する推論手法は、小規模なサンプルサイズに対してロバストであり、政策の価値評価のための正確な不確実性測定を提供することができます。さらに、完全な不確実性定量化を伴う政策比較に関する柔軟な推論を可能にします。提案する推論手法の有効性は、モンテカルロシミュレーションと、青年期のBMIデータセットへの適用によって実証します。

Convergence and complexity of block majorization-minimization for constrained block-Riemannian optimization

Convergence and complexity of block majorization-minimization for constrained block-Riemannian optimization / 制約付きブロックリーマン最適化におけるブロック主要化最小化の収束と計算量

Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex optimization that sequentially minimizes a majorizing surrogate of the objective function in each block coordinate while the other block coordinates are held fixed. We consider a family of BMM algorithms for minimizing nonsmooth nonconvex objectives, where each parameter block is constrained within a subset of a Riemannian manifold. We establish that this algorithm converges asymptotically to the set of stationary points, and attains an $\epsilon$-stationary point within $\widetilde{O}(\epsilon^{-2})$ iterations. In particular, the assumptions for our complexity results are completely Euclidean when the underlying manifold is a product of Euclidean or Stiefel manifolds, although our analysis makes explicit use of the Riemannian geometry. Our general analysis applies to a wide range of algorithms with Riemannian constraints: Riemannian MM, block projected gradient descent, Bures-JKO scheme for Wasserstein variational inference, optimistic likelihood estimation, geodesically constrained subspace tracking, robust PCA, and Riemannian CP-dictionary-learning. We experimentally validate that our algorithm converges faster than standard Euclidean algorithms applied to the Riemannian setting.

ブロック・マジョライゼーション・ミニマライゼーション（BMM）は、非凸最適化のための単純な反復アルゴリズムであり、他のブロック座標を固定したまま、各ブロック座標における目的関数のマジョライゼーション代理関数を順次最小化します。本稿では、各パラメータブロックがリーマン多様体のサブセット内に制約される、非滑らかな非凸目的関数を最小化するBMMアルゴリズム群を検討します。このアルゴリズムは、定常点の集合に漸近収束し、$\widetilde{O}(\epsilon^{-2})$回の反復で$\epsilon$定常点に到達することを確立します。特に、本解析ではリーマン幾何学を明示的に用いているが、基礎となる多様体がユークリッド多様体またはシュティーフェル多様体の積である場合、複雑性の結果に対する仮定は完全にユークリッド的です。我々の一般的な分析は、リーマン制約を持つ幅広いアルゴリズム、すなわちリーマンMM、ブロック射影勾配降下法、ワッサーシュタイン変分推論のためのBures-JKO法、楽観的尤度推定、測地学的制約付き部分空間追跡、ロバストPCA、リーマンCP辞書学習に適用できます。我々のアルゴリズムは、リーマン制約に適用される標準的なユークリッドアルゴリズムよりも収束が速いことを実験的に検証しました。

Two-way Node Popularity Model for Directed and Bipartite Networks

Two-way Node Popularity Model for Directed and Bipartite Networks / 有向ネットワークおよび二部ネットワークにおける双方向ノード人気モデル

There has been increasing research attention on community detection in directed and bipartite networks. However, these studies often fail to consider the popularity of nodes in different communities, which is a common phenomenon in real-world networks. To address this issue, we propose a new probabilistic framework called the Two-Way Node Popularity Model (TNPM). The TNPM also accommodates edges from different distributions within a general sub-Gaussian family. We introduce the Delete-One-Method (DOM) for model fitting and community structure identification, and provide a comprehensive theoretical analysis with novel technical skills dealing with sub-Gaussian generalization. Additionally, we propose the Two-Stage Divided Cosine Algorithm (TSDC) to handle large-scale networks more efficiently. Our proposed methods offer multi-folded advantages in terms of estimation accuracy and computational efficiency, as demonstrated through extensive numerical studies. We apply our methods to two real-world applications, uncovering interesting findings.

有向ネットワークおよび二部ネットワークにおけるコミュニティ検出に関する研究はますます注目されています。しかし、これらの研究では、異なるコミュニティにおけるノードの人気度が考慮されていないことが多く、これは現実世界のネットワークでよく見られる現象です。この問題に対処するために、我々は双方向ノード人気度モデル（TNPM）と呼ばれる新しい確率的フレームワークを提案します。TNPMは、一般的なサブガウス族内の異なる分布からのエッジも受け入れます。モデルフィッティングとコミュニティ構造同定のためのDelete-One-Method（DOM）を導入し、サブガウス汎化を扱う新しい技術を用いた包括的な理論分析を提供します。さらに、大規模ネットワークをより効率的に処理するためのTwo-Stage Divided Cosine Algorithm（TSDC）を提案します。提案手法は、広範な数値研究によって実証されているように、推定精度と計算効率の点で多面的な利点を提供します。我々は、この手法を2つの実世界アプリケーションに適用し、興味深い知見を得た。

A Symplectic Analysis of Alternating Mirror Descent

A Symplectic Analysis of Alternating Mirror Descent / 交代ミラー降下法のシンプレクティック解析

Motivated by understanding the behavior of the Alternating Mirror Descent (AMD) algorithm for bilinear zero-sum games, we study the discretization of continuous-time Hamiltonian flow via the symplectic Euler method. We provide a framework for analysis using results from Hamiltonian dynamics and symplectic numerical integrators, with an emphasis on the existence and properties of a conserved quantity, the modified Hamiltonian (MH), for the symplectic Euler method.We compute the MH in closed-form when the original Hamiltonian is a quadratic function, and show that it generally differs from the other conserved quantity known previously in the literature. We derive new error bounds on the MH when truncated at orders in the stepsize in terms of the number of iterations, $K$, and use these bounds to show an improved $\mathcal{O}(K^{1/5})$ total regret bound and an $\mathcal{O}(K^{-4/5})$ duality gap of the average iterates for AMD. Finally, we propose a conjecture which, if true, would imply that the total regret for AMD scales as $\mathcal{O}\left(K^{\varepsilon}\right)$ and the duality gap of the average iterates as $\mathcal{O}\left(K^{-1+\varepsilon}\right)$ for any $\varepsilon>0$, and we can take $\varepsilon=0$ upon certain convergence conditions for the MH.

双線形零和ゲームにおけるAlternating Mirror Descent（AMD）アルゴリズムの挙動を理解することを目的に、シンプレクティックオイラー法を用いて連続時間ハミルトンフローの離散化を研究します。我々は、ハミルトニアンダイナミクスとシンプレクティック数値積分器の結果を用いた解析の枠組みを提示し、特にシンプレクティックオイラー法における保存量である修正ハミルトニアン（MH）の存在と特性に着目します。元のハミルトニアンが2次関数である場合にMHを閉じた形で計算し、それが文献で以前知られている他の保存量とは一般に異なることを示す。反復回数$K$に関して、ステップサイズのオーダーで切り捨てられた場合のMHの新しい誤差境界を導出し、この境界を用いてAMDの平均反復回数の改良された合計後悔境界$\mathcal{O}(K^{1/5})$と双対性ギャップ$\mathcal{O}(K^{-4/5})$を示す。最後に、我々は、もし真であれば、AMDの総後悔は$\mathcal{O}\left(K^{\varepsilon}\right)$に比例し、平均の双対性ギャップは任意の$\varepsilon>0$に対して$\mathcal{O}\left(K^{-1+\varepsilon}\right)$に比例し、MHの特定の収束条件では$\varepsilon=0$を取ることができるという予想を提案します。

Contrasting Local and Global Modeling with Machine Learning and Satellite Data: A Case Study Estimating Tree Canopy Height in African Savannas

Contrasting Local and Global Modeling with Machine Learning and Satellite Data: A Case Study Estimating Tree Canopy Height in African Savannas / 機械学習と衛星データを用いた局所モデリングとグローバルモデリングの対比：アフリカのサバンナにおける樹冠高推定のケーススタディ

While advances in machine learning with satellite imagery (SatML) are facilitating environmental monitoring at a global scale, developing SatML models that are accurate and useful for local regions remains critical to understanding and acting on an ever-changing planet. As increasing attention and resources are being devoted to training SatML models with global data, it is important to understand when improvements in global models will make it easier to train or fine-tune models that are accurate in specific regions. To explore this question, we design the first study that explicitly contrasts local and global training paradigms for SatML, through a case study of tree canopy height (TCH) mapping in the Karingani Game Reserve, Mozambique. We find that recent advances in global TCH mapping do not necessarily translate to better local modeling abilities in our study region. Specifically, small models trained only with locally-collected data outperform published global TCH maps, and even outperform globally pretrained models that we fine-tune using local data. Analyzing these results further, we identify specific points of conflict and synergy between local and global modeling paradigms that can inform future research toward aligning local and global performance objectives in geospatial machine learning.

衛星画像を用いた機械学習(SatML)の進歩により、地球規模の環境モニタリングが容易になっている一方で、常に変化する地球を理解し、対応していくためには、正確で地域に有用なSatMLモデルの開発が依然として重要です。地球規模のデータを用いたSatMLモデルのトレーニングにますます注目とリソースが投入されるようになるにつれ、地球規模のモデルの改善によって、特定の地域で正確なモデルのトレーニングや微調整が容易になるのはいつなのかを理解することが重要です。この問題を探るため、モザンビークのカリンガニ動物保護区における樹冠高(TCH)マッピングのケーススタディを通して、SatMLのローカルおよびグローバルのトレーニングパラダイムを明示的に対比する初の研究を設計しました。その結果、近年の地球規模のTCHマッピングの進歩が、必ずしも研究対象地域におけるローカルモデリング能力の向上につながるわけではないことがわかりました。具体的には、ローカルで収集されたデータのみでトレーニングされた小規模なモデルは、公開されている地球規模のTCHマップよりも性能が高く、ローカルデータを使用して微調整した地球規模で事前トレーニングされたモデルよりも性能が優れています。これらの結果をさらに分析することで、局所的モデリングパラダイムと全体的モデリングパラダイム間の具体的な対立点と相乗効果を特定し、地理空間機械学習における局所的および全体的パフォーマンス目標の整合に向けた将来の研究に役立てることができます。

Boosted Control Functions: Distribution Generalization and Invariance in Confounded Models

Boosted Control Functions: Distribution Generalization and Invariance in Confounded Models / ブースト制御関数：交絡モデルにおける分布の汎化と不変性

Modern machine learning methods and the availability of large-scale data have significantly advanced our ability to predict target quantities from large sets of covariates. However, these methods often struggle under distributional shifts, particularly in the presence of hidden confounding. While the impact of hidden confounding is well-studied in causal effect estimation, e.g., instrumental variables, its implications for prediction tasks under shifting distributions remain underexplored. This work addresses this gap by introducing a strong notion of invariance that, unlike existing weaker notions, allows for distribution generalization even in the presence of nonlinear, non-identifiable structural functions. Central to this framework is the Boosted Control Function (BCF), a novel, identifiable target of inference that satisfies the proposed strong invariance notion and is provably worst-case optimal under distributional shifts. The theoretical foundation of our work lies in Simultaneous Equation Models for Distribution Generalization (SIMDGs), which bridge machine learning with econometrics by describing data-generating processes under distributional shifts. To put these insights into practice, we propose the ControlTwicing algorithm to estimate the BCF using nonparametric machine-learning techniques and study its generalization performance on synthetic and real-world datasets compared to robust and empirical risk minimization approaches.

現代の機械学習手法と大規模データの利用可能性は、大規模な共変量セットから目標量を予測する能力を飛躍的に向上させた。しかし、これらの手法は分布シフト、特に隠れた交絡が存在する状況下ではしばしば困難をきたす。隠れた交絡の影響は、操作変数などの因果効果推定においては十分に研究されているが、シフトする分布下での予測タスクへの影響については十分に検討されていない。本研究では、既存の弱い概念とは異なり、非線形で識別不可能な構造関数が存在する場合でも分布の一般化を可能にする強い不変性の概念を導入することで、このギャップを解消します。この枠組みの中心となるのは、ブーステッド制御関数（BCF）です。これは、提案された強い不変性の概念を満たし、分布シフト下で最悪ケース最適であることが証明できる、新しい識別可能な推論対象です。本研究の理論的基礎は、分布一般化のための同時方程式モデル（SIMDG）にあります。SIMDGは、分布シフト下におけるデータ生成プロセスを記述することで、機械学習と計量経済学の橋渡しをします。これらの洞察を実践に移すため、ノンパラメトリック機械学習技術を用いてBCFを推定するControlTwicingアルゴリズムを提案し、合成データセットと実世界のデータセットにおけるその一般化性能を、堅牢な経験的リスク最小化アプローチと比較して研究します。

DCatalyst: A Unified Accelerated Framework for Decentralized Optimization

DCatalyst: A Unified Accelerated Framework for Decentralized Optimization / DCatalyst：分散最適化のための統合高速化フレームワーク

We study decentralized optimization over a network of agents, modeled as an undirected graph and operating without a central server. The objective is to minimize a composite function $f+r$, where $f$ is a (strongly) convex function representing the average of the agents’ losses, and $r$ is a convex, extended-value function (regularizer).We introduce DCatalyst, a unified black-box framework that injects Nesterov-type acceleration into decentralized optimization algorithms. At its core, DCatalyst is an inexact, momentum-accelerated proximal scheme (outer loop) that seamlessly wraps around a given decentralized method (inner loop). We show that DCatalyst attains optimal (up to logarithmic factors) communication and computational complexity across a broad family of decentralized algorithms and problem instances. In particular, it delivers accelerated rates for problem classes that previously lacked accelerated decentralized methods, thereby broadening the effectiveness of decentralized methods.On the technical side, our framework introduces inexact estimating sequences–an extension of Nesterov’s classical estimating sequences, tailored to decentralized, composite optimization. This construction systematically accommodates consensus errors and inexact solutions of local subproblems, addressing challenges that existing estimating-sequence-based analyses cannot handle while retaining a black-box, plug-and-play character.

本研究では、無向グラフとしてモデル化され、中央サーバーなしで動作するエージェントネットワーク上の分散最適化を研究します。目的は、合成関数$f+r$を最小化することです。ここで、$f$はエージェントの損失の平均を表す（強い）凸関数、$r$は凸拡張値関数（正則化子）です。分散最適化アルゴリズムにネステロフ型加速を注入する統合ブラックボックスフレームワークであるDCatalystを紹介します。DCatalystの核となるのは、不正確な運動量加速型近似スキーム（外側のループ）で、与えられた分散手法（内側のループ）をシームレスにラップします。DCatalystは、幅広い分散アルゴリズムと問題インスタンスにわたって、最適な（対数係数まで）通信と計算複雑性を実現することを示します。特に、これまで加速分散手法が不足していた問題クラスに対して加速率を実現し、分散手法の有効性を高めます。技術面では、私たちのフレームワークは、ネステロフの古典的な推定シーケンスを分散型の複合最適化に合わせて拡張した、不正確な推定シーケンスを導入します。この構築は、ブラックボックスのプラグアンドプレイの特性を維持しながら、コンセンサスエラーとローカルサブ問題の不正確な解決を体系的に受け入れ、既存の推定シーケンスベースの分析では処理できない課題に対処します。

Covariate-dependent Hierarchical Dirichlet Processes

Covariate-dependent Hierarchical Dirichlet Processes / 共変量依存階層的ディリクレ過程

Bayesian hierarchical modeling is a natural framework to effectively integrate data and borrow information across groups. In this paper, we address problems related to density estimation and identifying clusters across related groups, by proposing a hierarchical Bayesian approach that incorporates additional covariate information. To achieve flexibility, our approach builds on ideas from Bayesian nonparametrics, combining the hierarchical Dirichlet process with dependent Dirichlet processes. The proposed model is widely applicable, accommodating multiple and mixed covariate types through appropriate kernel functions as well as different output types through suitable component-specific likelihoods. This extends our ability to discern the relationship between covariates and clusters, while also effectively borrowing information and quantifying differences across groups. By employing a data augmentation trick, we are able to tackle the intractable normalized weights and construct a Markov chain Monte Carlo algorithm for posterior inference. The proposed method is illustrated on simulated data and two real data sets on single-cell RNA sequencing (scRNA-seq) and calcium imaging. For scRNA-seq data, we show that the incorporation of cell dynamics facilitates the discovery of additional cell subgroups. On calcium imaging data, our method identifies interpretable clusters of time frames with similar neural activity, aligning with the observed behavior of the animal.

ベイズ階層モデリングは、グループ間でデータを効果的に統合し、情報を借用するための自然なフレームワークです。本稿では、追加の共変量情報を組み込んだ階層ベイズアプローチを提案することで、関連グループ間の密度推定とクラスターの特定に関する問題に取り組みます。柔軟性を実現するために、私たちのアプローチはベイズノンパラメトリックスのアイデアに基づき、階層ディリクレ過程と従属ディリクレ過程を組み合わせています。提案モデルは幅広い適用が可能で、適切なカーネル関数によって複数の混合共変量タイプに対応し、適切なコンポーネント固有の尤度によって異なる出力タイプに対応します。これにより、共変量とクラスターの関係を識別する能力が拡張されると同時に、グループ間で効果的に情報を借用し、差異を定量化できます。データ拡張トリックを用いることで、扱いにくい正規化重みに対処し、事後推論のためのマルコフ連鎖モンテカルロアルゴリズムを構築できます。提案手法は、シミュレーションデータと、シングルセルRNAシーケンシング（scRNA-seq）およびカルシウムイメージングに関する2つの実データセットを用いて実証されています。scRNA-seqデータでは、細胞ダイナミクスを組み込むことで、新たな細胞サブグループの発見が容易になることを示します。カルシウムイメージングデータでは、本手法は、動物の観察された行動と一致する、類似した神経活動を示す解釈可能な時間フレームのクラスターを識別します。

Online Bernstein-von Mises theorem

Online Bernstein-von Mises theorem / オンライン・バーンスタイン＝フォン・ミーゼス定理

Online learning is an inferential paradigm in which parameters are updated incrementally from sequentially available data, in contrast to batch learning, where the entire dataset is processed at once. In this paper, we assume that mini-batches from the full dataset become available sequentially. The Bayesian framework, which updates beliefs about unknown parameters after observing each mini-batch, is naturally suited for online learning. At each step, we update the posterior distribution using the current prior and new observations, with the updated posterior serving as the prior for the next step. However, this recursive Bayesian updating is rarely computationally tractable unless the model and prior are conjugate. When the model is regular, the updated posterior can be approximated by a normal distribution, as justified by the Bernstein-von Mises theorem. We adopt a variational approximation at each step and investigate the frequentist properties of the final posterior obtained through this sequential procedure. Under mild assumptions, we show that the accumulated approximation error becomes negligible once the mini-batch size exceeds a threshold depending on the parameter dimension. As a result, the sequentially updated posterior is asymptotically indistinguishable from the full posterior.

オンライン学習は、データセット全体を一度に処理するバッチ学習とは対照的に、順次利用可能なデータからパラメータを段階的に更新する推論パラダイムです。本稿では、データセット全体からのミニバッチが順次利用可能になると仮定します。各ミニバッチを観測した後に未知のパラメータに関する確信を更新するベイズフレームワークは、オンライン学習に自然に適しています。各ステップで、現在の事前分布と新しい観測値を使用して事後分布を更新し、更新された事後分布を次のステップの事前分布として使用します。しかし、この再帰的なベイズ更新は、モデルと事前分布が共役でない限り、計算的に扱いにくいことがほとんどです。モデルが正則である場合、更新された事後分布は、ベルンシュタイン-フォン・ミーゼス定理によって正当化されるように、正規分布で近似できます。各ステップで変分近似を採用し、この逐次的な手順によって得られる最終的な事後分布の頻度主義的特性を調査します。軽い仮定の下で、ミニバッチサイズがパラメータ次元に依存する閾値を超えると、累積近似誤差は無視できることを示します。その結果、逐次更新された事後分布は、完全な事後分布と漸近的に区別できなくなります。

Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective

Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective / トランスフォーマーは次元の呪いを克服できる：近似の観点からの理論的研究

The Transformer model is widely used in various application areas of machine learning, such as natural language processing. This paper investigates the approximation of the Hölder continuous function class $\mathcal{H}_{Q}^{\beta}\left([0,1]^{d\times n},\mathbb{R}^{d\times n}\right)$ by Transformers and constructs several Transformers that can overcome the curse of dimensionality. These Transformers consist of one self-attention layer with one head and the softmax function as the activation function, along with several feedforward layers. For example, to achieve an approximation accuracy of $\epsilon$, if the activation functions of the feedforward layers in the Transformer are ReLU and floor, only $\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$ layers of feedforward layers are needed, with widths of these layers not exceeding $\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$. If other activation functions are allowed in the feedforward layers, the width of the feedforward layers can be further reduced to a constant. These results demonstrate that Transformers have a strong expressive capability. The construction in this paper is based on the Kolmogorov-Arnold Superposition Theorem and does not require the concept of contextual mapping, hence our proof is more intuitively clear compared to previous Transformer approximation works. Additionally, the translation technique proposed in this paper helps to apply the previous approximation results of feedforward neural networks to Transformer research.

Transformerモデルは、自然言語処理など、機械学習のさまざまな応用分野で広く使用されています。本論文では、Hölder連続関数クラス$\mathcal{H}_{Q}^{\beta}\left([0,1]^{d\times n},\mathbb{R}^{d\times n}\right)$のTransformerによる近似を検証し、次元の呪いを克服できる複数のTransformerを構築します。これらのTransformerは、1つのヘッドを持つ1つの自己注意層と、活性化関数としてソフトマックス関数、そして複数のフィードフォワード層から構成されます。例えば、近似精度$\epsilon$を達成するには、Transformerのフィードフォワード層の活性化関数がReLUとfloorである場合、フィードフォワード層は$\mathcal{O}\left(\log\frac{1}{\epsilon}\right)$層のみ必要であり、これらの層の幅は$\mathcal{O}\left(\frac{1}{\epsilon^{2/\beta}}\log\frac{1}{\epsilon}\right)$を超えません。フィードフォワード層で他の活性化関数が許可されている場合、フィードフォワード層の幅はさらに定数まで削減できます。これらの結果は、Transformerが強力な表現能力を持っていることを示しています。本論文の構成はKolmogorov-Arnoldの重ね合わせ定理に基づいており、コンテキストマッピングの概念を必要としないため、これまでのTransformer近似研究と比較して、証明はより直感的に明確です。さらに、本論文で提案された変換技術は、フィードフォワードニューラルネットワークの以前の近似結果をTransformer研究に適用するのに役立ちます。

参考文献

Journal of Machine Learning Research Papers Volume 27

論文