Journal of Machine Learning Research Papers: Volume 25の論文一覧

Journal of Machine Learning Research Papers Volume 25に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Aequitas Flow: Streamlining Fair ML Experimentation
Aequitas Flow：公正なML実験の効率化

Aequitas Flow is an open-source framework and toolkit for end-to-end Fair Machine Learning (ML) experimentation, and benchmarking in Python. This package fills integration gaps that exist in other fair ML packages. In addition to the existing audit capabilities in Aequitas, the Aequitas Flow module provides a pipeline for fairness-aware model training, hyperparameter optimization, and evaluation, enabling easy-to-use and rapid experiments and analysis of results. Aimed at ML practitioners and researchers, the framework offers implementations of methods, datasets, metrics, and standard interfaces for these components to improve extensibility. By facilitating the development of fair ML practices, Aequitas Flow hopes to enhance the incorporation of fairness concepts in AI systems making AI systems more robust and fair.

Aequitas Flowは、Pythonでのエンドツーエンドのフェアマシンラーニング(ML)実験とベンチマークのためのオープンソースフレームワークおよびツールキットです。このパッケージは、他のフェアMLパッケージに存在する統合ギャップを埋めます。Aequitasの既存の監査機能に加えて、Aequitas Flowモジュールは、フェアネスを考慮したモデルトレーニング、ハイパーパラメータの最適化、および評価のためのパイプラインを提供し、使いやすく迅速な実験と結果の分析を可能にします。ML実践者と研究者を対象としたこのフレームワークは、これらのコンポーネントの拡張性を向上させるためのメソッド、データセット、メトリック、および標準インターフェイスの実装を提供します。フェアMLプラクティスの開発を促進することで、Aequitas FlowはAIシステムへのフェアネス概念の組み込みを強化し、AIシステムをより堅牢でフェアなものにしたいと考えています。

Information Capacity Regret Bounds for Bandits with Mediator Feedback
情報容量、バンディット、メディエーターフィードバック付き後悔範囲

This work addresses the mediator feedback problem, a bandit game where the decision set consists of a number of policies, each associated with a probability distribution over a common space of outcomes. Upon choosing a policy, the learner observes an outcome sampled from its distribution and incurs the loss assigned to this outcome in the present round. We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set. Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity in both the adversarial and the stochastic settings. For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity. We also consider the case when the policies’ distributions can vary between rounds, thus addressing the related bandits with expert advice problem, which we improve upon its prior results. Additionally, we prove a lower bound showing that exploiting the similarity between the policies is not possible in general under linear bandit feedback. Finally, for a full-information variant, we provide a regret bound scaling with the information radius of the policy set.

この研究は、仲介者フィードバック問題、つまりバンディットゲームを扱っています。バンディットゲームでは、決定セットは多数のポリシーで構成され、各ポリシーは共通の結果空間上の確率分布に関連付けられています。ポリシーを選択すると、学習者はその分布からサンプリングされた結果を観察し、現在のラウンドでこの結果に割り当てられた損失を被ります。ポリシーセットの複雑さの情報理論的尺度として、ポリシーセット容量を導入します。古典的なEXP4アルゴリズムを採用して、敵対的設定と確率的設定の両方でポリシーセット容量に応じた新しい後悔境界を提供します。選択したポリシーセットファミリについて、容量と同様にスケーリングする、ほぼ一致する下限を証明します。また、ポリシーの分布がラウンド間で変化する可能性がある場合も考慮し、関連するバンディットと専門家のアドバイスの問題に対処します。この問題は、以前の結果を改善します。さらに、線形バンディットフィードバックでは、ポリシー間の類似性を利用することは一般に不可能であることを示す下限を証明します。最後に、完全情報バリアントの場合、ポリシーセットの情報半径を使用して、後悔境界スケーリングを提供します。

DAG-Informed Structure Learning from Multi-Dimensional Point Processes
DAG情報に基づく多次元点過程からの構造学習

Motivated by inferring causal relationships among neurons using ensemble spike train data, this paper introduces a new technique for learning the structure of a directed acyclic graph (DAG) within a large network of events, applicable to diverse multi-dimensional temporal point process (MuTPP) data. At the core of MuTPP lie the conditional intensity functions, for which we construct a generative model parameterized by the graph parameters of a DAG and develop an equality-constrained estimator, departing from exhaustive search-based methods. We present a novel, flexible augmented Lagrangian (Flex-AL) optimization scheme that ensures provable global convergence and computational efficiency gains over the classical AL algorithm. Additionally, we explore causal structure learning by integrating acyclicity-constraints and sparsity-regularization. We demonstrate: (i) in cases without regularization, the incorporation of the acyclicity constraint is essential for ensuring DAG recovery consistency; (ii) with suitable regularization, the DAG-constrained estimator achieves both parameter estimation and DAG reconstruction consistencies similar to the unconstrained counterpart, but significantly enhances empirical performance. Furthermore, simulation studies indicate that our proposed DAG-constrained estimator, when appropriately penalized, yields more accurate graphs compared to unconstrained or unregularized estimators. Finally, we apply the proposed method to two real MuTPP datasets.

本稿では、アンサンブルスパイクトレインデータを使用してニューロン間の因果関係を推測することに着目し、多様な多次元時相点過程(MuTPP)データに適用可能な、大規模なイベントネットワーク内の有向非巡回グラフ(DAG)の構造を学習する新しい手法を紹介します。MuTPPの中核となるのは条件付き強度関数です。これに対して、DAGのグラフパラメータでパラメータ化された生成モデルを構築し、網羅的な検索ベースの方法から離れて、等式制約付き推定器を開発します。古典的なALアルゴリズムに比べて、証明可能なグローバル収束と計算効率の向上を保証する、新しい柔軟な拡張ラグランジュ(Flex-AL)最適化スキームを紹介します。さらに、非巡回制約とスパース性正則化を統合することにより、因果構造学習を探ります。次のことを示します。(i)正則化がない場合、非巡回制約を組み込むことは、DAG回復の一貫性を保証するために不可欠です。(ii)適切な正則化により、DAG制約付き推定器は、制約なしの推定器と同様のパラメータ推定とDAG再構成の一貫性の両方を実現しますが、実験的なパフォーマンスが大幅に向上します。さらに、シミュレーション研究では、提案されたDAG制約付き推定器は、適切にペナルティを課された場合、制約なしまたは正則化されていない推定器と比較して、より正確なグラフを生成することが示されています。最後に、提案された方法を2つの実際のMuTPPデータセットに適用します。

Optimizing Noise for f-Differential Privacy via Anti-Concentration and Stochastic Dominance
アンチ濃度と確率的優位性によるf差動プライバシーのためのノイズの最適化

In this paper, we establish anti-concentration inequalities for additive noise mechanisms which achieve $f$-differential privacy ($f$-DP), a notion of privacy phrased in terms of a tradeoff function $f$ which limits the ability of an adversary to determine which individuals were in the database. We show that canonical noise distributions (CNDs), proposed by Awan and Vadhan (2023), match the anti-concentration bounds at half-integer values, indicating that their tail behavior is near-optimal. We also show that all CNDs are sub-exponential, regardless of the $f$-DP guarantee. In the case of log-concave CNDs, we show that they are the stochastically smallest noise compared to any other noise distributions with the same strong privacy guarantee. In terms of integer-valued noise, we propose a new notion of discrete CND and prove that a discrete CND always exists, can be constructed by rounding a continuous CND, and that the discrete CND is unique when designed for a statistic with sensitivity 1. We further show that the discrete CND at sensitivity 1 is stochastically smallest compared to other integer-valued noises. Our theoretical results shed light on the different types of privacy guarantees possible in the $f$-DP framework and can be incorporated in more complex mechanisms to optimize performance.

本稿では、データベースにどの個人が含まれていたかを敵対者が特定する能力を制限するトレードオフ関数fで表現されるプライバシーの概念である$f$差分プライバシー($f$-DP)を実現する加法ノイズメカニズムの反集中不等式を確立します。AwanとVadhan (2023)によって提案された標準ノイズ分布(CND)は、半整数値で反集中境界と一致し、そのテール動作がほぼ最適であることを示します。また、$f$-DP保証に関係なく、すべてのCNDがサブ指数であることも示します。対数凹CNDの場合、同じ強力なプライバシー保証を持つ他のノイズ分布と比較して、確率的に最小のノイズであることを示します。整数値ノイズに関しては、離散CNDという新しい概念を提案し、離散CNDが常に存在し、連続CNDを丸めることで構築でき、感度1の統計用に設計された離散CNDは一意であることを証明します。さらに、感度1の離散CNDは、他の整数値ノイズと比較して確率的に最小であることを示します。私たちの理論的結果は、$f$-DPフレームワークで可能なさまざまな種類のプライバシー保証を明らかにし、より複雑なメカニズムに組み込んでパフォーマンスを最適化することができます。

A Rainbow in Deep Network Black Boxes
ディープネットワークのブラックボックスの中の虹

A central question in deep learning is to understand the functions learned by deep networks. What is their approximation class? Do the learned weights and representations depend on initialization? Previous empirical work has evidenced that kernels defined by network activations are similar across initializations. For shallow networks, this has been theoretically studied with random feature models, but an extension to deep networks has remained elusive. Here, we provide a deep extension of such random feature models, which we call the rainbow model. We prove that rainbow networks define deterministic (hierarchical) kernels in the infinite-width limit. The resulting functions thus belong to a data-dependent RKHS which does not depend on the weight randomness. We also verify numerically our modeling assumptions on deep CNNs trained on image classification tasks, and show that the trained networks approximately satisfy the rainbow hypothesis. In particular, rainbow networks sampled from the corresponding random feature model achieve similar performance as the trained networks. Our results highlight the central role played by the covariances of network weights at each layer, which are observed to be low-rank as a result of feature learning.

ディープラーニングにおける中心的な問題は、ディープネットワークによって学習された関数を理解することです。それらの近似クラスは何ですか?学習された重みと表現は初期化に依存しますか?これまでの実証研究は、ネットワークアクティベーションによって定義されたカーネルは初期化間で類似していることを証明しています。浅いネットワークの場合、これはランダム特徴モデルで理論的に研究されてきましたが、ディープネットワークへの拡張は依然として困難でした。ここでは、そのようなランダム特徴モデルのディープ拡張を提供し、これをレインボーモデルと呼びます。レインボーネットワークは、無限幅の極限で決定論的(階層的)カーネルを定義することを証明します。したがって、結果として得られる関数は、重みのランダム性に依存しないデータ依存のRKHSに属します。また、画像分類タスクでトレーニングされたディープCNNのモデリング仮定を数値的に検証し、トレーニングされたネットワークがレインボー仮説をほぼ満たすことを示します。特に、対応するランダム特徴モデルからサンプリングされたレインボーネットワークは、トレーニングされたネットワークと同様のパフォーマンスを実現します。私たちの結果は、特徴学習の結果として低ランクであることが観察される各層のネットワーク重みの共分散が果たす中心的な役割を強調しています。

How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
2 層ニューラルネットワークが一度に1つの(巨大な)ステップで学習する方法

For high-dimensional Gaussian data, we investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to an improvement in the approximation capacity with respect to the initialization. First, we compare the influence of batch size to that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = O(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = O(d^2)$ is essential for neurons to specialize in multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist “hard” directions requiring $n = O(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. Second, we show that the picture drastically improves over multiple gradient steps: a batch size of $n = O(d)$ is indeed sufficient to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allow for a drastic improvement in the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence, which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the intertwined role of the structure of the task to learn, the detail of the algorithm (the batch size), and the architecture (i.e., the number of hidden neurons), shedding new light on how neural networks adapt to the feature and learn complex task from data over time.

高次元ガウスデータの場合、2層ニューラルネットワークの特徴がいくつかの大きなバッチ勾配降下ステップを通じてターゲット関数の構造に適応し、初期化に関する近似能力が向上する仕組みを理論的に調査します。まず、バッチサイズの影響を複数（ただし有限個）のステップの影響と比較します。単一の勾配ステップの場合、サイズ$n = O(d)$のバッチは、ターゲット関数に合わせるために必要かつ十分ですが、学習できる方向は1つだけです。対照的に、$n = O(d^2)$は、ニューロンが単一の勾配ステップでターゲットの複数の関連方向に特化するために不可欠です。この場合でも、$n = O(d^\ell)$サンプルを学習する必要がある「ハード」方向が存在する可能性があることを示します。ここで、$\ell$はターゲットの飛躍インデックスとして知られています。次に、複数の勾配ステップにわたって図が劇的に改善することを示します。バッチサイズ$n = O(d)$は、階段状の性質を満たす複数のターゲット方向を学習するのに十分であり、時間の経過とともにますます多くの方向を学習できます。最後に、これらの方向によって、初期化時の近似能力と一般化誤差が大幅に改善される仕組みについて説明し、ランダムな特徴/遅延レジームと特徴学習レジームのスケールの分離を示します。私たちの技術的分析では、集中、射影ベースの条件付け、ガウス等価性に関連する手法の組み合わせを活用しており、これらは独立した関心事であると考えています。特殊化と学習に必要な条件を特定することで、学習するタスクの構造、アルゴリズムの詳細(バッチサイズ)、アーキテクチャ(つまり、非表示ニューロンの数)の絡み合った役割が明らかになり、ニューラルネットワークが特徴に適応し、時間の経過とともにデータから複雑なタスクを学習する方法に新たな光が当てられます。

Hamiltonian Monte Carlo for efficient Gaussian sampling: long and random steps
効率的なガウスサンプリングのためのハミルトニアンモンテカルロ：長くランダムなステップ

Hamiltonian Monte Carlo (HMC) is a Markov chain algorithm for sampling from a high-dimensional distribution with density $e^{-f(x)}$, given access to the gradient of $f$. A particular case of interest is that of a $d$-dimensional Gaussian distribution with covariance matrix $\Sigma$, in which case $f(x) = x^\top \Sigma^{-1} x$. We show that Metropolis-adjusted HMC can sample from a distribution that is $\varepsilon$-close to a Gaussian in total variation distance using $\widetilde{O}(\sqrt{\kappa} d^{1/4} \log(1/\varepsilon))$ gradient queries, where $\varepsilon>0$ and $\kappa$ is the condition number of $\Sigma$.Our algorithm uses long and random integration times for the Hamiltonian dynamics, and it creates a warm start by first running HMC without a Metropolis adjustment. This contrasts with (and was motivated by) recent results that give an $\widetilde\Omega(\kappa d^{1/2})$ query lower bound for HMC with a fixed integration times or from a cold start, even for the Gaussian case.

ハミルトンモンテカルロ(HMC)は、$f$の勾配へのアクセスを与えられた、密度$e^{-f(x)}$の高次元分布からサンプリングするマルコフ連鎖アルゴリズムです。特に興味深いのは、共分散行列$\Sigma$を持つ$d$次元ガウス分布の場合で、この場合$f(x) = x^\top \Sigma^{-1} x$となります。我々は、メトロポリス調整HMCが、$\widetilde{O}(\sqrt{\kappa} d^{1/4} \log(1/\varepsilon))$勾配クエリを使用して、総変動距離がガウス分布に$\varepsilon$近い分布からサンプリングできることを示します。ここで、$\varepsilon>0$であり、$\kappa$は$\Sigma$の条件数です。我々のアルゴリズムは、ハミルトン力学に長くランダムな積分時間を使用し、最初にメトロポリス調整なしでHMCを実行することでウォームスタートを作成します。これは、ガウス分布の場合でも、固定積分時間またはコールドスタートでHMCに$\widetilde\Omega(\kappa d^{1/2})$クエリ下限を与える最近の結果とは対照的です(そして、その結果がきっかけでした)。

Memorization With Neural Nets: Going Beyond the Worst Case
ニューラルネットによる記憶：最悪のケースを超えて

In practice, deep neural networks are often able to easily interpolate their training data. To understand this phenomenon, many works have aimed to quantify the memorization capacity of a neural network architecture: the largest number of points such that the architecture can interpolate any placement of these points with any assignment of labels. For real-world data, however, one intuitively expects the presence of a benign structure so that interpolation already occurs at a smaller network size than suggested by memorization capacity. In this paper, we investigate interpolation by adopting an instance-specific viewpoint. We introduce a simple randomized algorithm that, given a fixed finite data set with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time. The required number of parameters is linked to geometric properties of the two classes and their mutual arrangement. As a result, we obtain guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds. We verify our theoretical result with numerical experiments and additionally investigate the effectiveness of the algorithm on MNIST and CIFAR-10.

実際には、ディープニューラルネットワークはトレーニングデータを簡単に補間できる場合が多い。この現象を理解するために、多くの研究でニューラルネットワー・アーキテクチャの記憶容量を定量化することを目指してきた。記憶容量とは、アーキテクチャがラベルの割り当てによってこれらのポイントの任意の配置を補間できる最大のポイント数です。しかし、現実世界のデータの場合、記憶容量が示唆するよりも小さいネットワークサイズで補間がすでに発生するように、良性の構造が存在することが直感的に予想されます。この論文では、インスタンス固有の観点を採用して補間を調査します。2つのクラスを含む固定された有限データセットが与えられた場合、高い確率で補間3層ニューラルネットワークを多項式時間で構築する、単純なランダム化アルゴリズムを紹介します。必要なパラメーターの数は、2つのクラスの幾何学的特性とそれらの相互配置に関連しています。その結果、サンプル数に依存しない保証が得られ、最悪の場合の記憶容量の限界を超えることができます。我々は数値実験によって理論的結果を検証し、さらにMNISTとCIFAR-10におけるアルゴリズムの有効性を調査します。

PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates
PROMISE：スケーラブルな曲率推定を組み込むことによる前処理された確率的最適化法

Ill-conditioned problems are ubiquitous in large-scale machine learning: as a data set grows to include more and more features correlated with the labels, the condition number increases. Yet traditional stochastic gradient methods converge slowly on these ill-conditioned problems, even with careful hyperparameter tuning. This paper introduces PROMISE (Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates), a suite of sketching-based preconditioned stochastic gradient algorithms that deliver fast convergence on ill-conditioned large-scale convex optimization problems arising in machine learning. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha; each algorithm comes with a strong theoretical analysis and effective default hyperparameter values. Empirically, we verify the superiority of the proposed algorithms by showing that, using default hyperparameter values, they outperform or match popular tuned stochastic gradient optimizers on a test bed of 51 ridge and logistic regression problems assembled from benchmark machine learning repositories. On the theoretical side, this paper introduces the notion of quadratic regularity in order to establish linear convergence of all proposed methods even when the preconditioner is updated infrequently. The speed of linear convergence is determined by the quadratic regularity ratio, which often provides a tighter bound on the convergence rate compared to the condition number, both in theory and in practice, and explains the fast global linear convergence of the proposed methods.

大規模な機械学習では、悪条件の問題が至る所で発生します。データセットが大きくなり、ラベルと相関する特徴がますます多く含まれるようになると、条件数が増加します。しかし、従来の確率的勾配法では、ハイパーパラメータを慎重に調整しても、これらの悪条件の問題への収束は遅いです。この論文では、機械学習で発生する悪条件の大規模な凸最適化問題に高速収束を実現するスケッチベースの前処理済み確率的勾配アルゴリズムのスイートであるPROMISE (スケーラブルな曲率推定を組み込んだ前処理付き確率最適化法)を紹介します。PROMISEには、SVRG、SAGA、およびKatyushaの前処理済みバージョンが含まれており、各アルゴリズムには強力な理論的分析と効果的なデフォルトのハイパーパラメータ値が付属しています。実験的に、デフォルトのハイパーパラメータ値を使用して、ベンチマーク機械学習リポジトリから集められた51のリッジおよびロジスティック回帰問題のテストベッドで、一般的な調整済み確率的勾配オプティマイザーよりも優れているか同等であることを示すことで、提案アルゴリズムの優位性を検証します。理論面では、本論文では、前処理がまれにしか更新されない場合でも、提案されたすべての方法の線形収束を確立するために、2次正則性の概念を導入しています。線形収束の速度は、2次正則性比によって決定されます。これは、理論と実践の両方で条件数と比較して、収束率のより厳しい境界を提供することが多く、提案された方法の高速なグローバル線形収束を説明しています。

Causal effects of intervening variables in settings with unmeasured confounding
測定されていない交絡因子のある設定における介在変数の因果効果

We present new results on average causal effects in settings with unmeasured exposure-outcome confounding. Our results are motivated by a class of estimands, e.g., frequently of interest in medicine and public health, that are currently not targeted by standard approaches for average causal effects. We recognize these estimands as queries about the average causal effect of an intervening variable. We anchor our introduction of these estimands in an investigation of the role of chronic pain and opioid prescription patterns, and illustrate how conventional approaches will lead to non-replicable estimates with ambiguous policy implications. We argue that our alternative effects are replicable and have clear policy implications, and furthermore are non-parametrically identified by the classical frontdoor formula. As an independent contribution, we derive a new semiparametric efficient estimator of the frontdoor formula with a uniform sample boundedness guarantee. This property is unique among previously-described estimators in its class, and we demonstrate superior performance in finite-sample settings. The theoretical results are applied to data from the National Health and Nutrition Examination Survey.

我々は、測定されていない曝露-結果交絡がある設定における平均因果効果に関する新しい結果を提示します。我々の結果は、例えば医学や公衆衛生で頻繁に関心を集めている、平均因果効果の標準的なアプローチでは現在対象とされていない推定値のクラスに動機付けられています。我々はこれらの推定値を、介在変数の平均因果効果に関するクエリとして認識しています。我々はこれらの推定値の導入を、慢性疼痛とオピオイド処方パターンの役割の調査に結び付け、従来のアプローチが曖昧な政策的含意を持つ再現不可能な推定値につながることを示しています。我々は、代替効果は再現可能で明確な政策的含意があり、さらに古典的なフロントドア式によってノンパラメトリックに識別されると主張します。独立した貢献として、我々は均一なサンプルの有界性保証を備えたフロントドア式の新しいセミパラメトリックな効率的な推定値を導出します。この特性は、そのクラスの以前に説明された推定値の中では独特であり、有限サンプル設定で優れたパフォーマンスを示す。理論的結果は、国民健康栄養調査のデータに適用されます。

Lower Complexity Adaptation for Empirical Entropic Optimal Transport
経験的エントロピー最適輸送のための低複雑性適応

Entropic optimal transport (EOT) presents an effective and computationally viable alternative to unregularized optimal transport (OT), offering diverse applications for large-scale data analysis. In this work, we derive novel statistical bounds for empirical plug-in estimators of the EOT cost and show that their statistical performance in the entropy regularization parameter $\varepsilon$ and the sample size $n$ only depends on the simpler of the two probability measures. For instance, under sufficiently smooth costs this yields the parametric rate $n^{-1/2}$ with factor $\varepsilon^{-d/2}$, where $d$ is the minimum dimension of the two population measures. This confirms that empirical EOT also adheres to the lower complexity adaptation principle, a hallmark feature only recently identified for unregularized OT. As a consequence of our theory, we show that the empirical entropic Gromov-Wasserstein distance and its unregularized version for measures on Euclidean spaces also obey this principle. Additionally, we comment on computational aspects and complement our findings with Monte Carlo simulations. Our technique employs empirical process theory and relies on a dual formulation of EOT over a single function class. Central to our analysis is the observation that the entropic cost-transformation of a function class does not increase its uniform metric entropy by much.

エントロピー最適輸送(EOT)は、非正規化最適輸送(OT)に代わる効果的で計算可能な代替手段であり、大規模データ分析にさまざまな用途を提供します。この研究では、EOTコストの経験的プラグイン推定値の新しい統計的境界を導出し、エントロピー正則化パラメーター$\varepsilon$とサンプルサイズ$n$における統計的パフォーマンスは、2つの確率尺度のうちのより単純なものにのみ依存することを示します。たとえば、コストが十分に滑らかな場合、これにより、係数$\varepsilon^{-d/2}$を持つパラメトリックレート$n^{-1/2}$が生成されます。ここで、$d$は2つの母集団尺度の最小次元です。これは、経験的EOTが、非正規化OTで最近になって特定された特徴的な機能である、複雑性の低さの適応原理にも従っていることを裏付けています。私たちの理論の結果として、経験的エントロピーGromov-Wasserstein距離と、ユークリッド空間上の尺度に対するその非正規化バージョンもこの原理に従うことを示します。さらに、計算面についてもコメントし、モンテカルロシミュレーションで調査結果を補完します。私たちの手法は経験的プロセス理論を採用し、単一の関数クラスに対するEOTの二重定式化に依存しています。私たちの分析の中心となるのは、関数クラスのエントロピーコスト変換によって、その均一メトリックエントロピーがそれほど増加しないという観察です。

A Note on Entrywise Consistency for Mixed-data Matrix Completion
混合データ行列補完のための入力方向の一貫性に関する注意

This note studies matrix completion for a partially observed $n$ by $p$ data matrix involving mixed types of variables (e.g., continuous, binary, ordinal). A general family of non-linear factor models is considered, under which the matrix completion problem becomes the estimation of an $n$ by $p$ low-rank matrix ${\mathbf M}$. For existing methods in the literature, estimation consistency is established by showing $\Vert \hat {\mathbf M} – {\mathbf M}^*\Vert_F/\sqrt{np}$, the scaled Frobenius norm of the difference between the estimated and true ${\mathbf M}$ matrices, converges to zero in probability as $n$ and $p$ grow to infinity. However, this notion of consistency does not guarantee the convergence of each individual entry and, thus, may not be sufficient when specific data entries or the worst-case scenario is of interest. To address this issue, we consider the notion of entrywise consistency based on $\Vert \hat {\mathbf M} – {\mathbf M}^* \Vert_{\mbox{max}}$, the max norm of the estimation error matrix. We propose refinement procedures that turn estimators, which are consistent in the Frobenius norm sense, into entrywise estimators through a one-step refinement.Tight probabilistic error bounds are derived for the proposed estimators. The proposed methods are evaluated by simulation studies and real-data applications for collaborative filtering and large-scale educational assessment.

このノートでは、混合型の変数(連続、バイナリ、順序など)を含む部分的に観測された$n$行$p$列のデータ行列の行列補完について検討します。行列補完問題が$n$行$p$列の低ランク行列${\mathbf M}$の推定になる、非線形因子モデルの一般的なファミリーを検討します。文献の既存の方法では、推定値の一貫性は、推定された${\mathbf M}$行列と実際の${\mathbf M}$行列の差のスケーリングされたフロベニウスノルム$\Vert \hat {\mathbf M} – {\mathbf M}^*\Vert_F/\sqrt{np}$が、$n$と$p$が無限大に大きくなるにつれて確率が0に収束することを示すことによって確立されます。ただし、この一貫性の概念は、個々のエントリの収束を保証するものではないため、特定のデータエントリまたは最悪のシナリオが対象の場合は十分ではない可能性があります。この問題に対処するために、推定誤差行列の最大ノルムである$\Vert \hat {\mathbf M} – {\mathbf M}^* \Vert_{\mbox{max}}$に基づくエントリワイズ一貫性の概念を検討します。フロベニウスノルムの意味で一貫性のある推定量を、1ステップの改良によってエントリワイズ推定量に変換する改良手順を提案します。提案された推定量に対して、厳密な確率的誤差境界が導出されます。提案された方法は、シミュレーション研究と、協調フィルタリングおよび大規模な教育評価の実データアプリケーションによって評価されます。

A Characterization of Multioutput Learnability
マルチ出力学習可能性の特性評価

We consider the problem of learning multioutput function classes in the batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classification and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting.

バッチ設定とオンライン設定でマルチ出力関数クラスを学習する問題について検討します。両方の設定において、関数クラスの各単一出力制約が学習可能である場合にのみ、マルチ出力関数クラスが学習可能であることを示します。これにより、バッチ設定とオンライン設定の両方で、マルチラベル分類とマルチ出力回帰の学習可能性の完全な特性が提供されます。拡張として、バンディットフィードバック設定でのマルチラベル学習可能性も検討し、フルフィードバック設定の場合と同様の特性を示します。

Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning
分散減少分布ロバストQ学習のサンプル複雑性

Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the $q$-function of an infinite-horizon $\gamma$-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise $\epsilon$-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of $\tilde O(|\mathbf{S}||\mathbf{A}|(1-\gamma)^{-4}\epsilon^{-2})$, where $\mathbf{S}$ and $\mathbf{A}$ denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size $\delta$, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.

分布シフト下での動的意思決定は、強化学習の理論と応用において基本的な関心事です。データが収集される環境の分布は、モデルが展開される環境の分布とは異なる場合があります。この論文では、分布シフトにもかかわらずロバストなポリシーを効果的に学習できる、分布ロバストなQ学習とその分散低減版という2つの新しいモデルフリーアルゴリズムを紹介します。これらのアルゴリズムは、Kullback-Leiblerの曖昧性がエントリごとの$\epsilon$度の精度に設定された、無限期間の$\gamma$割引ロバストマルコフ決定プロセスの$q$関数を効率的に近似するように設計されています。さらに、分散低減分布ロバストなQ学習は、同期Q学習と分散低減手法を組み合わせてパフォーマンスを向上させます。その結果、ミニマックスサンプル複雑度の上限は$\tilde O(|\mathbf{S}||\mathbf{A}|(1-\gamma)^{-4}\epsilon^{-2})$に達することが証明されました。ここで、$\mathbf{S}$と$\mathbf{A}$は状態空間と行動空間を表します。これは、あいまいさのサイズ$\delta$に依存しない最初の複雑度結果であり、新しい複雑度理論の洞察を提供します。さらに、一連の数値実験により、理論的発見と分布シフトを処理するアルゴリズムの効率性が確認されています。

Lower Bounds on the Bayesian Risk via Information Measures
情報測度によるベイズリスクの下限

This paper focuses on parameter estimation and introduces a new method for lower bounding the Bayesian risk. The method allows for the use of virtually any information measure, including R\’enyi’s $\alpha$, $\varphi$-divergences, and Sibson’s $\alpha$-Mutual Information. The approach considers divergences as functionals of measures and exploits the duality between spaces of measures and spaces of functions. In particular, we show that one can lower bound the risk with any information measure by upper bounding its dual via Markov’s inequality.We are thus able to provide estimator-independent impossibility results thanks to the Data-Processing Inequalities that divergences satisfy.The results are then applied to settings of interest involving both discrete and continuous parameters, including the “Hide-and-Seek” problem, and compared to the state-of-the-art techniques. An important observation is that the behaviour of the lower bound in the number of samples is influenced by the choice of the information measure. We leverage this by introducing a new divergence inspired by the “Hockey-Stick” divergence, which is demonstrated empirically to provide the largest lower bound across all considered settings. If the observations are subject to privatisation, stronger impossibility results can be obtained via Strong Data-Processing Inequalities. The paper also discusses some generalisations and alternative directions.

この論文では、パラメータ推定に焦点を当て、ベイズリスクの下限を設定するための新しい方法を紹介します。この方法では、R\’enyiの$\alpha$、$\varphi$ダイバージェンス、およびSibsonの$\alpha$相互情報量を含む、事実上あらゆる情報尺度を使用できます。このアプローチでは、ダイバージェンスを尺度の関数として扱い、尺度の空間と関数の空間の双対性を活用します。特に、マルコフの不等式を介してその双対を上限に設定することにより、あらゆる情報尺度でリスクを下限に設定できることを示します。したがって、ダイバージェンスが満たすデータ処理不等式のおかげで、推定量に依存しない不可能性の結果を提供できます。次に、結果を「かくれんぼ」問題を含む離散パラメータと連続パラメータの両方を含む関心のある設定に適用し、最先端の技術と比較します。重要な観察結果は、サンプル数の下限の動作が情報尺度の選択によって影響を受けることです。私たちは、すべての考慮された設定にわたって最大の下限値を提供することが実証されている「ホッケースティック」ダイバージェンスにヒントを得た新しいダイバージェンスを導入することで、これを活用します。観測がプライベート化の対象となる場合、強力なデータ処理不等式を介してより強力な不可能性結果を得ることができます。この論文では、いくつかの一般化と代替の方向性についても説明しています。

Bayesian Structural Learning with Parametric Marginals for Count Data: An Application to Microbiota Systems
カウントデータに対するパラメトリック周辺値によるベイジアン構造学習：微生物叢システムへの応用

High dimensional and heterogeneous count data are collected in various applied fields. In this paper, we look closely at high-resolution sequencing data on the microbiome, which have enabled researchers to study the genomes of entire microbial communities. Revealing the underlying interactions between these communities is of vital importance to learn how microbes influence human health. To perform structural learning from multivariate count data such as these, we develop a novel Gaussian copula graphical model with two key elements. Firstly, we employ parametric regression to characterize the marginal distributions. This step is crucial for accommodating the impact of external covariates. Neglecting this adjustment could potentially introduce distortions in the inference of the underlying network of dependences. Secondly, we advance a Bayesian structure learning framework, based on a computationally efficient search algorithm that is suited to high dimensionality. The approach returns simultaneous inference of the marginal effects and of the dependence structure, including graph uncertainty estimates. A simulation study and a real data analysis of microbiome data highlight the applicability of the proposed approach at inferring networks from multivariate count data in general, and its relevance to microbiome analyses in particular. The proposed method is implemented in the R package BDgraph.

さまざまな応用分野で、高次元で異質なカウントデータが収集されています。この論文では、研究者が微生物群全体のゲノムを研究することを可能にした、マイクロバイオームの高解像度シーケンスデータについて詳しく説明します。これらの群間の根本的な相互作用を明らかにすることは、微生物が人間の健康にどのように影響するかを知るために非常に重要です。このような多変量カウントデータから構造学習を実行するために、2つの重要な要素を持つ新しいガウスコピュラグラフィカルモデルを開発しました。まず、パラメトリック回帰を使用して、周辺分布を特徴付けます。この手順は、外部共変量の影響に対応するために重要です。この調整を無視すると、依存関係の根本的なネットワークの推論に歪みが生じる可能性があります。次に、高次元に適した計算効率の高い検索アルゴリズムに基づくベイズ構造学習フレームワークを進めます。このアプローチは、グラフの不確実性の推定を含む、周辺効果と依存関係構造の同時推論を返します。シミュレーション研究とマイクロバイオームデータの実際のデータ分析により、提案されたアプローチが一般的に多変量カウントデータからネットワークを推測する際に適用可能であること、特にマイクロバイオーム分析との関連性が明らかになりました。提案された方法は、RパッケージBDgraphに実装されています。

Transfer Learning with Uncertainty Quantification: Random Effect Calibration of Source to Target (RECaST)
不確かさ定量化による転移学習:ソースからターゲットへのランダム効果キャリブレーション(RECaST)

Transfer learning uses a data model, trained to make predictions or inferences on data from one population, to make reliable predictions or inferences on data from another population. Most existing transfer learning approaches are based on fine-tuning pre-trained neural network models, and fail to provide crucial uncertainty quantification. We develop a statistical framework for model predictions based on transfer learning, called RECaST. The primary mechanism is a Cauchy random effect that recalibrates a source model to a target population; we mathematically and empirically demonstrate the validity of our RECaST approach for transfer learning between linear models, in the sense that prediction sets will achieve their nominal stated coverage, and we numerically illustrate the method’s robustness to asymptotic approximations for nonlinear models. Whereas many existing techniques are built on particular source models, RECaST is agnostic to the choice of source model, and does not require access to source data. For example, our RECaST transfer learning approach can be applied to a continuous or discrete data model with linear or logistic regression, deep neural network architectures, etc. Furthermore, RECaST provides uncertainty quantification for predictions, which is mostly absent in the literature. We examine our method’s performance in a simulation study and in an application to real hospital data.

転移学習では、ある集団のデータに対して予測や推論を行うようにトレーニングされたデータモデルを使用して、別の集団のデータに対して信頼性の高い予測や推論を行います。既存の転移学習アプローチのほとんどは、事前トレーニング済みのニューラルネットワークモデルの微調整に基づいており、重要な不確実性の定量化を提供できません。私たちは、転移学習に基づくモデル予測の統計フレームワークを開発し、RECaSTと名付けました。主なメカニズムは、ソースモデルをターゲット集団に再調整するコーシーランダム効果です。私たちは、予測セットが名目上の規定範囲を達成するという意味で、線形モデル間の転移学習に対するRECaSTアプローチの有効性を数学的かつ経験的に実証し、非線形モデルの漸近近似に対するこの方法の堅牢性を数値的に示します。既存の多くの手法が特定のソースモデルに基づいて構築されているのに対し、RECaSTはソースモデルの選択に依存せず、ソースデータへのアクセスを必要としません。たとえば、当社のRECaST転移学習アプローチは、線形回帰またはロジスティック回帰、ディープニューラルネットワークアーキテクチャなどを使用した連続または離散データモデルに適用できます。さらに、RECaSTは予測の不確実性の定量化を提供しますが、これは文献にはほとんど記載されていません。当社は、シミュレーション研究と実際の病院データへの適用において、当社の方法のパフォーマンスを検証します。

Inference on High-dimensional Single-index Models with Streaming Data
ストリーミングデータを用いた高次元単一インデックスモデルの推論

Traditional statistical methods are faced with new challenges due to streaming data. The major challenge is the rapidly growing volume and velocity of data, which makes storing such huge data sets in memory impossible. The paper presents an online inference framework for regression parameters in high-dimensional semiparametric single-index models with unknown link functions. The proposed online procedure updates only the current data batch and summary statistics of historical data instead of re-accessing the entire raw data set. At the same time, we do not need to estimate the unknown link function, which is a highly challenging task. In addition, a generalized convex loss function is used in the proposed inference procedure. To illustrate the proposed method, we use the Huber loss function and the negative log-likelihood of the logistic regression model. In this study, the asymptotic normality of the proposed online debiased Lasso estimators and the bounds of the proposed online Lasso estimators are investigated. To evaluate the performance of the proposed method, extensive simulation studies have been conducted. We provide applications to Nasdaq stock prices and financial distress data sets.

従来の統計手法は、ストリーミングデータによる新たな課題に直面しています。主な課題は、データの量と速度が急速に増加していることであり、このような巨大なデータセットをメモリに保存することは不可能です。この論文では、未知のリンク関数を持つ高次元セミパラメトリックシングルインデックスモデルの回帰パラメーターのオンライン推論フレームワークを紹介します。提案されたオンライン手順では、生データセット全体に再アクセスするのではなく、現在のデータバッチと履歴データの要約統計のみを更新します。同時に、非常に困難なタスクである未知のリンク関数を推定する必要はありません。さらに、提案された推論手順では、一般化された凸損失関数が使用されます。提案された方法を説明するために、Huber損失関数とロジスティック回帰モデルの負の対数尤度を使用します。この研究では、提案されたオンライン偏りのないLasso推定量の漸近正規性と、提案されたオンラインLasso推定量の範囲を調査します。提案された方法のパフォーマンスを評価するために、広範なシミュレーション研究が実施されました。当社は、Nasdaq株価および財務難データセットへのアプリケーションを提供しています。

On the Convergence of Projected Alternating Maximization for Equitable and Optimal Transport
公平で最適な輸送のための予測交互最大化の収束について

This paper studies the equitable and optimal transport (EOT) problem, which has many applications such as fair division problems and optimal transport with multiple agents etc. In the discrete distributions case, the EOT problem can be formulated as a linear program (LP). Since this LP is prohibitively large for general LP solvers, (Scetbon et al., 2021) suggests to perturb the problem by adding an entropy regularization. They proposed a projected alternating maximization algorithm (PAM) to solve the dual of the entropy regularized EOT. In this paper, we provide the first convergence analysis of PAM. A novel rounding procedure is proposed to help construct the primal solution for the original EOT problem. We also propose a variant of PAM by incorporating the extrapolation technique that can numerically improve the performance of PAM. Results in this paper may shed lights on block coordinate (gradient) descent methods for general optimization problems.

この論文では、公平な分割問題や複数のエージェントによる最適輸送など、多くの用途がある公平かつ最適な輸送(EOT)問題を研究します。離散分布の場合、EOT問題は線形計画(LP)として定式化できます。このLPは一般的なLPソルバーにとって法外に大きいため、(Scetbonら、2021)はエントロピー正則化を追加して問題を摂動することを提案しています。彼らは、エントロピー正則化EOTの双対を解決するために、投影交互最大化アルゴリズム(PAM)を提案しました。この論文では、PAMの最初の収束分析を提供します。元のEOT問題の基本解の構築に役立つ新しい丸め手順が提案されています。また、PAMのパフォーマンスを数値的に向上できる外挿手法を組み込んだPAMのバリアントも提案しています。この論文の結果は、一般的な最適化問題に対するブロック座標(勾配)降下法に光を当てる可能性があります。

ENNS: Variable Selection, Regression, Classification, and Deep Neural Network for High-Dimensional Data
ENNS：高次元データのための変数選択、回帰、分類、ディープニューラルネットワーク

High-dimensional, low-sample-size (HDLSS) data have been attracting people’s attention for a long time. Many studies have proposed different approaches to dealing with this situation, among which variable selection is a significant idea. However, neural networks have been used to model complicated relationships. This paper discusses current variable selection techniques with neural networks. We showed that the stage-wise algorithm with the neural network suffers from some disadvantages, such as that the variables entering the model later may not be consistent. We also proposed an ensemble method to achieve better variable selection and proved that it has a probability tending to zero that a false variable will be selected. Moreover, we discussed further regularization to deal with over-fitting. Simulations and examples of real data are given to support the theory.

高次元、低サンプルサイズ(HDLSS)データは長い間人々の注目を集めてきました。多くの研究がこの状況に対処するためのさまざまなアプローチを提案しており、その中で変数選択は重要なアイデアです。しかし、複雑な関係をモデル化するためにニューラルネットワークが使用されてきました。この論文では、ニューラルネットワークを使用した現在の変数選択手法について説明します。ニューラルネットワークを使用した段階的なアルゴリズムには、モデルに後で入力される変数が一貫していない可能性があるなど、いくつかの欠点があることを示しました。また、より優れた変数選択を実現するためのアンサンブル法を提案し、誤った変数が選択される確率がゼロに近づくことを証明しました。さらに、過剰適合に対処するためのさらなる正則化についても説明しました。理論をサポートするために、シミュレーションと実際のデータの例を示します。

On the Optimality of Gaussian Kernel Based Nonparametric Tests against Smooth Alternatives
滑らかな代替法に対するガウスカーネルに基づくノンパラメトリック検定の最適性について

Nonparametric tests via kernel embedding of distributions have witnessed a great deal of practical successes in recent years. However, statistical properties of these tests are largely unknown beyond consistency against a fixed alternative. To fill in this void, we study here the asymptotic properties of goodness-of-fit, homogeneity and independence tests using Gaussian kernels, arguably the most popular and successful among such tests. Our results provide theoretical justifications for this common practice by showing that tests using a Gaussian kernel with an appropriately chosen scaling parameter are minimax optimal against smooth alternatives in all three settings. In addition, our analysis also pinpoints the importance of choosing a diverging scaling parameter when using Gaussian kernels and suggests a data-driven choice of the scaling parameter that yields tests optimal, up to an iterated logarithmic factor, over a wide range of smooth alternatives. Numerical experiments are also presented to further demonstrate the practical merits of the methodology.

分布のカーネル埋め込みによるノンパラメトリック検定は、近年、多くの実用的成功を収めています。しかし、これらの検定の統計的特性は、固定された代替検定に対する一貫性以外にはほとんど知られていません。この空白を埋めるために、ここでは、おそらく最も一般的で成功しているガウスカーネルを使用した適合度、同次性、独立性検定の漸近特性について研究します。私たちの結果は、適切に選択されたスケーリングパラメータを持つガウスカーネルを使用した検定が、3つの設定すべてにおいて滑らかな代替検定に対してミニマックス最適であることを示すことにより、この一般的な方法の理論的正当性を提供します。さらに、私たちの分析は、ガウスカーネルを使用する際に発散するスケーリングパラメータを選択することの重要性も指摘し、反復対数係数まで、広範囲の滑らかな代替検定に対して最適な検定をもたらすスケーリングパラメータをデータに基づいて選択することを提案しています。数値実験も提示され、この方法論の実用的メリットをさらに実証しています。

Open-Source Conversational AI with SpeechBrain 1.0
SpeechBrain 1.0によるオープンソースの会話型AI

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete recipes of code and algorithms required for training them.This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face.SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

SpeechBrainは、PyTorchをベースとしたオープンソースの会話型AIツールキットで、音声認識、音声強調、話者認識、テキスト読み上げなどの音声処理タスクに特に重点を置いています。事前トレーニング済みのモデルと、それらのトレーニングに必要なコードとアルゴリズムの完全なレシピの両方を公開することで、透明性と再現性を促進します。この論文では、ツールキットの進化における重要なマイルストーンであるSpeechBrain 1.0を紹介します。現在、このツールキットには音声、オーディオ、言語処理タスクのレシピが200個以上あり、Hugging Faceで100個以上のモデルが利用可能です。SpeechBrain 1.0では、多様な学習様式、大規模言語モデル(LLM)の統合、高度なデコード戦略をサポートする新しいテクノロジーと、新しいモデル、タスク、様式が導入されています。また、新しいベンチマークリポジトリも含まれており、研究者はさまざまなタスクにわたってモデルを評価するための統合プラットフォームを利用できます。

Triple Component Matrix Factorization: Untangling Global, Local, and Noisy Components
三重成分行列の因数分解: グローバル、ローカル、ノイズの多い成分のもつれを解きほぐす

In this work, we study the problem of common and unique feature extraction from noisy data. When we have $N$ observation matrices from $N$ different and associated sources corrupted by sparse and potentially gross noise, can we recover the common and unique components from these noisy observations? This is a challenging task as the number of parameters to estimate is approximately thrice the number of observations. Despite the difficulty, we propose an intuitive alternating minimization algorithm called triple component matrix factorization (TCMF) to recover the three components exactly. TCMF is distinguished from existing works in literature thanks to two salient features. First, TCMF is a principled method to separate the three components given noisy observations provably. Second, the bulk of the computation in TCMF can be distributed. On the technical side, we formulate the problem as a constrained nonconvex nonsmooth optimization problem. Despite the intricate nature of the problem, we provide a Taylor series characterization of its solution by solving the corresponding Karush–Kuhn–Tucker conditions. Using this characterization, we can show that the alternating minimization algorithm makes significant progress at each iteration and converges into the ground truth at a linear rate. Numerical experiments in video segmentation and anomaly detection highlight the superior feature extraction abilities of TCMF.

本研究では、ノイズの多いデータから共通および固有の特徴を抽出する問題を研究します。スパースで潜在的に大きなノイズによって破損した、N個の異なる関連ソースからのN個の観測行列がある場合、これらのノイズの多い観測から共通および固有のコンポーネントを復元できますか?推定するパラメーターの数は観測数の約3倍であるため、これは困難な作業です。困難にもかかわらず、3つのコンポーネントを正確に復元するために、三重成分行列因子分解(TCMF)と呼ばれる直感的な交互最小化アルゴリズムを提案します。TCMFは、2つの顕著な特徴により、文献の既存の研究とは一線を画しています。まず、TCMFは、ノイズの多い観測が与えられた場合に3つのコンポーネントを証明可能に分離する原理的な方法です。次に、TCMFの計算の大部分は分散できます。技術面では、問題を制約付き非凸非平滑最適化問題として定式化します。問題の複雑さにもかかわらず、対応するKarush-Kuhn-Tucker条件を解くことで、そのソリューションのテイラー級数特性を提供します。この特性を使用して、交互最小化アルゴリズムが各反復で大幅に進歩し、線形速度でグラウンドトゥルースに収束することを示すことができます。ビデオセグメンテーションと異常検出の数値実験により、TCMFの優れた特徴抽出機能が明らかになりました。

Generalization on the Unseen, Logic Reasoning and Degree Curriculum
目に見えないものの一般化、論理的推論、学位カリキュラム

This paper considers the learning of logical (Boolean) functions with a focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an ‘extrapolating’ or ‘reasoning’ learner. We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable.

この論文では、分布外一般化の強力な例である、見えないものの一般化(GOTU)設定に焦点を当てた論理(ブール)関数の学習について検討します。これは、特定の推論タスク(算術/論理など)におけるデータの豊富な組み合わせ特性により、代表的なデータのサンプリングが困難になり、GOTUでの学習が成功すると、「外挿」または「推論」学習者の最初のビネットが得られるという事実に基づいています。(S)GDによってトレーニングされたさまざまなネットワークアーキテクチャがGOTUでどのように機能するかを調査し、スパース関数と、Transformer、ランダムフィーチャモデル、線形ネットワークのインスタンスを含むネットワークモデルのクラスについて、見えないものの最小次数補間が学習されるという理論的および実験的証拠を示します。より具体的には、これは、高次基底要素に最小のフーリエ質量を持つトレーニングデータの補間を意味します。これらの発見は、2つの意味合いを導きます。(1)ブール関数の長さの一般化問題に対する説明を提供します(例: Anilら2022)。(2)サポートを増分することで単項式をより効率的に学習する、Degree-Curriculumと呼ばれるカリキュラム学習アルゴリズムを紹介します。最後に、最小次数バイアスが依然として発生したり消えたりする可能性がある他のモデルや非スパース領域への拡張、および望ましくない場合にそれを潜在的に修正する方法を説明します。

Goal-Space Planning with Subgoal Models
サブゴールモデルによるゴール・スペース・プランニング

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a given set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning, and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

この論文では、バックグラウンドプランニングを使用したモデルベース強化学習への新しいアプローチ、つまりDynaアーキテクチャに似た(近似)動的プログラミング更新とモデルフリー更新の混合について調査します。学習済みモデルを使用したバックグラウンドプランニングは、前者の方がメモリと計算量が大幅に多いにもかかわらず、Double DQNなどのモデルフリーの代替手段よりも劣ることがよくあります。根本的な問題は、学習済みモデルが不正確になる可能性があり、特に多くのステップを反復する場合に無効な状態を生成することが多いことです。この論文では、バックグラウンドプランニングを特定の(抽象的な)サブゴールのセットに制限し、ローカルでサブゴール条件付きモデルのみを学習することで、この制限を回避します。この目標空間プランニング(GSP)アプローチは、計算効率が高く、自然な形で時間的抽象化が組み込まれているため、長期的なプランニングが高速化され、遷移ダイナミクスの学習が完全に回避されます。GSPアルゴリズムは、さまざまな基本学習者がさまざまなドメインで大幅に高速に学習できるように、抽象空間から値を伝播できることを示します。

Homeomorphic Projection to Ensure Neural-Network Solution Feasibility for Constrained Optimization
制約付き最適化のためのニューラルネットワーク解の実現可能性を確保するためのホメオモルフィック射影

There has been growing interest in employing neural networks (NNs) to directly solve constrained optimization problems with low run-time complexity. However, it is non-trivial to ensure NN solutions strictly satisfy problem constraints due to inherent NN prediction errors. Existing feasibility-ensuring methods are either computationally expensive or lack performance guarantee. In this paper, we propose Homeomorphic Projection as a low-complexity scheme to guarantee NN solution feasibility for optimization over a general set homeomorphic to a unit ball, covering all compact convex sets and certain classes of non-convex sets. The idea is to (i) learn a minimum distortion homeomorphic mapping between the constraint set and a unit ball using a bi-Lipschitz invertible NN (INN), and then (ii) perform a simple bisection operation concerning the unit ball such that the INN-mapped final solution is feasible with respect to the constraint set with minor distortion-induced optimality loss. We prove the feasibility guarantee and bounded optimality loss under mild conditions. Simulation results, including those for non-convex AC-OPF problems in power grid operation, show that homeomorphic projection outperforms existing methods in solution feasibility and run-time complexity while achieving similar optimality loss.

ニューラルネットワーク(NN)を用いて、実行時間の複雑さが低い制約付き最適化問題を直接解くことへの関心が高まっています。しかし、NNの予測誤差が内在するため、NNソリューションが問題の制約を厳密に満たすことを保証するのは簡単ではありません。既存の実行可能性保証方法は、計算コストが高いか、パフォーマンスが保証されていません。本稿では、すべてのコンパクトな凸集合と特定のクラスの非凸集合を網羅し、単位球に同相な一般集合の最適化に対するNNソリューションの実行可能性を保証する低複雑さの方式として、同相射影を提案します。アイデアは、(i)双リプシッツ可逆NN (INN)を使用して制約集合と単位球の間の歪みが最小の同相マッピングを学習し、次に(ii)単位球に関する単純な二分演算を実行して、INNマッピングされた最終ソリューションが制約集合に関して実行可能であり、歪みによる最適性損失が小さいというものです。軽度の条件下で、実行可能性保証と制限された最適性損失を証明します。電力網運用における非凸AC-OPF問題を含むシミュレーション結果は、同相射影が、同様の最適性損失を達成しながら、ソリューションの実現可能性と実行時の複雑さにおいて既存の方法よりも優れていることを示しています。

Label Noise Robustness of Conformal Prediction
共形予測のラベルノイズロバスト性

We study the robustness of conformal prediction, a powerful tool for uncertainty quantification, to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. We further extend our theory and formulate the requirements for correctly controlling a general loss function, such as the false negative proportion, with noisy labels. Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity.

我々は、不確実性定量化の強力なツールである共形予測のラベルノイズに対する堅牢性を研究します。我々の分析は、回帰問題と分類問題の両方に取り組んでおり、観測されていないノイズのない真実ラベルを正しくカバーする不確実性セットをいつどのように構築できるかを特徴付ける。我々はさらに理論を拡張し、ノイズのあるラベルで偽陰性比率などの一般的な損失関数を正しく制御するための要件を定式化します。我々の理論と実験は、ノイズが分散して変動性を高めるときはいつでも、ノイズのあるラベルでの共形予測とリスク制御技術は、クリーンな真実ラベルよりも保守的なリスクを達成することを示唆しています。他の敵対的なケースでは、スコアやデータの規則性なしに真実ラベルの正しいリスクを確実に達成するために、共形予測アルゴリズムで制限されたサイズのノイズを修正することもできます。

PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium
PAPAL：混合ナッシュ均衡のための証明可能な論文ベースのプライマル・デュアル・アルゴリズム

We consider the non-convex non-concave objective function in two-player zero-sum continuous games. The existence of pure Nash equilibrium requires stringent conditions, posing a major challenge for this problem. To circumvent this difficulty, we examine the problem of identifying a mixed Nash equilibrium, where strategies are randomized and characterized by probability distributions over continuous domains. To this end, we propose PArticle-based Primal-dual ALgorithm (PAPAL) tailored for a weakly entropy-regularized min-max optimization over probability distributions. This algorithm employs the stochastic movements of particles to represent the updates of random strategies for the $\epsilon$-mixed Nash equilibrium. We offer a comprehensive convergence analysis of the proposed algorithm, demonstrating its effectiveness. In contrast to prior research that attempted to update particle importance without movements, PAPAL is the first implementable particle-based algorithm accompanied by non-asymptotic quantitative convergence results, running time, and sample complexity guarantees. Our framework contributes novel insights into the particle-based algorithms for continuous min-max optimization in the general non-convex non-concave setting.

2人のプレイヤーによるゼロ和連続ゲームにおける非凸非凹目的関数について考察します。純粋なナッシュ均衡が存在するには厳しい条件が必要であり、この問題の大きな課題となっています。この困難を回避するために、連続領域上の確率分布によって戦略がランダム化され特徴付けられる混合ナッシュ均衡を特定する問題を検討します。この目的のために、確率分布上の弱エントロピー正規化最小最大最適化に合わせて調整されたPArticleベースのPrimal-dual ALgorithm (PAPAL)を提案します。このアルゴリズムは、粒子の確率的移動を使用して、$\epsilon$混合ナッシュ均衡のランダム戦略の更新を表します。提案アルゴリズムの包括的な収束分析を提供し、その有効性を実証します。移動なしで粒子の重要性を更新しようとした以前の研究とは対照的に、PAPALは、非漸近的な定量的収束結果、実行時間、およびサンプルの複雑さの保証を伴う、実装可能な最初の粒子ベースのアルゴリズムです。私たちのフレームワークは、一般的な非凸非凹設定における連続最小最大最適化のための粒子ベースのアルゴリズムに新たな洞察をもたらします。

Geometric Learning with Positively Decomposable Kernels
正に分解可能なカーネルによる幾何学的学習

Kernel methods are powerful tools in machine learning. Classical kernel methods are based on positive definite kernels, which enable learning in reproducing kernel Hilbert spaces (RKHS). For non-Euclidean data spaces, positive definite kernels are difficult to come by. In this case, we propose the use of reproducing kernel Krein space (RKKS) based methods, which require only kernels that admit a positive decomposition. We show that one does not need to access this decomposition to learn in RKKS. We then investigate the conditions under which a kernel is positively decomposable. We show that invariant kernels admit a positive decomposition on homogeneous spaces under tractable regularity assumptions. This makes them much easier to construct than positive definite kernels, providing a route for learning with kernels for non-Euclidean data. By the same token, this provides theoretical foundations for RKKS-based methods in general.

カーネル法は機械学習の強力なツールです。従来のカーネル法は正定値カーネルに基づいており、再生カーネルヒルベルト空間(RKHS)での学習を可能にします。非ユークリッドデータ空間の場合、正定値カーネルは入手困難です。この場合、正の分解を許容するカーネルのみを必要とする再生カーネルクライン空間(RKKS)ベースの方法の使用を提案します。RKKSでの学習にはこの分解にアクセスする必要がないことを示します。次に、カーネルが正に分解可能な条件を調査します。扱いやすい正則性仮定の下で、不変カーネルが同次空間上で正の分解を許容することを示します。これにより、不変カーネルは正定値カーネルよりもはるかに簡単に構築でき、非ユークリッドデータのカーネルで学習するためのルートが提供されます。同様に、これはRKKSベースの方法全般の理論的基礎を提供します。

Mentored Learning: Improving Generalization and Convergence of Student Learner
メンター学習：学生学習者の一般化と収束の改善

Student learners typically engage in an iterative process of actively updating its hypotheses, like active learning. While this behavior can be advantageous, there is an inherent risk of introducing mistakes through incremental updates including weak initialization, inaccurate or insignificant history states, resulting in expensive convergence cost. In this work, rather than solely monitoring the update of the learner’s status, we propose monitoring the disagreement w.r.t. $\mathcal{F}^\mathcal{T}(\cdot)$ between the learner and teacher, and call this new paradigm “Mentored Learning”, which consists of `how to teach’ and `how to learn’. By actively incorporating feedback that deviates from the learner’s current hypotheses, convergence will be much easier to analyze without strict assumptions on learner’s historical status, then deriving tighter generalization bounds on error and label complexity. Formally, we introduce an approximately optimal teaching hypothesis, $h^\mathcal{T}$, incorporating a tighter slack term $\left(1+\mathcal{F}^{\mathcal{T}}(\widehat{h}_t)\right)\Delta_t$ to replace the typical $2\Delta_t$ used in hypothesis pruning. Theoretically, we demonstrate that, guided by this teaching hypothesis, the learner can converge to tighter generalization bounds on error and label complexity compared to non-educated learners who lack guidance from a teacher: 1) the generalization error upper bound can be reduced from $R(h^*)+4\Delta_{T-1}$ to approximately $R(h^{\mathcal{T}})+2\Delta_{T-1}$, and 2) the label complexity upper bound can be decreased from $4 \theta\left(TR(h^{*})+2O(\sqrt{T})\right)$ to approximately $2\theta\left(2TR(h^{\mathcal{T}})+3 O(\sqrt{T})\right)$. To adhere strictly to our assumption, self-improvement of teaching is proposed when $h^\mathcal{T}$ loosely approximates $h^*$. In the context of learning, we further consider two teaching scenarios: instructing a white-box and black-box learner. Experiments validate this teaching concept and demonstrate superior generalization performance compared to fundamental active learning strategies, such as IWAL, IWAL-D, etc.

学生学習者は通常、能動学習のように、仮説を積極的に更新する反復プロセスに従事します。この動作は有利になる可能性がありますが、弱い初期化、不正確または重要でない履歴状態などの増分更新を通じて間違いが導入され、高価な収束コストが発生するという固有のリスクがあります。この研究では、学習者のステータスの更新を単に監視するのではなく、学習者と教師の間の$\mathcal{F}^\mathcal{T}(\cdot)$に関する不一致を監視することを提案し、この新しいパラダイムを「指導方法」と「学習方法」で構成される「指導学習」と呼びます。学習者の現在の仮説から逸脱するフィードバックを積極的に取り入れることで、学習者の履歴ステータスに関する厳密な仮定なしに収束を分析し、エラーとラベルの複雑さに関するより厳密な一般化境界を導き出すことがはるかに簡単になります。正式には、仮説の刈り込みで使用される典型的な$2\Delta_t$の代わりに、よりタイトなスラック項$\left(1+\mathcal{F}^{\mathcal{T}}(\widehat{h}_t)\right)\Delta_t$を組み込んだ、ほぼ最適な教育仮説$h^\mathcal{T}$を導入します。理論的には、この教授仮説に導かれて、学習者は教師からの指導を受けていない非教育学習者と比較して、エラーとラベル複雑性に関するより厳しい一般化境界に収束できることを示しています。1)一般化エラーの上限は、$R(h^*)+4\Delta_{T-1}$から約$R(h^{\mathcal{T}})+2\Delta_{T-1}$に減少でき、2)ラベル複雑性の上限は、$4 \theta\left(TR(h^{*})+2O(\sqrt{T})\right)$から約$2\theta\left(2TR(h^{\mathcal{T}})+3 O(\sqrt{T})\right)$に減少できます。私たちの仮定に厳密に従うために、$h^\mathcal{T}$が$h^*$に緩く近似する場合、教授の自己改善が提案されます。学習の文脈では、ホワイトボックス学習者とブラックボックス学習者を指導するという2つの教育シナリオをさらに検討します。実験によりこの教育コンセプトが検証され、IWAL、IWAL-Dなどの基本的な能動学習戦略と比較して優れた一般化パフォーマンスが実証されました。

Robust Principal Component Analysis using Density Power Divergence
密度パワーダイバージェンスを用いたロバストな主成分分析

Principal component analysis (PCA) is a widely employed statistical tool used primarily for dimensionality reduction. However, it is known to be adversely affected by the presence of outlying observations in the sample, which is quite common. Robust PCA methods using M-estimators have theoretical benefits, but their robustness drop substantially for high dimensional data. On the other end of the spectrum, robust PCA algorithms solving principal component pursuit or similar optimization problems have high breakdown, but lack theoretical richness and demand high computational power compared to the M-estimators. We introduce a novel robust PCA estimator based on the minimum density power divergence estimator. This combines the theoretical strength of the M-estimators and the minimum divergence estimators with a high breakdown guarantee regardless of data dimension. We present a computationally efficient algorithm for this estimate. Our theoretical findings are supported by extensive simulations and comparisons with existing robust PCA methods. We also showcase the proposed algorithm’s applicability on two benchmark data sets and a credit card transactions data set for fraud detection.

主成分分析(PCA)は、主に次元削減に使用される統計ツールとして広く採用されています。しかし、サンプル内に外れ値が存在すると、それが悪影響を及ぼすことが知られています。これは非常に一般的です。M推定量を使用する堅牢なPCA法には理論的な利点がありますが、高次元データでは堅牢性が大幅に低下します。その一方で、主成分追求や同様の最適化問題を解決する堅牢なPCAアルゴリズムは、ブレークダウンが高いものの、理論的な豊かさに欠け、M推定量に比べて高い計算能力を必要とします。私たちは、最小密度べき乗ダイバージェンス推定量に基づく新しい堅牢なPCA推定量を紹介します。これは、M推定量と最小ダイバージェンス推定量の理論的な強みを、データ次元に関係なく高いブレークダウン保証と組み合わせたものです。私たちは、この推定のための計算効率の高いアルゴリズムを紹介します。私たちの理論的発見は、広範なシミュレーションと既存の堅牢なPCA法との比較によって裏付けられています。また、提案されたアルゴリズムが、詐欺検出のための2つのベンチマークデータセットとクレジットカード取引データセットに適用可能であることも示します。

Graphical Dirichlet Process for Clustering Non-Exchangeable Grouped Data
交換不可能なグループ化されたデータをクラスタリングするためのグラフィカルなディリクレプロセス

We consider the problem of clustering grouped data with possibly non-exchangeable groups whose dependencies can be characterized by a known directed acyclic graph. To allow the sharing of clusters among the non-exchangeable groups, we propose a Bayesian nonparametric approach, termed graphical Dirichlet process, that jointly models the dependent group-specific random measures by assuming each random measure to be distributed as a Dirichlet process whose concentration parameter and base probability measure depend on those of its parent groups. The resulting joint stochastic process respects the Markov property of the directed acyclic graph that links the groups. We characterize the graphical Dirichlet process using a novel hypergraph representation as well as the stick-breaking representation, the restaurant-type representation, and the representation as a limit of a finite mixture model. We develop an efficient posterior inference algorithm and illustrate our model with simulations and a real grouped single-cell data set.

我々は、依存関係が既知の有向非巡回グラフによって特徴付けられる、おそらく交換不可能なグループを持つグループ化されたデータのクラスタリングの問題を検討します。交換不可能なグループ間でクラスターを共有できるようにするために、我々はグラフィカル・ディリクレ過程と呼ばれるベイジアン・ノンパラメトリック手法を提案します。これは、各ランダム尺度が、集中パラメータと基本確率尺度が親グループのそれらに依存するディリクレ過程として分布していると仮定することにより、従属グループ固有のランダム尺度を共同でモデル化します。結果として得られる共同確率過程は、グループをリンクする有向非巡回グラフのマルコフ特性に従う。我々は、棒を折る表現、レストラン型表現、有限混合モデルの極限としての表現に加えて、新しいハイパーグラフ表現を使用してグラフィカル・ディリクレ過程を特徴付ける。我々は効率的な事後推論アルゴリズムを開発し、シミュレーションと実際のグループ化された単一細胞データセットでモデルを説明します。

Stability and L2-penalty in Model Averaging
モデル平均化における安定性とL2ペナルティ

Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there is little literature on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we define the stability, asymptotic empirical risk minimization, generalization and consistency of model averaging, and study the relationship among them. Similar to the existing results in literature, we find that stability can ensure that the model averaging estimator has good generalization performance and consistency under reasonable conditions, where consistency means that the model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose an $L_2$-penalty model averaging method without limiting model weights, and prove that it has stability and consistency. In order to overcome selection uncertainty of the $L_2$-penalty parameter, we use cross-validation to select a candidate set of $L_2$-penalty parameters, and then perform a weighted average of the estimators of model weights based on cross-validation errors. We demonstrate the usefulness of the proposed method with a Monte Carlo simulation and application to a prediction task on the wage1 dataset.

モデル平均化は、過去20年間で多くの注目を集めており、これは、潜在的なモデルを平均化することによって利用可能な情報を統合するものです。さまざまなモデル平均化方法が開発されていますが、安定性の観点からモデル平均化の理論的特性に関する文献はほとんどなく、これらの方法の大部分はモデルの重みを単体に制限しています。この論文の目的は、統計学習理論の安定性をモデル平均化に導入することです。したがって、モデル平均化の安定性、漸近的な経験的リスク最小化、一般化、一貫性を定義し、それらの関係を調べます。文献の既存の結果と同様に、安定性により、モデル平均化推定量が適切な条件下で優れた一般化パフォーマンスと一貫性を持つことを保証できることがわかります。一貫性とは、モデル平均化推定量が平均二乗予測誤差を漸近的に最小化できることを意味します。また、モデルの重みを制限しない$L_2$ペナルティモデル平均化方法を提案し、それが安定性と一貫性を持つことを証明します。$L_2$-ペナルティパラメータの選択不確実性を克服するために、クロスバリデーションを使用して$L_2$-ペナルティパラメータの候補セットを選択し、クロスバリデーションエラーに基づいてモデル重みの推定値の加重平均を実行します。モンテカルロシミュレーションとwage1データセットの予測タスクへの適用により、提案手法の有用性を実証します。

Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK
大きなバイアスによって誘発されるスパース活性化を持つニューラルネットワーク:バイアス一般化NTKによるより厳密な解析

We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime, where the networks’ biases are initialized to some constant rather than zero.We prove that under such initialization, the neural network will have sparse activation throughout the entire training process, which enables fast training procedures via some sophisticated computational methods. With such initialization, we show that the neural networks possess a different limiting kernel which we call bias-generalized NTK, and we study various properties of the neural networks with this new kernel.We first characterize the gradient descent dynamics. In particular, we show that the network in this case can achieve as fast convergence as the dense network, as opposed to the previous work suggesting that the sparse networks converge slower. In addition, our result improves the previous required width to ensure convergence.Secondly, we study the networks’ generalization: we show a width-sparsity dependence, which yields a sparsity-dependent Rademacher complexity and generalization bound. To our knowledge, this is the first sparsity-dependent generalization result via Rademacher complexity. Lastly, we study the smallest eigenvalue of this new kernel.We identify a data-dependent region where we can derive a much sharper lower bound on the NTK’s smallest eigenvalue than the worst-case bound previously known. This can lead to improvement in the generalization bound.

我々は、ネットワークのバイアスがゼロではなく何らかの定数に初期化されるニューラルタンジェントカーネル(NTK)領域で、1つの隠れ層を持つReLUネットワークのトレーニングについて研究します。このような初期化では、ニューラルネットワークはトレーニングプロセス全体を通じてスパースアクティベーションを持ち、洗練された計算方法によって高速トレーニング手順が可能になることを証明します。このような初期化により、ニューラルネットワークがバイアス一般化NTKと呼ぶ別の制限カーネルを持つことを示し、この新しいカーネルを持つニューラルネットワークのさまざまな特性について研究します。まず、勾配降下ダイナミクスを特徴付けます。特に、この場合のネットワークは、スパースネットワークの収束が遅いと示唆した以前の研究とは対照的に、密なネットワークと同じくらい高速な収束を達成できることを示します。さらに、我々の結果は、収束を確実にするために以前必要だった幅を改善します。次に、ネットワークの一般化について研究します。幅とスパース性の依存関係を示し、スパース性に依存するRademacher複雑度と一般化境界をもたらします。私たちの知る限り、これはRademacher複雑性による最初のスパース性依存の一般化結果です。最後に、この新しいカーネルの最小の固有値を調べます。これまでに知られている最悪のケースの境界よりも、NTKの最小の固有値のより明確な下限を導出できるデータ依存領域を特定します。これにより、一般化境界の改善につながる可能性があります。

Optimal Weighted Random Forests
最適な重み付けランダムフォレスト

The random forest (RF) algorithm has become a very popular prediction method for its great flexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners (trees) to aggregate their predictions. However, the predictive performance of different trees within the forest can vary significantly due to the randomization of the embedded bootstrap sampling and feature selection. In this paper, we focus on RF for regression and propose two optimal weighting algorithms, namely the 1 Step Optimal Weighted RF (1step-WRF$_\mathrm{opt}$) and 2 Steps Optimal Weighted RF (2steps-WRF$_\mathrm{opt}$), that combine the base learners through the weights determined by weight choice criteria. Under some regularity conditions, we show that these algorithms are asymptotically optimal in the sense that the resulting squared loss and risk are asymptotically identical to those of the infeasible but best possible weighted RF. Numerical studies conducted on real-world data sets and semi-synthetic data sets indicate that these algorithms outperform the equal-weight forest and two other weighted RFs proposed in the existing literature in most cases.

ランダムフォレスト(RF)アルゴリズムは、その優れた柔軟性と期待できる精度から、非常に人気の高い予測方法になっています。RFでは、すべてのベース学習者(ツリー)に等しい重みを付けて予測を集約するのが一般的です。ただし、フォレスト内の異なるツリーの予測パフォーマンスは、埋め込まれたブートストラップサンプリングと特徴選択のランダム化により、大幅に異なる場合があります。この論文では、回帰のRFに焦点を当て、重み選択基準によって決定される重みを介してベース学習者を組み合わせる1ステップ最適重み付けRF (1step-WRF$_\mathrm{opt}$)と2ステップ最適重み付けRF (2steps-WRF$_\mathrm{opt}$)という2つの最適重み付けアルゴリズムを提案します。いくつかの規則性条件下では、結果として得られる損失とリスクの二乗が、実行不可能ではあるが可能な限り最良の重み付けRFのものと漸近的に同一であるという意味で、これらのアルゴリズムが漸近的に最適であることを示します。現実世界のデータセットと半合成データセットに対して実施された数値的研究によると、これらのアルゴリズムはほとんどの場合、等重みフォレストや既存の文献で提案されている他の2つの重み付きRFよりも優れていることが示されています。

Efficient Active Manifold Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization
加速反復的に重み付けされた核ノルムの最小化による効率的なアクティブ多様体同定

This paper considers the problem of minimizing the sum of a smooth function and the Schatten-$p$ norm of the matrix. Our contribution involves proposing accelerated iteratively reweighted nuclear norm methods designed to solve the nonconvex low-rank minimization problem. Two major novelties characterize our approach. First, the proposed method possesses an active manifold identification property, enabling the provable identification of the correct rank of the stationary point within a finite number of iterations. Second, we introduce an adaptive updating strategy for smoothing parameters. This strategy automatically fixes parameters associated with zero singular values as constants upon detecting the correct rank while quickly driving the remaining parameters to zero. This adaptive behavior transforms the algorithm into one that effectively solves smooth problems after a few iterations, setting our work apart from existing iteratively reweighted methods for low-rank optimization. We prove the global convergence of the proposed algorithm, guaranteeing that every limit point of the iterates is a critical point. Furthermore, a local convergence rate analysis is provided under the Kurdyka-Łojasiewicz property. We conduct numerical experiments using both synthetic and real data to showcase our algorithm’s efficiency and superiority over existing methods.

本論文では、滑らかな関数と行列のSchatten-$p$ノルムの和を最小化する問題を考察します。私たちの貢献は、非凸低ランク最小化問題を解くために設計された、加速反復再重み付け核ノルム法の提案です。私たちのアプローチの特徴は、2つの大きな新規性です。まず、提案された方法は、アクティブな多様体識別特性を備えており、有限回数の反復内で定常点の正しいランクを証明可能に識別できます。次に、平滑化パラメータの適応更新戦略を導入します。この戦略は、正しいランクを検出すると、ゼロ特異値に関連付けられたパラメータを定数として自動的に固定し、残りのパラメータをすばやくゼロにします。この適応動作により、アルゴリズムは、数回の反復後に滑らかな問題を効果的に解決するアルゴリズムに変換され、私たちの研究は、既存の低ランク最適化の反復再重み付け方法とは一線を画しています。提案されたアルゴリズムのグローバル収束を証明し、反復のすべての極限点が臨界点であることを保証します。さらに、Kurdyka-Łojasiewiczプロパティに基づいて、ローカル収束率分析が提供されます。合成データと実際のデータの両方を使用して数値実験を実施し、既存の方法に対する当社のアルゴリズムの効率性と優位性を示します。

Empirical Design in Reinforcement Learning
強化学習における実証設計

Empirical design in reinforcement learning is no small task. Running good experiments requires attention to detail and at times significant computational resources. While compute resources available per dollar have continued to grow rapidly, so have the scale of typical experiments in reinforcement learning. It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience. The scale of these experiments often conflict with the need for statistical evidence, especially when comparing algorithms. Recent studies have highlighted how popular algorithms are sensitive to hyperparameter settings and implementation details, and that common empirical practice leads to weak statistical evidence (Machado et al., 2018; Henderson et al., 2018).This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning. In particular, we cover: the statistical assumptions underlying common performance measures, how to properly characterize performance variation and stability, hypothesis testing, special considerations for comparing multiple agents, baseline and illustrative example construction, and how to deal with hyperparameters and experimenter bias. Throughout we highlight common mistakes found in the literature and the statistical consequences of those in example experiments. The objective of this document is to provide answers on how we can use our unprecedented compute to do good science in reinforcement learning, as well as stay alert to potential pitfalls in our empirical design.

強化学習における経験的設計は、決して簡単な作業ではありません。良い実験を実行するには、細部への注意と、時にはかなりの計算リソースが必要です。1ドルあたりに利用できる計算リソースが急速に増加し続ける一方で、強化学習における典型的な実験の規模も増加しています。現在では、何百万ものパラメータを持つエージェントを、それぞれ30日間の経験に相当する数十のタスクに対してベンチマークすることが一般的です。これらの実験の規模は、特にアルゴリズムを比較する場合に、統計的証拠の必要性と矛盾することがよくあります。最近の研究では、人気のあるアルゴリズムがハイパーパラメータ設定と実装の詳細に敏感であること、そして一般的な経験的慣行が弱い統計的証拠につながることが強調されています(Machadoら、2018年、Hendersonら、2018年)。この原稿は、強化学習で良い実験を行う方法に関する行動の呼びかけと包括的なリソースの両方を表しています。特に、一般的なパフォーマンス測定の基礎となる統計的仮定、パフォーマンスの変動と安定性を適切に特徴付ける方法、仮説検定、複数のエージェントを比較する際の特別な考慮事項、ベースラインと説明例の構築、ハイパーパラメータと実験者のバイアスに対処する方法について取り上げます。全体を通して、文献で見られる一般的な間違いと、サンプル実験におけるそれらの統計的結果を強調します。このドキュメントの目的は、強化学習で優れた科学を行うために前例のないコンピューティングをどのように使用できるかについての答えを提供し、経験的設計の潜在的な落とし穴に注意を払うことです。

A Data-Adaptive RKHS Prior for Bayesian Learning of Kernels in Operators
演算子におけるカーネルのベイジアン学習のためのデータ適応RKHS事前

Kernels effectively represent nonlocal dependencies and are extensively employed in formu- lating operators between function spaces. Thus, learning kernels in operators from data is an inverse problem of general interest. Due to the nonlocal dependence, the inverse prob- lem is often severely ill-posed with a data-dependent normal operator. Traditional Bayesian methods address the ill-posedness by a non-degenerate prior, which may result in an unsta- ble posterior mean in the small noise regime, especially when data induces a perturbation in the null space of the normal operator. We propose a new data-adaptive Reproducing Kernel Hilbert Space (RKHS) prior, which ensures the stability of the posterior mean in the small noise regime. We analyze this adaptive prior and showcase its efficacy through applications on Toeplitz matrices and integral operators. Numerical experiments reveal that fixed non-degenerate priors can produce divergent posterior means under errors from discretization, model inaccuracies, partial observations, or erroneous noise assumptions. In contrast, our data-adaptive RKHS prior consistently yields convergent posterior means.

カーネルは非局所的な依存関係を効果的に表現し、関数空間間の演算子を定式化する際に広く使用されています。したがって、データから演算子のカーネルを学習することは、一般的な関心事である逆問題です。非局所的な依存関係のため、逆問題は、データ依存の正規演算子では深刻な不良設定になることがよくあります。従来のベイズ法では、非退化事前分布によって不良設定性に対処しますが、これは、特にデータが正規演算子のヌル空間で摂動を引き起こす場合に、ノイズが小さい状態で事後平均が不安定になる可能性があります。私たちは、ノイズが小さい状態で事後平均の安定性を保証する、新しいデータ適応型再生カーネルヒルベルト空間(RKHS)事前分布を提案します。私たちはこの適応事前分布を分析し、テプリッツ行列と積分演算子への応用を通じてその有効性を示します。数値実験により、固定された非退化事前分布は、離散化、モデルの不正確さ、部分的な観測、または誤ったノイズ仮定による誤差の下で、発散事後平均を生成する可能性があることが明らかになりました。対照的に、データ適応型RKHS事前分布は、一貫して収束事後平均を生成します。

GGD: Grafting Gradient Descent
GGD：グラフト勾配降下法

Simple random sampling has been widely used in traditional stochastic optimization algorithms. Although the gradient sampled by simple random sampling is a descent direction in expectation, it may have a relatively high variance which will cause the descent curve wiggling and slow down the optimization process. In this paper, we propose a novel stochastic optimization method called grafting gradient descent (GGD), which combines the strength from minibatching and importance sampling, and provide the convergence results of GGD. We show that the grafting gradient possesses a doubly robust property which ensures that the performance of GGD method is superior to the worse one of SGD with importance sampling method and mini-batch SGD method. Combined with advanced variance reduction techniques such as stochastic variance reduced gradient and adaptive stepsize methods such as Adam, these composite GGD-based methods and their theoretical bounds are provided. The real data studies also show that GGD achieves an intermediate performance among SGD with importance sampling and mini-batch SGD, and outperforms original SGD method. Then the proposed GGD is a better and more robust stochastic optimization framework in practice.

単純ランダムサンプリングは、従来の確率的最適化アルゴリズムで広く使用されています。単純ランダムサンプリングによってサンプリングされた勾配は、期待される下降方向ですが、比較的高い分散を持つ可能性があり、下降曲線の揺れを引き起こし、最適化プロセスを遅くします。この論文では、ミニバッチと重要度サンプリングの長所を組み合わせたグラフティング勾配降下法(GGD)と呼ばれる新しい確率的最適化手法を提案し、GGDの収束結果を示します。グラフティング勾配は、GGD法のパフォーマンスが、重要度サンプリング法とミニバッチSGD法の劣ったものよりも優れていることを保証する二重の堅牢性を備えていることを示します。確率的分散削減勾配などの高度な分散削減手法や、Adamなどの適応ステップサイズ法と組み合わせることで、これらの複合GGDベースの方法とその理論的境界が提供されます。実際のデータ研究によると、GGDは重要度サンプリングを使用したSGDとミニバッチSGDの中間のパフォーマンスを達成し、元のSGD方法よりも優れていることも示されています。したがって、提案されたGGDは、実際にはより優れた、より堅牢な確率的最適化フレームワークです。

Debiasing Evaluations That Are Biased by Evaluations
評価によって偏った評価のバイアス除去

It is common to evaluate a set of items by soliciting people to rate them. For example, universities ask students to rate the teaching quality of their instructors, and conference organizers ask authors of submissions to evaluate the quality of the reviews. However, in these applications, students often give a higher rating to a course if they receive higher grades in a course, and authors often give a higher rating to the reviews if their papers are accepted to the conference. In this work, we call these external factors the “outcome” experienced by people, and consider the problem of mitigating these outcome-induced biases in the given ratings when some information about the outcome is available. We formulate the information about the outcome as a known partial ordering on the bias. We propose a debiasing method by solving a regularized optimization problem under this ordering constraint, and also provide a carefully designed cross-validation method that adaptively chooses the appropriate amount of regularization. We provide theoretical guarantees on the performance of our algorithm, as well as experimental evaluations.

人々に評価を依頼して一連の項目を評価することは一般的です。たとえば、大学では学生に講師の教育の質を評価するよう求め、会議の主催者は論文の著者にレビューの質を評価するよう求めます。しかし、これらのアプリケーションでは、学生はコースで高い成績を取った場合にコースに高い評価を与えることが多く、著者は論文が会議に受け入れられた場合にレビューに高い評価を与えることがよくあります。この研究では、これらの外部要因を人々が経験する「結果」と呼び、結果に関する情報が利用可能である場合に、結果によって引き起こされるバイアスを所定の評価で緩和する問題を検討します。結果に関する情報は、バイアスの既知の部分順序として定式化します。この順序付け制約の下で正規化された最適化問題を解決することにより、バイアス除去方法を提案し、適切な量の正規化を適応的に選択する慎重に設計されたクロス検証方法も提供します。アルゴリズムのパフォーマンスに関する理論的な保証と実験的評価を提供します。

Optimal Learning Policies for Differential Privacy in Multi-armed Bandits
多椀バンディットにおける差分プライバシーのための最適学習方策

This paper studies the multi-armed bandit problem with a requirement of differential privacy guarantee or global differential privacy guarantee. We first prove that, the lower bound for the extra regret to protect $(\epsilon,\delta)$-global differential privacy is $\Omega({N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$ ($N$ is the number of arms and $T$ is the time horizon), which is independent with $T$ for $\delta > 0$ and large enough $T$. Moreover, the lower bound for the extra regret to protect $(\epsilon,\delta)$-differential privacy can be no more than the above bound. This means that, different with the case $\delta = 0$, it is possible to design algorithms that protect privacy and achieve the same asymptotical regret upper bound as the non-private algorithms when $\delta > 0$. Then we adapt the Follow the Perturbed Leader (FTPL) framework, and propose learning policies with both Gaussian and Beta perturbed distributions (DP-FTPL-Gauss and DP-FTPL-Beta) to protect $(\epsilon,\delta)$-differential privacy. The analysis shows that they achieve an $O({N\log T\over \Delta_{\min}} + N \min\{{1\over \delta^2}, {1\over \epsilon^2}\log{1\over \delta}\})$ regret upper bound, where $\Delta_{\min}$ is the minimum expected reward gap between the optimal arm and any other ones. We also design a unique perturbed distribution to protect $(\epsilon,\delta)$-differential privacy in the FTPL framework (DP-FTPL-New), which reduces the regret upper bound to $O({N\log T\over \Delta_{\min}} + {N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$. We further show that this perturbed distribution could also be used to protect $(\epsilon,\delta)$-global differential privacy, and design a corresponding algorithm GDP-Elim-New. We show that its regret upper bound is $O({\Delta_{\max} \over \Delta_{\min}}({N\log T\over \Delta_{\min}} + {N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T}))$. This shows that our $\Omega({N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$ regret lower bound is tight (e.g. when ${\Delta_{\max}\over \Delta_{\min}}$ is bounded).

この論文では、差分プライバシー保証またはグローバル差分プライバシー保証を必要とする多腕バンディット問題を研究します。まず、$(\epsilon,\delta)$-グローバル差分プライバシーを保護するための余分な後悔の下限が$\Omega({N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$（$N$はアームの数、$T$は時間範囲）であることを証明します。これは、$\delta > 0$で$T$が十分に大きい場合、$T$とは独立です。さらに、$(\epsilon,\delta)$-差分プライバシーを保護するための余分な後悔の下限は、上記の上限を超えてはいけません。これは、$\delta = 0$の場合とは異なり、$\delta > 0$の場合はプライバシーを保護し、非プライバシーアルゴリズムと同じ漸近的な後悔の上限を達成するアルゴリズムを設計できることを意味します。次に、Follow the Perturbed Leader (FTPL)フレームワークを採用し、ガウス分布とベータ摂動分布の両方を使用した学習ポリシー(DP-FTPL-GaussとDP-FTPL-Beta)を提案して、$(\epsilon,\delta)$差分プライバシーを保護します。分析により、$O({N\log T\over \Delta_{\min}} + N \min\{{1\over \delta^2}, {1\over \epsilon^2}\log{1\over \delta}\})$の後悔の上限を達成することが示されています。ここで、$\Delta_{\min}$は、最適なアームと他のアームの間の最小の期待報酬ギャップです。また、FTPLフレームワーク（DP-FTPL-New）で$（\epsilon、\delta）$差分プライバシーを保護するための独自の摂動分布を設計します。これにより、後悔の上限が$O（{N\log T\over \Delta_{\min}} + {N\over \epsilon }\log {（e^{\epsilon} -1）T + \delta T \over (e^{\epsilon}-1) + \delta T}）$に削減されます。さらに、この摂動分布は$（\epsilon、\delta）$グローバル差分プライバシーの保護にも使用できることを示し、対応するアルゴリズムGDP-Elim-Newを設計します。我々は、その後悔の上限が$O({\Delta_{\max} \over \Delta_{\min}}({N\log T\over \Delta_{\min}} + {N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T}))$であることを示します。これは、$\Omega({N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$の後悔の下限が厳しいことを示しています(たとえば、${\Delta_{\max}\over \Delta_{\min}}$が有界である場合)。

Data-Efficient Policy Evaluation Through Behavior Policy Search
行動ポリシー検索によるデータ効率の高い政策評価

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a minimal variance behavior policy — a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates.

マルコフ決定プロセス(MDP)のポリシーを評価するタスクについて検討します。ポリシーを評価するための標準的な偏りのない手法は、ポリシーを展開してそのパフォーマンスを観察することです。異なるポリシー(一般に動作ポリシーと呼ばれる)を展開して収集したデータを使用して、この標準的な手法よりも平均二乗誤差が低い偏りのない推定値を生成できることを示します。最小分散動作ポリシー(結果として得られる推定値の平均二乗誤差を最小化する動作ポリシー)の解析式を導きます。この式は実際には不明な条件に依存するため、平均二乗誤差を削減する動作ポリシーを検索するという新しいポリシー評価サブ問題、動作ポリシー検索を提案します。2つの動作ポリシー検索アルゴリズムを紹介し、ポリシーパフォーマンス推定値の平均二乗誤差の低減におけるその有効性を実証します。

Just Wing It: Near-Optimal Estimation of Missing Mass in a Markovian Sequence
Just Wing It：マルコフ数列における欠損質量の最適近似推定

We study the problem of estimating the stationary mass—also called the unigram mass—that is missing from a single trajectory of a discrete-time, ergodic Markov chain. This problem has several applications—for example, estimating the stationary missing mass is critical for accurately smoothing probability estimates in sequence models. While the classical Good–Turing estimator from the 1950s has appealing properties for i.i.d. data, it is known to be biased in the Markovian setting, and other heuristic estimators do not come equipped with guarantees. Operating in the general setting in which the size of the state space may be much larger than the length $n$ of the trajectory, we develop a linear-runtime estimator called Windowed Good–Turing (WingIt) and show that its risk decays as $\widetilde{O}(\mathsf{T_{mix}}/n)$, where $\mathsf{T_{mix}}$ denotes the mixing time of the chain in total variation distance. Notably, this rate is independent of the size of the state space and minimax-optimal up to a logarithmic factor in $n / \mathsf{T_{mix}}$. We also present an upper bound on the variance of the missing mass random variable, which may be of independent interest. We extend our estimator to approximate the stationary mass placed on elements occurring with small frequency in the trajectory. Finally, we demonstrate the efficacy of our estimators both in simulations on canonical chains and on sequences constructed from natural language text.

我々は、離散時間エルゴードマルコフ連鎖の単一の軌跡から欠落している定常質量(ユニグラム質量とも呼ばれる)を推定する問題を研究します。この問題にはいくつかの応用があります。たとえば、定常欠落質量を推定することは、シーケンスモデルで確率推定を正確に平滑化するために重要です。1950年代の古典的なグッドチューリング推定量はi.i.d.データに対して魅力的な特性を持っていますが、マルコフ設定では偏りがあることが知られており、他のヒューリスティック推定量には保証がありません。状態空間のサイズが軌跡の長さ$n$よりもはるかに大きい可能性がある一般的な設定で動作して、Windowed Good–Turing (WingIt)と呼ばれる線形実行時間推定器を開発し、そのリスクが$\widetilde{O}(\mathsf{T_{mix}}/n)$として減衰することを示します。ここで、$\mathsf{T_{mix}}$は、総変動距離におけるチェーンの混合時間を表します。注目すべきことに、この率は状態空間のサイズに依存せず、$n / \mathsf{T_{mix}}$の対数係数まで最小最大最適です。また、独立した関心事である可能性のある、欠損質量ランダム変数の分散の上限も提示します。推定器を拡張して、軌跡内で低頻度で発生する要素に配置される定常質量を近似します。最後に、標準チェーンと自然言語テキストから構築されたシーケンスの両方のシミュレーションで推定器の有効性を示します。

Estimating the Replication Probability of Significant Classification Benchmark Experiments
有意な分類ベンチマーク実験の再現確率の推定

A fundamental question in machine learning is: “What are the chances that a statistically significant result will replicate?” The standard framework of null hypothesis significance testing, however, cannot answer this question directly. In this work, we derive formulas for estimating the replication probability that are applicable in two of the most widely used experimental designs in machine learning: the comparison of two classifiers over multiple benchmark datasets and the comparison of two classifiers in k-fold cross-validation. Using simulation studies, we show that p-values just below the common significance threshold of 0.05 are insufficient to warrant a high confidence in the replicability of significant results, as such p-values are barely more informative than the flip of a coin. If a replication probability of around 0.95 is desired, then the significance threshold should be lowered to at least 0.003. This observation might explain, at least in part, why many published research findings fail to replicate.

機械学習における基本的な質問は、「統計的に有意な結果が再現される可能性はどれくらいか？」です。しかし、帰無仮説有意性検定の標準的なフレームワークでは、この質問に直接答えることはできません。この研究では、機械学習で最も広く使用されている2つの実験設計、つまり複数のベンチマークデータセットでの2つの分類器の比較と、k分割交差検証での2つの分類器の比較に適用できる、再現確率を推定する式を導き出します。シミュレーション研究を使用して、一般的な有意性しきい値0.05をわずかに下回るp値は、有意な結果の再現性に高い信頼性を保証するには不十分であることを示しています。このようなp値は、コインを投げるよりもほとんど情報量が多くありません。約0.95の再現確率が必要な場合は、有意性しきい値を少なくとも0.003に下げる必要があります。この観察は、多くの公開された研究結果が再現されない理由を少なくとも部分的に説明できるかもしれません。

Causal Discovery with Generalized Linear Models through Peeling Algorithms
ピーリングアルゴリズムによる一般化線形モデルによる因果関係の発見

This article presents a novel method for causal discovery with generalized structural equation models suited for analyzing diverse types of outcomes, including discrete, continuous, and mixed data. Causal discovery often faces challenges due to unmeasured confounders that hinder the identification of causal relationships. The proposed approach addresses this issue by developing two peeling algorithms (bottom-up and top-down) to ascertain causal relationships and valid instruments. This approach first reconstructs a super-graph to represent ancestral relationships between variables, using a peeling algorithm based on nodewise GLM regressions that exploit relationships between primary and instrumental variables. Then, it estimates parent-child effects from the ancestral relationships using another peeling algorithm while deconfounding a child’s model with information borrowed from its parents’ models. The article offers a theoretical analysis of the proposed approach, establishing conditions for model identifiability and providing statistical guarantees for accurately discovering parent-child relationships via the peeling algorithms. Furthermore, the article presents numerical experiments showcasing the effectiveness of our approach in comparison to state-of-the-art structure learning methods without confounders. Lastly, it demonstrates an application to Alzheimer’s disease (AD), highlighting the method’s utility in constructing gene-to-gene and gene-to-disease regulatory networks involving Single Nucleotide Polymorphisms (SNPs) for healthy and AD subjects.

本稿では、離散データ、連続データ、混合データなど、さまざまなタイプの結果を分析するのに適した一般化構造方程式モデルを使用した因果発見の新しい方法を紹介します。因果関係の特定を妨げる測定されていない交絡因子のために、因果関係の発見はしばしば困難に直面します。提案されたアプローチでは、因果関係と有効な手段を突き止めるための2つのピーリングアルゴリズム(ボトムアップとトップダウン)を開発することで、この問題に対処します。このアプローチでは、まず、主変数と手段変数の関係を利用するノードワイズGLM回帰に基づくピーリングアルゴリズムを使用して、変数間の祖先関係を表すスーパーグラフを再構築します。次に、別のピーリングアルゴリズムを使用して祖先関係から親子効果を推定し、同時に親のモデルから借りた情報で子のモデルを交絡除去します。本稿では、提案されたアプローチの理論的分析を提供し、モデルの識別可能性の条件を確立し、ピーリングアルゴリズムを介して親子関係を正確に発見するための統計的保証を提供します。さらに、本論文では、交絡因子のない最先端の構造学習法と比較した、当社のアプローチの有効性を示す数値実験を紹介します。最後に、アルツハイマー病(AD)への応用を示し、健康な被験者とAD被験者の単一ヌクレオチド多型(SNP)を含む遺伝子間および遺伝子と疾患間の制御ネットワークを構築する際のこの方法の有用性を強調します。

Spectral Regularized Kernel Goodness-of-Fit Tests
スペクトル正則化カーネル適合度検定

Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al. (2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an appropriate choice of the regularization parameter. However, the results in Balasubramanian et al. (2021) are obtained under the restrictive assumptions of the mean element being zero, and the uniform boundedness condition on the eigenfunctions of the integral operator. Moreover, the test proposed in Balasubramanian et al. (2021) is not practical as it is not computable for many kernels. In this paper, we address these shortcomings and extend the results to general spectral regularizers that include Tikhonov regularization.

最大平均乖離度(MMD)は、非ユークリッドデータを扱う能力があるため、ノンパラメトリック仮説検定を含む多くの機械学習および統計アプリケーションで大きな成功を収めてきました。最近、Balasubramanianら(2021)では、MMDに基づく適合度検定はミニマックス最適ではないが、そのTikhonov正規化バージョンは、正規化パラメータを適切に選択することでミニマックス最適になることが実証されました。ただし、Balasubramanianら(2021)の結果は、平均要素がゼロであり、積分演算子の固有関数に一様有界条件があるという制限的な仮定の下で得られています。さらに、Balasubramanianら(2021)で提案された検定は、多くのカーネルでは計算できないため実用的ではありません。本論文では、これらの欠点に対処し、結果をTikhonov正規化を含む一般的なスペクトル正規化器に拡張します。

Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality
エントロピー正則化RLに対するマトリョーシカ方策勾配：収束と大域最適性

A novel Policy Gradient (PG) algorithm, called Matryoshka Policy Gradient (MPG), is introduced and studied, in the context of fixed-horizon max-entropy reinforcement learning, where an agent aims at maximizing entropy bonuses additional to its cumulative rewards.In the linear function approximation setting with softmax policies, we prove uniqueness and characterize the optimal policy of the entropy regularized objective, together with global convergence of MPG.These results are proved in the case of continuous state and action space.MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the infinite horizon max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework.Finally, we provide a criterion for global optimality when the policy is parametrized by a neural network in terms of the neural tangent kernel at convergence.As a proof of concept, we evaluate numerically MPG on standard test benchmarks.

マトリョーシカポリシー勾配(MPG)と呼ばれる新しいポリシー勾配(PG)アルゴリズムが、エージェントが累積報酬に加えてエントロピーボーナスを最大化することを目指す固定期間最大エントロピー強化学習のコンテキストで導入され、研究されています。ソフトマックスポリシーを使用した線形関数近似設定では、エントロピー正規化目的の一意性を証明し、最適ポリシーとMPGのグローバル収束を特徴付けます。これらの結果は、連続状態およびアクション空間の場合に証明されています。MPGは直感的で、理論的に健全であり、さらに、無限期間最大エントロピー目的の最適ポリシーは、MPGフレームワークの最適ポリシーによって任意に近似できることを示しています。最後に、ポリシーが収束時にニューラルタンジェントカーネルに関してニューラルネットワークによってパラメーター化される場合のグローバル最適性の基準を提供します。概念実証として、標準テストベンチマークでMPGを数値的に評価します。

Non-Euclidean Monotone Operator Theory and Applications
非ユークリッド単調演算子の理論と応用

While monotone operator theory is often studied on Hilbert spaces, many interesting problems in machine learning and optimization arise naturally in finite-dimensional vector spaces endowed with non-Euclidean norms, such as diagonally-weighted $\ell_{1}$ or $\ell_{\infty}$ norms. This paper provides a natural generalization of monotone operator theory to finite-dimensional non-Euclidean spaces. The key tools are weak pairings and logarithmic norms. We show that the resolvent and reflected resolvent operators of non-Euclidean monotone mappings exhibit similar properties to their counterparts in Hilbert spaces. Furthermore, classical iterative methods and splitting methods for finding zeros of monotone operators are shown to converge in the non-Euclidean case. We apply our theory to equilibrium computation and Lipschitz constant estimation of recurrent neural networks, obtaining novel iterations and tighter upper bounds via forward-backward splitting.

単調作用素理論はヒルベルト空間で研究されることが多いが、機械学習や最適化における興味深い問題の多くは、対角重み付き$\ell_{1}$や$\ell_{\infty}$ノルムなどの非ユークリッドノルムを備えた有限次元ベクトル空間で自然に生じる。本論文では、単調作用素理論を有限次元非ユークリッド空間に自然に一般化します。重要なツールは、弱いペアリングと対数ノルムです。非ユークリッド単調写像のレゾルベント作用素と反射レゾルベント作用素が、ヒルベルト空間の対応する作用素と同様の特性を示すことを示す。さらに、単調作用素の零点を見つけるための古典的な反復法と分割法は、非ユークリッドの場合に収束することが示されます。本理論をリカレントニューラルネットワークの平衡計算とリプシッツ定数推定に適用し、前向き-後ろ向き分割によって新しい反復とより厳しい上限を得る。

Stochastic Regularized Majorization-Minimization with weakly convex and multi-convex surrogates
確率的正則化多数決-弱凸および多凸サロゲートによる最小化

Stochastic majorization-minimization (SMM) is a class of stochastic optimization algorithms that proceed by sampling new data points and minimizing a recursive average of surrogate functions of an objective function. The surrogates are required to be strongly convex and the existing convergence rate analysis for the general non-convex setting was not available. In this paper, we propose an extension of SMM where surrogates are allowed to be only weakly convex or block multi-convex, and the averaged surrogates are approximately minimized with proximal regularization or block-minimized within diminishing radii, respectively. For the general nonconvex constrained setting with non-i.i.d. data samples, we show that the first-order optimality gap of the proposed algorithm decays at the rate $\widetilde{O}(n^{-1/4})$ for the empirical loss and $\widetilde{O}(n^{-1/8})$ for the expected loss, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $\widetilde{O}(n^{-1/4})$. As a corollary, we obtain the first convergence rate bounds for various optimization methods under general nonconvex non-i.i.d. data setting: Double-averaging projected gradient descent and its generalizations, proximal point empirical risk minimization, and online matrix/tensor decomposition algorithms. We also provide experimental validation of our results.

確率的主要化最小化(SMM)は、新しいデータポイントをサンプリングし、目的関数の代理関数の再帰平均を最小化することによって処理を進める、確率的最適化アルゴリズムの一種です。代理関数は強く凸である必要があり、一般的な非凸設定に対する既存の収束率分析は利用できませんでした。この論文では、代理関数が弱凸またはブロック多重凸のみであることが許可され、平均代理関数がそれぞれ近似正則化で近似的に最小化されるか、減少半径内でブロック最小化される、SMMの拡張を提案します。非i.i.d.の一般的な非凸制約設定の場合、データサンプルでは、提案アルゴリズムの一次最適性ギャップが、経験的損失については$\widetilde{O}(n^{-1/4})$、期待損失については$\widetilde{O}(n^{-1/8})$の速度で減少することを示します。ここで、$n$は処理されるデータサンプルの数を表します。いくつかの追加の仮定の下では、後者の収束率は$\widetilde{O}(n^{-1/4})$まで改善できます。結果として、一般的な非凸非i.i.d.データ設定の下でのさまざまな最適化手法の最初の収束率境界を取得します。二重平均射影勾配降下法とその一般化、近似点経験的リスク最小化、オンライン行列/テンソル分解アルゴリズムです。また、結果の実験的検証も提供します。

Pure Differential Privacy for Functional Summaries with a Laplace-like Process
ラプラス様過程による機能要約のための純粋な差分プライバシー

Many existing mechanisms for achieving differential privacy (DP) on infinite-dimensional functional summaries typically involve embedding these functional summaries into finite-dimensional subspaces and applying traditional multivariate DP techniques. These mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism to achieve pure DP for functional summaries in a separable infinite-dimensional Hilbert space, named the Independent Component Laplace Process (ICLP) mechanism. This mechanism treats the summaries of interest as truly infinite-dimensional functional objects, thereby addressing several limitations of the existing mechanisms. Several statistical estimation problems are considered, and we demonstrate how one can enhance the utility of private summaries by oversmoothing the non-private counterparts. Numerical experiments on synthetic and real datasets demonstrate the effectiveness of the proposed mechanism.

無限次元の機能サマリーで差分プライバシー(DP)を実現するための既存のメカニズムの多くは、通常、これらの機能サマリーを有限次元サブスペースに埋め込み、従来の多変量DP手法を適用することを伴います。これらのメカニズムは、一般に各次元を均一に扱い、複雑で構造化されたサマリーを扱いにくくなっています。この研究では、分離可能な無限次元ヒルベルト空間で機能サマリーの純粋なDPを実現するための新しいメカニズム、独立成分ラプラス過程(ICLP)メカニズムを紹介します。このメカニズムは、対象のサマリーを真に無限次元の機能オブジェクトとして扱うため、既存のメカニズムのいくつかの制限に対処します。いくつかの統計的推定問題を考慮し、非プライベートなサマリーを過剰に平滑化することでプライベートサマリーの有用性を高める方法を示します。合成データセットと実際のデータセットでの数値実験により、提案されたメカニズムの有効性が実証されています。

Sparse Recovery With Multiple Data Streams: An Adaptive Sequential Testing Approach
複数のデータストリームによるスパースリカバリ:適応型シーケンシャル・テスト・アプローチ

Multistage design has been utilized across a variety of scientific fields, enabling the adaptive allocation of sensing resources to effectively eliminate null locations and localize signals. We present a decision-theoretic framework for multi-stage adaptive testing that minimizes the total number of measurements while ensuring pre-specified constraints on both the false positive rate (FPR) and the missed discovery rate (MDR). Our method, SMART, explicitly addresses the often-overlooked aspect of uncertainty quantification in machine learning algorithms, incorporating it at every decision stage. This enables SMART to respond adaptively to important patterns in the data streams, adjusting its decisions based on the strength of evidence at specific locations. By leveraging technical tools and key concepts from multiple testing, adaptive thresholding, and compound decision theory, SMART not only enhances the aggregation of information across individual tests but also allows for varying thresholds tailored to the observed data, thereby ensuring effective error rate control and resulting in significant savings on total study costs. Through comprehensive analyses of large-scale A/B tests, high-throughput screening, and image analysis, we demonstrate that our approach yields substantial efficiency gains and improved control over error rates compared to existing methodologies.

多段階設計はさまざまな科学分野で利用されており、感知リソースの適応的な割り当てにより、ヌル位置を効果的に排除し、信号の位置を特定することができます。私たちは、測定の総数を最小限に抑えながら、偽陽性率(FPR)と見逃し発見率(MDR)の両方に事前に指定された制約を保証する、多段階適応テストの意思決定理論的フレームワークを提示します。私たちの方法であるSMARTは、機械学習アルゴリズムにおける不確実性の定量化という見落とされがちな側面に明示的に対処し、それをすべての意思決定段階に組み込んでいます。これにより、SMARTはデータストリーム内の重要なパターンに適応的に応答し、特定の場所の証拠の強さに基づいて決定を調整できます。多重テスト、適応しきい値設定、複合意思決定理論の技術ツールと主要概念を活用することで、SMARTは個々のテスト間の情報の集約を強化するだけでなく、観測データに合わせて調整されたさまざまなしきい値を可能にするため、効果的なエラー率制御が保証され、総研究コストが大幅に節約されます。大規模なA/Bテスト、ハイスループットスクリーニング、画像分析の包括的な分析を通じて、当社のアプローチにより、既存の方法論と比較して大幅な効率性の向上とエラー率の制御の改善がもたらされることを実証しました。

Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning
因果的オフライン強化学習のための操作変数値の反復

In offline reinforcement learning (RL) an optimal policy is learned solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables is all mediated by the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of the conditional moment restriction. To our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL.

オフライン強化学習(RL)では、事前に収集された観測データのみから最適なポリシーが学習されます。ただし、観測データでは、アクションは観測されない変数によってしばしば混同されます。RLのコンテキストにおける道具変数(IV)は、状態変数への影響がすべてアクションによって媒介される変数です。有効な道具変数が存在する場合、観測データを通じて混同された遷移ダイナミクスを回復できます。遷移ダイナミクスが加法的非線形関数形式を許容する混同されたマルコフ決定プロセスを研究します。IVを使用して、条件付きモーメント制約を導出し、それを通じて観測データに基づいて遷移ダイナミクスを識別できます。条件付きモーメント制約の主双対再定式化に基づく、証明可能で効率的なIV支援値反復(IVVI)アルゴリズムを提案します。私たちの知る限り、これは道具支援オフラインRLの証明可能で効率的な最初のアルゴリズムです。

Identifying Causal Eﬀects using Instrumental Time Series: Nuisance IV and Correcting for the Past
インストゥルメンタル時系列を使用した因果効果の特定：迷惑IVと過去の補正

Instrumental variable (IV) regression relies on instruments to infer causal eﬀects from observational data with unobserved confounding. We consider IV regression in time series models, such as vector auto-regressive (VAR) processes. Direct applications of i.i.d. techniques are generally inconsistent as they do not correctly adjust for dependencies in the past. In this paper, we outline the diﬃculties that arise due to time structure and propose methodology for constructing identifying equations that can be used for consistent parametric estimation of causal eﬀects in time series data. One method uses extra nuisance covariates to obtain identifiability (an idea that can be of interest even in the i.i.d. case). We further propose a graph marginalization framework that allows us to apply nuisance IV and other IV methods in a principled way to time series. Our methods make use of a version of the global Markov property, which we prove holds for VAR(p) processes. For VAR(1) processes, we prove identifiability conditions that relate to Jordan forms and are diﬀerent from the well-known rank conditions in the i.i.d. case (they do not require as many instruments as covariates, for example). We provide methods, prove their consistency, and show how the inferred causal eﬀect can be used for distribution generalization. Simulation experiments corroborate our theoretical results. We provide ready-to-use Python code.

操作変数(IV)回帰は、観測されない交絡を伴う観測データから因果効果を推測するために操作変数に依存します。ベクトル自己回帰(VAR)プロセスなどの時系列モデルにおけるIV回帰を検討します。i.i.d.手法を直接適用すると、過去の依存関係が正しく調整されないため、一般的に一貫性がありません。この論文では、時間構造によって生じる困難について概説し、時系列データにおける因果効果の一貫したパラメトリック推定に使用できる識別方程式を構築するための方法論を提案します。1つの方法では、追加の迷惑共変量を使用して識別可能性を取得します(i.i.d.の場合でも興味深いアイデアです)。さらに、時系列に迷惑IVおよびその他のIV方法を原理的に適用できるようにするグラフ周辺化フレームワークを提案します。私たちの方法は、グローバルマルコフ特性のバージョンを利用しており、これがVAR(p)プロセスに当てはまることを証明しています。VAR(1)プロセスについては、ジョルダン形式に関連し、i.i.d.の場合のよく知られているランク条件とは異なる識別可能性条件を証明しています(たとえば、共変量ほど多くの操作変数を必要としません)。方法を提供し、その一貫性を証明し、推定された因果効果が分布の一般化にどのように使用できるかを示します。シミュレーション実験は、私たちの理論的結果を裏付けています。すぐに使用できるPythonコードを提供します。

RLtools: A Fast, Portable Deep Reinforcement Learning Library for Continuous Control
RLtools：連続制御のための高速でポータブルな深層強化学習ライブラリ

Deep Reinforcement Learning (RL) can yield capable agents and control policies in several domains but is commonly plagued by prohibitively long training times. Additionally, in the case of continuous control problems, the applicability of learned policies on real-world embedded devices is limited due to the lack of real-time guarantees and portability of existing libraries. To address these challenges, we present RLtools, a dependency-free, header-only, pure C++ library for deep supervised and reinforcement learning.Its novel architecture allows RLtools to be used on a wide variety of platforms, from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers. Specifically, due to the tight integration of the RL algorithms with simulation environments, RL can solve popular RL problems up to 76 times faster than other popular RL frameworks.We also benchmark the inference on a diverse set of microcontrollers and show that in most cases our optimized implementation is by far the fastest. Finally, RLtools enables the first-ever demonstration of training a deep RL algorithm directly on a microcontroller, giving rise to the field of TinyRL. The source code as well as documentation and live demos are available through our project page at https://rl.tools.

深層強化学習(RL)は、複数のドメインで有能なエージェントと制御ポリシーを生み出すことができますが、通常、法外に長いトレーニング時間という問題があります。さらに、連続制御の問題の場合、リアルタイム保証と既存のライブラリの移植性の欠如により、学習したポリシーを実際の組み込みデバイスに適用することは制限されます。これらの課題に対処するために、深層教師あり学習および強化学習用の依存性のない、ヘッダーのみの純粋なC++ライブラリであるRLtoolsを紹介します。その斬新なアーキテクチャにより、RLtoolsは、ワークステーションやラップトップ上のHPCクラスターからスマートフォン、スマートウォッチ、マイクロコントローラーまで、さまざまなプラットフォームで使用できます。具体的には、RLアルゴリズムとシミュレーション環境の緊密な統合により、RLは一般的なRL問題を他の一般的なRLフレームワークよりも最大76倍高速に解決できます。また、さまざまなマイクロコントローラーのセットで推論をベンチマークし、ほとんどの場合、最適化された実装がはるかに高速であることを示しています。最後に、RLtoolsは、ディープRLアルゴリズムをマイクロコントローラ上で直接トレーニングする初めてのデモンストレーションを可能にし、TinyRLの分野を生み出しました。ソースコード、ドキュメント、ライブデモは、プロジェクトページ(https://rl.tools)から入手できます。

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
スパースレートリダクションによるホワイトボックストランスフォーマー：圧縮がすべてですか?

In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve strong performance across different settings: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression.

本稿では、表現学習の自然な目的は、データの分布、つまりトークンのセットを、非一貫性な部分空間でサポートされる低次元のガウス混合に向けて圧縮および変換することであると主張します。このような表現の良し悪しは、学習された表現の内在的情報ゲインと外在的スパース性を同時に最大化する、スパースレート削減と呼ばれる原理的な尺度によって評価できます。この観点から、トランスフォーマーを含む一般的なディープネットワークアーキテクチャは、この尺度を最適化するための反復スキームを実現するものと見なすことができます。特に、この目的の一部に対する交互の最適化からトランスフォーマーブロックを導出します。マルチヘッドセルフアテンション演算子は、特徴のコーディングレートに対して近似勾配降下ステップを実装することで表現を圧縮し、後続のマルチレイヤーパーセプトロンは特徴をスパース化します。これにより、数学的に完全に解釈可能な、CRATEと呼ばれるホワイトボックストランスフォーマーのようなディープネットワークアーキテクチャのファミリーが生まれます。ノイズ除去と圧縮の新しい関係を通じて、前述の圧縮エンコーディングの逆が、同じクラスのCRATEアーキテクチャによって実現できることを示します。したがって、このようにして導出されたホワイトボックスアーキテクチャは、エンコーダとデコーダの両方に共通です。実験では、これらのネットワークは、その単純さにもかかわらず、大規模な実世界の画像とテキストデータセットの表現を圧縮およびスパース化することを実際に学習し、ViT、MAE、DINO、BERT、GPT2などのさまざまな設定で優れたパフォーマンスを達成することが示されています。提案された計算フレームワークは、データ圧縮の統一された観点から、ディープラーニングの理論と実践のギャップを埋める大きな可能性を示していると考えています。

Commutative Scaling of Width and Depth in Deep Neural Networks
深層ニューラルネットワークにおける幅と深さの可換スケーリング

In this paper, we study the commutativity of infinite width and depth limits in deep neural networks. Our aim is to understand the behavior of neural functions (functions that depend on a neural network model) as width and depth go to infinity (in some sense), and eventually identify settings under which commutativity holds, i.e. the neural function tends to the same limit no matter how width and depth limits are taken. In this paper, we formally introduce and define the commutativity framework, and discuss its implications on neural network design and scaling. We study commutativity for the neural covariance kernel which reflects how network layers separate data. Our findings extend previous results established in Hayou and Yang (2023) by showing that taking the width and depth to infinity in a deep neural network with skip connections, when branches are suitably scaled to avoid exploding behavior, result in the same covariance structure no matter how that limit is taken. This has a number of theoretical and practical implications that we discuss in the paper. The proof techniques in this paper are new and rely on tools that are more accessible to readers who are not familiar with stochastic calculus (used in the proofs of Hayou and Yang (2023)).

本論文では、ディープニューラルネットワークにおける無限の幅と深さの制限の可換性について検討します。私たちの目的は、幅と深さが無限大(ある意味で)になるときのニューラル関数(ニューラルネットワークモデルに依存する関数)の動作を理解し、最終的に可換性が保持される設定、つまり幅と深さの制限がどのように取られてもニューラル関数が同じ制限に近づく設定を特定することです。本論文では、可換性フレームワークを正式に導入して定義し、ニューラルネットワークの設計とスケーリングへの影響について説明します。ネットワークレイヤーがデータを分離する方法を反映するニューラル共分散カーネルの可換性について説明します。私たちの研究結果は、HayouとYang (2023)で確立された以前の結果を拡張したもので、スキップ接続のあるディープニューラルネットワークで幅と深さを無限大にすると、爆発的な動作を回避するために分岐が適切にスケーリングされている場合、その制限がどのように取られても同じ共分散構造になることを示しています。これには、本論文で説明するいくつかの理論的および実用的な意味合いがあります。この論文の証明手法は新しいものであり、確率計算に精通していない読者にとってよりアクセスしやすいツールに依存しています（HayouとYang（2023）の証明で使用されています）。

Value-Distributional Model-Based Reinforcement Learning
価値分布モデルに基づく強化学習

Quantifying uncertainty about a policy’s long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective, where the goal is to learn the posterior distribution over value functions induced by parameter (epistemic) uncertainty of the Markov decision process. Previous work restricts the analysis to a few moments of the distribution over values or imposes a particular distribution shape, e.g., Gaussians. Inspired by distributional reinforcement learning, we introduce a Bellman operator whose fixed-point is the value distribution function. Based on our theory, we propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function. We combine EQR with soft actor-critic (SAC) for policy optimization with an arbitrary differentiable objective function of the learned value distribution. Evaluation across several continuous-control tasks shows performance benefits with respect to both model-based and model-free algorithms. The code is available at https://github.com/boschresearch/dist-mbrl.

ポリシーの長期的なパフォーマンスに関する不確実性を定量化することは、逐次的な意思決定タスクを解決するために重要です。私たちは、モデルベースのベイジアン強化学習の観点からこの問題を研究します。その目的は、マルコフ決定プロセスのパラメータ（認識論的）不確実性によって誘導される価値関数の事後分布を学習することです。以前の研究では、分析を値の分布のいくつかのモーメントに制限するか、ガウス分布などの特定の分布形状を課しています。分布強化学習に触発されて、私たちは、固定点が価値分布関数であるベルマン演算子を導入します。私たちの理論に基づいて、価値分布関数を学習するモデルベースのアルゴリズムである認識分位回帰（EQR）を提案します。私たちは、学習した価値分布の任意の微分可能な目的関数を使用して、ポリシーを最適化するために、EQRをソフトアクタークリティック（SAC）と組み合わせます。いくつかの連続制御タスクにわたる評価では、モデルベースとモデルフリーの両方のアルゴリズムに関してパフォーマンス上の利点が示されています。コードはhttps://github.com/boschresearch/dist-mbrlで入手できます。

Optimistic Search: Change Point Estimation for Large-scale Data via Adaptive Logarithmic Queries
楽観的探索：適応対数クエリによる大規模データの変化点推定

Change point estimation is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching one change point through all candidates requires $O(n)$ evaluations of the gain function for an interval with $n$ observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can become infeasible. Instead, we propose optimistic search, a methodology that only requires $O(\log n)$ evaluations of the gain function, leading to huge computational gains for massive (large-scale, high-dimensional) data for single and multiple change point estimation. Towards solid understanding of our strategy, we investigate in detail the $p$-dimensional Gaussian changing means setup, including high-dimensional scenarios. For some of our proposals, we prove asymptotic minimax optimality for detecting change points and derive sharp asymptotic rates for localizing change points. Our search strategy generalizes far beyond the theoretically analyzed setup. We illustrate, as an example, massive computational speedup in change point detection for high-dimensional Gaussian graphical models.

変化点推定は、データをセグメント化するときに適合度が向上するゲイン関数の最大値の検索として定式化されることが多い。すべての候補から1つの変化点を検索するには、観測値がn個の区間に対してゲイン関数を$O(n)$回評価する必要があります。各評価に計算負荷がかかる場合(高次元モデルなど)、これは実行不可能になる可能性があります。代わりに、我々は楽観的検索を提案します。これはゲイン関数を$O(\log n)$回評価するだけで済む方法論であり、単一および複数の変化点推定のための大量の(大規模、高次元)データに対して大きな計算上の利益をもたらす。我々の戦略をしっかりと理解するために、高次元シナリオを含む$p$次元のガウス平均変化設定を詳細に調査します。我々の提案のいくつかでは、変化点を検出するための漸近的ミニマックス最適性を証明し、変化点を特定するための急激な漸近率を導出します。我々の検索戦略は、理論的に分析された設定をはるかに超えて一般化されます。例として、高次元ガウスグラフィカルモデルの変化点検出における計算の大幅な高速化を示します。

PyPop7: A Pure-Python Library for Population-Based Black-Box Optimization
PyPop7：人口ベースのブラックボックス最適化のためのPure-Pythonライブラリ

In this paper, we present an open-source pure-Python library called PyPop7 for black-box optimization (BBO). As population-based methods (e.g., evolutionary algorithms, swarm intelligence, and pattern search) become increasingly popular for BBO, the design goal of PyPop7 is to provide a unified API and elegant implementations for them, particularly in challenging high-dimensional scenarios. Since these population-based methods easily suffer from the notorious curse of dimensionality owing to random sampling as one of core operations for most of them, recently various improvements and enhancements have been proposed to alleviate this issue more or less mainly via exploiting possible problem structures: such as, decomposition of search distribution or space, low-memory approximation, low-rank metric learning, variance reduction, ensemble of random subspaces, model self-adaptation, and fitness smoothing. These novel sampling strategies could better exploit different problem structures in high-dimensional search space and therefore they often result in faster rates of convergence and/or better qualities of solution for large-scale BBO. Now PyPop7 has covered many of these important advances on a set of well-established BBO algorithm families and also provided an open-access interface to adding the latest or missed black-box optimizers for further functionality extensions. Its well-designed source code (under GPL-3.0 license) and full-fledged online documents (under CC-BY 4.0 license) have been freely available at https://github.com/Evolutionary-Intelligence/pypop and https://pypop.readthedocs.io, respectively.

本稿では、ブラックボックス最適化(BBO)用のオープンソースの純粋なPythonライブラリPyPop7を紹介します。集団ベースの方法(進化的アルゴリズム、群知能、パターン検索など)がBBOでますます普及するにつれて、PyPop7の設計目標は、特に困難な高次元シナリオで、それらのための統一されたAPIとエレガントな実装を提供することです。これらの集団ベースの方法は、ほとんどのコア操作の1つとしてランダムサンプリングを行うため、悪名高い次元の呪いに簡単に悩まされるため、最近、主に考えられる問題構造を利用することでこの問題を軽減するためのさまざまな改善と機能強化が提案されています。たとえば、検索分布または空間の分解、低メモリ近似、低ランクメトリック学習、分散削減、ランダムサブスペースのアンサンブル、モデルの自己適応、適応度スムージングなどです。これらの新しいサンプリング戦略は、高次元検索空間におけるさまざまな問題構造をより有効に活用できるため、多くの場合、大規模BBOの収束速度が速くなったり、ソリューションの品質が向上したりします。現在、PyPop7は、確立されたBBOアルゴリズムファミリのセットでこれらの重要な進歩の多くをカバーしており、さらに機能を拡張するために最新または見逃されたブラックボックスオプティマイザーを追加するためのオープンアクセスインターフェイスも提供しています。その適切に設計されたソースコード(GPL-3.0ライセンス)と本格的なオンラインドキュメント(CC-BY 4.0ライセンス)は、それぞれhttps://github.com/Evolutionary-Intelligence/pypopとhttps://pypop.readthedocs.ioで無料で入手できます。

Evidence Estimation in Gaussian Graphical Models Using a Telescoping Block Decomposition of the Precision Matrix
精度行列の伸縮ブロック分解を用いたガウスグラフィカルモデルにおける証拠推定

Marginal likelihood, also known as model evidence, is a fundamental quantity in Bayesian statistics. It is used for model selection using Bayes factors or for empirical Bayes tuning of prior hyper-parameters. Yet, the calculation of evidence has remained a longstanding open problem in Gaussian graphical models. Currently, the only feasible solutions that exist are for special cases such as the Wishart or G-Wishart, in moderate dimensions. We develop an approach based on a novel telescoping block decomposition of the precision matrix that allows the estimation of evidence by application of Chib’s technique under a very broad class of priors under mild requirements. Specifically, the requirements are: (a) the priors on the diagonal terms on the precision matrix can be written as gamma or scale mixtures of gamma random variables and (b) those on the off-diagonal terms can be represented as normal or scale mixtures of normal. This includes structured priors such as the Wishart or G-Wishart, and more recently introduced element-wise priors, such as the Bayesian graphical lasso and the graphical horseshoe. Among these, the true marginal is known in an analytically closed form for Wishart, providing a useful validation of our approach. For the general setting of the other three, and several more priors satisfying conditions (a) and (b) above, the calculation of evidence has remained an open question that this article resolves under a unifying framework.

周辺尤度はモデル証拠とも呼ばれ、ベイズ統計における基本的な量です。ベイズ係数を使用したモデル選択や、事前ハイパーパラメータの経験的ベイズ調整に使用されます。しかし、証拠の計算はガウスグラフィカルモデルにおいて長年未解決の問題のままです。現在、存在する唯一の実行可能なソリューションは、中程度の次元でのウィシャートやG-ウィシャートなどの特殊なケースです。私たちは、軽度の要件の下で非常に広範な事前分布の下でチブの手法を適用することにより証拠を推定できる、精度行列の新しいテレスコーピングブロック分解に基づくアプローチを開発しました。具体的には、要件は次のとおりです。(a)精度行列の対角項の事前分布は、ガンマランダム変数のガンマまたはスケール混合として記述でき、(b)対角項の事前分布は、正規または正規のスケール混合として表すことができます。これには、ウィシャートやG-ウィシャートなどの構造化事前分布や、ベイジアングラフィカルラッソやグラフィカルホースシューなどの最近導入された要素ごとの事前分布が含まれます。これらのうち、ウィシャートの真の境界は解析的に閉じた形式で知られており、私たちのアプローチの有効な検証を提供します。他の3つの一般的な設定、および上記の条件(a)と(b)を満たすいくつかの事前分布については、証拠の計算が未解決の問題のままでしたが、この記事では統一的なフレームワークの下で解決します。

An Asymptotic Study of Discriminant and Vote-Averaging Schemes for Randomly-Projected Linear Discriminants
ランダムに投影された線形判別変数に対する判別方式と投票平均化方式の漸近研究

Modern technology has contributed to the rise of high-dimensional data in various domains such as bio-informatics, chemometrics, and face recognition. In the recent literature, random projections and, in particular, randomly-projected ensembles based on the classical Linear Discriminant Analysis (LDA), have been proposed for classification problems involving such high-dimensional data. In this work, we study the two main classes of randomly-projected LDA ensemble classifiers, namely discriminant averaging and vote averaging. Through asymptotic analysis in a growth regime where the problem dimensions are assumed to grow at constant rates to each other for a fixed ensemble size, we determine the exact mechanism through which the ensemble size affects the classification performance. Furthermore, we investigate whether projection selection truly matters in an ensemble setting, and, ultimately, derive the optimal form of the randomly-projected LDA ensemble. Motivated by these findings, we propose a framework for efficient tuning of the optimal classifier’s ensemble size and projection dimension based on an estimator of the classifier probability of misclassification which is consistent under the assumed growth regime. The proposed framework is shown to outperform the existing rule-of-thumb, as well as other methods for parameter tuning, on both real and synthetic data.

現代の技術は、バイオインフォマティクス、ケモメトリクス、顔認識など、さまざまな領域で高次元データの増加に貢献しています。最近の文献では、ランダム射影、特に古典的な線形判別分析(LDA)に基づくランダム射影アンサンブルが、このような高次元データを含む分類問題に対して提案されています。この研究では、ランダム射影LDAアンサンブル分類器の2つの主要なクラス、つまり判別平均化と投票平均化について検討します。問題の次元が固定されたアンサンブルサイズに対して互いに一定の割合で増加すると仮定した成長レジームでの漸近解析を通じて、アンサンブルサイズが分類パフォーマンスに影響を与える正確なメカニズムを決定します。さらに、アンサンブル設定で射影選択が本当に重要であるかどうかを調査し、最終的にランダム射影LDAアンサンブルの最適な形式を導き出します。これらの発見に基づいて、想定される成長体制下で一貫性のある分類器の誤分類確率の推定値に基づいて、最適な分類器のアンサンブルサイズと投影次元を効率的に調整するためのフレームワークを提案します。提案されたフレームワークは、実際のデータと合成データの両方で、既存の経験則やその他のパラメータ調整方法よりも優れていることが示されています。

Learning and scoring Gaussian latent variable causal models with unknown additive interventions
未知の相加的介入によるガウス潜在変数因果モデルの学習と採点

With observational data alone, causal structure learning is a challenging problem. The task becomes easier when having access to data collected from perturbations of the underlying system, even when the nature of these is unknown. Existing methods either do not allow for the presence of latent variables or assume that these remain unperturbed. However, these assumptions are hard to justify if the nature of the perturbations is unknown. We provide results that enable scoring causal structures in the setting with additive, but unknown interventions. Specifically, we propose a maximum-likelihood estimator in a structural equation model that exploits system-wide invariances to output an equivalence class of causal structures from perturbation data. Furthermore, under certain structural assumptions on the population model, we provide a simple graphical characterization of all the DAGs in the interventional equivalence class. We illustrate the utility of our framework on synthetic data as well as real data involving California reservoirs and protein expressions. The software implementation is available as the Python package utlvce.

観測データだけでは、因果構造の学習は困難な問題です。基礎システムの摂動から収集されたデータにアクセスできると、その性質が不明であっても、タスクは容易になります。既存の方法では、潜在変数の存在が考慮されていないか、潜在変数が摂動されないままであると想定しています。しかし、摂動の性質が不明な場合、これらの仮定を正当化することは困難です。私たちは、追加的ではあるが未知の介入を伴う設定で因果構造をスコアリングできる結果を提供します。具体的には、システム全体の不変性を利用して摂動データから因果構造の同値クラスを出力する構造方程式モデルの最大尤度推定量を提案します。さらに、集団モデルに関する特定の構造仮定の下で、介入同値クラスのすべてのDAGの簡単なグラフィカルな特性評価を提供します。私たちは、合成データだけでなく、カリフォルニアの貯留層とタンパク質発現を含む実際のデータでも、私たちのフレームワークの有用性を示します。ソフトウェア実装は、Pythonパッケージutlvceとして利用できます。

Non-splitting Neyman-Pearson Classifiers
非分割ネイマン・ピアソン分類器

The Neyman-Pearson (NP) binary classification paradigm constrains the more severe type of error (e.g., the type I error) under a preferred level while minimizing the other (e.g., the type II error). This paradigm is suitable for applications such as severe disease diagnosis, fraud detection, among others. A series of NP classifiers have been developed to guarantee the type I error control with high probability. However, these existing classifiers involve a sample splitting step: a mixture of class 0 and class 1 observations to construct a scoring function and some left-out class 0 observations to construct a threshold. This splitting enables classifier threshold construction built upon independence, but it amounts to insufficient use of data for training and a potentially higher type II error. Leveraging a canonical linear discriminant analysis (LDA) model, we derive a quantitative CLT for a certain functional of quadratic forms of the inverse of sample and population covariance matrices, and based on this result, develop for the first time NP classifiers without splitting the training sample. Numerical experiments have confirmed the advantages of our new non-splitting parametric strategy.

ネイマン-ピアソン(NP)バイナリ分類パラダイムは、より重大なタイプのエラー(タイプIエラーなど)を優先レベル以下に制限し、その他のエラー(タイプIIエラーなど)を最小限に抑えます。このパラダイムは、重篤な病気の診断、詐欺の検出などのアプリケーションに適しています。タイプIエラーの制御を高い確率で保証するために、一連のNP分類器が開発されました。ただし、これらの既存の分類器には、サンプル分割ステップが含まれます。つまり、クラス0とクラス1の観測値を混合してスコアリング関数を作成し、クラス0の観測値を一部除外してしきい値を作成します。この分割により、独立性に基づいた分類器しきい値の構築が可能になりますが、トレーニング用のデータが十分に使用されず、タイプIIエラーが高くなる可能性があります。標準線形判別分析(LDA)モデルを活用して、サンプルと母集団の共分散行列の逆行列の二次形式の特定の関数に対する定量的CLTを導出し、この結果に基づいて、トレーニングサンプルを分割せずに初めてNP分類器を開発しました。数値実験により、新しい非分割パラメトリック戦略の利点が確認されました。

Studying the Interplay between Information Loss and Operation Loss in Representations for Classification
分類のための表現における情報損失と操作損失の相互作用の研究

Information-theoretic measures have been widely adopted for machine learning (ML) feature design. Inspired by this, we look at the relationship between information loss in the Shannon sense and the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy representations. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete encoder. When considering a general family of lossy continuous representations, we show that a form of vanishing information loss (a weak informational sufficiency (WIS)) implies a vanishing MPE loss. Our findings support the observation that selecting/designing representations that capture informational sufficiency is appropriate for learning. However, this selection is rather conservative if the intended goal is achieving MPE in classification. Supporting this, we show that it is possible to adopt an alternative notion of informational sufficiency (strictly weaker than pure sufficiency in the mutual information sense) to achieve operational sufficiency in learning. Furthermore, our new WIS condition is used to demonstrate the expressive power of digital encoders and the capacity of two existing compression-based algorithms to achieve lossless prediction in ML.

情報理論的尺度は、機械学習(ML)の特徴設計に広く採用されています。これに着想を得て、我々は、非可逆表現のファミリーを考慮した場合の、シャノンの意味での情報の損失と最小エラー確率(MPE)の意味での演算損失の関係に注目します。最初の結果は、離散エンコーダーを採用した場合のそれぞれの演算損失の関数として、弱い形式の情報損失の下限を提供します。一般的な非可逆連続表現のファミリーを考慮すると、消失する情報損失の形式(弱い情報の十分性(WIS))は消失するMPE損失を意味することがわかります。我々の調査結果は、情報の十分性を捉える表現を選択/設計することが学習に適しているという観察を裏付けています。ただし、分類でMPEを達成することが目的である場合、この選択はかなり保守的です。これを裏付けるために、学習で演算の十分性を達成するために、情報の十分性の代替概念(相互情報量の意味での純粋な十分性よりも厳密に弱い)を採用できることを示します。さらに、私たちの新しいWIS条件は、デジタルエンコーダーの表現力と、MLでロスレス予測を実現する既存の2つの圧縮ベースのアルゴリズムの能力を実証するために使用されます。

skscope: Fast Sparsity-Constrained Optimization in Python
skscope: Python での高速スパース性制約最適化

Applying iterative solvers on sparsity-constrained optimization (SCO) requires tedious mathematical deduction and careful programming/debugging that hinders these solvers’ broad impact. In the paper, the library skscope is introduced to overcome such an obstacle. With skscope, users can solve the SCO by just programming the objective function. The convenience of skscope is demonstrated through two examples in the paper, where sparse linear regression and trend filtering are addressed with just four lines of code. More importantly, skscope’s efficient implementation allows state-of-the-art solvers to quickly attain the sparse solution regardless of the high dimensionality of parameter space. Numerical experiments reveal the available solvers in skscope can achieve up to 80x speedup on the competing relaxation solutions obtained via the benchmarked convex solver. skscope is published on the Python Package Index (PyPI) and Conda, and its source code is available at: https://github.com/abess-team/skscope.

スパース制約最適化(SCO)に反復ソルバーを適用するには、面倒な数学的推論と慎重なプログラミング/デバッグが必要であり、これらのソルバーの幅広い影響を妨げています。この論文では、このような障害を克服するためにライブラリskscopeが紹介されています。skscopeを使用すると、ユーザーは目的関数をプログラミングするだけでSCOを解くことができます。skscopeの利便性は、論文の2つの例で実証されており、スパース線形回帰とトレンドフィルタリングはわずか4行のコードで処理されています。さらに重要なのは、skscopeの効率的な実装により、最先端のソルバーは、パラメーター空間の高次元性に関係なく、スパースソリューションを迅速に取得できることです。数値実験により、skscopeで利用可能なソルバーは、ベンチマークされた凸ソルバーを介して取得された競合する緩和ソリューションの最大80倍の高速化を実現できることが明らかになりました。skscopeはPython Package Index (PyPI)とCondaで公開されており、ソースコードはhttps://github.com/abess-team/skscopeで入手できます。

aeon: a Python Toolkit for Learning from Time Series
aeon: 時系列から学習するための Python ツールキット

aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at https://github.com/aeon-toolkit/aeon.

aeonは、時系列に関連するすべての機械学習タスク用の統合Python 3ライブラリです。このパッケージには、時系列予測、分類、外在的回帰、クラスタリング用のモジュールのほか、時系列データ用に設計されたさまざまなユーティリティ、変換、距離測定が含まれています。また、aeonには、異常検出、類似性検索、セグメンテーションなどのタスク用の実験的なモジュールも多数あります。aeonは、新しいユーザーを支援し、モデル選択やパイプラインなどの便利なツールとaeon推定器を簡単に統合できるように、可能な限りscikit-learn APIに準拠しています。最新の研究成果の効率的な実装を含む、時系列アルゴリズムの幅広いライブラリを提供します。オプションの依存関係システムを使用して、aeonはさまざまなパッケージを単一のインターフェイスに統合しながら、コアフレームワークの依存関係を最小限に抑えます。このパッケージは3条項BSDライセンスの下で配布され、https://github.com/aeon-toolkit/aeonで入手できます。

Compressed and distributed least-squares regression: convergence rates with applications to federated learning
圧縮および分布最小二乗回帰:連合学習への応用による収束率

In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected Hölder regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.

この論文では、分散学習や連合学習で広く使用されている手法である機械学習の確率的勾配アルゴリズムに対する圧縮の影響を調査します。分散に関して同じ条件を満たす複数の不偏圧縮演算子間の収束率の違いを強調し、従来の最悪ケース分析を超えています。そのために、最小二乗回帰(LSR)の場合に焦点を当て、ランダムフィールドに依存する二次関数を最小化する一般的な確率近似アルゴリズムを分析します。分析に合わせて調整されたランダムフィールドに関する弱い仮定(具体的には、期待されるHölder正則性)とノイズ共分散を考慮し、圧縮を含むさまざまなランダム化メカニズムの分析を可能にします。次に、結果を連合学習の場合に拡張します。より正式には、アルゴリズムによって誘発される加法ノイズの共分散$\mathfrak{C}_{\mathrm{ania}}$の収束への影響を強調します。確率場が非正則であるにもかかわらず、極限分散項は$\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (ここで$H$は最適化問題のヘッセ行列、$K$は反復回数)に比例し、バニラLSRの場合のレートが$\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013)となることを実証します。次に、$\mathfrak{C}_{\mathrm{ania}}$の圧縮戦略への依存性と、最終的にはそれが収束に与える影響を、最初は集中型のケースで、次に2つの異種FLフレームワークで分析します。

Contamination-source based K-sample clustering
汚染源ベースのKサンプルクラスタリング

In this work, we investigate the $K$-sample clustering of populations subject to contamination phenomena. A contamination model is a two-component mixture model where one component is known (standard behaviour) and the second component, modeling a departure from the standard behaviour, is unknown.When $K$ populations from such a model are observed we propose a semiparametric clustering methodology to detect which populations are impacted by the same type of contamination, with the aim of faciliting coordinated diagnosis and best practices sharing. We prove the consistency of our approach under the assumption of the existence of true clusters and demonstrate the performances of our methodology through an extensive Monte Carlo study. Finally, we apply our methodology, implemented in the R admix package, to a European countries COVID-19 excess of mortality dataset, aiming to cluster countries similarly impacted by the pandemic across different age groups.

この研究では、汚染現象の影響を受ける集団の$K$サンプルクラスタリングを調査します。コンタミネーションモデルは、1つの成分が既知(標準挙動)で、2つ目の成分(標準挙動からの逸脱をモデル化)が不明である2成分混合モデルです。このようなモデルから$K$の母集団が観察された場合、調整された診断とベストプラクティスの共有を容易にすることを目的として、どの母集団が同じタイプの汚染の影響を受けているかを検出するためのセミパラメトリッククラスタリング方法論を提案します。私たちは、真のクラスターの存在を前提として、私たちのアプローチの一貫性を証明し、広範なモンテカルロ研究を通じて私たちの方法論の性能を実証します。最後に、R admixパッケージに実装された方法論を、ヨーロッパ諸国のCOVID-19超過死亡率データセットに適用し、パンデミックの影響を同様に受けた国を異なる年齢層でクラスター化することを目指しています。

Measuring Sample Quality in Algorithms for Intractable Normalizing Function Problems
難解な正規化関数問題のためのアルゴリズムにおけるサンプル品質の測定

Models with intractable normalizing functions have numerous applications. Because the normalizing constants are functions of the parameters of interest, standard Markov chain Monte Carlo cannot be used for Bayesian inference for these models. A number of algorithms have been developed for such models. Some have the posterior distribution as their asymptotic distribution. Other “asymptotically inexact” algorithms do not possess this property. There is limited guidance for evaluating approximations based on these algorithms. Hence it is very hard to tune them. We propose two new diagnostics that address these problems for intractable normalizing function models. Our first diagnostic, inspired by the second Bartlett identity, is in principle broadly applicable to Monte Carlo approximations beyond the normalizing function problem. We develop an approximate version of this diagnostic that is applicable to intractable normalizing function problems. Our second diagnostic is a Monte Carlo approximation to a kernel Stein discrepancy-based diagnostic introduced by Gorham and Mackey (2017). We provide theoretical justification for our methods and apply them to several algorithms in challenging simulated and real data examples including an Ising model, an exponential random graph model, and a Conway-Maxwell-Poisson regression model, obtaining interesting insights about the algorithms in these contexts.

扱いにくい正規化関数を持つモデルには、さまざまな用途があります。正規化定数は関心のあるパラメータの関数であるため、標準的なマルコフ連鎖モンテカルロ法は、これらのモデルのベイズ推論には使用できません。このようなモデル用に、多数のアルゴリズムが開発されています。一部のアルゴリズムでは、事後分布が漸近分布となります。その他の「漸近的に不正確な」アルゴリズムには、この特性がありません。これらのアルゴリズムに基づく近似を評価するためのガイダンスは限られています。したがって、これらを調整するのは非常に困難です。私たちは、扱いにくい正規化関数モデルの問題に対処する2つの新しい診断法を提案します。最初の診断法は、第2バートレット恒等式にヒントを得たもので、原理的には、正規化関数の問題を超えてモンテカルロ近似法に広く適用できます。私たちは、扱いにくい正規化関数の問題に適用できるこの診断法の近似バージョンを開発します。2番目の診断法は、GorhamとMackey (2017)によって導入されたカーネルStein不一致ベースの診断法に対するモンテカルロ近似法です。私たちは、私たちの方法の理論的根拠を示し、それをイジングモデル、指数ランダムグラフモデル、コンウェイ・マクスウェル・ポアソン回帰モデルなどの難しいシミュレーションおよび実際のデータ例のいくつかのアルゴリズムに適用し、これらのコンテキストにおけるアルゴリズムに関する興味深い洞察を得ました。

OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research
OmniSafe:安全な強化学習研究を加速する基盤

AI systems empowered by reinforcement learning (RL) algorithms harbor the immense potential to catalyze societal advancement, yet their deployment is often impeded by significant safety concerns. Particularly in safety-critical applications, researchers have raised concerns about unintended harms or unsafe behaviors of unaligned RL agents. The philosophy of safe reinforcement learning (SafeRL) is to align RL agents with harmless intentions and safe behavioral patterns. In SafeRL, agents learn to develop optimal policies by receiving feedback from the environment, while also fulfilling the requirement of minimizing the risk of unintended harm or unsafe behavior. However, due to the intricate nature of SafeRL algorithm implementation, combining methodologies across various domains presents a formidable challenge. This had led to an absence of a cohesive and efficacious learning framework within the contemporary SafeRL research milieu. In this work, we introduce a foundational framework designed to expedite SafeRL research endeavors. Our comprehensive framework encompasses an array of algorithms spanning different RL domains and places heavy emphasis on safety elements. Our efforts are to make the SafeRL-related research process more streamlined and efficient, therefore facilitating further research in AI safety.

強化学習(RL)アルゴリズムによって強化されたAIシステムは、社会の進歩を促進する大きな可能性を秘めていますが、安全性に関する重大な懸念によってその導入が妨げられることがよくあります。特に安全性が重要なアプリケーションでは、研究者は、調整されていないRLエージェントの意図しない危害や危険な動作について懸念を表明しています。安全な強化学習(SafeRL)の哲学は、RLエージェントを無害な意図と安全な動作パターンに合わせることです。SafeRLでは、エージェントは環境からのフィードバックを受け取ることで最適なポリシーを開発することを学習し、同時に意図しない危害や危険な動作のリスクを最小限に抑えるという要件も満たします。ただし、SafeRLアルゴリズムの実装は複雑なため、さまざまなドメインにわたる方法論を組み合わせることは困難な課題となります。このため、現在のSafeRL研究環境には、まとまりのある効果的な学習フレームワークが存在しません。この研究では、SafeRL研究の取り組みを促進するために設計された基礎フレームワークを紹介します。当社の包括的なフレームワークは、さまざまなRLドメインにまたがる一連のアルゴリズムを網羅しており、安全性の要素に重点を置いています。当社の取り組みは、SafeRL関連の研究プロセスをより合理化および効率化し、AIの安全性に関するさらなる研究を促進することです。

Random Smoothing Regularization in Kernel Gradient Descent Learning
カーネル勾配降下学習におけるランダム平滑化正則化

Random smoothing data augmentation is a unique form of regularization that can prevent overfitting by introducing noise to the input data, encouraging the model to learn more generalized features. Despite its success in various applications, there has been a lack of systematic study on the regularization ability of random smoothing. In this paper, we aim to bridge this gap by presenting a framework for random smoothing regularization that can adaptively and effectively learn a wide range of ground truth functions belonging to the classical Sobolev spaces. Specifically, we investigate two underlying function spaces: the Sobolev space of low intrinsic dimension, which includes the Sobolev space in D-dimensional Euclidean space or low-dimensional sub-manifolds as special cases, and the mixed smooth Sobolev space with a tensor structure. By using random smoothing regularization as novel convolution-based smoothing kernels, we can attain optimal convergence rates in these cases using a kernel gradient descent algorithm, either with early stopping or weight decay. It is noteworthy that our estimator can adapt to the structural assumptions of the underlying data and avoid the curse of dimensionality. This is achieved through various choices of injected noise distributions such as Gaussian, Laplace, or general polynomial noises, allowing for broad adaptation to the aforementioned structural assumptions of the underlying data. The convergence rate depends only on the effective dimension, which may be significantly smaller than the actual data dimension. We conduct numerical experiments on simulated data to validate our theoretical results.

ランダムスムージングデータ拡張は、入力データにノイズを導入することで過剰適合を防ぎ、モデルがより一般化された特徴を学習するように促すことができる、独自の形式の正則化です。さまざまなアプリケーションで成功しているにもかかわらず、ランダムスムージングの正則化能力に関する体系的な研究は不足しています。この論文では、古典的なソボレフ空間に属するさまざまなグラウンドトゥルース関数を適応的かつ効果的に学習できるランダムスムージング正則化のフレームワークを提示することで、このギャップを埋めることを目指します。具体的には、D次元ユークリッド空間または低次元部分多様体のソボレフ空間を特殊なケースとして含む、固有次元の低いソボレフ空間と、テンソル構造を持つ混合平滑ソボレフ空間という2つの基礎関数空間を調査します。ランダムスムージング正則化を新しい畳み込みベースのスムージングカーネルとして使用することで、早期停止または重み減衰のいずれかを使用したカーネル勾配降下アルゴリズムを使用して、これらのケースで最適な収束率を達成できます。注目すべきは、私たちの推定器が基礎データの構造的仮定に適応し、次元の呪いを回避できることです。これは、ガウス、ラプラス、または一般的な多項式ノイズなどの注入ノイズ分布のさまざまな選択によって実現され、基礎データの前述の構造的仮定に幅広く適応できます。収束率は有効次元のみに依存しますが、これは実際のデータ次元よりも大幅に小さい場合があります。私たちは、シミュレーションデータで数値実験を行い、理論的な結果を検証します。

MLRegTest: A Benchmark for the Machine Learning of Regular Languages
MLRegTest: 標準言語の機械学習のベンチマーク

Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or restricted propositional) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.

形式言語から構築された合成データセットを使用すると、シーケンス分類用の機械学習システムの学習および一般化機能を詳細に調べることができます。この記事では、MLRegTestと呼ばれるシーケンス分類に関する機械学習システムの新しいベンチマークを紹介します。これには、1,800の正規言語のトレーニングセット、開発セット、テストセットが含まれています。さまざまな種類の形式言語はさまざまな種類の長距離依存関係を表し、シーケンス内の長距離依存関係を正しく識別することは、MLシステムがうまく一般化するための既知の課題です。MLRegTestは、論理的複雑さ(モナド2階、1階、命題、または制限付き命題)と論理リテラルの種類(文字列、階層文字列、サブシーケンス、またはそれらの組み合わせ)に従って言語を整理します。論理的複雑さとリテラルの選択により、正規言語のさまざまな種類の長距離依存関係を体系的に理解し、したがって、さまざまなMLシステムがそのような長距離依存関係を学習する能力を理解することができます。最後に、MLRegTestにおけるさまざまなニューラルネットワーク(単純なRNN、LSTM、GRU、トランスフォーマー)のパフォーマンスを検証します。主な結論は、パフォーマンスはテストセットの種類、言語のクラス、およびニューラルネットワークアーキテクチャに大きく依存するということです。

A tensor factorization model of multilayer network interdependence
多層ネットワーク相互依存性のテンソル分解モデル

Multilayer networks describe the rich ways in which nodes are related by accounting for different relationships in separate layers. These multiple relationships are naturally represented by an adjacency tensor. In this work we study the use of the nonnegative Tucker decomposition (NNTuck) of such tensors under a KL loss as an expressive factor model that naturally generalizes existing stochastic block models of multilayer networks. Quantifying interdependencies between layers can identify redundancies in the structure of a network, indicate relationships between disparate layers, and potentially inform survey instruments for collecting social network data. We propose definitions of layer independence, dependence, and redundancy based on likelihood ratio tests between nested nonnegative Tucker decompositions. Using both synthetic and real-world data, we evaluate the use and interpretation of the NNTuck as a model of multilayer networks. Algorithmically, we show that using expectation maximization (EM) to maximize the log-likelihood under the NNTuck is step-by-step equivalent to tensorial multiplicative updates for the NNTuck under a KL loss, extending a previously known equivalence from nonnegative matrices to nonnegative tensors.

多層ネットワークは、別々の層における異なる関係を考慮することにより、ノードが関係する多様な方法を説明します。これらの複数の関係は、隣接テンソルによって自然に表現されます。この研究では、KL損失の下でのこのようなテンソルの非負Tucker分解(NNTuck)を、多層ネットワークの既存の確率的ブロックモデルを自然に一般化する表現因子モデルとして使用することを研究します。層間の相互依存性を定量化することで、ネットワーク構造の冗長性を識別し、異なる層間の関係性を示し、ソーシャルネットワークデータを収集するための調査手段に情報を提供できる可能性があります。入れ子になった非負Tucker分解間の尤度比検定に基づいて、層の独立性、依存性、冗長性の定義を提案します。合成データと実世界のデータの両方を使用して、多層ネットワークモデルとしてのNNTuckの使用と解釈を評価します。アルゴリズム的には、期待最大化(EM)を使用してNNTuckでの対数尤度を最大化することが、KL損失でのNNTuckのテンソル乗法更新と段階的に同等であり、非負行列から非負テンソルへの既知の同等性を拡張することを示します。

Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces
リー群とその均質空間上の定常カーネルとガウス過程 II: 非コンパクト対称空間

Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process’ covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.

ガウス過程は、機械学習における時空間モデルの最も重要なクラスであると言えるでしょう。モデル化された関数に関する事前情報をエンコードし、正確なベイズ学習または近似ベイズ学習に使用できます。多くのアプリケーション、特に物理科学や工学、さらには地統計学や神経科学などの分野では、対称性に対する不変性は、考えられる事前情報の最も基本的な形式の1つです。ガウス過程の共分散がこのような対称性に対して不変性を持つことで、定常性の概念をこのような空間に最も自然に一般化できます。この研究では、対称性のコンテキストで生じる非常に大規模な非ユークリッド空間クラスで定常ガウス過程を構築するための、建設的で実用的な手法を開発します。この手法により、(i)共分散カーネルを計算し、(ii)このような空間で定義された事前および事後ガウス過程から実用的な方法でサンプリングすることが可能になります。この研究は2つの部分に分かれており、それぞれ異なる技術的検討事項が含まれています。第1部ではコンパクト空間を研究し、第2部では特定の構造を持つ非コンパクト空間を研究します。私たちの貢献により、研究対象の非ユークリッドガウス過程モデルは、標準的なガウス過程ソフトウェアパッケージで利用できる、よく理解されている計算手法と互換性を持つようになり、実践者が利用できるようになります。

Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case
リー群とその均質空間上の定常カーネルとガウス過程 I: コンパクトケース

Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process’ covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.

ガウス過程は、機械学習における時空間モデルの最も重要なクラスであると言えるでしょう。モデル化された関数に関する事前情報をエンコードし、正確なベイズ学習または近似ベイズ学習に使用できます。多くのアプリケーション、特に物理科学や工学、さらには地統計学や神経科学などの分野では、対称性に対する不変性は、考えられる事前情報の最も基本的な形式の1つです。ガウス過程の共分散がこのような対称性に対して不変性を持つことで、定常性の概念をこのような空間に最も自然に一般化できます。この研究では、対称性のコンテキストで生じる非常に大規模な非ユークリッド空間クラスで定常ガウス過程を構築するための、建設的で実用的な手法を開発します。この手法により、(i)共分散カーネルを計算し、(ii)このような空間で定義された事前および事後ガウス過程から実用的な方法でサンプリングすることが可能になります。この研究は2つの部分に分かれており、それぞれ異なる技術的検討事項が含まれています。第1部ではコンパクト空間を研究し、第2部では特定の構造を持つ非コンパクト空間を研究します。私たちの貢献により、研究対象の非ユークリッドガウス過程モデルは、標準的なガウス過程ソフトウェアパッケージで利用できる、よく理解されている計算手法と互換性を持つようになり、実践者が利用できるようになります。

On Doubly Robust Inference for Double Machine Learning in Semiparametric Regression
セミパラメトリック回帰における二重機械学習のための二重ロバスト推論について

Due to concerns about parametric model misspecification, there is interest in using machine learning to adjust for confounding when evaluating the causal effect of an exposure on an outcome. Unfortunately, exposure effect estimators that rely on machine learning predictions are generally subject to so-called plug-in bias, which can render naive p-values and confidence intervals invalid. Progress has been made via proposals like targeted minimum loss estimation and more recently double machine learning, which rely on learning the conditional mean of both the outcome and exposure. Valid inference can then be obtained so long as both predictions converge (sufficiently fast) to the truth. Focusing on partially linear regression models, we show that a specific implementation of the machine learning techniques can yield exposure effect estimators that have small bias even when one of the first-stage predictions does not converge to the truth. The resulting tests and confidence intervals are doubly robust. We also show that the proposed estimators may fail to be regular when only one nuisance parameter is consistently estimated; nevertheless, we observe in simulation studies that our proposal can lead to reduced bias and improved confidence interval coverage in moderate-to-large samples.

パラメトリックモデルの誤指定に関する懸念から、曝露が結果に及ぼす因果効果を評価する際に、機械学習を使用して交絡を調整することに関心が寄せられています。残念ながら、機械学習の予測に依存する曝露効果推定量は、一般的にいわゆるプラグインバイアスの影響を受けやすく、単純なp値と信頼区間が無効になる可能性があります。目標最小損失推定や、最近では結果と曝露の両方の条件付き平均を学習することに依存する二重機械学習などの提案によって進歩が遂げられています。両方の予測が(十分に速く)真実に収束する限り、有効な推論を得ることができます。部分線形回帰モデルに焦点を当て、機械学習手法の特定の実装により、第1段階の予測の1つが真実に収束しない場合でも、バイアスが小さい曝露効果推定量が得られることを示します。結果として得られるテストと信頼区間は二重に堅牢です。また、1つの迷惑パラメータのみが一貫して推定される場合、提案された推定量が正規にならない可能性があることも示しています。それにもかかわらず、シミュレーション研究では、私たちの提案により、中規模から大規模のサンプルにおけるバイアスの低減と信頼区間の範囲の改善が実現できることが観察されています。

Deep Neural Network Approximation of Invariant Functions through Dynamical Systems
動的システムによる不変関数の深層ニューラルネットワーク近似

We study the approximation of functions which are invariant with respect to certain permutations of the input indices using flow maps of dynamical systems. Such invariant functions include the much studied translation-invariant ones involving image tasks, but also encompasses many permutation-invariant functions that find emerging applications in science and engineering. We prove sufficient conditions for universal approximation of these functions by a controlled dynamical system, which can be viewed as a general abstraction of deep residual networks with symmetry constraints. These results not only imply the universal approximation for a variety of commonly employed neural network architectures for symmetric function approximation, but also guide the design of architectures with approximation guarantees for applications involving new symmetry requirements.

私たちは、動的システムのフローマップを使用して、入力インデックスの特定の順列に関して不変である関数の近似を研究します。このような不変関数には、画像タスクを含む多くの研究された翻訳不変関数が含まれますが、科学や工学で新たな応用を見つける多くの順列不変関数も含んでいます。私たちは、対称性制約を持つ深い残差ネットワークの一般的な抽象化と見なすことができる制御された動的システムによるこれらの関数の普遍的な近似のための十分な条件を証明します。これらの結果は、対称関数近似に一般的に使用されるさまざまなニューラルネットワークアーキテクチャのユニバーサル近似を示すだけでなく、新しい対称性要件を含むアプリケーションの近似保証を備えたアーキテクチャの設計をガイドします。

A Statistical Experimental Design Method for Constructing Deterministic Sensing Matrices for Compressed Sensing
圧縮センシングのための決定論的センシング行列を構築するための統計的実験計画法

Compressed sensing is a signal processing technique used to efficiently acquire and reconstruct signals across various fields, including science, engineering, and business. A critical research challenge in compressed sensing is constructing a sensing matrix with desirable reconstruction properties. For optimal performance, the reconstruction process requires the sensing matrix to have low coherence. Several methods have been proposed to create deterministic sensing matrices. We propose a new statistical method to construct deterministic sensing matrices by intelligently sampling rows of Walsh-Hadamard matrices. Compared to existing methods, our approach yields sensing matrices with lower coherence, accommodates a more flexible number of measurements, and entails lower computational cost.

圧縮センシングは、科学、工学、ビジネスなど、さまざまな分野で信号を効率的に集録し、再構成するために使用される信号処理技術です。圧縮センシングにおける重要な研究課題は、望ましい再構成特性を持つセンシングマトリックスを構築することです。最適な性能を発揮するには、再構成プロセスでセンシングマトリックスのコヒーレンスを低くする必要があります。決定論的センシング行列を作成するために、いくつかの方法が提案されています。私たちは、Walsh-Hadamard行列の行をインテリジェントにサンプリングすることにより、決定論的センシング行列を構築するための新しい統計的手法を提案します。既存の方法と比較して、私たちのアプローチは、より低いコヒーレンスでセンシングマトリックスを生成し、より柔軟な測定数に対応し、計算コストを低く抑えます。

Functional optimal transport: regularized map estimation and domain adaptation for functional data
機能的最適輸送:機能データの正則化マップ推定と領域適応

We introduce a formulation of regularized optimal transport problem for distributions on function spaces, where the stochastic map between functional domains can be approximated in terms of an (infinite-dimensional) Hilbert-Schmidt operator mapping a Hilbert space of functions to another. For numerous machine learning applications, data can be naturally viewed as samples drawn from spaces of functions, such as curves and surfaces, in high dimensions. Optimal transport for functional data analysis provides a useful framework of treatment for such domains. Since probability measures in infinite dimensional spaces generally lack absolute continuity (i.e., with respect to non-degenerate Gaussian measures), the Monge map in the standard optimal transport theory for finite dimensional spaces typically does not exist in the functional settings arising in such machine learning applications. This necessitates a suitable notion of approximation for the best pushforward measure to be obtained via a transport map. Indeed, our approach to the transportation problem in functional spaces is by a suitable regularization technique — we restrict the class of transport maps to be a Hilbert-Schmidt space of operators.Within this regularization framework, we develop an efficient algorithm for finding the stochastic transport map between functional domains and provide theoretical guarantees on the existence, uniqueness, and consistency of our estimate for the Hilbert-Schmidt space of compact linear operators. We validate our method on synthetic datasets and examine the functional properties of the transport map. Experiments on real-world datasets of robot arm trajectories further demonstrate the effectiveness of our method on applications in domain adaptation.

私たちは、関数空間上の分布に対する正規化された最適輸送問題の定式化を導入します。ここで関数領域間の確率マップは、関数のヒルベルト空間を別のヒルベルト空間にマップする(無限次元)ヒルベルト-シュミット演算子によって近似できます。多くの機械学習アプリケーションでは、データは高次元の曲線や面などの関数の空間から抽出されたサンプルとして自然に見ることができます。関数データ分析の最適輸送は、そのような領域の処理に役立つフレームワークを提供します。無限次元空間の確率測度は一般に絶対的な連続性(つまり、非退化ガウス測度に関して)を欠くため、有限次元空間の標準的な最適輸送理論におけるモンジュマップは、このような機械学習アプリケーションで生じる関数設定には通常存在しない。このため、輸送マップを介して取得される最適なプッシュフォワード測度には、適切な近似の概念が必要です。実際、関数空間における輸送問題に対する私たちのアプローチは、適切な正則化技術によるものです。つまり、輸送マップのクラスを演算子のヒルベルト・シュミット空間に制限します。この正則化フレームワーク内で、関数領域間の確率輸送マップを見つけるための効率的なアルゴリズムを開発し、コンパクト線形演算子のヒルベルト・シュミット空間の推定値の存在、一意性、一貫性について理論的な保証を提供します。合成データセットで私たちの方法を検証し、輸送マップの機能特性を調べます。ロボットアームの軌跡の実際のデータセットでの実験により、ドメイン適応のアプリケーションにおける私たちの方法の有効性がさらに実証されます。

Desiderata for Representation Learning: A Causal Perspective
表現学習の要件: 因果的観点

Representation learning constructs low-dimensional representations tosummarize essential features of high-dimensional data. This learningproblem is often approached by describing various desiderataassociated with learned representations; e.g., that they benon-spurious, efficient, or disentangled. It can be challenging,however, to turn these intuitive desiderata into formal criteria thatcan be measured and enhanced based on observed data. In this paper,we take a causal perspective on representation learning, formalizingnon-spuriousness and efficiency (in supervised representationlearning) and disentanglement (in unsupervised representationlearning) using counterfactual quantities and observable consequencesof causal assertions. This yields computable metrics that can be usedto assess the degree to which representations satisfy the desiderataof interest and learn non-spurious and disentangled representationsfrom single observational datasets.

表現学習は、高次元データの本質的な特徴を要約するために低次元表現を構築します。この学習問題は、学習した表現に関連するさまざまなdesiderataを説明することによってしばしばアプローチされます。例えば、それらが非偽物である、効率的である、または解きほぐされています。しかし、これらの直感的なdesiderataを、観察データに基づいて測定および強化できる正式な基準に変えるのは難しい場合があります。この論文では、表象学習、非偽性と効率性の形式化(教師あり表現学習)と、因果的主張の観察可能な結果を使用して、非偽性と効率性の形式化(教師なし表現学習)について因果関係の視点を取ります。これにより、表現が関心の目的を満たす程度を評価し、単一の観測データセットから非スプリアス表現ともつれ解消された表現を学習するために使用できる計算可能なメトリックが得られます。

Accelerated Gradient Tracking over Time-varying Graphs for Decentralized Optimization
分散最適化のための時間変動グラフ上の加速勾配追跡

Decentralized optimization over time-varying graphs has been increasingly common in modern machine learning with massive data stored on millions of mobile devices, such as in federated learning. This paper revisits the widely used accelerated gradient tracking and extends it to time-varying graphs. We prove that the practical single loop accelerated gradient tracking needs $O((\frac{\gamma}{1-\sigma_{\gamma}})^2\sqrt{\frac{L}{\epsilon}})$ and $O((\frac{\gamma}{1-\sigma_{\gamma}})^{1.5}\sqrt{\frac{L}{\mu}}\log\frac{1}{\epsilon})$ iterations to reach an $\epsilon$-optimal solution over time-varying graphs when the problems are nonstrongly convex and strongly convex, respectively, where $\gamma$ and $\sigma_{\gamma}$ are two common constants charactering the network connectivity, $L$ and $\mu$ are the smoothness and strong convexity constants, respectively, and one iteration corresponds to one gradient oracle call and one communication round. Our convergence rates improve significantly over the ones of $O(\frac{1}{\epsilon^{5/7}})$ and $O((\frac{L}{\mu})^{5/7}\frac{1}{(1-\sigma)^{1.5}}\log\frac{1}{\epsilon})$, respectively, which were proved in the original literature of accelerated gradient tracking only for static graphs, where $\frac{\gamma}{1-\sigma_{\gamma}}$ equals $\frac{1}{1-\sigma}$ when the network is time-invariant. When combining with a multiple consensus subroutine, the dependence on the network connectivity constants can be further improved to $O(1)$ and $O(\frac{\gamma}{1-\sigma_{\gamma}})$ for the gradient oracle and communication round complexities, respectively. When the network is static, by employing the Chebyshev acceleration, our complexities exactly match the lower bounds without hiding any poly-logarithmic factor for both nonstrongly convex and strongly convex problems.

時間とともに変化するグラフに対する分散最適化は、連合学習など、何百万台ものモバイルデバイスに大量のデータが保存される現代の機械学習ではますます一般的になっています。この論文では、広く使用されている加速勾配追跡を再検討し、それを時間とともに変化するグラフに拡張します。問題がそれぞれ非強凸および強凸である場合、実用的な単一ループ加速勾配追跡では、時間変動グラフ上で$\epsilon$最適解に到達するために、$O((\frac{\gamma}{1-\sigma_{\gamma}})^2\sqrt{\frac{L}{\epsilon}})$および$O((\frac{\gamma}{1-\sigma_{\gamma}})^{1.5}\sqrt{\frac{L}{\mu}}\log\frac{1}{\epsilon})$回の反復が必要であることを証明します。ここで、$\gamma$および$\sigma_{\gamma}$は、ネットワークの接続性を特徴付ける2つの共通定数、$L$および$\mu$は、それぞれ滑らかさおよび強凸性定数であり、1回の反復は、1回の勾配オラクル呼び出しと1回の通信ラウンドに対応します。我々の収束率は、それぞれ$O(\frac{1}{\epsilon^{5/7}})$および$O((\frac{L}{\mu})^{5/7}\frac{1}{(1-\sigma)^{1.5}}\log\frac{1}{\epsilon})$よりも大幅に改善されています。これらの収束率は、ネットワークが時間不変である場合に$\frac{\gamma}{1-\sigma_{\gamma}}$が$\frac{1}{1-\sigma}$に等しい、静的グラフのみの高速勾配追跡の元の文献で証明されています。複数のコンセンサスサブルーチンと組み合わせると、ネットワーク接続定数への依存性は、勾配オラクルおよび通信ラウンドの複雑さに対してそれぞれ$O(1)$および$O(\frac{\gamma}{1-\sigma_{\gamma}})$にさらに改善されます。ネットワークが静的な場合、チェビシェフ加速法を採用することで、非強凸問題と強凸問題の両方で、多重対数係数を隠すことなく、複雑度が下限と正確に一致します。

Pearl: A Production-Ready Reinforcement Learning Agent
Pearl: 本番環境に対応した強化学習エージェント

Reinforcement learning (RL) is a versatile framework for optimizing long-term goals. Although many real-world problems can be formalized with RL, learning and deploying a performant RL policy requires a system designed to address several important challenges, including the exploration-exploitation dilemma, partial observability, dynamic action spaces, and safety concerns. While the importance of these challenges has been well recognized, existing open-source RL libraries do not explicitly address them. This paper introduces Pearl, a Production-Ready RL software package designed to embrace these challenges in a modular way. In addition to presenting benchmarking results, we also highlight examples of Pearl’s ongoing industry adoption to demonstrate its advantages for production use cases. Pearl is open sourced on GitHub at github.com/facebookresearch/pearl and its official website is pearlagent.github.io.

強化学習(RL)は、長期的な目標を最適化するための汎用性の高いフレームワークです。多くの現実世界の問題はRLで定式化できますが、パフォーマンスの高いRLポリシーを学習して展開するには、探索と活用のジレンマ、部分的な可観測性、動的アクションスペース、安全性の懸念など、いくつかの重要な課題に対処するように設計されたシステムが必要です。これらの課題の重要性は十分に認識されていますが、既存のオープンソースのRLライブラリは、これらの課題に明示的に対処していません。この論文では、これらの課題をモジュール方式で受け入れるように設計された、量産対応のRLソフトウェアパッケージであるPearlを紹介します。ベンチマーク結果を提示するだけでなく、Pearlの継続的な業界での採用例も取り上げ、生産ユースケースでの利点を実証します。Pearlはgithub.com/facebookresearch/pearlのGitHubでオープンソース化されており、その公式Webサイトはpearlagent.github.ioです。

Boundary constrained Gaussian processes for robust physics-informed machine learning of linear partial differential equations
線形偏微分方程式のロバストな物理情報に基づく機械学習のための境界制約付きガウス過程

We introduce a framework for designing boundary constrained Gaussian process (BCGP) priors for exact enforcement of linear boundary conditions, and apply it to the machine learning of (initial) boundary value problems involving linear partial differential equations (PDEs).In contrast to existing work, we illustrate how to design boundary constrained mean and kernel functions for all classes of boundary conditions typically used in PDE modelling, namely Dirichlet, Neumann, Robin and mixed conditions. Importantly, this is done in a manner which allows for both forward and inverse problems to be naturally accommodated. We prove that the BCGP kernel has a universal representational capacity under Dirichlet conditions, and establish a formal equivalence between BCGPs and boundary-constrained neural networks (BCNNs) of infinite width.Finally, extensive numerical experiments are performed involving several linear PDEs, the results of which demonstrate the effectiveness and robustness of BCGP inference in the presence of sparse, noisy data.

私たちは、線形境界条件の厳密な施行のための境界制約ガウス過程(BCGP)事前分布を設計するためのフレームワークを紹介し、それを線形偏微分方程式(PDE)を含む(初期)境界値問題の機械学習に適用します。既存の研究とは対照的に、我々はPDEモデリングで一般的に使用されるすべてのクラスの境界条件、すなわちディリクレ、ノイマン、ロビン、および混合条件に対して、境界制約平均関数とカーネル関数を設計する方法を示します。重要なのは、これが順問題と逆問題の両方に自然に適応できる方法で行われることです。私たちは、BCGPカーネルがディリクレ条件下で普遍的な表現能力を持つことを証明し、BCGPと無限幅の境界制約ニューラルネットワーク(BCNN)との形式的な同等性を確立します。最後に、いくつかの線形PDEを含む広範な数値実験を実行し、その結果は、スパースでノイズの多いデータがある場合のBCGP推論の有効性と堅牢性を実証します。

Almost Sure Convergence Rates Analysis and Saddle Avoidance of Stochastic Gradient Methods
確率的勾配法のほぼ確実収束率解析とサドル回避

The vast majority of convergence rates analysis for stochastic gradient methods in the literature focus on convergence in expectation, whereas trajectory-wise almost sure convergence is clearly important to ensure that any instantiation of the stochastic algorithms would converge with probability one. Here we provide a unified almost sure convergence rates analysis for stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov’s accelerated gradient (SNAG) methods. We show, for the first time, that the almost sure convergence rates obtained for these stochastic gradient methods on strongly convex functions, are arbitrarily close to their optimal convergence rates possible. For non-convex objective functions, we not only show that a weighted average of the squared gradient norms converges to zero almost surely, but also the last iterates of the algorithms. We further provide last-iterate almost sure convergence rates analysis for stochastic gradient methods on general convex smooth functions, in contrast with most existing results in the literature that only provide convergence in expectation for a weighted average of the iterates. The last-iterate almost sure convergence results also enable us to obtain almost sure avoidance of any strict saddle manifold by stochastic gradient methods with or without momentum. To the best of our knowledge, this is the first time such results are obtained for SHB and SNAG methods.

文献における確率的勾配法の収束率分析の大部分は期待値の収束に焦点を当てていますが、軌道ごとのほぼ確実な収束は、確率的アルゴリズムのあらゆるインスタンスが確率1で収束することを保証する上で明らかに重要です。ここでは、確率的勾配降下法(SGD)、確率的ヘビーボール法(SHB)、および確率的ネステロフの加速勾配法(SNAG)の統一されたほぼ確実な収束率分析を提供します。私たちは初めて、強凸関数に対するこれらの確率的勾配法で得られたほぼ確実な収束率が、可能な限り最適な収束率に任意に近いことを示します。非凸目的関数の場合、勾配ノルムの二乗の加重平均がほぼ確実にゼロに収束するだけでなく、アルゴリズムの最後の反復もゼロに収束することを示します。さらに、一般的な凸平滑関数に対する確率的勾配法の最後の反復のほぼ確実な収束率の分析を提供します。これは、反復の加重平均に対する期待収束のみを提供する文献の既存の結果のほとんどとは対照的です。最後の反復のほぼ確実な収束の結果により、モメンタムの有無にかかわらず、確率的勾配法によって厳密なサドル多様体をほぼ確実に回避することもできます。私たちの知る限り、SHB法とSNAG法でこのような結果が得られたのはこれが初めてです。

False discovery proportion envelopes with m-consistency
m一貫性のある偽発見割合エンベロープ

We provide new nonasymptotic false discovery proportion (FDP) confidence envelopes in several multiple testing settings relevant for modern high dimensional-data methods. We revisit the multiple testing scenarios considered in the recent work of Katsevich and Ramdas (2020): top-$k$, preordered (including knockoffs), online. Our emphasis is on obtaining FDP confidence bounds that both have non-asymptotical coverage and are asymptotically accurate in a specific sense, as the number $m$ of tested hypotheses grows. Namely, we introduce and study the property (which we call $m$-consistency) that the confidence bound converges to or below the desired level $\alpha$ when applied to a specific reference $\alpha$-level false discovery rate (FDR) controlling procedure. In this perspective, we derive new bounds that provide improvements over existing ones, both theoretically and practically, and are suitable for situations where at least a moderate number of rejections is expected. These improvements are illustrated with numerical experiments and real data examples. In particular, the improvement is significant in the knockoffs setting, which shows the impact of the method for a practical use. As side results, we introduce a new confidence envelope for the empirical cumulative distribution function of i.i.d. uniform variables, and we provide new power results in sparse cases, both being of independent interest.

私たちは、現代の高次元データ法に関連するいくつかの多重検定設定における、新しい非漸近的偽発見率(FDP)信頼範囲を提供します。KatsevichとRamdas (2020)の最近の研究で検討された多重検定シナリオ、すなわちトップ$k$、事前順序付け(模造品を含む)、オンラインを再検討します。私たちは、検定される仮説の数$m$が増えるにつれて、非漸近的カバレッジを持ち、特定の意味で漸近的に正確なFDP信頼限界を得ることに重点を置いています。すなわち、私たちは、特定の参照$\alpha$レベルの偽発見率(FDR)制御手順に適用された場合、信頼限界が目的のレベル$\alpha$以下に収束するという特性(我々は$m$一貫性と呼ぶ)を導入し、研究します。この観点から、私たちは、理論的にも実際的にも既存のものよりも改善され、少なくとも中程度の数の拒否が予想される状況に適した新しい限界を導出します。これらの改善は、数値実験と実際のデータ例で説明されます。特に、ノックオフ設定では改善が顕著であり、この方法が実用的であることを示しています。副次的な結果として、i.i.d.一様変数の経験的累積分布関数の新しい信頼エンベロープを導入し、スパースケースでの新しい検出力結果を提供します。どちらも独立した関心事です。

Wasserstein Proximal Coordinate Gradient Algorithms
ワッサーシュタイン近似座標勾配アルゴリズム

Motivated by approximation Bayesian computation using mean-field variational approximation and the computation of equilibrium in multi-species systems with cross-interaction, this paper investigates the composite geodesically convex optimization problem over multiple distributions. The objective functional under consideration is composed of a convex potential energy on a product of Wasserstein spaces and a sum of convex self-interaction and internal energies associated with each distribution. To efficiently solve this problem, we introduce the Wasserstein Proximal Coordinate Gradient (WPCG) algorithms with parallel, sequential, and random update schemes. Under a quadratic growth (QG) condition that is weaker than the usual strong convexity requirement on the objective functional, we show that WPCG converges exponentially fast to the unique global optimum. In the absence of the QG condition, WPCG is still demonstrated to converge to the global optimal solution, albeit at a slower polynomial rate. Numerical results for both motivating examples are consistent with our theoretical findings.

この論文では、平均場変分近似を用いた近似ベイズ計算と、相互作用のある多種システムにおける平衡の計算に着目し、複数の分布にわたる複合測地凸最適化問題を調査します。検討中の目的関数は、ワッサーシュタイン空間の積上の凸ポテンシャルエネルギーと、各分布に関連付けられた凸自己相互作用エネルギーと内部エネルギーの合計で構成されます。この問題を効率的に解くために、並列、順次、ランダム更新スキームを備えたワッサーシュタイン近似座標勾配(WPCG)アルゴリズムを導入します。目的関数に対する通常の強い凸性要件よりも弱い二次成長(QG)条件の下で、WPCGが唯一のグローバル最適値に指数関数的に速く収束することを示す。QG条件がない場合でも、WPCGは、より遅い多項式速度ではあるが、依然としてグローバル最適解に収束することが実証されています。両方の動機となる例の数値結果は、理論的発見と一致しています。

Concentration and Moment Inequalities for General Functions of Independent Random Variables with Heavy Tails
重い裾を持つ独立確率変数の一般関数に対する濃度とモーメントの不等式

The concentration of measure phenomenon serves an essential role in statistics and machine learning. This paper gives bounded difference-type concentration and moment inequalities for general functions of independent random variables with heavy tails. A general framework is presented, which can be used to prove inequalities for general functions once the moment inequality for sums of independent random variables is established. We illustrate the power of the framework by showing how it can be used to derive novel concentration and moment inequalities for bounded, Bernstein’s moment condition, weak-exponential, and polynomial-moment random variables. Furthermore, we give potential applications of these inequalities to statistical learning theory.

測定の集中現象は、統計と機械学習において重要な役割を果たします。この論文では、裾が重い独立確率変数の一般関数について、有界差分タイプの集中とモーメントの不等式を示します。一般的なフレームワークが提示され、独立した確率変数の合計の不等式が確立された瞬間に、一般的な関数の不等式を証明するために使用できます。このフレームワークの力を、有界確率変数、バーンスタインモーメント条件、弱指数関数、多項式モーメント確率変数の新規濃度とモーメントの不等式を導出するために使用できる方法を示すことで説明します。さらに、これらの不等式を統計的学習理論に応用できる可能性を見出します。

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies
摂動的に解ける階層としてのランダム全結合ニューラルネットワーク

We study the distribution of fully connected neural networks with Gaussian random weights/biases and L hidden layers, each of width proportional to a large parameter n. For polynomially bounded non-linearities we give sharp estimates in powers of 1/n for the joint cumulants of the network output and its derivatives. We further show that network cumulants form a perturbatively solvable hierarchy in powers of 1/n. That is, the k-th order cumulants in each layer are determined to leading order in 1/n by cumulants of order at most k computed at the previous layer. By explicitly deriving and then solving several such recursions, we find that the depth-to-width ratio L/n plays the role of an effective network depth, controlling both the distance to Gaussianity and the size of inter-neuron correlations.

私たちは、ガウスのランダムな重み/バイアスとL個の隠れ層を持つ全結合ニューラルネットワークの分布を研究します。各層の幅は大きなパラメータnに比例します。多項式有界非線形性の場合、ネットワーク出力とその導関数の結合キュムラントについて、1/nの累乗でシャープな推定値を与えます。さらに、ネットワークキュムラントが1/nの累乗で摂動的に解ける階層を形成することを示します。つまり、各層のk番目のキュムラントは、前の層で計算された最大でkの次数のキュムラントによって1/nの先行次数に決定されます。このような再帰を明示的に導出し、解くことで、深さと幅の比L/nが有効なネットワーク深度の役割を果たし、ガウス性までの距離とニューロン間相関のサイズの両方を制御することがわかりました。

On Regularized Radon-Nikodym Differentiation
正則化ラドン-ニコジム分化について

We discuss the problem of estimating Radon-Nikodym derivatives. This problem appears in various applications, such as covariate shift adaptation, likelihood-ratio testing, mutual information estimation, and conditional probability estimation. However, in many of the above applications one is interested in the pointwise evaluation of the Radon-Nikodym derivatives rather than in their approximation as elements of some spaces of functions, and this aspect has been left unexplored in the previous studies. To address the above problem, we employ the general regularization scheme in reproducing kernel Hilbert spaces. The convergence rate of the corresponding regularized algorithm is established by taking into account both the smoothness of the derivative and the capacity of the space in which it is estimated. This is done in terms of general source conditions and the regularized Christoffel functions. We also find that the reconstruction of Radon-Nikodym derivatives at any particular point can be done with higher order of accuracy as compared to the reported work available so far. Our theoretical results are illustrated by numerical simulations.

私たちは、ラドン・ニコディム導関数の推定の問題について議論します。この問題は、共変量シフト適応、尤度比検定、相互情報量推定、条件付き確率推定など、さまざまなアプリケーションで現れます。しかし、上記のアプリケーションの多くでは、関数のいくつかの空間の要素としての近似ではなく、ラドン・ニコディム導関数の点ごとの評価に関心があり、この側面は以前の研究では未調査のまま残されています。上記の問題に対処するために、カーネルヒルベルト空間の再現に一般的な正則化スキームを使用します。対応する正則化アルゴリズムの収束率は、導関数の滑らかさと、それが推定される空間の容量の両方を考慮することによって確立されます。これは、一般的なソース条件と正則化されたクリストッフェル関数の観点から行われます。また、特定のポイントでのラドン・ニコディム導関数の再構築は、これまでに報告されている研究と比較して、より高い精度で実行できることがわかった。私たちの理論的結果は、数値シミュレーションによって説明されます。

pgmpy: A Python Toolkit for Bayesian Networks
pgmpy: ベイジアンネットワークのための Python ツールキット

Bayesian Networks (BNs) are used in various fields for modeling, prediction, and decision making. pgmpy is a python package that provides a collection of algorithms and tools to work with BNs and related models. It implements algorithms for structure learning, parameter estimation, approximate and exact inference, causal inference, and simulations. These implementations focus on modularity and easy extensibility to allow users to quickly modify/add to existing algorithms, or to implement new algorithms for different use cases. pgmpy is released under the MIT License; the source code is available at: https://github.com/pgmpy/pgmpy, and the documentation at: https://pgmpy.org.

ベイジアンネットワーク(BN)は、モデリング、予測、および意思決定のためにさまざまな分野で使用されています。pgmpyは、BNおよび関連モデルを操作するためのアルゴリズムとツールのコレクションを提供するPythonパッケージです。構造学習、パラメータ推定、近似および正確な推論、因果推論、およびシミュレーションのためのアルゴリズムを実装します。これらの実装は、モジュール性と容易な拡張性に重点を置いており、ユーザーが既存のアルゴリズムを迅速に変更/追加したり、さまざまなユースケースに対して新しいアルゴリズムを実装したりできるようにします。pgmpyはMITライセンスの下でリリースされています。ソースコードはhttps://github.com/pgmpy/pgmpyから、ドキュメントはhttps://pgmpy.orgから入手できます。

Recursive Estimation of Conditional Kernel Mean Embeddings
条件付きカーネル平均埋め込みの再帰的推定

Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone’s theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.

カーネル平均埋め込みは、機械学習で広く使用されている手法で、確率分布を再生カーネルヒルベルト空間(RKHS)の要素にマッピングします。入力と出力のペアが観察される教師あり学習の問題では、入力が与えられた場合の出力の条件付き分布が重要なオブジェクトです。出力の入力依存の条件付き分布は、RKHS値関数、条件付きカーネル平均マップでエンコードできます。この論文では、ヒルベルト空間値の$L_2$空間、つまりボクナー空間で条件付きカーネル平均マップを推定する新しい再帰アルゴリズムを紹介します。穏やかな条件下での再帰推定量の弱および強$L_2$一貫性を証明します。アイデアは、局所的にコンパクトなポーランド空間でのヒルベルト空間値回帰に対するストーンの定理を一般化することです。条件付きカーネル平均埋め込みに関する新しい洞察を示し、提案された再帰法の収束に関する強い漸近境界を示します。最後に、ユークリッド空間、リーマン多様体、関数空間の局所的にコンパクトなサブセットからの入力の3つのアプリケーションドメインで結果を示します。

Penalized Overdamped and Underdamped Langevin Monte Carlo Algorithms for Constrained Sampling
制約付きサンプリングのためのペナルティ付き過減衰および過減衰ランジュバンモンテカルロアルゴリズム

We consider the constrained sampling problem where the goal is to sample from a target distribution $\pi(x)\propto e^{-f(x)}$ when $x$ is constrained to lie on a convex body $C\subset \mathbb{R}^d$. Motivated by penalty methods from continuous optimization, we propose and study penalized Langevin Dynamics (PLD) and penalized underdamped Langevin Monte Carlo (PULMC) methods for constrained sampling that convert the constrained sampling problem into an unconstrained sampling problem by introducing a penalty function for constraint violations. When $f$ is smooth and gradients of $f$ are available, we show ${\tilde{O}}(d/\varepsilon^{10})$ iteration complexity for PLD to sample the target up to an $\varepsilon$-error where the error is measured in terms of the total variation distance and $\tilde{O}(\cdot)$ hides some logarithmic factors. For PULMC, we improve this result to $\tilde{O}(\sqrt{d}/\varepsilon^{7})$ when the Hessian of $f$ is Lipschitz and the boundary of $C$ is sufficiently smooth. To our knowledge, these are the first convergence rate results for underdamped Langevin Monte Carlo methods in the constrained sampling setting that can handle non-convex choices of $f$ and can provide guarantees with the best dimension dependency among existing methods for constrained sampling when the gradients are deterministically available. We then consider the setting where only unbiased stochastic estimates of the gradients of $f$ are available, motivated by applications to large-scale Bayesian learning problems. We propose PSGLD and PSGULMC methods that are variants of PLD and PULMC that can handle stochastic gradients and that are scaleable to large datasets without requiring Metropolis-Hasting correction steps. For PSGLD and PSGULMC, when $f$ is strongly convex and smooth, we obtain an iteration complexity of $\tilde{O}(d/\varepsilon^{18})$ and $\tilde{O}(d\sqrt{d}/\varepsilon^{39})$ respectively in the 2-Wasserstein distance. For the more general case, when $f$ is smooth and $f$ can be non-convex, we also provide finite-time performance bounds and iteration complexity results. Finally, we illustrate the performance of our algorithms on Bayesian LASSO regression and Bayesian constrained deep learning problems.

私たちは、$x$が凸体$C\subset \mathbb{R}^d$上にあるように制約されているときに、目標分布$\pi(x)\propto e^{-f(x)}$からサンプリングすることを目的とする制約付きサンプリング問題を考察します。連続最適化のペナルティ法に着想を得て、制約違反に対するペナルティ関数を導入することで制約付きサンプリング問題を制約なしのサンプリング問題に変換する、制約付きサンプリングのためのペナルティ付きランジュバンダイナミクス(PLD)法とペナルティ付き減衰不足ランジュバンモンテカルロ(PULMC)法を提案し、検討します。$f$が滑らかで、$f$の勾配が利用可能な場合、PLDがターゲットを$\varepsilon$誤差までサンプリングするための反復計算量は${\tilde{O}}(d/\varepsilon^{10})$であることを示します。誤差は総変動距離で測定され、$\tilde{O}(\cdot)$はいくつかの対数因子を隠します。PULMCについては、$f$のヘッセ行列がLipschitzで$C$の境界が十分に滑らかな場合、この結果を$\tilde{O}(\sqrt{d}/\varepsilon^{7})$に改善します。私たちが知る限り、これらは制約付きサンプリング設定における減衰不足のランジュバンモンテカルロ法の収束率に関する最初の結果であり、$f$の非凸選択を処理でき、勾配が決定論的に利用できる場合に制約付きサンプリングの既存の方法の中で最良の次元依存性を保証できます。次に、大規模なベイズ学習問題への応用を動機として、$f$の勾配の不偏確率的推定値のみが利用できる設定を検討します。私たちは、確率的勾配を処理でき、メトロポリス-ヘイスティング補正手順を必要とせずに大規模なデータセットに拡張可能なPLDおよびPULMCのバリアントであるPSGLDおよびPSGULMC法を提案します。PSGLDとPSGULMCの場合、$f$が強凸かつ滑らかな場合、2-ワッサースタイン距離でそれぞれ$\tilde{O}(d/\varepsilon^{18})$と$\tilde{O}(d\sqrt{d}/\varepsilon^{39})$の反復複雑度が得られます。より一般的なケースとして、$f$が滑らかで$f$が非凸になる可能性がある場合、有限時間のパフォーマンス境界と反復複雑度の結果も提供します。最後に、ベイジアンLASSO回帰とベイジアン制約付きディープラーニングの問題に対するアルゴリズムのパフォーマンスを示します。

Fast Rates in Pool-Based Batch Active Learning
プールベースのバッチアクティブラーニングの高速レート

We consider a batch active learning scenario where the learner adaptively issues batches of points to a labeling oracle. Sampling labels in batches is highly desirable in practice due to the smaller number of interactive rounds with the labeling oracle (often human beings). However, batch active learning typically pays the price of a reduced adaptivity, leading to suboptimal results. In this paper we propose a solution which requires a careful trade off between the informativeness of the queried points and their diversity. We theoretically investigate batch active learning in the practically relevant scenario where the unlabeled pool of data is available beforehand (pool-based active learning). We analyze a novel stage-wise greedy algorithm and show that, as a function of the label complexity, the excess risk of this algorithm matches the known minimax rates in a standard statistical learning setting with linear function spaces. Our results also exhibit a mild dependence on the batch size. These initial results are then extended to hold for general function spaces with similar algorithmics. These are the first theoretical results that employ careful trade offs between informativeness and diversity to rigorously quantify the statistical performance of batch active learning in the pool-based scenario.

私たちは、学習者が適応的にポイントのバッチをラベル付けオラクルに発行するバッチアクティブラーニングのシナリオを検討します。ラベルをバッチでサンプリングすることは、ラベル付けオラクル(多くの場合、人間)との対話ラウンドの数が少ないため、実際には非常に望ましいことです。ただし、バッチアクティブラーニングは通常、適応性が低下するという代償を払い、最適ではない結果につながります。この論文では、クエリされたポイントの情報量とその多様性の間で慎重にトレードオフする必要があるソリューションを提案します。ラベルなしのデータプールが事前に利用できる(プールベースのアクティブラーニング)という実際的なシナリオで、バッチアクティブラーニングを理論的に調査します。新しい段階的な貪欲アルゴリズムを分析し、ラベルの複雑さの関数として、このアルゴリズムの過剰リスクが、線形関数空間を使用した標準的な統計学習設定での既知のミニマックスレートと一致することを示します。また、結果はバッチサイズにわずかに依存していることも示しています。これらの初期結果は、同様のアルゴリズムを持つ一般的な関数空間に当てはまるように拡張されます。これらは、情報量と多様性の間の慎重なトレードオフを採用して、プールベースのシナリオにおけるバッチアクティブラーニングの統計的パフォーマンスを厳密に定量化した最初の理論的結果です。

On Causality in Domain Adaptation and Semi-Supervised Learning: an Information-Theoretic Analysis for Parametric Models
ドメイン適応と半教師あり学習における因果関係について:パラメトリックモデルの情報理論的分析

Recent advancements in unsupervised domain adaptation (UDA) and semi-supervised learning (SSL), particularly incorporating causality, have led to significant methodological improvements in these learning problems. However, a formal theory that explains the role of causality in the generalization performance of UDA/SSL is still lacking. In this paper, we consider the UDA/SSL scenarios where we access $m$ labelled source data and $n$ unlabelled target data as training instances under different causal settings with a parametric probabilistic model. We study the learning performance (e.g., excess risk) of prediction in the target domain from an information-theoretic perspective. Specifically, we distinguish two scenarios: the learning problem is called causal learning if the feature is the cause and the label is the effect, and is called anti-causal learning otherwise. We show that in causal learning, the excess risk depends on the size of the source sample at a rate of $O(\frac{1}{m})$ only if the labelling distribution between the source and target domains remains unchanged. In anti-causal learning, we show that the unlabelled data dominate the performance at a rate of typically $O(\frac{1}{n})$. These results bring out the relationship between the data sample size and the hardness of the learning problem with different causal mechanisms.

教師なし領域適応(UDA)と半教師あり学習(SSL)の最近の進歩、特に因果関係の組み込みにより、これらの学習問題における方法論的改善が著しく進みました。しかし、UDA/SSLの一般化パフォーマンスにおける因果関係の役割を説明する正式な理論はまだありません。この論文では、パラメトリック確率モデルを使用して、異なる因果設定の下で、ラベル付きソースデータ$m$個とラベルなしターゲットデータ$n$個をトレーニングインスタンスとしてアクセスするUDA/SSLシナリオを検討します。情報理論的観点から、ターゲットドメインでの予測の学習パフォーマンス(過剰リスクなど)を調べます。具体的には、2つのシナリオを区別します。特徴が原因でラベルが結果である場合は学習問題を因果学習と呼び、それ以外の場合は反因果学習と呼びます。因果学習では、ソースドメインとターゲットドメイン間のラベル分布が変化しない場合にのみ、過剰リスクはソースサンプルのサイズに$O(\frac{1}{m})$の割合で依存することを示しています。反因果学習では、ラベルなしデータが通常$O(\frac{1}{n})$の割合でパフォーマンスを支配することを示しています。これらの結果は、データサンプルサイズと、異なる因果メカニズムを持つ学習問題の難しさとの関係を示しています。

Mean-Field Approximation of Cooperative Constrained Multi-Agent Reinforcement Learning (CMARL)
協調制約付きマルチエージェント強化学習 (CMARL) の平均場近似

Mean-Field Control (MFC) has recently been proven to be a scalable tool to approximately solve large-scale multi-agent reinforcement learning (MARL) problems. However, these studies are typically limited to unconstrained cumulative reward maximization framework. In this paper, we show that one can use the MFC approach to approximate the MARL problem even in the presence of constraints. Specifically, we prove that, an $N$-agent constrained MARL problem, with state, and action spaces of each individual agents being of sizes $|\mathcal{X}|$, and $|\mathcal{U}|$ respectively, can be approximated by an associated constrained MFC problem with an error, $e\triangleq \mathcal{O}\left([\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}]/\sqrt{N}\right)$. In a special case where the reward, cost, and state transition functions are independent of the action distribution of the population, we prove that the error can be improved to $e=\mathcal{O}(\sqrt{|\mathcal{X}|}/\sqrt{N})$. Also, we provide a Natural Policy Gradient based algorithm, and prove that it can solve the constrained MARL problem within an error of $\mathcal{O}(e)$ with a sample complexity of $\mathcal{O}(e^{-6})$.

平均場制御(MFC)は、大規模なマルチエージェント強化学習(MARL)問題を近似的に解決するためのスケーラブルなツールであることが最近証明されました。ただし、これらの研究は通常、制約のない累積報酬最大化フレームワークに限定されています。この論文では、制約がある場合でもMFCアプローチを使用してMARL問題を近似できることを示します。具体的には、個々のエージェントの状態空間とアクション空間のサイズがそれぞれ$|\mathcal{X}|$と$|\mathcal{U}|$である$N$エージェント制約付きMARL問題は、誤差$e\triangleq \mathcal{O}\left([\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}]/\sqrt{N}\right)$を持つ関連する制約付きMFC問題で近似できることを証明します。報酬、コスト、状態遷移関数が集団の行動分布に依存しない特殊なケースでは、誤差を$e=\mathcal{O}(\sqrt{|\mathcal{X}|}/\sqrt{N})$まで改善できることを証明します。また、自然ポリシー勾配ベースのアルゴリズムを提供し、サンプル複雑度が$\mathcal{O}(e^{-6})$で、誤差が$\mathcal{O}(e)$以内で制約付きMARL問題を解決できることを証明します。

Structured Optimal Variational Inference for Dynamic Latent Space Models
動的潜在空間モデルのための構造化最適変分推論

We consider a latent space model for dynamic networks, where our objective is to estimate the pairwise inner products plus the intercept of the latent positions. To balance posterior inference and computational scalability, we consider a structured mean-field variational inference framework, where the time-dependent properties of the dynamic networks are exploited to facilitate computation and inference. Additionally, an easy-to-implement block coordinate ascent algorithm is developed with message-passing type updates in each block, whereas the complexity per iteration is linear with the number of nodes and time points. To certify the optimality, we demonstrate that the variational risk of the proposed variational inference approach attains the minimax optimal rate with only a logarithm factor under certain conditions. To this end, we first derive the minimax lower bound, which might be of independent interest. In addition, we show that the posterior under commonly adopted Gaussian random walk priors can achieve the minimax lower bound with only a logarithm factor. To the best of our knowledge, this is the first such a throughout theoretical analysis of Bayesian dynamic latent space models. Simulations and real data analysis demonstrate the efficacy of our methodology and the efficiency of our algorithm.

私たちは、動的ネットワークの潜在空間モデルを検討します。ここでの目的は、ペアワイズ内積と潜在位置の切片を推定することです。事後推論と計算のスケーラビリティのバランスをとるために、動的ネットワークの時間依存特性を利用して計算と推論を容易にする構造化平均場変分推論フレームワークを検討します。さらに、各ブロックでメッセージパッシング型の更新を行う、実装が容易なブロック座標上昇アルゴリズムが開発されていますが、反復あたりの複雑さはノード数と時間点に比例します。最適性を証明するために、提案された変分推論アプローチの変分リスクが、特定の条件下で対数係数のみを使用してミニマックス最適率を達成することを実証します。このために、まず、独立した関心事である可能性のあるミニマックス下限を導出します。さらに、一般的に採用されているガウスランダムウォーク事前分布の下での事後分布が、対数係数のみを使用してミニマックス下限を達成できることを示します。私たちの知る限り、これはベイジアン動的潜在空間モデルの徹底した理論的分析としては初めてのものです。シミュレーションと実際のデータ分析により、私たちの方法論の有効性とアルゴリズムの効率性が実証されています。

Stable and Consistent Density-Based Clustering via Multiparameter Persistence
マルチパラメータ永続性による安定で一貫性のある密度ベースのクラスタリング

We consider the degree-Rips construction from topological data analysis, which provides a density-sensitive, multiparameter hierarchical clustering algorithm. We analyze its stability to perturbations of the input data using the correspondence-interleaving distance, a metric for hierarchical clusterings that we introduce. Taking certain one-parameter slices of degree-Rips recovers well-known methods for density-based clustering, but we show that these methods are unstable. However, we prove that degree-Rips, as a multiparameter object, is stable, and we propose an alternative approach for taking slices of degree-Rips, which yields a one-parameter hierarchical clustering algorithm with better stability properties. We prove that this algorithm is consistent, using the correspondence-interleaving distance. We provide an algorithm for extracting a single clustering from one-parameter hierarchical clusterings, which is stable with respect to the correspondence-interleaving distance. And, we integrate these methods into a pipeline for density-based clustering, which we call Persistable. Adapting tools from multiparameter persistent homology, we propose visualization tools that guide the selection of all parameters of the pipeline. We demonstrate Persistable on benchmark data sets, showing that it identifies multi-scale cluster structure in data.

私たちは、密度に敏感なマルチパラメータ階層的クラスタリングアルゴリズムを提供する位相データ解析からのdegree-Rips構築について考察します。私たちは、我々が導入する階層的クラスタリングの測定基準である対応インターリーブ距離を使用して、入力データの摂動に対するその安定性を解析します。degree-Ripsの特定の1パラメータスライスを取ることで、密度ベースのクラスタリングのよく知られた方法を回復できるが、これらの方法は不安定であることを示す。しかし、私たちは、マルチパラメータオブジェクトとしてのdegree-Ripsが安定していることを証明し、degree-Ripsのスライスを取るための代替アプローチを提案します。これにより、より安定性の高い1パラメータ階層的クラスタリングアルゴリズムが得られます。私たちは、対応インターリーブ距離を使用して、このアルゴリズムが一貫していることを証明します。私たちは、対応インターリーブ距離に関して安定している、1パラメータ階層的クラスタリングから単一のクラスタリングを抽出するアルゴリズムを提供します。そして、これらの手法を密度ベースのクラスタリングのパイプラインに統合し、これをPersistableと呼んでいます。マルチパラメータ持続相同性からツールを適応させて、パイプラインのすべてのパラメータの選択をガイドする視覚化ツールを提案します。ベンチマークデータセットでPersistableを実証し、データ内のマルチスケールクラスター構造を識別できることを示します。

Faster Randomized Methods for Orthogonality Constrained Problems
直交性制約問題に対するより高速なランダム化法

Recent literature has advocated the use of randomized methods foraccelerating the solution of various matrix problems arising inmachine learning and data science. One popular strategy for leveraging randomization in numerical linear algebra is to use it as a way to reduce problem size. However, methods based on this strategy lack sufficient accuracy for some applications. Randomized preconditioning is another approach for leveraging randomization in numerical linear algebra, which provides higher accuracy. The main challenge in using randomized preconditioning is the need for an underlying iterative method, thus randomized preconditioning so far has been applied almost exclusively to solving regression problems and linear systems. In this article, we show how to expand the application of randomized preconditioning to another important set of problems prevalent in machine learning: optimization problems with (generalized) orthogonality constraints. We demonstrate our approach, which is based on the framework of Riemannian optimization and Riemannian preconditioning, on the problem of computing the dominant canonical correlations and on the Fisher linear discriminant analysis problem. More broadly, our method is designed for problems with input matrices featuring one dimension much larger than the other (e.g., the number of samples much larger than the number of features). For both problems, we evaluate the effect of preconditioning on the computational costs and asymptotic convergenceand demonstrate empirically the utility of our approach.

最近の文献では、機械学習やデータサイエンスで発生するさまざまな行列問題の解決を加速するために、ランダム化法の使用が推奨されています。数値線形代数でランダム化を活用するための一般的な戦略の1つは、問題のサイズを縮小する方法として使用することです。ただし、この戦略に基づく方法は、一部のアプリケーションでは精度が不十分です。ランダム化前処理は、数値線形代数でランダム化を活用するもう1つのアプローチであり、精度が向上します。ランダム化前処理を使用する際の主な課題は、基礎となる反復法が必要であることです。そのため、ランダム化前処理は、これまで回帰問題と線形システムの解決にほぼ独占的に適用されてきました。この記事では、ランダム化前処理の適用を、機械学習でよく見られる別の重要な問題セット、つまり(一般化)直交性制約のある最適化問題に拡張する方法を示します。リーマン最適化とリーマン前処理のフレームワークに基づくアプローチを、支配的な正準相関を計算する問題とフィッシャー線形判別分析問題で示します。より広義には、私たちの方法は、ある次元が他の次元よりもはるかに大きい入力行列(たとえば、サンプル数が特徴の数よりもはるかに大きい)の問題向けに設計されています。両方の問題に対して、計算コストと漸近収束に対する前処理の影響を評価し、私たちのアプローチの有用性を実証します。

Estimation of Sparse Gaussian Graphical Models with Hidden Clustering Structure
隠れクラスタリング構造を持つスパースガウスグラフィカルモデルの推定

Estimation of Gaussian graphical models is important in natural science when modeling the statistical relationships between variables in the form of a graph. The sparsity and clustering structure of the concentration matrix is enforced to reduce model complexity and describe inherent regularities. We propose a model to estimate the sparse Gaussian graphical models with hidden clustering structure, which also allows additional linear constraints to be imposed on the concentration matrix. We design an efficient two-phase algorithm for solving the proposed model. Specifically, we develop a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to generate an initial point to warm start the second phase algorithm, which is a proximal augmented Lagrangian method (pALM), to get a solution with high accuracy. Numerical experiments on both synthetic data and real data demonstrate the good performance of our model, as well as the efficiency and robustness of our proposed algorithm.

ガウスグラフィカルモデルの推定は、自然科学において、変数間の統計的関係をグラフの形でモデル化する際に重要です。濃度行列のスパース性とクラスタリング構造は、モデルの複雑さを軽減し、固有の規則性を記述するために強制されます。隠れたクラスタリング構造を持つスパースガウスグラフィカルモデルを推定するモデルを提案します。これにより、濃度行列に追加の線形制約を課すこともできるようになります。提案モデルを解くための効率的な2段階アルゴリズムを設計します。具体的には、対称ガウスザイデルベースの交互方向乗数法(sGS-ADMM)を開発して初期点を生成し、第2段階アルゴリズムである近似拡張ラグランジュ法(pALM)をウォームスタートして、高精度のソリューションを取得します。合成データと実際のデータの両方での数値実験により、モデルの優れたパフォーマンスと、提案アルゴリズムの効率性と堅牢性が実証されています。

Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions for Regularization in Reinforcement Learning
割引正則化の再考:強化学習における正則化の新しい解釈、意図しない結果、および解決策

Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to avoid overfitting when faced with sparse or noisy data. It is commonly interpreted as de-emphasizing or ignoring delayed effects. In this paper, we prove two alternative views of discount regularization that expose unintended consequences and motivate novel regularization methods. In model-based RL, planning under a lower discount factor acts like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. In model-free RL, discount regularization equates to planning using a weighted average Bellman update, where the agent plans as if the values of all state-action pairs are closer than implied by the data. Our equivalence theorems motivate simple methods that generalize discount regularization by setting parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific methods across empirical examples with both tabular and continuous state spaces.

割引正則化は、最適ポリシーを計算するときに計画期間を短くする手法で、スパースなデータやノイズの多いデータに直面したときに過剰適合を回避するための一般的な選択肢です。これは通常、遅延効果を軽視または無視していると解釈されます。この論文では、意図しない結果を明らかにし、新しい正則化手法の動機となる割引正則化の2つの代替ビューを証明します。モデルベースのRLでは、割引係数が低い場合の計画は、遷移データが多い状態とアクションのペアに対してより強力な正則化を伴う事前確率のように機能します。これにより、状態とアクションのペア間でデータ量が不均一なデータセットから遷移行列が推定される場合、パフォーマンスが低下します。モデルフリーRLでは、割引正則化は加重平均ベルマン更新を使用した計画に相当し、エージェントはすべての状態とアクションのペアの値がデータによって示唆される値よりも近いかのように計画します。私たちの同値定理は、パラメーターをグローバルではなく個々の状態とアクションのペアに対してローカルに設定することで割引正則化を一般化する単純な手法の動機となります。割引正規化の失敗と、表形式と連続状態空間の両方の経験的例にわたって状態アクション固有の方法を使用してそれをどのように修正するかを示します。

PromptBench: A Unified Library for Evaluation of Large Language Models
PromptBench: 大規模言語モデルの評価のための統合ライブラリ

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that can be easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed as an open, general, and flexible codebase for research purpose. It aims to facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.

大規模言語モデル(LLM)の評価は、そのパフォーマンスを評価し、潜在的なセキュリティリスクを軽減するために重要です。この論文では、LLMを評価するための統合ライブラリであるPromptBenchについて紹介します。これは、研究者が簡単に使用および拡張できるいくつかの主要コンポーネント(プロンプト構築、プロンプトエンジニアリング、データセットとモデルの読み込み、敵対的プロンプト攻撃、動的評価プロトコル、分析ツール)で構成されています。PromptBenchは、研究目的のためのオープンで一般的、かつ柔軟なコードベースとして設計されています。これは、新しいベンチマークの作成、ダウンストリームアプリケーションのデプロイ、および新しい評価プロトコルの設計における独自の研究を促進することを目的としています。コードはhttps://github.com/microsoft/promptbenchで入手でき、継続的にサポートされます。

Gaussian Interpolation Flows
ガウス補間フロー

Gaussian denoising has emerged as a powerful method for constructing simulation-free continuous normalizing flows for generative modeling. Despite their empirical successes, theoretical properties of these flows and the regularizing effect of Gaussian denoising have remained largely unexplored. In this work, we aim to address this gap by investigating the well-posedness of simulation-free continuous normalizing flows built on Gaussian denoising. Through a unified framework termed Gaussian interpolation flow, we establish the Lipschitz regularity of the flow velocity field, the existence and uniqueness of the flow, and the Lipschitz continuity of the flow map and the time-reversed flow map for several rich classes of target distributions. This analysis also sheds light on the auto-encoding and cycle consistency properties of Gaussian interpolation flows. Additionally, we study the stability of these flows in source distributions and perturbations of the velocity field, using the quadratic Wasserstein distance as a metric. Our findings offer valuable insights into the learning techniques employed in Gaussian interpolation flows for generative modeling, providing a solid theoretical foundation for end-to-end error analyses of learning Gaussian interpolation flows with empirical observations.

ガウスノイズ除去は、生成モデリングのためのシミュレーションフリーの連続正規化フローを構築する強力な方法として登場しました。実験的には成功していますが、これらのフローの理論的特性とガウスノイズ除去の正規化効果は、ほとんど未調査のままでした。この研究では、ガウスノイズ除去に基づいて構築されたシミュレーションフリーの連続正規化フローの適切性を調査することで、このギャップを埋めることを目指します。ガウス補間フローと呼ばれる統一されたフレームワークを通じて、フロー速度場のLipschitz正規性、フローの存在と一意性、およびいくつかの豊富なターゲット分布のクラスに対するフローマップと時間反転フローマップのLipschitz連続性を確立します。この分析は、ガウス補間フローの自動エンコードとサイクル一貫性の特性にも光を当てます。さらに、2次ワッサーシュタイン距離をメトリックとして使用して、ソース分布と速度場の摂動におけるこれらのフローの安定性を調査します。私たちの研究結果は、生成モデリングのためのガウス補間フローで採用されている学習手法に関する貴重な洞察を提供し、経験的観察によるガウス補間フローの学習のエンドツーエンドのエラー分析のための強固な理論的基礎を提供します。

Gaussian Mixture Models with Rare Events
まれなイベントを持つガウス混合モデル

We study here a Gaussian mixture model (GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow numerical convergence rate. To theoretically understand this phenomenon, we formulate the numerical convergence problem of the EM algorithm with rare events data as a problem about a contraction operator. Theoretical analysis reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theoretical finding explains the empirical slow numerical convergence of the EM algorithm with rare events data. To overcome this challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. As compared with the standard EM algorithm, the key feature of the MEM algorithm is that it requires additionally labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as compared with the standard EM algorithm. The finite sample performance of the proposed method is illustrated by both simulation studies and a real-world dataset of Swedish traffic signs.

私たちは、稀少イベントデータを使用したガウス混合モデル(GMM)を研究します。この場合、一般的に使用される期待値最大化(EM)アルゴリズムは、数値収束率が極めて低くなります。この現象を理論的に理解するために、稀少イベントデータを使用したEMアルゴリズムの数値収束問題を、縮約演算子に関する問題として定式化します。理論的分析により、この場合の縮約演算子のスペクトル半径は、漸近的に1に任意に近づく可能性があることが明らかになりました。この理論的発見は、稀少イベントデータを使用したEMアルゴリズムの実験的な数値収束の遅さを説明しています。この課題を克服するために、部分的にラベル付けされたデータによって提供される情報を利用する混合EM (MEM)アルゴリズムが開発されました。標準のEMアルゴリズムと比較して、MEMアルゴリズムの主な特徴は、追加のラベル付けされたデータを必要とすることです。MEMアルゴリズムは、標準のEMアルゴリズムと比較して数値収束率を大幅に向上させることがわかりました。提案された方法の有限サンプルのパフォーマンスは、シミュレーション研究とスウェーデンの交通標識の実際のデータセットの両方によって実証されています。

On the Concentration of the Minimizers of Empirical Risks
経験的リスクの最小化者の集中について

Obtaining guarantees on the convergence of the minimizers of empirical risks to the ones of the true risk is a fundamental matter in statistical learning. Instead of deriving guarantees on the usual estimation error, the goal of this paper is to provide concentration inequalities on the distance between the sets of minimizers of the risks for a broad spectrum of estimation problems. In particular, the risks are defined on metric spaces through probability measures that are also supported on metric spaces. A particular attention will therefore be given to include unbounded spaces and non-convex cost functions that might also be unbounded. This work identifies a set of high-level assumptions allowing to describe a regime that seems to govern the concentration in many estimation problems, where the empirical minimizers are stable. This stability can then be leveraged to prove parametric concentration rates in probability and in expectation. The assumptions are verified, and the bounds showcased, on a selection of estimation problems such as barycenters on metric space with positive or negative curvature, subspaces of covariance matrices, regression problems and entropic-Wasserstein barycenters.

経験的リスクの最小化が真のリスクの最小化に収束することを保証することは、統計学習における基本的な事項です。通常の推定誤差の保証を導く代わりに、本論文の目的は、幅広い推定問題に対するリスクの最小化集合間の距離に関する集中不等式を提供することです。特に、リスクは、距離空間でもサポートされている確率測度を通じて距離空間上で定義されます。したがって、特に注意を払うのは、無制限の空間と、無制限である可能性のある非凸コスト関数を含めることです。この研究では、多くの推定問題で集中を支配すると思われる体制を記述できる一連の高レベルの仮定を特定し、そこでは経験的最小化が安定しています。この安定性を利用して、確率と期待値におけるパラメトリック集中率を証明できます。正または負の曲率を持つ距離空間上の重心、共分散行列の部分空間、回帰問題、エントロピー・ワッサーシュタイン重心などの推定問題の選択において、仮定が検証され、境界が示されます。

Variance estimation in graphs with the fused lasso
融合投げ縄を使用したグラフでの分散推定

We study the problem of variance estimation in general graph-structured problems. First, we develop a linear time estimator for the homoscedastic case that can consistently estimate the variance in general graphs. We show that our estimator attains minimax rates for the chain and 2D grid graphs when the mean signal has total variation with canonical scaling. Furthermore, we provide general upper bounds on the mean squared error performance of the fused lasso estimator in general graphs under a moment condition and a bound on the tail behavior of the errors. These upper bounds allow us to generalize for broader classes of distributions, such as sub-Exponential, many existing results on the fused lasso that are only known to hold with the assumption that errors are sub-Gaussian random variables. Exploiting our upper bounds, we then study a simple total variation regularization estimator for estimating the signal of variances in the heteroscedastic case. We also provide lower bounds showing that our heteroscedastic variance estimator attains minimax rates for estimating signals of bounded variation in grid graphs, and $K$-nearest neighbor graphs, and the estimator is consistent for estimating the variances in any connected graph.

私たちは、一般的なグラフ構造の問題における分散推定の問題を研究します。まず、一般的なグラフの分散を一貫して推定できる、等分散の場合の線形時間推定量を開発します。平均信号が正準スケーリングによる全変動を持つ場合、推定量がチェーングラフと2Dグリッドグラフのミニマックスレートを達成することを示す。さらに、モーメント条件の下での一般的なグラフにおける融合Lasso推定量の平均二乗誤差パフォーマンスの一般的な上限と、誤差のテール動作の上限を提供します。これらの上限により、誤差がガウス分布以下のランダム変数であるという仮定の下でのみ成立することが知られている融合Lassoに関する多くの既存の結果を、サブ指数分布などのより広範な分布クラスに一般化することができます。次に、我々の上限を利用して、異分散の場合の分散信号を推定するための単純な全変動正規化推定量を研究します。また、私たちの異分散分散推定量がグリッドグラフや$K$近傍グラフにおける有界変動の信号を推定するためのミニマックス率を達成し、推定量が任意の接続グラフにおける分散の推定に一貫していることを示す下限も提供します。

Random measure priors in Bayesian recovery from sketches
スケッチからのベイズ回復におけるランダム測度事前確率

This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol’s empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general “traits” setting, where each data point has integer levels of association with multiple symbols, typically referred to as “traits”. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait’s frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions.

この論文では、ランダムハッシュによって得られたスケッチに含まれるすべての情報を活用し、非可逆圧縮された離散データから頻度を回復するためのベイジアンノンパラメトリックアプローチを紹介します。データポイントを、ポアソン-キングマン事前分布を備えた未知の離散分布からのランダムサンプルとしてモデル化することにより、スケッチが与えられた場合のシンボルの実験頻度の事後分布を導出します。これにより、事後平均、中央値、最頻値などの平均関数を通じて、原理に基づいた頻度推定が得られます。この一般的な結果のディリクレ過程およびピットマン-ヨー過程事前分布への適用に焦点を当てます。特に、前者の事前分布は事後分布を簡素化する十分性プロパティを一意に満たし、後者は便利な大規模サンプルの漸近近似を可能にすることを証明します。さらに、このアプローチをカーディナリティ回復の問題に拡張し、スケッチされたデータセット内の異なるシンボルの数を推定します。頻度回復に対する当社のアプローチは、より一般的な「特性」設定にも適応します。この設定では、各データポイントが複数のシンボル(通常「特性」と呼ばれます)との整数レベルの関連を持ちます。一般化されたインドビュッフェプロセスを採用することで、特性の関連レベルにポアソン分布とベルヌーイ分布の両方を使用して特性の頻度の事後分布を計算し、それぞれ正確な事後頻度分布と近似事後頻度分布を生成します。

From continuous-time formulations to discretization schemes: tensor trains and robust regression for BSDEs and parabolic PDEs
連続時間定式化から離散化スキームへ:BSDEと放物線偏微分方程式のためのテンソル列とロバスト回帰

The numerical approximation of partial differential equations (PDEs) poses formidable challenges in high dimensions since classical grid-based methods suffer from the so-called curse of dimensionality. Recent attempts rely on a combination of Monte Carlo methods and variational formulations, using neural networks for function approximation. Extending previous work (Richter et al., 2021), we argue that tensor trains provide an appealing framework for parabolic PDEs: The combination of reformulations in terms of backward stochastic differential equations and regression-type methods holds the promise of leveraging latent low-rank structures, enabling both compression and efficient computation. Emphasizing a continuous-time viewpoint, we develop iterative schemes, which differ in terms of computational efficiency and robustness. We demonstrate both theoretically and numerically that our methods can achieve a favorable trade-off between accuracy and computational efficiency. While previous methods have been either accurate or fast, we have identified a novel numerical strategy that can often combine both of these aspects.

偏微分方程式(PDE)の数値近似は、古典的なグリッドベースの方法にはいわゆる次元の呪いがあるため、高次元では困難な課題となります。最近の試みは、モンテカルロ法と変分定式化の組み合わせに依存しており、関数近似にニューラルネットワークを使用しています。以前の研究(Richterら, 2021)を拡張して、テンソル列が放物型PDEの魅力的なフレームワークを提供すると主張します。後方確率微分方程式と回帰型方法による再定式化の組み合わせは、潜在的な低ランク構造を活用し、圧縮と効率的な計算の両方を可能にする可能性を秘めています。連続時間の観点を重視して、計算効率と堅牢性の点で異なる反復スキームを開発します。私たちの方法が精度と計算効率の間で好ましいトレードオフを実現できることを理論的にも数値的にも実証します。以前の方法は正確か高速かのどちらかでしたが、私たちはこれらの両方の側面を組み合わせることができる新しい数値戦略を特定しました。

Label Alignment Regularization for Distribution Shift
配布シフトのラベル配置正則化

Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this observation, we propose a regularization method for unsupervised domain adaptation that encourages alignment between the predictions in the target domain and its top singular vectors. Unlike conventional domain adaptation approaches that focus on regularizing representations, we instead regularize the classifier to align with the unsupervised target data, guided by the LAP in both the source and target domains. Theoretical analysis demonstrates that, under certain assumptions, our solution resides within the span of the top right singular vectors of the target domain data and aligns with the optimal solution. By removing the reliance on the commonly used optimal joint risk assumption found in classic domain adaptation theory, we showcase the effectiveness of our method on addressing problems where traditional domain adaptation methods often fall short due to high joint error. Additionally, we report improved performance over domain adaptation baselines in well-known tasks such as MNIST-USPS domain adaptation and cross-lingual sentiment analysis. An implementation is available at https://github.com/EhsanEI/lar/.

最近の研究では、教師あり学習におけるラベルアラインメントプロパティ(LAP)が強調されています。これは、データセット内のすべてのラベルのベクトルが、データマトリックスの上位数個の特異ベクトルの範囲内にほとんど収まるというものです。この観察からヒントを得て、ターゲットドメインの予測とその上位特異ベクトル間のアラインメントを促進する教師なしドメイン適応の正則化方法を提案します。表現の正則化に重点を置く従来のドメイン適応アプローチとは異なり、ソースドメインとターゲットドメインの両方でLAPをガイドとして、分類器を教師なしターゲットデータと一致するように正則化します。理論分析により、特定の仮定の下で、私たちのソリューションはターゲットドメインデータの右上の特異ベクトルの範囲内にあり、最適なソリューションと一致することが実証されています。従来のドメイン適応理論で一般的に使用されている最適なジョイントリスクの仮定への依存を排除することで、従来のドメイン適応方法ではジョイントエラーが高く、多くの場合不十分な問題に対処する私たちの方法の有効性を示します。さらに、MNIST-USPSドメイン適応やクロスリンガル感情分析などのよく知られたタスクにおいて、ドメイン適応ベースラインよりもパフォーマンスが向上したことを報告します。実装はhttps://github.com/EhsanEI/lar/で入手できます。

Fairness in Survival Analysis with Distributionally Robust Optimization
分布ロバスト最適化による生存時間解析の公平性

We propose a general approach for encouraging fairness in survival analysis models that is based on minimizing a worst-case error across all subpopulations that are “large enough” (occurring with at least a user-specified probability threshold). This approach can be used to convert a wide variety of existing survival analysis models into ones that simultaneously encourage fairness, without requiring the user to specify which attributes or features to treat as sensitive in the training loss function. From a technical standpoint, our approach applies recent methodological developments of distributionally robust optimization (DRO) to survival analysis. The complication is that existing DRO theory uses a training loss function that decomposes across contributions of individual data points, i.e., any term that shows up in the loss function depends only on a single training point. This decomposition does not hold for commonly used survival loss functions, including for the standard Cox proportional hazards model, its deep neural network variants, and many other recently developed survival analysis models that use loss functions involving ranking or similarity score calculations. We address this technical hurdle using a sample splitting strategy. We demonstrate our sample splitting DRO approach by using it to create fair versions of a diverse set of existing survival analysis models including the classical Cox model (and its deep neural network variant DeepSurv), the discrete-time model DeepHit, and the neural ODE model SODEN. We also establish a finite-sample theoretical guarantee to show what our sample splitting DRO loss converges to. Specifically for the Cox model, we further derive an exact DRO approach that does not use sample splitting. For all the survival models that we convert into DRO variants, we show that the DRO variants often score better on recently established fairness metrics (without incurring a significant drop in accuracy) compared to existing survival analysis fairness regularization techniques, including ones which directly use sensitive demographic information in their training loss functions.

私たちは、生存分析モデルにおける公平性を促進するための一般的なアプローチを提案します。このアプローチは、すべての「十分に大きい」（少なくともユーザーが指定した確率しきい値で発生する）サブポピュレーション全体で最悪のケースのエラーを最小化することに基づいています。このアプローチを使用すると、トレーニング損失関数でどの属性または機能を敏感として扱うかをユーザーが指定する必要なく、既存のさまざまな生存分析モデルを公平性を同時に促進するモデルに変換できます。技術的な観点から、我々のアプローチは、分布ロバスト最適化（DRO）の最近の方法論的発展を生存分析に適用します。複雑なのは、既存のDRO理論が、個々のデータポイントの寄与にわたって分解するトレーニング損失関数を使用していることです。つまり、損失関数に表示される項は、単一のトレーニングポイントにのみ依存します。この分解は、標準的なCox比例ハザードモデル、そのディープニューラルネットワークのバリアント、およびランキングや類似性スコアの計算を含む損失関数を使用する他の多くの最近開発された生存分析モデルなど、一般的に使用されている生存損失関数には当てはまりません。私たちは、サンプル分割戦略を使用してこの技術的なハードルに対処します。サンプル分割DROアプローチを実証するために、従来のCoxモデル(およびそのディープニューラルネットワークバリアントDeepSurv)、離散時間モデルDeepHit、ニューラルODEモデルSODENなど、既存のさまざまな生存分析モデルの公平なバージョンを作成します。また、サンプル分割DRO損失が何に収束するかを示すために、有限サンプルの理論的保証を確立します。特にCoxモデルについては、サンプル分割を使用しない正確なDROアプローチをさらに導出します。DROバリアントに変換するすべての生存モデルについて、DROバリアントは、トレーニング損失関数で機密性の高い人口統計情報を直接使用するものを含む、既存の生存分析公平性正規化手法と比較して、最近確立された公平性メトリックで(精度が大幅に低下することなく)優れたスコアを示すことが多いことを示します。

FineMorphs: Affine-Diffeomorphic Sequences for Regression
FineMorphs: 回帰のためのアフィン-微分同相シーケンス

A multivariate regression model of affine and diffeomorphic transformation sequences—FineMorphs—is presented. Leveraging concepts from shape analysis, model states are optimally “reshaped” by diffeomorphisms generated by smooth vector fields during learning. Affine transformations and vector fields are optimized within an optimal control setting, and the model can naturally reduce (or increase) dimensionality and adapt to large data sets via sub-optimal vector fields. An existence proof of solution and necessary conditions for optimality for the model are derived. Experimental results on real data sets from the UCI repository are presented, with favorable results in comparison with state-of-the-art in the literature, neural ordinary differential equation models, and densely-connected neural networks in TensorFlow.

アフィン変換シーケンスと微分同相変換シーケンスの多変量回帰モデル(FineMorphs)が表示されます。形状解析の概念を活用して、モデルの状態は、学習中に滑らかなベクトル場によって生成された微分同相によって最適に「再形成」されます。アフィン変換とベクトル場は最適な制御設定内で最適化され、モデルは自然に次元を減少(または増加)し、最適でないベクトル場を介して大規模なデータセットに適応できます。解の存在証明とモデルの最適性に必要な条件が導出されます。UCIリポジトリの実際のデータセットでの実験結果が提示され、最先端の文献、ニューラル常微分方程式モデル、およびTensorFlowの密に接続されたニューラルネットワークと比較して良好な結果が得られます。

Tensor-train methods for sequential state and parameter learning in state-space models
状態空間モデルにおける逐次状態およびパラメータ学習のためのテンソル学習法

We consider sequential state and parameter learning in state-space models with intractable state transition and observation processes. By exploiting low-rank tensor train (TT) decompositions, we propose new sequential learning methods for joint parameter and state estimation under the Bayesian framework. Our key innovation is the introduction of scalable function approximation tools such as TT for recursively learning the sequentially updated posterior distributions. The function approximation perspective of our methods offers tractable error analysis and potentially alleviates the particle degeneracy faced by many particle-based methods. In addition to the new insights into the algorithmic design, our methods complement conventional particle-based methods. Our TT-based approximations naturally define conditional Knothe–Rosenblatt (KR) rearrangements that lead to parameter estimation, filtering, smoothing and path estimation accompanying our sequential learning algorithms, which open the door to removing potential approximation bias. We also explore several preconditioning techniques based on either linear or nonlinear KR rearrangements to enhance the approximation power of TT for practical problems. We demonstrate the efficacy and efficiency of our proposed methods on several state-space models, in which our methods achieve state-of-the-art estimation accuracy and computational performance.

私たちは、扱いにくい状態遷移と観測プロセスを持つ状態空間モデルにおける逐次状態およびパラメータ学習について考察します。低ランクのテンソル列(TT)分解を利用して、ベイズ枠組みの下での共同パラメータおよび状態推定のための新しい逐次学習法を提案します。主な革新は、逐次更新事後分布を再帰的に学習するためのTTなどのスケーラブルな関数近似ツールの導入です。関数近似の観点から見た私たちの方法は扱いやすい誤差分析を提供し、多くの粒子ベースの方法が直面する粒子の退化を軽減できる可能性があります。アルゴリズム設計への新しい洞察に加えて、私たちの方法は従来の粒子ベースの方法を補完します。TTベースの近似は、条件付きKnothe-Rosenblatt (KR)再配置を自然に定義し、逐次学習アルゴリズムに伴うパラメータ推定、フィルタリング、スムージング、およびパス推定につながるため、潜在的な近似バイアスを排除する道が開かれます。また、実用的な問題に対するTTの近似力を高めるために、線形または非線形KR再配置に基づくいくつかの前処理手法も検討します。提案手法の有効性と効率性をいくつかの状態空間モデルで実証し、最先端の推定精度と計算性能を実現します。

Memory of recurrent networks: Do we compute it right?
リカレントネットワークのメモリ:正しく計算されていますか?

Numerical evaluations of the memory capacity (MC) of recurrent neural networks reported in the literature often contradict well-established theoretical bounds. In this paper, we study the case of linear echo state networks, for which the total memory capacity has been proven to be equal to the rank of the corresponding Kalman controllability matrix. We shed light on various reasons for the inaccurate numerical estimations of the memory, and we show that these issues, often overlooked in the recent literature, are of an exclusively numerical nature. More explicitly, we prove that when the Krylov structure of the linear MC is ignored, a gap between the theoretical MC and its empirical counterpart is introduced. As a solution, we develop robust numerical approaches by exploiting a result of MC neutrality with respect to the input mask matrix. Simulations show that the memory curves that are recovered using the proposed methods fully agree with the theory.

文献で報告されているリカレントニューラルネットワークの記憶容量(MC)の数値評価は、確立された理論的限界と矛盾することがよくあります。この論文では、総メモリ容量が対応するカルマン制御可能性行列のランクに等しいことが証明されている線形エコー状態ネットワークのケースを研究します。私たちは、記憶の不正確な数値推定のさまざまな理由を明らかにし、最近の文献では見過ごされがちなこれらの問題が、もっぱら数値的な性質のものであることを示しています。より明確に言えば、線形MCのクリロフ構造を無視すると、理論的なMCとその経験的な対応物との間にギャップが生じることを証明します。解決策として、入力マスク行列に対するMC中立性の結果を利用して、ロバストな数値アプローチを開発します。シミュレーションの結果、提案された方法を用いて復元された記憶曲線は、この理論と完全に一致していることが示されています。

The Loss Landscape of Deep Linear Neural Networks: a Second-order Analysis
深層線形ニューラルネットワークの損失ランドスケープ:二次解析

We study the optimization landscape of deep linear neural networks with square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms’ dynamics, have only been lightly studied. We go a step further with a complete analysis of the optimization landscape at order $2$. Among all critical points, we characterize global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that has been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.

私たちは、二乗損失を持つ深層線形ニューラルネットワークの最適化ランドスケープを研究しています。弱い仮定の下では、偽の局所的最小値も局所的最大値も存在しないことが知られています。しかし、一階アルゴリズムのダイナミクスに役割を果たすことができる非厳密なサドルポイントの存在と多様性は、あまり研究されていませんでした。さらに一歩進んで、$2$の注文での最適化ランドスケープの完全な分析を行います。すべての重要なポイントの中で、グローバルミニマイザー、厳密なサドルポイント、および非厳密なサドルポイントを特徴付けます。関連するすべての重要な値を列挙します。特性評価は単純で、部分行列製品のランクの条件を含み、線形ニューラルネットワークを最適化するときに証明または観察されたグローバル収束または暗黙的な正則化に光を当てます。ついでに、すべてのグローバル最小化のセットの明示的なパラメーター化を提供し、厳密なサドルポイントと非ストリクトなサドルポイントの大規模なセットを示します。

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise
サブワイブルノイズによる非凸確率確率勾配降下法の高確率収束限界

Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.

確率的勾配降下法は、機械学習で使用される最も一般的な反復アルゴリズムの1つであり、その収束分析は豊富な研究分野です。その収束特性を理解することは、さまざまな設定でどのような修正を使用すればよいかを判断するのに役立ちます。ただし、ほとんどの理論的結果は、凸性を前提としているか、平均でのみ収束結果を示しています。一方、この論文では、凸性を前提とせずに高確率で収束境界を証明しています。強い平滑性を前提として、2つの設定で高確率収束境界を証明しています。(1) Polyak-Łojasiewicz不等式とノルムサブガウス勾配ノイズを想定、(2)ノルムサブワイブル勾配ノイズを想定。2番目の設定では、収束を証明するための中間ステップとして、独立した関心事であるサブワイブルマルチンゲール差分シーケンス自己正規化濃度不等式を証明します。これは、フリードマン型濃度をサブ指数しきい値を超えて、より重い裾のマルチンゲール差分シーケンスに拡張します。また、未知の最適な反復に対する通常の境界ではなく、証明可能な収束保証を持つ単一の反復を選択する後処理方法も提供します。サブワイブルノイズの収束結果は、クリッピング、モメンタム、正規化などの変更を加えた確率的勾配降下法と同等かそれ以上の収束保証を持つ領域を、確率的勾配降下法が拡張します。

Euler Characteristic Tools for Topological Data Analysis
トポロジカルデータ解析のためのオイラー特性ツール

In this article, we study Euler characteristic techniques in topological data analysis. Pointwise computing the Euler characteristic of a family of simplicial complexes built from data gives rise to the so-called Euler characteristic profile. We show that this simple descriptor achieves state-of-the-art performance in supervised tasks at a meagre computational cost. Inspired by signal analysis, we compute hybrid transforms of Euler characteristic profiles. These integral transforms mix Euler characteristic techniques with Lebesgue integration to provide highly efficient compressors of topological signals. As a consequence, they show remarkable performances in unsupervised settings. On the qualitative side, we provide numerous heuristics on the topological and geometric information captured by Euler profiles and their hybrid transforms. Finally, we prove stability results for these descriptors as well as asymptotic guarantees in random settings.

この記事では、トポロジカルデータ解析におけるオイラー特性手法について学習します。データから構築された単純複素数の族のオイラー特性を点単位で計算すると、いわゆるオイラー特性プロファイルが生じます。この単純な記述子が、わずかな計算コストで教師付きタスクで最先端のパフォーマンスを達成することを示します。信号解析に触発されて、オイラー特性プロファイルのハイブリッド変換を計算します。これらの積分変換は、オイラー特性手法とルベーグ積分を組み合わせ、トポロジカル信号の高効率なコンプレッサーを提供します。その結果、彼らは監視されていない環境で驚くべきパフォーマンスを示します。定性的側面では、オイラープロファイルとそのハイブリッド変換によってキャプチャされたトポロジカルおよび幾何学的情報に関する多数のヒューリスティックを提供します。最後に、これらのディスクリプタの安定性結果と、ランダム設定での漸近保証を証明します。

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization
ニューラルネットワークにおける深さ縮退:初期化時の全結合ReLUネットワークにおける消失角度

Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks. The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.

さまざまなタスクで優れたパフォーマンスを発揮するにもかかわらず、ディープニューラルネットワークの多くの特性はまだ理論的に解明されていません。そのような謎の1つが、深度退化現象です。ネットワークを深くすればするほど、ネットワークは初期化時に定数関数に近づきます。この論文では、ReLUニューラルネットワークへの2つの入力間の角度の変化を、レイヤー数の関数として調べます。組み合わせ展開を使用することで、深度が増すにつれてこの角度が0に近づく速さを正確に表す式を見つけます。これらの式は、無限幅制限の一般的なフレームワークでは見えない微視的な変動を捉え、質的に異なる予測につながります。モンテカルロ実験で理論的結果を検証し、結果が有限ネットワークの動作を正確に近似していることを示します。また、深度退化現象が実際のネットワークのトレーニングに悪影響を与える可能性があることを経験的に調査します。式は、ReLU関数に渡される相関ガウス分布の混合モーメントで示されます。また、これらの混合モーメントとベッセル数の間には驚くべき組み合わせ関係があり、これによってこれらのモーメントを明示的に評価できるようになります。

Fortuna: A Library for Uncertainty Quantification in Deep Learning
Fortuna:深層学習における不確実性定量化のためのライブラリ

We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems.

私たちは、Fortunaは、ディープラーニングにおける不確実性の定量化のためのオープンソースライブラリです。Fortunaは、任意の学習済みニューラルネットワークに適用して信頼性の高い不確実性推定値を生成できるコンフォーマル予測や、ゼロから学習したディープニューラルネットワークに適用して不確実性の定量化と精度を向上させるスケーラブルなベイズ推論手法など、さまざまなキャリブレーション手法をサポートしています。Fortunaは、高度な不確実性定量化手法のための一貫したフレームワークを提供することで、ベンチマークのプロセスを簡素化し、実務家が堅牢なAIシステムを構築できるよう支援します。

Characterization of translation invariant MMD on Rd and connections with Wasserstein distances
Rd 上の変換不変 MMD の特徴とワッサーシュタイン距離との関連

Kernel mean embeddings and maximum mean discrepancies (MMD) associated with positive definite kernels are important tools in machine learning that allow to compare probability measures and sample distributions. We provide a full characterization of translation invariant MMDs on $\mathbb{R}^d$ that are parametrized by a spectral measure and a semi-definite positive symmetric matrix. Furthermore, we investigate the connections between translation invariant MMDs and Wasserstein distances on $\mathbb{R}^d$. We show in particular that convergence with respect to the MMD associated with the Energy Kernel of order $\alpha\in(0,1)$ implies convergence with respect to the Wasserstein distance of order $\beta<\alpha$. We also provide examples of kernels metrizing the Wasserstein space of order $\alpha\geq 1$. A short numerical experiment illustrates our findings in the framework of the one-sample-test.

正定値カーネルに関連するカーネル平均の埋め込みと最大平均不一致(MMD)は、確率測定とサンプル分布を比較できる機械学習の重要なツールです。スペクトル測度と半定値正対称行列によってパラメータ化される$mathbb{R}^d$上の並進不変MMDの完全な特性評価を提供します。さらに、移動不変MMDと$mathbb{R}^d$上のWasserstein距離との間の関係を調査します。特に、次数$alphain(0,1)$のエネルギーカーネルに関連付けられたMMDに対する収束は、次数$beta<alpha$のワッサーシュタイン距離に対する収束を意味することを示します。また、次数$alphageq 1$のWasserstein空間をメートル化するカーネルの例も提供します。短い数値実験は、1サンプル検定のフレームワークで私たちの発見を示しています。

On the Hyperparameters in Stochastic Gradient Descent with Momentum
運動量を伴う確率的勾配降下法におけるハイパーパラメータについて

Following the same routine as Shi et al. (2023), we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate that the two hyperparameters together, the learning rate and the momentum coefficient, play a significant role in the linear convergence rate in non-convex optimizations. Our analysis is based on using a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap for SGD only about the learning rate varies with the momentum coefficient increasing from zero to one when the momentum is introduced. Then, we propose a mathematical interpretation of why, in practice, SGD with momentum converges faster and is more robust in the learning rate than standard stochastic gradient descent (SGD). Finally, we show the Nesterov momentum under the presence of noise has no essential difference from the traditional momentum.

Shiら(2023)と同じ手順に従い、本論文では引き続き、モメンタム付き確率的勾配降下法(SGD with momentum)の理論分析を提示します。これとは異なり、SGD with momentumでは、学習率とモメンタム係数の2つのハイパーパラメータが、非凸最適化における線形収束率に重要な役割を果たすことを示します。この分析は、SGD with momentumの連続的なサロゲートとして機能するハイパーパラメータ依存の確率微分方程式(hp依存SDE)を使用することに基づいています。同様に、SGD with momentumの連続時間定式化の線形収束を確立し、Kramers-Fokker-Planck演算子のスペクトルを解析することにより、最適線形速度の明示的な表現を取得します。比較により、SGDのみの最適線形収束率と学習率に関する最終ギャップが、モメンタムが導入されたときにモメンタム係数が0から1に増加するとどのように変化するかを示します。次に、実際には、モーメンタムを使用したSGDが標準的な確率的勾配降下法(SGD)よりも速く収束し、学習率においてより堅牢である理由について、数学的な解釈を提案します。最後に、ノイズが存在する場合のネステロフモーメンタムは、従来のモーメンタムと本質的な違いがないことを示します。

Improved Random Features for Dot Product Kernels
ドット積カーネルのランダム特徴の改善

Dot product kernels, such as polynomial and exponential (softmax) kernels, are among the most widely used kernels in machine learning, as they enable modeling the interactions between input features, which is crucial in applications like computer vision, natural language processing, and recommender systems. We make several novel contributions for improving the efficiency of random feature approximations for dot product kernels, to make these kernels more useful in large scale learning. First, we present a generalization of existing random feature approximations for polynomial kernels, such as Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random features. We show empirically that the use of complex features can significantly reduce the variances of these approximations. Second, we provide a theoretical analysis for understanding the factors affecting the efficiency of various random feature approximations, by deriving closed-form expressions for their variances. These variance formulas elucidate conditions under which certain approximations (e.g., TensorSRHT) achieve lower variances than others (e.g., Rademacher sketches), and conditions under which the use of complex features leads to lower variances than real features. Third, by using these variance formulas, which can be evaluated in practice, we develop a data-driven optimization approach to improve random feature approximations for general dot product kernels, which is also applicable to the Gaussian kernel. We describe the improvements brought by these contributions with extensive experiments on a variety of tasks and datasets.

多項式カーネルや指数（ソフトマックス）カーネルなどのドット積カーネルは、入力特徴間の相互作用をモデル化できるため、機械学習で最も広く使用されているカーネルの1つです。これは、コンピュータービジョン、自然言語処理、レコメンデーションシステムなどのアプリケーションで重要です。ドット積カーネルのランダム特徴近似の効率を改善し、これらのカーネルを大規模学習でより有用なものにするためのいくつかの新しい貢献をします。まず、複素数値のランダム特徴を使用して、RademacherスケッチやGaussianスケッチ、TensorSRHTなどの多項式カーネルの既存のランダム特徴近似の一般化を示します。複素特徴を使用すると、これらの近似の分散を大幅に削減できることを経験的に示します。次に、分散の閉形式式を導出することにより、さまざまなランダム特徴近似の効率に影響を与える要因を理解するための理論的分析を提供します。これらの分散式は、特定の近似(TensorSRHTなど)が他の近似(Rademacherスケッチなど)よりも低い分散を達成する条件と、複雑な特徴の使用が実際の特徴よりも低い分散につながる条件を明らかにします。3番目に、実際に評価できるこれらの分散式を使用して、一般的なドット積カーネルのランダム特徴近似を改善するためのデータ駆動型最適化アプローチを開発します。これはガウスカーネルにも適用できます。さまざまなタスクとデータセットでの広範な実験により、これらの貢献によってもたらされた改善について説明します。

Regret Analysis of Bilateral Trade with a Smoothed Adversary
平滑化した敵対国との二国間貿易の遺憾分析

We study repeated bilateral trade where an adaptive $\sigma$-smooth adversary generates the valuations of sellers and buyers. We completely characterize the regret regimes for fixed-price mechanisms under different feedback models in the two cases where the learner can post the same or different prices to buyers and sellers. We begin by showing that, in the full-feedback scenario, the minimax regret after $T$ rounds is of order $\sqrt{T}$. Under partial feedback, any algorithm that has to post the same price to buyers and sellers suffers worst-case linear regret. However, when the learner can post two different prices at each round, we design an algorithm enjoying regret of order $T^{3/4}$, ignoring log factors. We prove that this rate is optimal by presenting a surprising $T^{3/4}$ lower bound, which is the paper’s main technical contribution.

私たちは、適応性のある$sigma$-スムーズな敵対者が売り手と買い手の評価を生成する二国間貿易の繰り返しを研究しています。固定価格メカニズムの後悔制度は、学習者が買い手と売り手に同じ価格または異なる価格を投稿できる2つのケースで、異なるフィードバックモデルの下で完全に特徴付けられます。まず、フルフィードバックのシナリオでは、$T$ラウンド後のミニマックスの後悔が$sqrt{T}$のオーダーであることを示します。部分的なフィードバックの下では、買い手と売り手に同じ価格を投稿しなければならないアルゴリズムは、最悪の場合、直線的な後悔を被ります。ただし、学習者が各ラウンドで2つの異なる価格を投稿できる場合、ログ係数を無視して、順序$T^{3/4}$の後悔を楽しむアルゴリズムを設計します。このレートが最適であることを証明するために、この論文の主な技術的貢献である驚くべき$T^{3/4}$の下限を提示します。

Invariant Physics-Informed Neural Networks for Ordinary Differential Equations
常微分方程式のための不変物理情報ニューラルネットワーク

Physics-informed neural networks have emerged as a prominent new method for solving differential equations. While conceptually straightforward, they often suffer training difficulties that lead to relatively large discretization errors or the failure to obtain correct solutions. In this paper we introduce invariant physics-informed neural networks for ordinary differential equations that admit a finite-dimensional group of Lie point symmetries. Using the method of equivariant moving frames, a differential equation is invariantized to obtain a, generally, simpler equation in the space of differential invariants. A solution to the invariantized equation is then mapped back to a solution of the original differential equation by solving the reconstruction equations for the left moving frame. The invariantized differential equation together with the reconstruction equations are solved using a physics-informed neural network, and form what we call an invariant physics-informed neural network. We illustrate the method with several examples, all of which considerably outperform standard non-invariant physics-informed neural networks.

物理学に基づくニューラルネットワークは、微分方程式を解くための新しい有力な方法として登場しました。概念的には単純ですが、トレーニングが困難で、離散化エラーが比較的大きくなったり、正しい解が得られなかったりすることがよくあります。この論文では、有限次元のリー点対称群を許容する常微分方程式用の不変物理学に基づくニューラルネットワークを紹介します。等変移動フレーム法を使用して、微分方程式を不変化し、微分不変量の空間で一般に簡単な方程式を取得します。次に、左移動フレームの再構成方程式を解くことで、不変化された方程式の解を元の微分方程式の解にマッピングし直します。不変化された微分方程式と再構成方程式は、物理学に基づくニューラルネットワークを使用して解かれ、不変物理学に基づくニューラルネットワークと呼ばれるものを形成します。この方法をいくつかの例で説明しますが、そのすべてが標準的な非不変物理学に基づくニューラルネットワークを大幅に上回っています。

Distribution Learning via Neural Differential Equations: A Nonparametric Statistical Perspective
神経微分方程式による分布学習:ノンパラメトリック統計的視点

Ordinary differential equations (ODEs), via their induced flow maps, provide a powerful framework to parameterize invertible transformations for representing complex probability distributions. While such models have achieved enormous success in machine learning, little is known about their statistical properties. This work establishes the first general nonparametric statistical convergence analysis for distribution learning via ODE models trained through likelihood maximization. We first prove a convergence theorem applicable to arbitrary velocity field classes $\mathcal{F}$ satisfying certain simple boundary constraints. This general result captures the trade-off between the approximation error and complexity of the ODE model. We show that the latter can be quantified via the $C^1$-metric entropy of the class $\mathcal{F}$. We then apply this general framework to the setting of $C^k$-smooth target densities, and establish nearly minimax-optimal convergence rates for two relevant velocity field classes $\mathcal{F}$: $C^k$ functions and neural networks. The latter is the practically important case of neural ODEs. Our results also provide insight on how the choice of velocity field class, and the dependence of this choice on sample size (e.g., the scaling of neural network classes), impact statistical performance.

常微分方程式(ODE)は、その誘導フローマップを介して、複雑な確率分布を表す可逆変換をパラメータ化する強力なフレームワークを提供します。このようなモデルは機械学習で大きな成功を収めていますが、その統計的特性についてはほとんどわかっていません。この研究では、尤度最大化によってトレーニングされたODEモデルを介して、分布学習に対する最初の一般的なノンパラメトリック統計収束分析を確立します。まず、特定の単純な境界制約を満たす任意の速度場クラス$\mathcal{F}$に適用できる収束定理を証明します。この一般的な結果は、近似誤差とODEモデルの複雑さの間のトレードオフを捉えています。後者は、クラス$\mathcal{F}$の$C^1$メトリックエントロピーを介して定量化できることを示します。次に、この一般的なフレームワークを$C^k$滑らかなターゲット密度の設定に適用し、2つの関連する速度場クラス$\mathcal{F}$、$C^k$関数とニューラルネットワークのほぼミニマックス最適収束率を確立します。後者はニューラルODEの実際上重要なケースです。私たちの結果は、速度場クラスの選択と、この選択のサンプルサイズへの依存性(ニューラルネットワーククラスのスケーリングなど)が統計パフォーマンスにどのように影響するかについても洞察を提供します。

Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression
マルチ出力ニューラルネットワークのためのバリエーション空間:マルチタスク学習とネットワーク圧縮に関する洞察

This paper introduces a novel theoretical framework for the analysis of vector-valued neural networks through the development of vector-valued variation spaces, a new class of reproducing kernel Banach spaces. These spaces emerge from studying the regularization effect of weight decay in training networks with activation functions like the rectified linear unit (ReLU). This framework offers a deeper understanding of multi-output networks and their function-space characteristics. A key contribution of this work is the development of a representer theorem for the vector-valued variation spaces. This representer theorem establishes that shallow vector-valued neural networks are the solutions to data-fitting problems over these infinite-dimensional spaces, where the network widths are bounded by the square of the number of training data. This observation reveals that the norm associated with these vector-valued variation spaces encourages the learning of features that are useful for multiple tasks, shedding new light on multi-task learning with neural networks. Finally, this paper develops a connection between weight-decay regularization and the multi-task lasso problem. This connection leads to novel bounds for layer widths in deep networks that depend on the intrinsic dimensions of the training data representations. This insight not only deepens the understanding of the deep network architectural requirements, but also yields a simple convex optimization method for deep neural network compression. The performance of this compression procedure is evaluated on various architectures.

この論文では、ベクトル値変動空間(新しい種類の再生カーネルバナッハ空間)の開発を通じて、ベクトル値ニューラルネットワークの分析のための新しい理論的枠組みを紹介します。これらの空間は、ReLU (Rerectified Linear Unit)などの活性化関数を持つトレーニングネットワークにおける重み減衰の正則化効果の研究から生まれました。この枠組みにより、マルチ出力ネットワークとその関数空間特性をより深く理解できます。本研究の重要な貢献は、ベクトル値変動空間の代表定理の開発です。この代表定理は、ネットワーク幅がトレーニングデータ数の2乗で制限されるこれらの無限次元空間でのデータフィッティング問題の解決策は浅いベクトル値ニューラルネットワークであることを証明しています。この観察から、これらのベクトル値変動空間に関連付けられた規範が、複数のタスクに役立つ特徴の学習を促進し、ニューラルネットワークによるマルチタスク学習に新たな光を当てていることがわかります。最後に、この論文では、重み減衰正則化とマルチタスクラッソ問題の関係について説明します。この関係により、トレーニングデータ表現の固有の次元に依存するディープネットワークのレイヤー幅の新しい境界が導き出されます。この洞察により、ディープネットワークのアーキテクチャ要件の理解が深まるだけでなく、ディープニューラルネットワーク圧縮のためのシンプルな凸最適化手法も得られます。この圧縮手順のパフォーマンスは、さまざまなアーキテクチャで評価されます。

Individual-centered Partial Information in Social Networks
ソーシャルネットワークにおける個人中心の部分的情報

In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length $L$ and gives rise to a partial adjacency matrix. Under $L=2$, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive theoretical properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual’s partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure.

統計的ネットワーク分析では、ネットワーク全体が利用可能であるか、または複数のサブグラフをサンプリングしてネットワークのさまざまなグローバル特性を推定できると想定されることが多い。しかし、実際のソーシャルネットワークでは、人々はネットワークのローカルビューのみに基づいて決定を下すことが多い。ここでは、特定の個人を中心としたローカルネットワークをパス長$L$で特徴付け、部分隣接行列を生成する部分情報フレームワークを検討します。$L=2$の下では、一般的な確率ブロックモデル(SBM)とその次数補正バリアント(DCSBM)を使用した(グローバル)コミュニティ検出の問題に焦点を当てる。部分隣接行列の信号項から固有値と固有ベクトルの理論的特性を導出し、適切な条件下で一貫性を実現する新しいスペクトルベースのコミュニティ検出アルゴリズムを提案します。また、この分析により、グローバルコミュニティ構造を決定する際の個人の部分情報の重要性を評価する新しい中心性尺度を提案することも可能になります。シミュレーションされたネットワークと実際のネットワークを使用して、アルゴリズムのパフォーマンスを実証し、中心性測定を他の一般的な代替手段と比較して、それが固有のノード情報をキャプチャすることを示します。私たちの結果は、部分情報フレームワークによって、グローバル構造に関するさまざまな個人の視点を比較できることを示しています。

Data-driven Automated Negative Control Estimation (DANCE): Search for, Validation of, and Causal Inference with Negative Controls
データ駆動型自動否定制御推定(DANCE):否定制御による探索、検証、および因果推論

Negative control variables are increasingly used to adjust for unmeasured confounding bias in causal inference using observational data. They are typically identified by subject matter knowledge and there is currently a severe lack of data-driven methods to find negative controls. In this paper, we present a statistical test for discovering negative controls of a special type—disconnected negative controls—that can serve as surrogates of the unmeasured confounder, and we incorporate that test into the Data-driven Automated Negative Control Estimation (DANCE) algorithm. DANCE first uses the new validation test to identify subsets of a set of candidate negative control variables that satisfy the assumptions of disconnected negative controls. It then applies a negative control method to each pair of these validated negative control variables, and aggregates the output to produce an unbiased point estimate and confidence interval for a causal effect in the presence of unmeasured confounding. We (1) prove the correctness of this validation test, and thus of DANCE; (2) demonstrate via simulation experiments that DANCE outperforms both naive analysis ignoring unmeasured confounding and negative control method with randomly selected candidate negative controls; and (3) demonstrate the effectiveness of DANCE on a challenging real-world problem.

ネガティブコントロール変数は、観察データを使用した因果推論における測定されていない交絡バイアスを調整するためにますます使用されています。これらは通常、主題知識によって識別されますが、現在、ネガティブコントロールを見つけるためのデータ駆動型の方法が非常に不足しています。この論文では、測定されていない交絡因子の代理として機能することができる特別なタイプのネガティブコントロール（切断されたネガティブコントロール）を発見するための統計テストを提示し、そのテストをデータ駆動型自動ネガティブコントロール推定（DANCE）アルゴリズムに組み込みます。DANCEは最初に新しい検証テストを使用して、切断されたネガティブコントロールの仮定を満たす候補ネガティブコントロール変数セットのサブセットを識別します。次に、これらの検証されたネガティブコントロール変数の各ペアにネガティブコントロールメソッドを適用し、出力を集計して、測定されていない交絡がある場合の因果効果の偏りのない点推定値と信頼区間を生成します。（1）この検証テスト、ひいてはDANCEの正しさを証明します。（２）シミュレーション実験により、DANCEが、測定されていない交絡因子を無視した単純な分析と、ランダムに選択された候補ネガティブコントロールを用いたネガティブコントロール法の両方よりも優れていることを実証します。（３）困難な現実世界の問題に対するDANCEの有効性を実証します。

Continuous Prediction with Experts’ Advice
専門家のアドバイスによる連続予測

Prediction with experts’ advice is one of the most fundamental problems in online learning and captures many of its technical challenges. A recent line of work has looked at online learning through the lens of differential equations and continuous-time analysis. This viewpoint has yielded optimal results for several problems in online learning. In this paper, we employ continuous-time stochastic calculus in order to study the discrete-time experts’ problem. We use these tools to design a continuous-time, parameter-free algorithm with improved guarantees on the quantile regret. We then develop an analogous discrete-time algorithm with a very similar analysis and identical quantile regret bounds. Finally, we design an anytime continuous-time algorithm with regret matching the optimal fixed-time rate when the gains are independent Brownian motions; in many settings, this is the most difficult case. This gives some evidence that, even with adversarial gains, the optimal anytime and fixed-time regrets may coincide.

専門家のアドバイスによる予測は、オンライン学習における最も基本的な問題の1つであり、その技術的課題の多くを網羅しています。最近の研究では、微分方程式と連続時間解析の観点からオンライン学習が検討されています。この観点は、オンライン学習のいくつかの問題に対して最適な結果をもたらしました。この論文では、離散時間の専門家の問題を研究するために、連続時間確率計算を採用しています。これらのツールを使用して、分位後悔に対する保証が改善された連続時間のパラメーターフリーアルゴリズムを設計します。次に、非常によく似た分析と同一の分位後悔境界を持つ類似の離散時間アルゴリズムを開発します。最後に、ゲインが独立したブラウン運動である場合に、後悔が最適な固定時間レートと一致する、いつでも連続する時間アルゴリズムを設計します。多くの設定では、これが最も難しいケースです。これは、敵対的なゲインであっても、最適ないつでもの後悔と固定時間の後悔が一致する可能性があるという証拠を示しています。

Memory-Efficient Sequential Pattern Mining with Hybrid Tries
ハイブリッドトライによるメモリ効率の高いシーケンシャルパターンマイニング

This paper develops a memory-efficient approach for Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck for large data sets. Our methodology involves a novel hybrid trie data structure that exploits recurring patterns to compactly store the data set in memory; and a corresponding mining algorithm designed to effectively extract patterns from this compact representation. Numerical results on small to medium-sized real-life test instances show an average improvement of 85% in memory consumption and 49% in computation time compared to the state of the art. For large data sets, our algorithm stands out as the only capable SPM approach within 256GB of system memory, potentially saving 1.7TB in memory consumption.

この論文では、大規模なデータセットのメモリボトルネックがよく知られているナレッジディスカバリーの基本的なトピックであるシーケンシャルパターンマイニング(SPM)のメモリ効率の高いアプローチを開発します。私たちの方法論には、繰り返し発生するパターンを利用してデータセットをメモリにコンパクトに格納する新しいハイブリッドトライデータ構造が含まれます。そして、このコンパクトな表現からパターンを効果的に抽出するように設計された対応するマイニングアルゴリズム。小規模から中規模の実際のテストインスタンスでの数値結果は、最先端の技術と比較して、メモリ消費量が平均85%、計算時間が49%改善されていることを示しています。大規模なデータセットの場合、当社のアルゴリズムは、256GBのシステムメモリ内で唯一可能なSPMアプローチとして際立っており、メモリ消費量を1.7TB節約できる可能性があります。

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds
低次元多様体上の方策最適化のためのニューラル方策ミラーディセントのサンプル複雑性

Policy gradient methods equipped with deep neural networks have achieved great success in solving high-dimensional reinforcement learning (RL) problems. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN). Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\tilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, explaining the efficacy of deep policy gradient algorithms.

ディープニューラルネットワークを備えたポリシー勾配法は、高次元強化学習(RL)問題を解決する上で大きな成功を収めてきました。しかし、現在の分析では、なぜ次元の呪いに耐性があるのか説明できません。この研究では、ディープ畳み込みニューラルネットワーク(CNN)を使用したニューラルポリシーミラー降下法(NPMD)アルゴリズムのサンプル複雑性を調べます。多くの高次元環境には、画像を状態としてとらえる環境など、低次元構造を持つ状態空間があるという経験的観察に基づいて、状態空間を、固有次元が$d\ll D$である$D$次元ユークリッド空間に埋め込まれた$d$次元多様体と見なします。NPMDの各反復で、価値関数とポリシーの両方がCNNによって適切に近似できることを示します。近似誤差はネットワークのサイズによって制御され、以前のネットワークの滑らかさを継承できます。その結果、ネットワークサイズとハイパーパラメータを適切に選択することで、NPMDは期待値の$\tilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$サンプルで$\epsilon$最適ポリシーを見つけることができます。ここで、$\alpha\in(0,1]$は環境の滑らかさを示します。以前の研究と比較して、私たちの結果は、NPMDが状態空間の低次元構造を活用して次元の呪いから逃れることができることを示しており、ディープポリシー勾配アルゴリズムの有効性を説明しています。

Split Conformal Prediction and Non-Exchangeable Data
分割共形予測と交換不可能なデータ

Split conformal prediction (CP) is arguably the most popular CP method for uncertainty quantification, enjoying both academic interest and widespread deployment. However, the original theoretical analysis of split CP makes the crucial assumption of data exchangeability, which hinders many real-world applications. In this paper, we present a novel theoretical framework based on concentration inequalities and decoupling properties of the data, proving that split CP remains valid for many non-exchangeable processes by adding a small coverage penalty. Through experiments with both real and synthetic data, we show that our theoretical results translate to good empirical performance under non-exchangeability, e.g., for time series and spatiotemporal data. Compared to recent conformal algorithms designed to counter specific exchangeability violations, we show that split CP is competitive in terms of coverage and interval size, with the benefit of being extremely simple and orders of magnitude faster than alternatives.

スプリット・コンフォーマル予測(CP)は、不確実性定量化のためのCP法としておそらく最も広く普及しており、学術的関心と広範な導入の両方で利用されています。しかし、スプリットCPの元々の理論分析では、データの交換可能性という重要な仮定が立てられており、これが多くの現実世界の応用を妨げています。この論文では、濃度不等式とデータの分離特性に基づく新しい理論的枠組みを提示し、小さなカバレッジ・ペナルティを追加することで、スプリットCPが多くの交換不可能なプロセスに対して有効であることを証明します。実際のデータと合成データの両方を使用した実験により、時系列データや時空間データなど、交換不可能な状況で理論結果が良好な実証的パフォーマンスにつながることを示しています。特定の交換可能性違反に対抗するように設計された最近のコンフォーマルアルゴリズムと比較して、スプリットCPはカバレッジと間隔サイズの点で競争力があり、他の方法よりも非常にシンプルで桁違いに高速であるという利点があることを示しています。

Structured Dynamic Pricing: Optimal Regret in a Global Shrinkage Model
ストラクチャード・ダイナミック・プライシング:グローバル・シュリンケージ・モデルにおける最適な後悔

We consider dynamic pricing strategies in a streamed longitudinal data set-up where the objective is to maximize, over time, the cumulative profit across a large number of customer segments. We consider a dynamic model with the consumers’ preferences as well as price sensitivity varying over time. Building on the well-known finding that consumers sharing similar characteristics act in similar ways, we consider a global shrinkage structure, which assumes that the consumers’ preferences across the different segments can be well approximated by a spatial autoregressive (SAR) model. In such a streamed longitudinal setup, we measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance. We propose a pricing policy based on penalized stochastic gradient descent (PSGD) and explicitly characterize its regret as functions of time, the temporal variability in the model parameters as well as the strength of the auto-correlation network structure spanning the varied customer segments. Our regret analysis results not only demonstrate asymptotic optimality of the proposed policy but also show that for policy planning it is essential to incorporate available structural information as policies based on unshrunken models are highly sub-optimal in the aforementioned set-up. We conduct simulation experiments across a wide range of regimes as well as real-world networks based studies and report encouraging performance for our proposed method.

私たちは、多数の顧客セグメントにわたる累積利益を時間の経過とともに最大化することを目的とする、ストリーミングされた縦断的データ設定における動的価格設定戦略について検討します。私たちは、消費者の嗜好と価格感度が時間とともに変化する動的モデルを検討します。同様の特性を持つ消費者は同様の行動をとるというよく知られた発見に基づいて、異なるセグメントにわたる消費者の嗜好が空間自己回帰(SAR)モデルによって十分に近似できると仮定するグローバル収縮構造を検討します。このようなストリーミングされた縦断的設定では、動的価格設定ポリシーのパフォーマンスを後悔(モデルパラメータのシーケンスを事前に知っている千里眼と比較した予測収益損失)によって測定します。私たちは、ペナルティ付き確率的勾配降下法(PSGD)に基づく価格設定ポリシーを提案し、その後悔を時間の関数、モデルパラメータの時間的変動、およびさまざまな顧客セグメントにまたがる自己相関ネットワーク構造の強度として明示的に特徴付けます。私たちの後悔分析の結果は、提案されたポリシーの漸近最適性を証明するだけでなく、縮小されていないモデルに基づくポリシーは前述の設定では非常に最適ではないため、ポリシー計画には利用可能な構造情報を組み込むことが不可欠であることを示しています。私たちは、さまざまな体制と実際のネットワークに基づく研究でシミュレーション実験を行い、提案された方法の有望なパフォーマンスを報告しています。

Sparse Graphical Linear Dynamical Systems
スパースグラフィカル線形力学系

Time-series datasets are central in machine learning with applications in numerous fields of science and engineering, such as biomedicine, Earth observation, and network analysis. Extensive research exists on state-space models (SSMs), which are powerful mathematical tools that allow for probabilistic and interpretable learning on time series. Learning the model parameters in SSMs is arguably one of the most complicated tasks, and the inclusion of prior knowledge is known to both ease the interpretation but also to complicate the inferential tasks. Very recent works have attempted to incorporate a graphical perspective on some of those model parameters, but they present notable limitations that this work addresses. More generally, existing graphical modeling tools are designed to incorporate either static information, focusing on statistical dependencies among independent random variables (e.g., graphical Lasso approach), or dynamic information, emphasizing causal relationships among time series samples (e.g., graphical Granger approaches). However, there are no joint approaches combining static and dynamic graphical modeling within the context of SSMs. This work proposes a novel approach to fill this gap by introducing a joint graphical modeling framework that bridges the graphical Lasso model and a causal-based graphical approach for the linear-Gaussian SSM. We present DGLASSO (Dynamic Graphical Lasso), a new inference method within this framework that implements an efficient block alternating majorization-minimization algorithm. The algorithm’s convergence is established by departing from modern tools from nonlinear analysis. Experimental validation on various synthetic data showcases the effectiveness of the proposed model and inference algorithm. This work will significantly contribute to the understanding and utilization of time-series data in diverse scientific and engineering applications where incorporating a graphical approach is essential to perform the inference.

時系列データセットは機械学習の中心であり、バイオメディカル、地球観測、ネットワーク分析など、科学や工学のさまざまな分野で応用されています。状態空間モデル(SSM)に関する研究は広範に行われており、これは時系列の確率的かつ解釈可能な学習を可能にする強力な数学的ツールです。SSMのモデルパラメータの学習は、おそらく最も複雑なタスクの1つであり、事前知識を含めると解釈が容易になる一方で推論タスクが複雑になることが知られています。ごく最近の研究では、それらのモデルパラメータの一部にグラフィカルな視点を取り入れようと試みられましたが、この研究ではその制限に対処しています。より一般的には、既存のグラフィカルモデリングツールは、独立したランダム変数間の統計的依存関係に重点を置いた静的情報(グラフィカルLassoアプローチなど)、または時系列サンプル間の因果関係を強調した動的情報(グラフィカルGrangerアプローチなど)のいずれかを組み込むように設計されています。ただし、SSMのコンテキスト内で静的グラフィカルモデリングと動的グラフィカルモデリングを組み合わせた共同アプローチはありません。この研究では、グラフィカルLassoモデルと線形ガウスSSMの因果ベースのグラフィカルアプローチを橋渡しする共同グラフィカルモデリングフレームワークを導入することで、このギャップを埋める新しいアプローチを提案します。このフレームワーク内で、効率的なブロック交互メジャー化最小化アルゴリズムを実装する新しい推論方法であるDGLASSO (Dynamic Graphical Lasso)を紹介します。このアルゴリズムの収束は、非線形解析の最新ツールから逸脱することで確立されます。さまざまな合成データでの実験的検証により、提案されたモデルと推論アルゴリズムの有効性が実証されています。本研究は、推論を実行するためにグラフィカルアプローチを組み込むことが不可欠な、さまざまな科学および工学アプリケーションにおける時系列データの理解と利用に大きく貢献します。

Statistical analysis for a penalized EM algorithm in high-dimensional mixture linear regression model
高次元混合線形回帰モデルにおけるペナルティ付きEMアルゴリズムの統計解析

The expectation-maximization (EM) algorithm and its variants are widely used in statistics. In high-dimensional mixture linear regression, the model is assumed to be a finite mixture of linear regression and the number of predictors is much larger than the sample size. The standard EM algorithm, which attempts to find the maximum likelihood estimator, becomes infeasible for such model. We devise a group lasso penalized EM algorithm and study its statistical properties. Existing theoretical results of regularized EM algorithms often rely on dividing the sample into many independent batches and employing a fresh batch of sample in each iteration of the algorithm. Our algorithm and theoretical analysis do not require sample-splitting, and can be extended to multivariate response cases. The proposed methods also have encouraging performances in numerical studies.

期待値最大化(EM)アルゴリズムとそのバリアントは、統計で広く使用されています。高次元混合線形回帰では、モデルは線形回帰の有限混合物であると想定され、予測変数の数はサンプルサイズよりもはるかに大きくなります。最尤推定量を見つけようとする標準のEMアルゴリズムは、そのようなモデルでは実行不可能になります。グループ投げ縄ペナルティEMアルゴリズムを考案し、その統計的特性を研究します。正則化されたEMアルゴリズムの既存の理論的結果は、多くの場合、サンプルを多くの独立したバッチに分割し、アルゴリズムの各反復で新しいサンプルのバッチを使用することに依存しています。私たちのアルゴリズムと理論分析はサンプル分割を必要とせず、多変量応答ケースに拡張できます。提案された方法は、数値研究においても有望な性能を発揮します。

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
証明可能な後悔限界を持つ分布的およびリスク感受性強化学習の橋渡し

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}\left(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK}\right)$ regret upper bound, where $S$, $A$, $K$, $H$, $T=KH$, and $\beta$ represent the number of states, actions, episodes, time horizon, number of total time-steps, and risk parameter respectively. It matches RSVI2, with novel distributional analysis that focuses on the distributions of returns rather than the risk values associated with these returns. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. To address the computational inefficiencies inherent in the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach effectively represents any bounded distribution using a refined distribution class. It significantly amplifies computational efficiency while maintaining the established regret bounds.We also prove a tighter minimax lower bound of $\Omega\left(\frac{\exp(\beta H/6)-1}{\beta }\sqrt{SAT}\right)$ for the $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.

私たちは、分布強化学習(DRL)法を用いて、リスクに敏感な強化学習(RSRL)の後悔保証を研究します。特に、リターンのエントロピーリスク尺度(EntRM)を目的とする有限エピソードマルコフ決定過程を検討します。EntRMの重要な特性である独立性特性を活用して、リスクに敏感な分布動的計画法フレームワークを確立します。次に、モデルフリーとモデルベースの2つの異なるスキームを通じて楽観主義を実装する2つの新しいDRLアルゴリズムを提案します。両方とも、$\tilde{\mathcal{O}}\left(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK}\right)$の後悔上限を達成することを証明します。ここで、$S$、$A$、$K$、$H$、$T=KH$、および$\beta$は、それぞれ状態数、アクション数、エピソード数、時間範囲、合計時間ステップ数、およびリスクパラメータを表す。これは、リターンに関連するリスク値ではなく、リターンの分布に焦点を当てた新しい分布分析により、RSVI2と一致します。私たちの知る限り、これはサンプルの複雑さに関してDRLとRSRLをつなぐ最初のリグレット分析です。モデルフリーDRLアルゴリズムに固有の計算上の非効率性に対処するために、分布表現を使用した代替DRLアルゴリズムを提案します。このアプローチは、洗練された分布クラスを使用して、任意の制限付き分布を効果的に表現します。確立されたリグレット境界を維持しながら、計算効率を大幅に向上させます。また、$\beta>0$の場合のより厳しいミニマックス下限$\Omega\left(\frac{\exp(\beta H/6)-1}{\beta }\sqrt{SAT}\right)$を証明します。これは、リスク中立設定で厳しい下限$\Omega(H\sqrt{SAT})$を回復します。

Low-Rank Matrix Estimation in the Presence of Change-Points
変化点の存在下での低ランク行列推定

We consider a general trace regression model with multiple structural changes and propose a universal approach for simultaneous exact or near-low-rank matrix recovery and change-point detection. It incorporates nuclear norm penalized least-squares minimization into a grid search scheme that determines the potential structural break. Under a set of general conditions, we establish the non-asymptotic error bounds with a nearly-oracle rate for the matrix estimators as well as the super-consistency rate for the change-point localization. We use concrete random design instances to justify the appropriateness of the proposed conditions. Numerical results demonstrate the validity and effectiveness of the proposed scheme.

私たちは、複数の構造変更を伴う一般的なトレース回帰モデルを検討し、厳密またはほぼ低ランクの行列回復と変化点検出を同時に行うための普遍的なアプローチを提案します。これは、核ノルムのペナルティ付き最小二乗最小化を、潜在的な構造破壊を決定するグリッド探索スキームに組み込んでいます。一連の一般的な条件下で、行列推定器のほぼオラクル率と変化点局在化の超一貫性率を使用して、非漸近誤差範囲を確立します。提案された条件の適切性を正当化するために、具体的なランダム設計インスタンスを使用します。数値結果は、提案されたスキームの有効性と有効性を示しています。

A Framework for Improving the Reliability of Black-box Variational Inference
ブラックボックス変分推論の信頼性を向上させるためのフレームワーク

Black-box variational inference (BBVI) now sees widespread use in machine learning and statistics as a fast yet flexible alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, stochastic optimization methods for BBVI remain unreliable and require substantial expertise and hand-tuning to apply effectively. In this paper, we propose robust and automated black-box VI (RABVI), a framework for improving the reliability of BBVI optimization. RABVI is based on rigorously justified automation techniques, includes just a small number of intuitive tuning parameters, and detects inaccurate estimates of the optimal variational approximation. RABVI adaptively decreases the learning rate by detecting convergence of the fixed–learning-rate iterates, then estimates the symmetrized Kullback–Leibler (KL) divergence between the current variational approximation and the optimal one. It also employs a novel optimization termination criterion that enables the user to balance desired accuracy against computational cost by comparing (i) the predicted relative decrease in the symmetrized KL divergence if a smaller learning were used and (ii) the predicted computation required to converge with the smaller learning rate. We validate the robustness and accuracy of RABVI through carefully designed simulation studies and on a diverse set of real-world model and data examples.

ブラックボックス変分推論(BBVI)は、近似ベイズ推論のためのマルコフ連鎖モンテカルロ法の高速かつ柔軟な代替手段として、現在、機械学習と統計学で広く使用されています。しかし、BBVIの確率的最適化法は依然として信頼性が低く、効果的に適用するには相当の専門知識と手動の調整が必要です。この論文では、BBVI最適化の信頼性を向上させるフレームワークである、堅牢で自動化されたブラックボックスVI (RABVI)を提案します。RABVIは厳密に正当化された自動化技術に基づいており、直感的な調整パラメータが少数含まれており、最適な変分近似の不正確な推定値を検出します。RABVIは、固定学習率反復の収束を検出することで学習率を適応的に下げ、現在の変分近似と最適な変分近似の間の対称化されたKullback-Leibler (KL)ダイバージェンスを推定します。また、新しい最適化終了基準も採用しており、これにより、(i)より小さな学習を使用した場合の対称化KLダイバージェンスの予測される相対的減少と、(ii)より小さな学習率で収束するために必要な予測計算を比較することで、ユーザーは必要な精度と計算コストのバランスをとることができます。RABVIの堅牢性と精度は、慎重に設計されたシミュレーション研究と、さまざまな実世界のモデルとデータの例を通じて検証されています。

Understanding Entropic Regularization in GANs
GAN のエントロピー正則化の理解

Generative Adversarial Networks (GANs) are a popular method for learning distributions from data by modeling the target distribution as a function of a known distribution. The function, often referred to as the generator, is optimized to minimize a chosen distance measure between the generated and target distributions. One commonly used measure for this purpose is the Wasserstein distance. However, Wasserstein distance is hard to compute and optimize, and in practice entropic regularization techniques are used to facilitate its computation and improve numerical convergence. The influence of regularization on the learned solution, however, remains not well-understood. In this paper, we study how several popular entropic regularizations of Wasserstein distance impact the solution learned by a Wasserstein GAN in a simple benchmark setting where the generator is linear and the target distribution is high-dimensional Gaussian. We show that entropy regularization of Wasserstein distance promotes sparsification of the solution, while replacing the Wasserstein distance with the Sinkhorn divergence recovers the unregularized solution. The significant benefit of both regularization techniques is that they remove the curse of dimensionality suffered by Wasserstein distance. We show that in both cases the optimal generator can be learned to accuracy $\epsilon$ with $O(1/\epsilon^2)$ samples from the target distribution without requiring to constrain the discriminator. We thus conclude that these regularization techniques can improve the quality of the generator learned from empirical data in a way that is applicable for a large class of distributions.

生成的敵対ネットワーク(GAN)は、既知の分布の関数としてターゲット分布をモデル化することにより、データから分布を学習する一般的な方法です。この関数は、ジェネレーターとも呼ばれ、生成された分布とターゲット分布の間の選択された距離尺度を最小化するように最適化されます。この目的でよく使用される尺度の1つは、ワッサーシュタイン距離です。ただし、ワッサーシュタイン距離は計算と最適化が難しく、実際にはエントロピー正則化手法を使用して計算を容易にし、数値収束を改善します。ただし、学習されたソリューションに対する正則化の影響はまだ十分に理解されていません。この論文では、ジェネレーターが線形でターゲット分布が高次元ガウス分布である単純なベンチマーク設定で、ワッサーシュタイン距離のいくつかの一般的なエントロピー正則化が、ワッサーシュタインGANによって学習されたソリューションにどのように影響するかを調べます。ワッサーシュタイン距離のエントロピー正則化はソリューションのスパース化を促進し、ワッサーシュタイン距離をシンクホーンダイバージェンスに置き換えると、正則化されていないソリューションが回復されることを示します。両方の正規化手法の重要な利点は、ワッサーシュタイン距離が被る次元の呪いを取り除くことです。どちらの場合も、識別器を制約することなく、ターゲット分布から$O(1/\epsilon^2)$サンプルを使用して、最適なジェネレーターを精度$\epsilon$まで学習できることを示しています。したがって、これらの正規化手法により、大規模な分布クラスに適用できる方法で、経験的データから学習したジェネレーターの品質を向上させることができると結論付けています。

BenchMARL: Benchmarking Multi-Agent Reinforcement Learning
BenchMARL: マルチエージェント強化学習のベンチマーキング

The field of Multi-Agent Reinforcement Learning (MARL) is currently facing a reproducibility crisis. While solutions for standardized reporting have been proposed to address the issue, we still lack a benchmarking tool that enables standardization and reproducibility, while leveraging cutting-edge Reinforcement Learning (RL) implementations. In this paper, we introduce BenchMARL, the first MARL training library created to enable standardized benchmarking across different algorithms, models, and environments. BenchMARL uses TorchRL as its backend, granting it high-performance and maintained state-of-the-art implementations while addressing the broad community of MARL PyTorch users. Its design enables systematic configuration and reporting, thus allowing users to create and run complex benchmarks from simple one-line inputs. BenchMARL is open-sourced on GitHub at https://github.com/facebookresearch/BenchMARL

現在、マルチエージェント強化学習(MARL)の分野は再現性の危機に直面しています。この問題を解決するために、標準化されたレポート作成のソリューションが提案されていますが、最先端の強化学習(RL)の実装を活用しながら、標準化と再現性を可能にするベンチマークツールはまだ不足しています。この論文では、さまざまなアルゴリズム、モデル、および環境で標準化されたベンチマークを可能にするために作成された最初のMARLトレーニングライブラリであるBenchMARLを紹介します。BenchMARLはTorchRLをバックエンドとして使用し、MARL PyTorchユーザーの幅広いコミュニティに対応しながら、高性能で維持された最先端の実装を提供します。その設計により、体系的な構成とレポート作成が可能になるため、ユーザーは単純な1行の入力から複雑なベンチマークを作成および実行できます。BenchMARLは、GitHub at https://github.com/facebookresearch/BenchMARLでオープンソース化されています。

Learning from many trajectories
多くの軌跡から学ぶ

We initiate a study of supervised learning from many independent sequences (“trajectories”) of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting—trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.

私たちは、シーケンスモデリング、制御、強化学習のタスクを反映した、独立していない共変量の多数の独立したシーケンス(「軌跡」)からの教師あり学習の研究を開始します。概念的には、我々のマルチ軌跡設定は、統計学習理論における2つの従来の設定、つまり独立した例からの学習と単一の自己相関シーケンスからの学習の中間に位置します。効率的な学習の条件は、前者の設定を一般化したものです。つまり、軌跡は、独立した例の標準要件を拡張する方法で非退化である必要があります。特に、軌跡がエルゴード的、長い、または厳密に安定である必要はありません。線形最小二乗回帰の場合、それぞれの長さがTであるm個の軌跡によって生成されたn次元の例を考えると、軌跡の数が少数(つまり、m \lesssim n)から多数(つまり、m \gtrsim n)に増加すると、統計効率に顕著な変化が見られます。具体的には、この問題の最悪ケースのエラー率は、$m \gtrsim n$の場合は常に$\Theta(n / m T)$であることを証明します。一方、$m \lesssim n$の場合は、単純で限界的に不安定な線形動的システムによって実現される最悪ケースのエラー率の(明確な)下限値$\Omega(n^2 / m^2 T)$を確立します。重要な結果は、軌道が定期的にリセットされる領域では、エラー率は最終的に、すべての例がそれらの限界から抽出され、独立しているかのように動作するということです。分析の結果として、線形システム識別問題の保証も改善されます。

Interpretable algorithmic fairness in structured and unstructured data
構造化データと非構造化データにおける解釈可能なアルゴリズムの公平性

Systemic bias with respect to gender and race is prevalent in datasets, making it challenging to train classification models that are accurate and alleviate bias. We propose a unified method for alleviating bias in structured and unstructured data, based on a novel optimization approach for optimally flipping outcome labels and training classification models simultaneously. In the case of structured data, we introduce constraints on selected objective measures of meritocracy, and present four case studies, demonstrating that our approach often outperforms state-of the art methods in terms of fairness and meritocracy. In the case of unstructured data, we present two case studies on image classification, demonstrating that our method outperforms state-of-the-art approaches in terms of fairness. Moreover, we note that the decrease in accuracy over the nominal model is $3.31 \%$ on structured data and $0.65 \%$ on unstructured data. Finally, we leverage Optimal Classification Trees (OCTs), to provide insights on which attributes of individuals lead to flipping of their labels and apply it to interpret the flipping decisions on structured data. Utilizing OCTs with auxiliary tabular data as well as Gradient-weighted Class Activation Mapping (Grad-CAM), we provide insights on the flipping decisions for unstructured data.

データセットには性別や人種に関する体系的なバイアスが蔓延しており、正確でバイアスを軽減する分類モデルをトレーニングすることが困難になっています。私たちは、結果ラベルを最適に反転し、分類モデルを同時にトレーニングする新しい最適化アプローチに基づいて、構造化データと非構造化データのバイアスを軽減する統一された方法を提案します。構造化データの場合、選択された客観的な実力主義の尺度に制約を導入し、4つのケーススタディを提示して、私たちのアプローチが公平性と実力主義の点で最先端の方法よりも優れていることが多いことを実証します。非構造化データの場合、画像分類に関する2つのケーススタディを提示して、私たちの方法が公平性の点で最先端の方法よりも優れていることを実証します。さらに、名目モデルに対する精度の低下は、構造化データでは$3.31 \%$、非構造化データでは$0.65 \%$であることに注意してください。最後に、最適分類ツリー(OCT)を活用して、個々の属性がラベルの反転につながるかどうかについての洞察を提供し、それを構造化データの反転決定の解釈に適用します。補助的な表形式データと勾配加重クラス活性化マッピング(Grad-CAM)を使用したOCTを利用して、非構造化データの反転決定に関する洞察を提供します。

FedCBO: Reaching Group Consensus in Clustered Federated Learning through Consensus-based Optimization
FedCBO:コンセンサスベースの最適化によるクラスター化された連合学習におけるグループコンセンサスの達成

Federated learning is an important framework in modern machine learning that seeks to integrate the training of learning models from multiple users, each user having their own local data set, in a way that is sensitive to data privacy and to communication loss constraints. In clustered federated learning, one assumes an additional unknown group structure among users, and the goal is to train models that are useful for each group, rather than simply training a single global model for all users. In this paper, we propose a novel solution to the problem of clustered federated learning that is inspired by ideas in consensus-based optimization (CBO). Our new CBO-type method is based on a system of interacting particles that is oblivious to group memberships. Our model is motivated by rigorous mathematical reasoning, which includes a mean-field analysis describing the large number of particles limit of our particle system, as well as convergence guarantees for the simultaneous global optimization of general non-convex objective functions (corresponding to the loss functions of each cluster of users) in the mean-field regime. Experimental results demonstrate the efficacy of our FedCBO algorithm compared to other state-of-the-art methods and help validate our methodological and theoretical work.

フェデレーテッドラーニングは、データのプライバシーと通信損失の制約に配慮しながら、各ユーザーが独自のローカルデータセットを持つ複数のユーザーからの学習モデルのトレーニングを統合しようとする、現代の機械学習における重要なフレームワークです。クラスター化されたフェデレーテッドラーニングでは、ユーザー間に追加の未知のグループ構造が想定され、すべてのユーザーに対して単一のグローバルモデルをトレーニングするのではなく、各グループに役立つモデルをトレーニングすることが目標となります。この論文では、コンセンサスベース最適化(CBO)のアイデアに触発された、クラスター化されたフェデレーテッドラーニングの問題に対する新しいソリューションを提案します。新しいCBOタイプの方法は、グループメンバーシップを意識せずに相互作用する粒子のシステムに基づいています。このモデルは、粒子システムの多数の粒子の制限を説明する平均場解析、および平均場領域での一般的な非凸目的関数(各ユーザークラスターの損失関数に対応)の同時グローバル最適化の収束保証を含む厳密な数学的推論に基づいています。実験結果は、他の最先端の方法と比較した当社のFedCBOアルゴリズムの有効性を実証し、当社の方法論的および理論的作業を検証するのに役立ちます。

On the Connection between Lp- and Risk Consistency and its Implications on Regularized Kernel Methods
Lp とリスク整合性の関係と正規化カーネル法への影響について

As a predictor’s quality is often assessed by means of its risk, it is natural to regard risk consistency as a desirable property of learning methods, and many such methods have indeed been shown to be risk consistent. The first aim of this paper is to establish the close connection between risk consistency and $L_p$-consistency for a considerably wider class of loss functions than has been done before. The attempt to transfer this connection to shifted loss functions surprisingly reveals that this shift does not reduce the assumptions needed on the underlying probability measure to the same extent as it does for many other results. The results are applied to regularized kernel methods such as support vector machines.

予測変数の品質は、そのリスクによって評価されることが多いため、リスクの一貫性を学習方法の望ましい特性と見なすのは自然なことであり、実際に多くのそのような方法がリスクの一貫性を示すことが示されています。この論文の最初の目的は、リスクの一貫性と$L_p$の一貫性との間の密接な関係を確立することです。これは、これまでよりもかなり広範な損失関数のクラスについてです。この接続をシフト損失関数に転送しようとすると、驚くべきことに、このシフトは、他の多くの結果の場合と同じ程度に、基礎となる確率尺度に必要な仮定を減少させないことを明らかにしています。結果は、サポートベクターマシンなどの正規化されたカーネルメソッドに適用されます。

Pre-trained Gaussian Processes for Bayesian Optimization
ベイズ最適化のための事前学習済みガウス過程

Bayesian optimization (BO) has become a popular strategy for global optimization of expensive real-world functions. Contrary to a common expectation that BO is suited to optimizing black-box functions, it actually requires domain knowledge about those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process (GP) priors that specify initial beliefs on functions. However, even with expert knowledge, it is non-trivial to quantitatively define a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. We detail what pre-training entails for GPs using a KL divergence based loss function, and propose a new pre-training based BO framework named HyperBO. Theoretically, we show bounded posterior predictions and near-zero regrets for HyperBO without assuming the “ground truth” GP prior is known. To verify our approach in realistic setups, we collect a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both our new tuning dataset and existing multi-task BO benchmarks.

ベイズ最適化(BO)は、高価な実世界の関数のグローバル最適化のための一般的な戦略となっています。BOはブラックボックス関数の最適化に適しているという一般的な期待に反して、実際にはBOを正常に展開するにはそれらの関数に関するドメイン知識が必要です。このようなドメイン知識は、関数の初期信念を指定するガウス過程(GP)事前確率によく現れます。ただし、専門知識があっても、事前確率を定量的に定義することは簡単ではありません。これは、チューニング目標のランドスケープを理解するのが難しいことが多い複雑な機械学習モデルのハイパーパラメータ調整問題に特に当てはまります。私たちは、これらの機能事前確率を設定するための代替手法を求めています。特に、類似関数からのデータがあり、よりタイトな分布を事前に事前トレーニングできるシナリオを検討します。KLダイバージェンスベースの損失関数を使用してGPの事前トレーニングに必要なことを詳しく説明し、HyperBOという新しい事前トレーニングベースのBOフレームワークを提案します。理論的には、GP事前の「真実」が既知であると仮定することなく、HyperBOの制限された事後予測とほぼゼロの後悔を示します。現実的な設定でこのアプローチを検証するために、一般的な画像とテキストのデータセット、およびタンパク質配列データセットで、ほぼ最先端のディープラーニングモデルの構成を数万個トレーニングすることにより、大規模なマルチタスクハイパーパラメータチューニングデータセットを収集します。結果から、平均して、HyperBOは、新しいチューニングデータセットと既存のマルチタスクBOベンチマークの両方で、競合する最良の方法よりも少なくとも3倍効率的に適切なハイパーパラメータを特定できることがわかります。

Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis
マルチソースデータ分析のための異種混合を考慮したクラスタ分散学習

In diverse fields ranging from finance to omics, it is increasingly common that data is distributed with multiple individual sources (referred to as “clients” in some studies). Integrating raw data, although powerful, is often not feasible, for example, when there are considerations on privacy protection. Distributed learning techniques have been developed to integrate summary statistics as opposed to raw data. In many existing distributed learning studies, it is stringently assumed that all the clients have the same model. To accommodate data heterogeneity, some federated learning methods allow for client-specific models. In this article, we consider the scenario that clients form clusters, those in the same cluster have the same model, and different clusters have different models. Further considering the clustering structure can lead to a better understanding of the “interconnections” among clients and reduce the number of parameters. To this end, we develop a novel penalization approach. Specifically, group penalization is imposed for regularized estimation and selection of important variables, and fusion penalization is imposed to automatically cluster clients. An effective ADMM algorithm is developed, and the estimation, selection, and clustering consistency properties are established under mild conditions. Simulation and data analysis further demonstrate the practical utility and superiority of the proposed approach.

金融からオミクスまで、さまざまな分野で、データが複数の個別のソース（一部の研究では「クライアント」と呼ばれる）に分散されることがますます一般的になっています。生のデータを統合することは強力ですが、プライバシー保護に関する考慮事項がある場合など、多くの場合実現可能ではありません。生のデータではなく要約統計を統合するための分散学習技術が開発されました。既存の多くの分散学習研究では、すべてのクライアントが同じモデルを持っていると厳密に想定されています。データの異質性に対応するために、一部の連合学習方法では、クライアント固有のモデルが許可されています。この記事では、クライアントがクラスターを形成し、同じクラスター内のクライアントは同じモデルを持ち、異なるクラスターは異なるモデルを持つシナリオを検討します。クラスタリング構造をさらに考慮すると、クライアント間の「相互接続」をよりよく理解し、パラメーターの数を減らすことができます。この目的のために、新しいペナルティアプローチを開発しました。具体的には、グループペナルティを課して重要な変数の正規化された推定と選択を行い、フュージョンペナルティを課してクライアントを自動的にクラスタリングします。効果的なADMMアルゴリズムが開発され、穏やかな条件下で推定、選択、クラスタリングの一貫性特性が確立されています。シミュレーションとデータ分析により、提案されたアプローチの実用性と優位性がさらに実証されています。

From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data
小規模スケールから大規模スケールまで:複雑なデータの幾何学的解析に基づく測定距離密度解析

How can we tell complex point clouds with different small scale characteristics apart, while disregarding global features? Can we find a suitable transformation of such data in a way that allows to discriminate between differences in this sense with statistical guarantees? In this paper, we consider the analysis and classification of complex point clouds as they are obtained, e.g., via single molecule localization microscopy. We focus on the task of identifying differences between noisy point clouds based on small scale characteristics, while disregarding large scale information such as overall size. We propose an approach based on a transformation of the data via the so-called Distance-to-Measure (DTM) function, a transformation which is based on the average of nearest neighbor distances. For each data set, we estimate the probability density of average local distances of all data points and use the estimated densities for classification. While the applicability is immediate and the practical performance of the proposed methodology is very good, the theoretical study of the density estimators is quite challenging, as they are based on non-i.i.d. observations that have been obtained via a complicated transformation. In fact, the transformed data are stochastically dependent in a non-local way that is not captured by commonly considered dependence measures. Nonetheless, we show that the asymptotic behaviour of the density estimator is driven by a kernel density estimator of certain i.i.d. random variables by using theoretical properties of $U$-statistics, which allows to handle the dependencies via a Hoeffding decomposition. We show via a numerical study and in an application to simulated single molecule localization microscopy data of chromatin fibers that unsupervised classification tasks based on estimated DTM-densities achieve excellent separation results.

全体的な特徴を無視して、異なる小規模な特性を持つ複雑な点群を区別するにはどうすればよいでしょうか。統計的な保証をもって、この意味での違いを区別できるような、そのようなデータの適切な変換方法を見つけることはできるでしょうか。この論文では、単一分子局在顕微鏡法などによって得られた複雑な点群の分析と分類について検討します。全体的なサイズなどの大規模な情報を無視して、小規模な特性に基づいてノイズの多い点群間の違いを識別するというタスクに焦点を当てます。私たちは、いわゆる距離測定(DTM)関数によるデータの変換に基づくアプローチを提案します。この変換は、最近傍距離の平均に基づいています。各データセットについて、すべてのデータポイントの平均ローカル距離の確率密度を推定し、推定された密度を分類に使用します。提案された方法論はすぐに適用でき、実際のパフォーマンスも非常に優れていますが、密度推定量は非i.i.d.に基づいているため、理論的研究は非常に困難です。複雑な変換を経て得られた観測値です。実際、変換されたデータは、一般的に考えられている依存性の尺度では捉えられない非局所的な方法で確率的に依存しています。それでもなお、私たちは、密度推定量の漸近的動作が、特定のi.i.d.ランダム変数のカーネル密度推定量によって駆動されることを、$U$統計の理論的特性を使用して示します。これにより、Hoeffding分解を介して依存性を処理できます。数値研究と、クロマチン繊維のシミュレートされた単一分子局在顕微鏡データへの適用により、推定されたDTM密度に基づく教師なし分類タスクが優れた分離結果を達成することを示します。

PAMI: An Open-Source Python Library for Pattern Mining
PAMI:パターンマイニング用のオープンソースPythonライブラリ

Crucial information that can empower users with competitive information to achieve socio-economic development lies hidden in big data. Pattern mining aims to discover this needy information by finding user interest-based patterns in big data. Unfortunately, existing pattern mining libraries are limited to finding a few types of patterns in transactional and sequence databases. This paper tackles this problem by providing a cross-platform open-source Python library called PAttern MIning (PAMI). PAMI provides several algorithms to discover different types of patterns hidden in various types of databases across multiple computing architectures. PAMI also contains algorithms to generate various types of synthetic databases. PAMI offers a command line interface, Jupyter Notebook support, and easy maintenance through the Python Package Index. Furthermore, the source code is available under the GNU General Public License, version 3. Finally, PAMI offers several resources, such as a user’s guide, a developer’s guide, datasets, and a bug report.

ビッグデータには、社会経済発展を達成するためにユーザーに競争力のある情報を提供できる重要な情報が隠されています。パターンマイニングは、ビッグデータでユーザーの関心に基づくパターンを見つけることで、この必要な情報を発見することを目的としています。残念ながら、既存のパターンマイニングライブラリは、トランザクションデータベースとシーケンスデータベースで数種類のパターンを見つけることに限定されています。この論文では、クロスプラットフォームのオープンソースPythonライブラリであるPAttern MIning (PAMI)を提供することでこの問題に取り組んでいます。PAMIは、複数のコンピューティングアーキテクチャにわたるさまざまな種類のデータベースに隠されたさまざまな種類のパターンを発見するためのいくつかのアルゴリズムを提供します。PAMIには、さまざまな種類の合成データベースを生成するアルゴリズムも含まれています。PAMIは、コマンドラインインターフェイス、Jupyter Notebookサポート、およびPythonパッケージインデックスによる簡単なメンテナンスを提供します。さらに、ソースコードはGNU General Public Licenseバージョン3の下で利用できます。最後に、PAMIは、ユーザーガイド、開発者ガイド、データセット、バグレポートなどのいくつかのリソースを提供します。

Law of Large Numbers and Central Limit Theorem for Wide Two-layer Neural Networks: The Mini-Batch and Noisy Case
広幅2層ニューラルネットワークのための大数の法則と中心極限定理:ミニバッチとノイズケース

In this work, we consider a wide two-layer neural network and study the behavior of its empirical weights under a dynamics set by a stochastic gradient descent along the quadratic loss with mini-batches and noise. Our goal is to prove a trajectorial law of large number as well as a central limit theorem for their evolution. When the noise is scaling as $1/N^\beta$ and $1/2<\beta\le\infty$, we rigorously derive and generalize the LLN obtained for example by Rotskoff and Van den Injden (Com. Pure. Appl. Math, 2022), Mei and Montanari and Nguyen (Pnas 2018) or Sirignano and Spiliopoulos (Siam. J. Appl. Math. 2020). When $3/4<\beta\le\infty$, we also generalize the CLT of Sirignano and Spiliopoulos (Stoch. Proc. Appl. 2020) and further exhibit the effect of mini-batching on the asymptotic variance which leads the fluctuations. The case $\beta=3/4$ is trickier and we give an example showing the divergence with time of the variance thus establishing the instability of the predictions of the neural network in this case. It is illustrated by simple numerical examples.

この研究では、幅広い2層ニューラルネットワークを検討し、ミニバッチとノイズを伴う2次損失に沿った確率的勾配降下法によって設定されたダイナミクスの下での経験的重みの挙動を調べます。私たちの目標は、大数の軌道法則とその進化の中心極限定理を証明することです。ノイズが$1/N^\beta$および$1/2<\beta\le\infty$としてスケーリングされている場合、たとえばRotskoffとVan den Injden（Com. Pure. Appl. Math、2022）、MeiとMontanariとNguyen（Pnas 2018）、またはSirignanoとSpiliopoulos（Siam. J. Appl. Math. 2020）によって得られたLLNを厳密に導出し、一般化します。$3/4<\beta\le\infty$の場合、SirignanoとSpiliopoulos (Stoch. Proc. Appl. 2020)のCLTも一般化し、さらに、変動をもたらす漸近分散に対するミニバッチングの効果を示します。$\beta=3/4$の場合はより複雑であり、分散の時間による発散を示す例を示し、この場合のニューラルネットワークの予測の不安定性を確立します。これは、簡単な数値例で説明されます。

Risk Measures and Upper Probabilities: Coherence and Stratification
リスク指標と上限確率:一貫性と層別化

Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of alternative aggregation functionals, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we arrive at a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.

機械学習は通常、古典的な確率論を前提としており、これは集約が期待に基づいて構築されることを意味します。現在、機械学習の数学的基盤として、古典的な確率論に代わるより豊かな選択肢を検討する動機付けとなる理由は複数あります。私たちは、スペクトルリスク測度、チョケ積分、またはローレンツノルムとしてさまざまに知られている、強力で豊富なクラスの代替凝集汎関数を体系的に調べます。さまざまな特性評価結果を提示し、このスペクトルファミリーが特別な理由を示します。そうすることで、再配列不変バナッハ空間の理論からの結果を利用して誘導される上位確率の観点から、すべての首尾一貫したリスク尺度の自然な層別化に到達します。不確実性に対するこの新しいアプローチが、機械学習の実際的な問題への取り組みにどのように役立つかを経験的に示しています。

Parallel-in-Time Probabilistic Numerical ODE Solvers
時間並列確率数値 ODE ソルバー

Probabilistic numerical solvers for ordinary differential equations (ODEs) treat the numerical simulation of dynamical systems as problems of Bayesian state estimation. Aside from producing posterior distributions over ODE solutions and thereby quantifying the numerical approximation error of the method itself, one less-often noted advantage of this formalism is the algorithmic flexibility gained by formulating numerical simulation in the framework of Bayesian filtering and smoothing. In this paper, we leverage this flexibility and build on the time-parallel formulation of iterated extended Kalman smoothers to formulate a parallel-in-time probabilistic numerical ODE solver. Instead of simulating the dynamical system sequentially in time, as done by current probabilistic solvers, the proposed method processes all time steps in parallel and thereby reduces the computational complexity from linear to logarithmic in the number of time steps. We demonstrate the effectiveness of our approach on a variety of ODEs and compare it to a range of both classic and probabilistic numerical ODE solvers.

常微分方程式(ODE)の確率数値ソルバーは、動的システムの数値シミュレーションをベイズ状態推定の問題として扱います。ODE解の事後分布を生成し、それによって方法自体の数値近似誤差を定量化する以外に、この形式主義のあまり注目されていない利点の1つは、ベイズフィルタリングとスムージングのフレームワークで数値シミュレーションを定式化することによって得られるアルゴリズムの柔軟性です。この論文では、この柔軟性を活用し、反復拡張カルマンスムーザーの時間並列定式化に基づいて、時間並列の確率数値ODEソルバーを定式化します。現在の確率ソルバーのように動的システムを時間的に順番にシミュレートする代わりに、提案された方法はすべての時間ステップを並列に処理し、それによって計算の複雑さを時間ステップの数の線形から対数に削減します。さまざまなODEでこのアプローチの有効性を示し、従来の数値ODEソルバーと確率数値ODEソルバーの両方と比較します。

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
特徴分布データのためのスケーラブルな高次元多変量線形回帰

Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.

特徴分散データは、特徴によって分割され、複数のコンピューティングノードにまたがって格納されるデータと呼ばれ、多数の特徴を持つアプリケーションでますます一般的になっています。この論文では、そのようなデータに多変量線形回帰を適用するための2段階緩和貪欲アルゴリズム(TSRGA)を提案します。TSRGAの主な利点は、通信の複雑さが特徴ディメンションに依存しないため、非常に大規模なデータセットに対して高度にスケーラブルになることです。さらに、多変量応答変数の場合、TSRGAを使用して低ランク係数推定値を生成できます。TSRGAの高速収束は、シミュレーション実験によって検証されています。最後に、提案されたTSRGAを、10-Kレポートからの非構造化データを活用する金融アプリケーションに適用し、高密度の大次元行列を多数含むアプリケーションでの有用性を実証します。

Dropout Regularization Versus l2-Penalization in the Linear Model
線形モデルにおけるドロップアウト正則化と l2-ペナルティ化

We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and $\ell_2$-regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator.

私たちは、線形回帰モデルにおけるドロップアウトを伴う勾配降下反復の統計的振る舞いを調査します。特に、期待値の収束と反復の共分散行列の非漸近境界が導出されます。この結果は、線形モデルにおけるドロップアウトと$ell_2$正則化との間の広く引用されている関係に、より光を当てています。私たちは、勾配降下ダイナミクスとドロップアウトによって誘発される追加のランダム性との間の相互作用により、より微妙な関係を示しています。さらに、正則化効果を持たず、最小二乗推定量に収束するドロップアウトの単純化されたバリアントを研究します。

Efficient Convex Algorithms for Universal Kernel Learning
ユニバーサルカーネル学習のための効率的な凸アルゴリズム

The accuracy and complexity of machine learning algorithms based on kernel optimization are determined by the set of kernels over which they are able to optimize. An ideal set of kernels should: admit a linear parameterization (for tractability); be dense in the set of all kernels (for robustness); be universal (for accuracy). Recently, a framework was proposed for using positive matrices to parameterize a class of positive semi-separable kernels. Although this class can be shown to meet all three criteria, previous algorithms for optimization of such kernels were limited to classification and furthermore relied on computationally complex Semidefinite Programming (SDP) algorithms. In this paper, we pose the problem of learning semiseparable kernels as a minimax optimization problem and propose a SVD-QCQP primal-dual algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches. Furthermore, we provide an efficient implementation of this algorithm for both classification and regression — an implementation which enables us to solve problems with 100 features and up to 30,000 datums. Finally, when applied to benchmark data, the algorithm demonstrates the potential for significant improvement in accuracy over typical (but non-convex) approaches such as Neural Nets and Random Forest with similar or better computation time.

カーネル最適化に基づく機械学習アルゴリズムの精度と複雑さは、最適化できるカーネルのセットによって決まります。理想的なカーネルのセットは、線形パラメータ化が可能であること(扱いやすさのため)、すべてのカーネルのセット内で密であること(堅牢性のため)、普遍的であること(精度のため)である必要があります。最近、正の行列を使用して正の半分離カーネルのクラスをパラメータ化するためのフレームワークが提案されました。このクラスは3つの基準すべてを満たすことが示されていますが、このようなカーネルを最適化する以前のアルゴリズムは分類に限定されており、さらに計算が複雑な半正定値計画(SDP)アルゴリズムに依存していました。この論文では、半分離カーネルの学習の問題をミニマックス最適化問題として提起し、以前のSDPベースのアプローチと比較して計算の複雑さを大幅に削減するSVD-QCQPプライマルデュアルアルゴリズムを提案します。さらに、分類と回帰の両方に対してこのアルゴリズムの効率的な実装を提供します。この実装により、100個の機能と最大30,000個のデータを持つ問題を解決できます。最後に、ベンチマークデータに適用すると、このアルゴリズムは、ニューラルネットやランダムフォレストなどの一般的な(ただし非凸)アプローチと比較して、同等またはより短い計算時間で精度を大幅に向上できる可能性を示します。

Manifold Learning by Mixture Models of VAEs for Inverse Problems
逆問題に対するVAEの混合モデルによる多様体学習

Representing a manifold of very high-dimensional data with generative models has been shown to be computationally efficient in practice. However, this requires that the data manifold admits a global parameterization. In order to represent manifolds of arbitrary topology, we propose to learn a mixture model of variational autoencoders. Here, every encoder-decoder pair represents one chart of a manifold. We propose a loss function for maximum likelihood estimation of the model weights and choose an architecture that provides us the analytical expression of the charts and of their inverses. Once the manifold is learned, we use it for solving inverse problems by minimizing a data fidelity term restricted to the learned manifold. To solve the arising minimization problem we propose a Riemannian gradient descent algorithm on the learned manifold. We demonstrate the performance of our method for low-dimensional toy examples as well as for deblurring and electrical impedance tomography on certain image manifolds.

生成モデルを使用して非常に高次元のデータの多様体を表現することは、実際には計算効率が高いことが示されています。ただし、これには、データ多様体がグローバルパラメータ化を許容する必要があります。任意のトポロジの多様体を表現するために、変分オートエンコーダの混合モデルを学習することを提案します。ここでは、すべてのエンコーダとデコーダのペアが多様体の1つのチャートを表します。モデルの重みを最大尤度で推定するための損失関数を提案し、チャートとその逆の解析表現を提供するアーキテクチャを選択します。多様体が学習されると、学習した多様体に制限されたデータ忠実度項を最小化することにより、逆問題を解決するためにそれを使用します。発生する最小化問題を解決するために、学習した多様体に対するリーマン勾配降下アルゴリズムを提案します。低次元のおもちゃの例、および特定の画像多様体でのぼかし除去と電気インピーダンストモグラフィーに対するこの方法のパフォーマンスを示します。

An Algorithmic Framework for the Optimization of Deep Neural Networks Architectures and Hyperparameters
深層ニューラルネットワークアーキテクチャとハイパーパラメータの最適化のためのアルゴリズムフレームワーク

In this paper, we propose DRAGON (for DiRected Acyclic Graph OptimizatioN), an algorithmic framework to automatically generate efficient deep neural networks architectures and optimize their associated hyperparameters. The framework is based on evolving Directed Acyclic Graphs (DAGs), defining a more flexible search space than the existing ones in the literature. It allows mixtures of different classical operations: convolutions, recurrences and dense layers, but also more newfangled operations such as self-attention. Based on this search space we propose neighbourhood and evolution search operators to optimize both the architecture and hyper-parameters of our networks. These search operators can be used with any metaheuristic capable of handling mixed search spaces. We tested our algorithmic framework with an asynchronous evolutionary algorithm on a time series forecasting benchmark. The results demonstrate that DRAGON outperforms state-of-the-art handcrafted models and AutoML techniques for time series forecasting on numerous datasets. DRAGON has been implemented as a python open-source package.

この論文では、効率的なディープニューラルネットワークアーキテクチャを自動的に生成し、関連するハイパーパラメータを最適化するアルゴリズムフレームワークであるDRAGON (DiRected Acyclic Graph OptimizatioN)を提案します。このフレームワークは、進化する有向非巡回グラフ(DAG)に基づいており、文献にある既存のものよりも柔軟な検索空間を定義します。畳み込み、再帰、密なレイヤーなどのさまざまな古典的な操作を混在させることができますが、セルフアテンションなどのより新しい操作も混在させることができます。この検索空間に基づいて、ネットワークのアーキテクチャとハイパーパラメータの両方を最適化するための近傍検索演算子と進化検索演算子を提案します。これらの検索演算子は、混合検索空間を処理できる任意のメタヒューリスティックで使用できます。時系列予測ベンチマークで非同期進化アルゴリズムを使用して、アルゴリズムフレームワークをテストしました。結果は、DRAGONが多数のデータセットでの時系列予測において最先端の手作りモデルやAutoMLテクニックよりも優れていることを示しています。DRAGONはPythonオープンソースパッケージとして実装されています。

Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity
最適に近いサンプル複雑度による分布ロバストなモデルベースオフライン強化学習

This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy—with as few samples as possible—that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithms. We further develop an information-theoretic lower bound, which suggests that learning RMDPs is at least as hard as the standard MDPs when the uncertainty level is sufficient small, and corroborates the tightness of our upper bound up to polynomial factors of the (effective) horizon length for a range of uncertainty levels. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.

この論文では、積極的な探索なしに履歴データから意思決定を行うことを学習することを目的とするオフライン強化学習(RL)におけるモデルの堅牢性とサンプル効率という中心的な問題に関するものです。環境の不確実性と変動性のため、展開された環境が履歴データセットの収集に使用された公称環境から逸脱した場合でも、できるだけ少ないサンプルで、適切に機能する堅牢なポリシーを学習することが重要です。有限期間と無限期間の両方の設定で、Kullback-Leiblerダイバージェンスによって指定された不確実性セットを持つ表形式の堅牢なマルコフ決定プロセスに焦点を当て、オフラインRLの分布的に堅牢な定式化を検討します。サンプルの不足に対処するために、堅牢な値推定に慎重に設計されたデータ駆動型のペナルティ項を課すことによって、分布的に堅牢な値反復と不確実性に対する悲観主義の原則を組み合わせたモデルベースのアルゴリズムが提案されています。状態アクション空間を完全にカバーする必要なく分布シフトを測定する履歴データセットの穏やかで調整された仮定の下で、提案されたアルゴリズムの有限サンプル複雑性を確立します。さらに、情報理論的な下限を開発しました。これは、不確実性のレベルが十分に小さい場合、RMDPの学習は少なくとも標準MDPと同じくらい難しいことを示唆し、さまざまな不確実性のレベルに対して(有効な)地平線の長さの多項式係数までの上限の厳しさを裏付けています。私たちの知る限り、これは、モデルの不確実性と部分的なカバレッジの下で学習する、証明可能なほぼ最適な堅牢なオフラインRLアルゴリズムを初めて提供します。

Grokking phase transitions in learning local rules with gradient descent
勾配降下法による局所ルールの学習における相転移の理解

We discuss two solvable grokking (generalisation beyond overfitting) models in a rule-learning scenario. We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and provide evidence that grokking is a consequence of the locality of the teacher model. We analyze the rule-30 cellular automaton learning task, numerically determine the critical exponent and the grokking time distribution, and compare them with the prediction of the proposed grokking model. Finally, we numerically study the connection between structure formation and grokking.

私たちは、ルール学習シナリオにおける2つの解可能なgrokking(過学習を超えた一般化)モデルについて説明します。grokkingが相転移であることを示し、臨界指数、grokking確率、およびgrokking時間分布の正確な解析式を見つけます。さらに、提案されたgrokkingセットアップを標準(パーセプトロン)統計学習理論と接続するテンソルネットワークマップを導入し、grokkingが教師モデルの局所性の結果であるという証拠を提供します。ルール30セルオートマトン学習課題を解析し、臨界指数とグロッキング時間分布を数値的に決定し、それらを提案されたグロッキングモデルの予測と比較します。最後に、構造形成とグロッキングの関係を数値的に研究します。

Unsupervised Tree Boosting for Learning Probability Distributions
学習確率分布のための教師なし木ブースティング

We propose an unsupervised tree boosting algorithm for inferring the underlying sampling distribution of an i.i.d. sample based on fitting additive tree ensembles in a manner analogous to supervised tree boosting. Integral to the algorithm is a new notion of “addition” on probability distributions that leads to a coherent notion of “residualization”, i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several “group-like” properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of “addition” and “residualization” to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula. The algorithm then performs competitively with state-of-the-art deep-learning approaches in multivariate density estimation on multiple benchmark data sets.

私たちは、教師ありツリーブースティングに類似した方法で加法ツリーアンサンブルをフィッティングすることにより、i.i.d.サンプルの基礎となるサンプリング分布を推測するための教師なしツリーブースティングアルゴリズムを提案します。このアルゴリズムに不可欠なのは、確率分布に対する「加算」という新しい概念であり、これは「残差化」という一貫した概念、つまり、観測値から確率分布を減算して、後者のサンプリング分布から分布構造を除去することにつながります。私たちは、これらの概念が、単変量CDFのいくつかの「グループのような」特性による累積分布関数(CDF)変換および合成を通じて、単変量分布に対して自然に生じることを示します。従来の多変量CDFはこれらの特性を保持しませんが、多変量CDFの新しい定義はこれらの特性を復元できるため、「加算」と「残差化」の概念を多変量設定に対しても定式化できます。これにより、加法ツリーアンサンブルの段階的フィッティングに基づく教師なしブースティングアルゴリズムが生まれ、Kullback-Leiblerダイバージェンスを真実から順次削減します。このアルゴリズムにより、フィッティングされた密度の分析評価が可能になり、簡単にサンプリングできる生成モデルが出力されます。スケール依存の縮小と、周辺とコピュラを個別にフィッティングする2段階戦略でアルゴリズムを強化します。このアルゴリズムは、複数のベンチマークデータセットでの多変量密度推定において、最先端のディープラーニングアプローチと競合するパフォーマンスを発揮します。

Linear Regression With Unmatched Data: A Deconvolution Perspective
不一致データによる線形回帰: デコンボリューションの視点

Consider the regression problem where the response $Y\in\mathbb{R}$ and the covariate $X\in\mathbb{R}^d$ for $d\geq 1$ are unmatched. Under this scenario, we do not have access to pairs of observations from the distribution of $(X, Y)$, but instead, we have separate data sets $\{Y_i\}_{i=1}^{n_Y}$ and $\{X_j\}_{j=1}^{n_X}$, possibly collected from different sources. We study this problem assuming that the regression function is linear and the noise distribution is known, an assumption that we relax in the applications we consider. We introduce an estimator of the regression vector based on deconvolution and demonstrate its consistency and asymptotic normality under identifiability. Even when identifiability does not hold, we show in some cases that our estimator, the DLSE (Deconvolution Least Squared Estimator), is consistent in terms of an extended $\ell_2$ norm. Using this observation, we devise a method for semi-supervised learning, i.e., when we have access to a small sample of matched pairs $\{(X_k, Y_k)\}_{k=1}^m$. Several applications with synthetic and real data sets are considered to illustrate the theory.

応答$Y\in\mathbb{R}$と共変量$X\in\mathbb{R}^d$ ($d\geq 1$)が一致しない回帰問題を考えてみましょう。このシナリオでは、分布$(X, Y)$からの観測値のペアにアクセスすることはできませんが、代わりに、異なるソースから収集された可能性のある別々のデータセット$\{Y_i\}_{i=1}^{n_Y}$と$\{X_j\}_{j=1}^{n_X}$があります。回帰関数が線形でノイズ分布が既知であると仮定してこの問題を検討しますが、検討するアプリケーションではこの仮定を緩めます。デコンボリューションに基づく回帰ベクトルの推定量を導入し、識別可能性の下での一貫性と漸近正規性を実証します。識別可能性が成立しない場合でも、いくつかのケースでは、推定量DLSE (Deconvolution Least Squared Estimator)が拡張$\ell_2$ノルムに関して一貫していることを示します。この観察結果を使用して、一致したペアの小さなサンプル$\{(X_k, Y_k)\}_{k=1}^m$にアクセスできる場合の半教師あり学習の手法を考案します。合成データセットと実際のデータセットを使用したいくつかのアプリケーションを検討して、理論を説明します。

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit
無限幅極限での深層ニューラルネットワークの可積分パラメーター化の学習

To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks more regimes are possible, and in this paper we study in detail a specific choice of “small” initialization corresponding to “mean-field” limits of neural networks, which we call integrable parameterizations (IPs). First, we show that under standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit and no learning occurs. We then propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics. In particular, one of these methods consists in using large initial learning rates, and we show that it is equivalent to a modification of the recently proposed maximal update parameterization µP. We confirm our results with numerical experiments on image classification tasks, which additionally show a strong difference in behavior between various choices of activation functions that is not yet captured by theory.

訓練されたディープニューラルネットワークの動作を理論的に理解するには、ランダム初期化から勾配法によって誘発されるダイナミクスを研究する必要があります。ただし、これらのモデルの非線形および構成構造により、これらのダイナミクスを分析することは困難です。これらの課題を克服するために、最近、大きな幅の漸近線が有益な視点として登場し、現実のディープネットワークに関する実用的な洞察につながりました。2層ニューラルネットワークの場合、これらの漸近線を通じて、訓練されたモデルの性質は、カーネルレジーム(大きな初期分散の場合)から特徴学習レジーム(小さな初期分散の場合)まで、初期のランダム重みのスケールに応じて根本的に変化することが理解されています。より深いネットワークの場合は、より多くのレジームが可能であり、この論文では、ニューラルネットワークの「平均場」限界に対応する「小さな」初期化の特定の選択を詳細に研究します。これを積分可能パラメーター化(IP)と呼びます。まず、標準i.i.d.の下で、ゼロ平均初期化では、4層を超えるニューラルネットワークの積分可能なパラメータ化は、無限幅の極限の定常点から開始され、学習は行われません。次に、この自明な動作を回避するためのさまざまな方法を提案し、結果として生じるダイナミクスを詳細に分析します。特に、これらの方法の1つは、大きな初期学習率を使用することです。これは、最近提案された最大更新パラメータ化 µPの修正に相当することを示しています。画像分類タスクの数値実験で結果を確認します。この実験では、理論ではまだ捉えられていない、さまざまな活性化関数の選択間での動作に大きな違いがあることも示しています。

Sharp analysis of power iteration for tensor PCA
テンソルPCAのパワー反復のシャープな解析

We investigate the power iteration algorithm for the tensor PCA model introduced in Richard and Montanari (2014). Previous work studying the properties of tensor power iteration is either limited to a constant number of iterations, or requires a non-trivial data-independent initialization. In this paper, we move beyond these limitations and analyze the dynamics of randomly initialized tensor power iteration up to polynomially many steps. Our contributions are threefold: First, we establish sharp bounds on the number of iterations required for power method to converge to the planted signal, for a broad range of the signal-to-noise ratios. Second, our analysis reveals that the actual algorithmic threshold for power iteration is smaller than the one conjectured in the literature by a $\mathrm{polylog}(n)$ factor, where $n$ is the ambient dimension. Finally, we propose a simple and effective stopping criterion for power iteration, which provably outputs a solution that is highly correlated with the true signal. Extensive numerical experiments verify our theoretical results.

私たちは、RichardとMontanari (2014)で導入されたテンソルPCAモデルのべき乗反復アルゴリズムを調査します。テンソルのべき乗反復の特性を研究するこれまでの研究は、反復回数が定数に制限されているか、または非自明なデータ非依存の初期化を必要とします。この論文では、これらの制限を超えて、多項式的に多くのステップまでランダムに初期化されたテンソルのべき乗反復のダイナミクスを分析します。私たちの貢献は3つあります。まず、広範囲の信号対雑音比に対して、べき乗法が植え付けられた信号に収束するために必要な反復回数に明確な境界を確立しました。次に、分析により、べき乗反復の実際のアルゴリズムしきい値は、文献で推測されているしきい値よりも$\mathrm{polylog}(n)$係数だけ小さいことが明らかになりました。ここで、$n$は周囲の次元です。最後に、真の信号と高度に相関するソリューションを証明できるべき、べき乗反復のシンプルで効果的な停止基準を提案します。広範囲にわたる数値実験により、私たちの理論的結果が検証されます。

On the Intrinsic Structures of Spiking Neural Networks
スパイキングニューラルネットワークの固有構造について

Recent years have emerged a surge of interest in spiking neural networks (SNNs). The performance of SNNs hinges not only on searching apposite architectures and connection weights, similar to conventional artificial neural networks, but also on the meticulous configuration of their intrinsic structures. However, there has been a dearth of comprehensive studies examining the impact of intrinsic structures; thus developers often feel challenging to apply a standardized configuration of SNNs across diverse datasets or tasks. This work delves deep into the intrinsic structures of SNNs. Initially, we draw two key conclusions: (1) the membrane time hyper-parameter is intimately linked to the eigenvalues of the integration operation, dictating the functional topology of spiking dynamics; (2) various hyper-parameters of the firing-reset mechanism govern the overall firing capacity of an SNN, mitigating the injection ratio or sampling density of input data. These findings elucidate why the efficacy of SNNs hinges heavily on the configuration of intrinsic structures and lead to a recommendation that enhancing the adaptability of these structures contributes to improving the overall performance and applicability of SNNs. Inspired by this recognition, we propose two feasible approaches to enhance SNN learning, involving developing self-connection architectures and stochastic spiking neurons to augment the adaptability of the integration operation and firing-reset mechanism, respectively. We theoretically prove that (1) both methods promote the expressive property for universal approximation, (2) the incorporation of self-connection architectures fosters ample solutions and structural stability for SNNs approximating adaptive dynamical systems, (3) the stochastic spiking neurons maintain generalization bounds with an exponential reduction in Rademacher complexity. Empirical experiments conducted on various real-world datasets affirm the effectiveness of our proposed methods.

近年、スパイキングニューラルネットワーク(SNN)への関心が高まっています。SNNのパフォーマンスは、従来の人工ニューラルネットワークと同様に適切なアーキテクチャと接続の重みを探すだけでなく、その固有構造の綿密な構成にも左右されます。しかし、固有構造の影響を調査する包括的な研究は不足しており、開発者はさまざまなデータセットやタスクに標準化されたSNN構成を適用することに困難を感じることがよくあります。この研究では、SNNの固有構造を深く掘り下げます。まず、2つの重要な結論を導き出します。(1)膜時間ハイパーパラメータは積分演算の固有値と密接に関連しており、スパイキングダイナミクスの機能トポロジを決定します。(2)発火リセットメカニズムのさまざまなハイパーパラメータがSNNの全体的な発火能力を制御し、入力データの注入率またはサンプリング密度を緩和します。これらの発見は、SNNの有効性が固有構造の構成に大きく依存する理由を明らかにし、これらの構造の適応性を高めることがSNNの全体的なパフォーマンスと適用性の向上に寄与するという推奨につながります。この認識に触発されて、我々はSNN学習を強化するための2つの実行可能なアプローチを提案します。それは、それぞれ統合操作と発火リセットメカニズムの適応性を高めるために自己接続アーキテクチャと確率的スパイキングニューロンを開発することです。我々は理論的に、(1)両方の方法が普遍近似の表現特性を促進すること、(2)自己接続アーキテクチャを組み込むことで、適応型動的システムを近似するSNNの十分なソリューションと構造的安定性が促進されること、(3)確率的スパイキングニューロンがRademacher複雑度の指数関数的な減少で一般化境界を維持することを証明しました。さまざまな実際のデータセットで実施された実証実験により、提案方法の有効性が確認されています。

Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance
多目的学習における三者間トレードオフ:最適化、一般化、競合回避

Multi-objective learning (MOL) often arises in machine learning problems when there are multiple data modalities or tasks. One critical challenge in MOL is the potential conflict among different objectives during the optimization process. Recent works have developed various dynamic weighting algorithms for MOL, where the central idea is to find an update direction that avoids conflicts among objectives. Albeit its appealing intuition, empirical studies show that dynamic weighting methods may not outperform static ones. To understand this theory-practice gap, we focus on a stochastic variant of MGDA, the Multi-objective gradient with Double sampling (MoDo), and study the generalization performance and its interplay with optimization through the lens of algorithmic stability in the framework of statistical learning theory. We find that the key rationale behind MGDA—updating along conflict-avoidant direction—may hinder dynamic weighting algorithms from achieving the optimal $O(1/\sqrt{n})$ population risk, where $n$ is the number of training samples. We further demonstrate the impact of dynamic weights on the three-way trade-off among optimization, generalization, and conflict avoidance unique in MOL. We showcase the generality of our theoretical framework by analyzing other algorithms under the framework. Experiments on various multi-task learning benchmarks are performed to demonstrate the practical applicability. Code is available at https://github.com/heshandevaka/Trade-Off-MOL.

多目的学習(MOL)は、複数のデータモダリティまたはタスクがある場合に、機械学習の問題でよく発生します。MOLの重要な課題の1つは、最適化プロセス中に異なる目的間で競合が発生する可能性があることです。最近の研究では、MOL用のさまざまな動的重み付けアルゴリズムが開発されており、その中心的な考え方は、目的間の競合を回避する更新方向を見つけることです。直感的には魅力的ですが、経験的研究では、動的重み付け方法が静的方法よりも優れているとは限らないことが示されています。この理論と実践のギャップを理解するために、MGDAの確率的バリアントである、ダブルサンプリングによる多目的勾配(MoDo)に焦点を当て、統計学習理論のフレームワークにおけるアルゴリズムの安定性の観点から、一般化のパフォーマンスと最適化との相互作用を調べます。MGDAの背後にある主要な理論的根拠である競合回避方向に沿った更新は、動的重み付けアルゴリズムが最適な$O(1/\sqrt{n})$集団リスクを達成することを妨げる可能性があることがわかりました。ここで、$n$はトレーニングサンプルの数です。さらに、MOLに特有の最適化、一般化、競合回避の3方向のトレードオフに対する動的重みの影響を示します。このフレームワークの下で他のアルゴリズムを分析することにより、理論的フレームワークの一般性を示します。実際の適用可能性を実証するために、さまざまなマルチタスク学習ベンチマークで実験が行われます。コードはhttps://github.com/heshandevaka/Trade-Off-MOLで入手できます。

Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data
不均衡なデータによるクロスエントロピー損失下における制約なし特徴モデルに対するニューラル崩壊

Neural Collapse (NC) is a fascinating phenomenon that arises during the terminal phase of training (TPT) of deep neural networks (DNNs). Specifically, for balanced training datasets (each class shares the same number of samples), it is observed that the feature vectors of samples from the same class converge to their corresponding in-class mean features and their pairwise angles are the same. In this paper, we study the extension of the NC phenomenon to imbalanced datasets under cross-entropy loss function in the context of the unconstrained feature model (UFM). Our contribution is multi-fold compared with the state-of-the-art results: (a) we show that the feature vectors within the same class still collapse to the same mean vector; (b) the mean feature vectors no longer share the same pairwise angle. Instead, those angles depend on sample sizes; (c) we also characterize the sharp threshold on which the minority collapse (the feature vectors of the minority groups collapse to one single vector) will happen; (d) finally, we argue that the effect of the imbalance in datasets diminishes as the sample size grows. Our results provide a complete picture of the NC under the cross-entropy loss for imbalanced datasets. Numerical experiments confirm our theories.

ニューラルコラプス(NC)は、ディープニューラルネットワーク(DNN)のトレーニングの最終段階(TPT)で発生する興味深い現象です。具体的には、バランスの取れたトレーニングデータセット(各クラスが同じ数のサンプルを共有)の場合、同じクラスのサンプルの特徴ベクトルが、対応するクラス内平均特徴に収束し、ペアワイズアングルが同じであることが観察されます。この論文では、制約なし特徴モデル(UFM)のコンテキストで、クロスエントロピー損失関数の下で、NC現象の不均衡なデータセットへの拡張について検討します。最先端の結果と比較して、私たちの貢献は数倍です。(a)同じクラス内の特徴ベクトルは依然として同じ平均ベクトルに収束することを示します。(b)平均特徴ベクトルはもはや同じペアワイズアングルを共有しません。代わりに、それらのアングルはサンプルサイズに依存します。(c)少数派の収束(少数派グループの特徴ベクトルが1つのベクトルに収束する)が発生する明確なしきい値も特徴付けます。(d)最後に、サンプルサイズが大きくなるにつれて、データセットの不均衡の影響は減少すると主張します。私たちの結果は、不均衡なデータセットのクロスエントロピー損失の下でのNCの完全な図を提供します。数値実験は私たちの理論を裏付けています。

Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables
潜在変数による因果構造推定のための一般化独立ノイズ条件

We investigate the challenging task of learning causal structure in the presence of latent variables, including locating latent variables, determining their quantity, and identifying causal relationships among both latent and observed variables. To address this, we propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. Specifically, for two observed random vectors $\bf{Y}$ and $\bf{Z}$, GIN holds if and only if $\omega^{\intercal}\mathbf{Y}$ and $\mathbf{Z}$ are statistically independent, where $\omega$ is a non-zero parameter vector determined by the cross-covariance between $\mathbf{Y}$ and $\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic causal models. From a graphical perspective, roughly speaking, GIN implies the existence of a set $\mathcal{S}$ such that $\mathcal{S}$ is causally earlier (w.r.t. the causal ordering) than $\mathbf{Y}$, and that every active (collider-free) path between $\mathbf{Y}$ and $\mathbf{Z}$ must contain a node from $\mathcal{S}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the underlying causal structure of a LiNGLaH is identifiable in light of GIN conditions under mild assumptions. Experimental results on both synthetic and three real-world data sets show the effectiveness of the proposed approach.

私たちは、潜在変数の存在下で因果構造を学習するという困難なタスクを調査します。これには、潜在変数の特定、その量の決定、潜在変数と観測変数の両方の因果関係の特定が含まれます。これに対処するために、潜在変数を組み込んだ線形非ガウス非巡回因果モデルに対して、一般化独立ノイズ(GIN)条件を提案します。この条件は、特定の測定変数の線形結合と他の測定変数との間の独立性を確立します。具体的には、2つの観測ランダムベクトル$\bf{Y}$と$\bf{Z}$に対して、GINが成立するのは、$\omega^{\intercal}\mathbf{Y}$と$\mathbf{Z}$が統計的に独立している場合のみです。ここで、$\omega$は、$\mathbf{Y}$と$\mathbf{Z}$間の相互共分散によって決定されるゼロ以外のパラメーターベクトルです。次に、線形非ガウス非巡回因果モデルにおけるGIN条件の必要かつ十分なグラフィカル基準を示します。グラフィカルな観点から、大まかに言えば、GINは、$\mathcal{S}$が$\mathbf{Y}$よりも因果的に早い(因果順序に関して)ような集合$\mathcal{S}$の存在と、$\mathbf{Y}$と$\mathbf{Z}$間のすべてのアクティブな(衝突のない)パスに$\mathcal{S}$からのノードが含まれている必要があることを意味します。興味深いことに、独立ノイズ条件(つまり、交絡因子がない場合、原因は原因への効果の回帰から得られる残差から独立している)は、GINの特殊なケースと見なすことができます。GINと潜在的な因果構造とのこのような関係を利用して、提案されたGIN条件と適切に設計された検索手順をさらに活用し、潜在的な交絡因子も因果関係があり、階層構造に従う可能性のある線形非ガウス潜在階層モデル(LiNGLaH)を効率的に推定します。LiNGLaHの根本的な因果構造は、軽度の仮定の下でGIN条件に照らして識別可能であることを示します。合成データセットと3つの実際のデータセットの両方での実験結果は、提案されたアプローチの有効性を示しています。

Classification of Data Generated by Gaussian Mixture Models Using Deep ReLU Networks
ディープ ReLU ネットワークを使用したガウス混合モデルによって生成されたデータの分類

This paper studies the binary classification of unbounded data from ${\mathbb R}^d$ generated under Gaussian Mixture Models (GMMs) using deep ReLU neural networks. We obtain — for the first time — non-asymptotic upper bounds and convergence rates of the excess risk (excess misclassification error) for the classification without restrictions on model parameters. While the majority of existing generalization analysis of classification algorithms relies on a bounded domain, we consider an unbounded domain by leveraging the analyticity and fast decay of Gaussian distributions. To facilitate our analysis, we give a novel approximation error bound for general analytic functions using ReLU networks, which may be of independent interest. Gaussian distributions can be adopted nicely to model data arising in applications, e.g., speeches, images, and texts; our results provide a theoretical verification of the observed efficiency of deep neural networks in practical classification problems.

この論文では、深層ReLUニューラルネットワークを使用して、Gaussian Mixture Models(GMM)の下で生成された${mathbb R}^d$からの非有界データの二項分類を研究します。モデルパラメータの制限なしに、分類の過剰リスク(過剰誤分類誤差)の非漸近的な上限と収束率を初めて取得しました。分類アルゴリズムの既存の一般化分析の大部分は有界領域に依存していますが、ガウス分布の解析性と高速減衰を活用して、非有界領域を検討します。分析を容易にするために、ReLUネットワークを使用した一般的な解析関数に対して新しい近似誤差範囲を与えますが、これは独立した関心事である可能性があります。ガウス分布は、スピーチ、画像、テキストなどのアプリケーションで発生するデータをモデル化するためにうまく採用できます。私たちの結果は、実際の分類問題における深層ニューラルネットワークの観察された効率の理論的検証を提供します。

Differentially Private Topological Data Analysis
差分プライベートトポロジカルデータ解析

This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used Cech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of Cech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real data set tracking human movement.

この論文では、差分プライバシー(DP)トポロジカルデータ解析(TDA)を初めて試み、ほぼ最適なプライベートパーシスタンスダイアグラムを生成するものです。ボトルネック距離の観点からパーシスタンスダイアグラムの感度を分析し、一般的に使用されるCech複合体の感度はサンプルサイズ$n$が増加しても低下しないことを示します。このため、Cech複合体のパーシスタンスダイアグラムをプライベート化することは困難です。代替案として、$L^1$測定距離(DTM)によって取得されるパーシスタンスダイアグラムの感度は$O(1/n)$であることを示します。感度分析に基づいて、$L^1$-DTMパーシスタンスダイアグラムのボトルネック距離の観点からユーティリティ関数が定義される指数メカニズムの使用を提案します。また、プライバシーメカニズムの精度の上限と下限を導出します。得られた境界は、メカニズムのプライバシーエラーがほぼ最適であることを示しています。私たちは、シミュレーションと人間の動きを追跡する実際のデータセットを通じて、プライベート化された永続性ダイアグラムのパフォーマンスを実証します。

On the Optimality of Misspecified Spectral Algorithms
誤指定スペクトルアルゴリズムの最適性について

In the misspecified spectral algorithms problem, researchers usually assume the underground true function $f_{\rho}^{*} \in [\mathcal{H}]^{s}$, a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ for some $s\in (0,1)$. The existing minimax optimal results require $\|f_{\rho}^{*}\|_{L^{\infty}}<\infty$ which implicitly requires $s > \alpha_{0}$ where $\alpha_{0}\in (0,1)$ is the embedding index, a constant depending on $\mathcal{H}$. Whether the spectral algorithms are optimal for all $s\in (0,1)$ is an outstanding problem lasting for years. In this paper, we show that spectral algorithms are minimax optimal for any $\alpha_{0}-\frac{1}{\beta} < s < 1$, where $\beta$ is the eigenvalue decay rate of $\mathcal{H}$. We also give several classes of RKHSs whose embedding index satisfies $ \alpha_0 = \frac{1}{\beta} $. Thus, the spectral algorithms are minimax optimal for all $s\in (0,1)$ on these RKHSs.

誤って指定されたスペクトルアルゴリズムの問題では、研究者は通常、地下の真関数$f_{rho}^{*} in [mathcal{H}]^{s}$と仮定します。これは、再生カーネルヒルベルト空間(RKHS) $mathcal{H}$の平滑でない補間空間で、一部の0,1 $sin (0,1)$に対して非線形補間空間です。既存のミニマックスの最適結果には$|f_{rho}^{*}|_{L^{infty}}<infty$が必要です。これは暗黙的に$s >alpha_{0}$を必要とします。ここで$alpha_{0}in (0,1)$は埋め込みインデックスであり、$mathcal{H}$に依存する定数です。スペクトルアルゴリズムがすべての$sin(0,1)$に最適であるかどうかは、何年も続く未解決の問題です。この論文では、スペクトルアルゴリズムが$alpha_{0}-frac{1}{beta} < s < 1$に対してミニマックス最適であることを示します(ここで、$beta$は$mathcal{H}$の固有値減衰率です)。また、埋め込みインデックスが$ alpha_0 = frac{1}{beta} $を満たすRKHSのいくつかのクラスも与えます。したがって、スペクトルアルゴリズムは、これらのRKHS上のすべての$sin(0,1)$に対してミニマックス最適です。

An Entropy-Based Model for Hierarchical Learning
階層学習のためのエントロピーベースモデル

Machine learning, the predominant approach in the field of artificial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and efficient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Real-world datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance’s norm can be construed as the data instance’s complexity, and the allocation of computational resources is tailored to this complexity, resulting in benefits such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds.

人工知能の分野で主流のアプローチである機械学習により、コンピューターはデータと経験から学習することができます。教師あり学習フレームワークでは、データインスタンスとそれに対応するラベル間の依存関係を正確かつ効率的に学習するには、データ分布とターゲット関数に関する補助情報が必要です。この中心概念は、統計学習理論における正則化の概念と一致しています。実際のデータセットは、多くの場合、マルチスケールのデータインスタンス分布と、適切に動作する滑らかなターゲット関数によって特徴付けられます。べき乗分布などのスケール不変確率分布は、さまざまなコンテキストでのマルチスケールのデータインスタンス分布の注目すべき例を提供します。この論文では、マルチスケールエントロピーベースのトレーニング手順でこのようなマルチスケールデータ構造を活用する階層学習モデルを紹介し、その統計的および計算上の利点を探ります。階層学習モデルは、簡単なタスクから複雑なタスクへの人間の学習の論理的進行にヒントを得たもので、解釈可能なレベルを備えています。このモデルでは、データインスタンスのノルムの対数はデータインスタンスの複雑さとして解釈でき、計算リソースの割り当てはこの複雑さに合わせて調整されるため、推論速度の向上などの利点が得られます。さらに、統計リスクのマルチスケール分析により、従来の均一な収束境界と比較して強力な保証が得られます。

Optimal Clustering with Bandit Feedback
バンディットフィードバックによる最適なクラスタリング

This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent’s task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC’s performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.

この論文では、バンディットフィードバックによるオンラインクラスタリングの問題について検討します。一連のアーム(またはアイテム)は、さまざまな未知のグループに分割できます。各グループ内では、各アームに関連付けられた観測は同じ平均ベクトルを持つ同じ分布に従います。各タイムステップで、エージェントはアームを照会またはプルし、関連付けられている分布から独立した観測を取得します。後続のプルは、以前のプルと以前に取得したサンプルに依存します。エージェントのタスクは、最小のアームプル数で、エラーの確率が規定の定数$\delta$を超えないようにして、アームの基本的なパーティションを明らかにすることです。提案された問題は、ウイルスの亜種のクラスタリングからオンライン市場のセグメンテーションまで、さまざまな用途があります。このタスクの予想されるサンプル複雑性に関するインスタンス依存の情報理論的下限値を提示し、計算効率が高く漸近的に最適なアルゴリズム、つまりBandit Online Clustering (BOC)を設計します。このアルゴリズムには、サブルーチンとしてNP困難な重み付きクラスタリング問題を正確に解く必要性を回避する、適応型シーケンシャルテストの新しい停止規則が含まれています。合成データセットと実世界のデータセットでの広範なシミュレーションを通じて、BOCのパフォーマンスが下限に漸近的に一致し、非適応型ベースラインアルゴリズムを大幅に上回ることを示しています。

A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression
多重線形回帰とペナルティ付き回帰との接続に対する柔軟な経験的ベイズアプローチ

We introduce a new empirical Bayes approach for large-scale multiple linear regression. Our approach combines two key ideas: (i) the use of flexible “adaptive shrinkage” priors, which approximate the nonparametric family of scale mixture of normal distributions by a finite mixture of normal distributions; and (ii) the use of variational approximations to efficiently estimate prior hyperparameters and compute approximate posteriors. Combining these two ideas results in fast and flexible methods, with computational speed comparable to fast penalized regression methods such as the Lasso, and with competitive prediction accuracy across a wide range of scenarios. Further, we provide new results that establish conceptual connections between our empirical Bayes methods and penalized methods. Specifically, we show that the posterior mean from our method solves a penalized regression problem, with the form of the penalty function being learned from the data by directly solving an optimization problem (rather than being tuned by cross-validation). Our methods are implemented in an R package, mr.ash.alpha, available from https://github.com/stephenslab/mr.ash.alpha.

私たちは、大規模な多重線形回帰のための新しい経験的ベイズ手法を紹介します。この手法は、2つの重要なアイデアを組み合わせています。(i)柔軟な「適応型収縮」事前分布の使用。これは、正規分布の有限な混合スケールでノンパラメトリックファミリを近似します。(ii)事前ハイパーパラメータを効率的に推定し、近似事後分布を計算する変分近似の使用です。この2つのアイデアを組み合わせると、高速で柔軟な方法が得られます。計算速度はLassoなどの高速ペナルティ付き回帰方法に匹敵し、さまざまなシナリオで競争力のある予測精度が得られます。さらに、経験的ベイズ方法とペナルティ付き方法の概念的なつながりを確立する新しい結果を提供します。具体的には、この方法の事後平均がペナルティ付き回帰問題を解決し、ペナルティ関数の形式は、クロス検証によって調整されるのではなく、最適化問題を直接解決することによってデータから学習されることを示します。私たちの方法は、https://github.com/stephenslab/mr.ash.alphaから入手できるRパッケージmr.ash.alphaに実装されています。

Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks
深部残差ネットワークのためのニューラルタンジェントカーネルのスペクトル解析

Deep residual network architectures have been shown to achieve superior accuracy over classical feed-forward networks, yet their success is still not fully understood. Focusing on massively over-parameterized, fully connected residual networks with ReLU activation through their respective neural tangent kernels (ResNTK), we provide here a spectral analysis of these kernels. Specifically, we show that, much like NTK for fully connected networks (FC-NTK), for input distributed uniformly on the hypersphere $S^d$, the eigenvalues of ResNTK corresponding to their spherical harmonics eigenfunctions decay polynomially with frequency $k$ as $k^{-d}$. These in turn imply that the set of functions in their Reproducing Kernel Hilbert Space are identical to those of both FC-NTK as well as the standard Laplace kernel. Our spectral analysis allows us to highlight several additional properties of ResNTK, which depend on the choice of a hyper-parameter that balances between the skip and residual connections. Specifically, (1) with no bias, deep ResNTK is significantly biased toward even frequency functions; (2) unlike FC-NTK for deep networks, which is spiky and therefore yields poor generalization, ResNTK is stable and yields small generalization errors. We finally demonstrate these with experiments showing further that these phenomena arise in real networks.

ディープ残差ネットワークアーキテクチャは、従来のフィードフォワードネットワークよりも優れた精度を実現することが示されていますが、その成功はまだ完全には理解されていません。それぞれのニューラルタンジェントカーネル(ResNTK)によるReLUアクティベーションを備えた、大規模に過剰パラメータ化された完全接続残差ネットワークに焦点を当て、ここではこれらのカーネルのスペクトル分析を示します。具体的には、完全接続ネットワーク(FC-NTK)のNTKと同様に、超球$S^d$上に均一に分散された入力に対して、球面調和関数の固有関数に対応するResNTKの固有値は、周波数$k$で$k^{-d}$として多項式的に減衰することを示します。これは、再生カーネルヒルベルト空間の関数セットが、FC-NTKと標準ラプラスカーネルの両方の関数セットと同じであることを意味します。スペクトル分析により、スキップ接続と残差接続のバランスをとるハイパーパラメータの選択に依存するResNTKのいくつかの追加特性を強調できます。具体的には、(1)バイアスがない場合、深層ResNTKは偶数頻度関数に大きく偏ります。(2)スパイク状で一般化が不十分な深層ネットワークのFC-NTKとは異なり、ResNTKは安定しており、一般化エラーが小さくなります。最後に、これらの現象が実際のネットワークで発生することを示す実験でこれらを実証します。

Permuted and Unlinked Monotone Regression in R^d: an approach based on mixture modeling and optimal transport
R^dにおける置換および非リンク単調回帰:混合モデリングと最適輸送に基づくアプローチ

Suppose that we have a regression problem with response variable $Y \in \mathbb{R}^d$ and predictor $X \in \mathbb{R}^d$, for $d \ge 1$. In permuted or unlinked regression we have access to separate unordered data on $X$ and $Y$, as opposed to data on $(X,Y)$-pairs in usual regression. So far in the literature the case $d=1$ has received attention, see e.g., the recent papers by Rigollet and Weed [Information & Inference, 8, 619-717] and Balabdaoui et al. [J. Mach. Learn. Res., 22 (172), 1-60]. In this paper, we consider the general multivariate setting with $d \geq 1$. We show that the notion of cyclical monotonicity of the regression function is sufficient for identification and estimation in the permuted/unlinked regression model. We study permutation recovery in the permuted regression setting and develop a computationally efficient and easy-to-use algorithm for denoising based on the Kiefer-Wolfowitz [Ann. Math. Statist., 27, 887-906] nonparametric maximum likelihood estimator and techniques from the theory of optimal transport. We provide explicit upper bounds on the associated mean squared denoising error for Gaussian noise. As in previous work on the case $d = 1$, the permuted/unlinked setting involves slow (logarithmic) rates of convergence rooted in the underlying deconvolution problem. We also provide an extension to a certain class of elliptic noise distributions that includes a multivariate generalization of the Laplace distribution, for which polynomial rates can be obtained. Numerical studies complement our theoretical analysis and show that the proposed approach performs at least on par with the methods in the aforementioned prior work in the case $d = 1$ while achieving substantial reductions in terms of computational complexity.

応答変数$Y \in \mathbb{R}^d$と予測変数$X \in \mathbb{R}^d$（$d \ge 1$）の回帰問題があるとします。置換回帰または非連結回帰では、通常の回帰の$(X,Y)$ペアのデータとは対照的に、$X$と$Y$の個別の順序なしデータにアクセスできます。これまでの文献では、$d=1$の場合が注目されています。たとえば、RigolletとWeed [Information & Inference, 8, 619-717]およびBalabdaouiら[J. Mach. Learn. Res., 22 (172), 1-60]の最近の論文を参照してください。この論文では、$d \geq 1$の一般的な多変量設定を検討します。回帰関数の周期的単調性の概念は、置換/非リンク回帰モデルにおける識別と推定に十分であることを示す。置換回帰設定における置換回復を研究し、Kiefer-Wolfowitz [Ann. Math. Statist., 27, 887-906]ノンパラメトリック最大尤度推定量と最適輸送理論の手法に基づいて、計算効率が高く使いやすいノイズ除去アルゴリズムを開発します。ガウスノイズに関連する平均二乗ノイズ除去誤差の明示的な上限を提供します。ケース$d = 1$に関する以前の研究と同様に、置換/非リンク設定では、基礎となるデコンボリューション問題に起因する遅い(対数的)収束速度が伴う。また、多項式速度を取得できるラプラス分布の多変量一般化を含む特定の楕円ノイズ分布クラスへの拡張も提供します。数値研究は私たちの理論的分析を補完し、提案されたアプローチは、$d = 1$の場合に前述の先行研究の方法と少なくとも同等のパフォーマンスを発揮し、計算の複雑さの点で大幅な削減を達成することを示しています。

Volterra Neural Networks (VNNs)
ボルテラニューラルネットワーク (VNN)

The importance of inference in Machine Learning (ML) has led to an explosive number of different proposals, particularly in Deep Learning. In an attempt to reduce the complexity of Convolutional Neural Networks, we propose a Volterra filter-inspired Network architecture. This architecture introduces controlled non-linearities in the form of interactions between the delayed input samples of data. We propose a cascaded implementation of Volterra Filtering so as to significantly reduce the number of parameters required to carry out the same classification task as that of a conventional Neural Network. We demonstrate an efficient parallel implementation of this Volterra Neural Network (VNN), along with its remarkable performance while retaining a relatively simpler and potentially more tractable structure. Furthermore, we show a rather sophisticated adaptation of this network to nonlinearly fuse the RGB (spatial) information and the Optical Flow (temporal) information of a video sequence for action recognition. The proposed approach is evaluated on UCF-101 and HMDB-51 datasets for action recognition, and is shown to outperform state of the art CNN approaches.

機械学習(ML)における推論の重要性は、特にディープラーニングにおいて、爆発的に多くの異なる提案を生み出してきました。畳み込みニューラルネットワークの複雑さを軽減する試みとして、我々はVolterraフィルターにヒントを得たネットワークアーキテクチャを提案します。このアーキテクチャは、遅延入力データサンプル間の相互作用という形で、制御された非線形性を導入します。私たちは、従来のニューラルネットワークと同じ分類タスクを実行するために必要なパラメーターの数を大幅に削減するために、Volterraフィルターのカスケード実装を提案します。私たちは、このVolterraニューラルネットワーク(VNN)の効率的な並列実装と、比較的単純で潜在的に扱いやすい構造を維持しながら優れたパフォーマンスを発揮することを示します。さらに、このネットワークをかなり高度に適応させて、アクション認識のためにビデオシーケンスのRGB (空間)情報とオプティカルフロー(時間)情報を非線形に融合することを示します。提案されたアプローチは、アクション認識のためのUCF-101およびHMDB-51データセットで評価され、最先端のCNNアプローチよりも優れていることが示されています。

Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm
ベクトル値正則化最小二乗アルゴリズムの最適ソボレフノルム率に向けて

We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between L2 and the hypothesis space, which we consider as a vector-valued reproducing kernel Hilbert space. These rates allow to treat the misspecified case in which the true regression function is not contained in the hypothesis space. We combine standard assumptions on the capacity of the hypothesis space with a novel tensor product construction of vector-valued interpolation spaces in order to characterize the smoothness of the regression function. Our upper bound not only attains the same rate as real-valued kernel ridge regression, but also removes the assumption that the target regression function is bounded. For the lower bound, we reduce the problem to the scalar setting using a projection argument. We show that these rates are optimal in most cases and independent of the dimension of the output space. We illustrate our results for the special case of vector-valued Sobolev spaces.

私たちは、L2と仮説空間の間を補間する連続スケールのノルム上で、無限次元ベクトル値リッジ回帰の最初の最適レートを提示します。仮説空間は、ベクトル値再生カーネルヒルベルト空間と見なします。これらのレートにより、真の回帰関数が仮説空間に含まれない、誤って指定されたケースを処理できます。回帰関数の滑らかさを特徴付けるために、仮説空間の容量に関する標準的な仮定と、ベクトル値補間空間の新しいテンソル積構成を組み合わせます。上限は、実数値カーネルリッジ回帰と同じレートを達成するだけでなく、ターゲット回帰関数が有界であるという仮定も取り除きます。下限については、射影引数を使用して問題をスカラー設定に縮小します。これらのレートはほとんどの場合に最適であり、出力空間の次元とは無関係であることを示します。ベクトル値ソボレフ空間の特殊なケースの結果を示します。

Bayesian Regression Markets
ベイズ回帰市場

Although machine learning tasks are highly sensitive to the quality of input data, relevant datasets can often be challenging for firms to acquire, especially when held privately by a variety of owners. For instance, if these owners are competitors in a downstream market, they may be reluctant to share information. Focusing on supervised learning for regression tasks, we develop a regression market to provide a monetary incentive for data sharing. Our mechanism adopts a Bayesian framework, allowing us to consider a more general class of regression tasks. We present a thorough exploration of the market properties, and show that similar proposals in literature expose the market agents to sizeable financial risks, which can be mitigated in our setup.

機械学習タスクは入力データの品質に非常に敏感ですが、特にさまざまな所有者が非公開で保有している場合、関連するデータセットは企業が取得するのが難しいことがよくあります。例えば、これらの所有者が下流市場の競争相手である場合、情報の共有に消極的かもしれません。回帰タスクの教師あり学習に焦点を当て、データ共有に金銭的なインセンティブを提供する回帰市場を開発します。私たちのメカニズムはベイズフレームワークを採用しているため、回帰タスクのより一般的なクラスを考えることができます。私たちは、市場の特性の徹底的な調査を提示し、文献の同様の提案は、私たちのセットアップで軽減することができる市場のエージェントをかなりの金融リスクにさらすことを示しています。

Sharpness-Aware Minimization and the Edge of Stability
シャープネスを意識した最小化と安定性のエッジ

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the “edge of stability” based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an “edge of stability” for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.

最近の実験では、ステップサイズ$eta$の勾配降下法(GD)を使用してニューラルネットワークを訓練すると、損失のヘッセ分布の演算子ノルムが約$2/eta$に達するまで成長し、その後、この値を中心に変動することがよくあります。量$2/eta$は、損失の局所的な2次近似の考慮に基づいて「安定のエッジ」と呼ばれています。同様の計算を実行して、一般化を改善することが示されているGDの変種であるSharpness-Aware Minimization(SAM)の「安定性のエッジ」に到達します。GDの場合とは異なり、結果として得られるSAMエッジは勾配のノルムに依存します。3つの深層学習タスクを使用すると、SAMはこの分析によって特定された安定性のエッジで動作することが経験的にわかります。

Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization
確率的および敵対的オンライン凸最適化を橋渡しするための楽観的オンライン鏡面降下法

The stochastically extended adversarial (SEA) model, introduced by Sachs et al. (2022), serves as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition on expected loss functions, it is shown that the expected static regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. Sachs et al. (2022) also provide a regret bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model with smooth expected loss functions. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, but with a relaxation of the convexity requirement from individual functions to expected functions. For strongly convex and smooth functions, we establish an $\mathcal{O}\left(\frac{1}{\lambda}\left(\sigma_{\max}^2+\Sigma_{\max}^2\right)\log \left(\left(\sigma_{1:T}^2 + \Sigma_{1:T}^2\right)/\left(\sigma_{\max}^2+\Sigma_{\max}^2\right)\right)\right)$ bound, better than their $\mathcal{O}((\sigma_{\max}^2$ $ + \Sigma_{\max}^2) \log T)$ result. For exp-concave and smooth functions, our approach yields a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Moreover, we introduce the first expected dynamic regret guarantee for the SEA model with convex and smooth expected functions, which is more favorable than static regret bounds in non-stationary environments. Furthermore, we expand our investigation to scenarios with non-smooth expected loss functions and propose novel algorithms built upon optimistic OMD with an implicit update, successfully attaining both static and dynamic regret guarantees.

Sachsら(2022)によって導入された確率的に拡張された敵対的(SEA)モデルは、確率的オンライン凸最適化と敵対的オンライン凸最適化の間の補間として機能します。期待損失関数の平滑性条件下では、楽観的なfollow-the-regularized-leader (FTRL)の期待される静的後悔は、凸関数の累積確率分散$\sigma_{1:T}^2$と累積敵対的変動$\Sigma_{1:T}^2$に依存することが示されています。Sachsら(2022)は、強凸関数の最大確率分散$\sigma_{\max}^2$と最大敵対的変動$\Sigma_{\max}^2$に基づく後悔の境界も提供しています。彼らの研究に触発されて、我々は滑らかな期待損失関数を持つSEAモデルに対する楽観的オンラインミラー降下法(OMD)の理論的保証を調査します。凸関数と滑らかな関数の場合、同じ$\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$の後悔境界が得られますが、個々の関数から期待関数への凸性要件が緩和されます。強凸関数および滑らかな関数の場合、我々は$\mathcal{O}\left(\frac{1}{\lambda}\left(\sigma_{\max}^2+\Sigma_{\max}^2\right)\log \left(\left(\sigma_{1:T}^2 + \Sigma_{1:T}^2\right)/\left(\sigma_{\max}^2+\Sigma_{\max}^2\right)\right)\right)$境界を確立します。これは、それらの$\mathcal{O}((\sigma_{\max}^2$ $ + \Sigma_{\max}^2) \log T)$結果よりも優れています。指数凹関数および滑らかな関数の場合、我々のアプローチは新しい$\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$境界をもたらします。さらに、凸型で滑らかな期待関数を持つSEAモデルの最初の期待動的後悔保証を導入します。これは、非定常環境における静的後悔境界よりも有利です。さらに、滑らかでない期待損失関数を持つシナリオに調査を拡張し、暗黙の更新を伴う楽観的OMDに基づいて構築された新しいアルゴリズムを提案し、静的および動的後悔保証の両方を正常に達成します。

Multi-Objective Neural Architecture Search by Learning Search Space Partitions
探索空間分割の学習による多目的ニューラルアーキテクチャ探索

Deploying deep learning models requires taking into consideration neural network metrics such as model size, inference latency, and #FLOPs, aside from inference accuracy. This results in deep learning model designers leveraging multi-objective optimization to design effective deep neural networks in multiple criteria. However, applying multi-objective optimizations to neural architecture search (NAS) is nontrivial because NAS tasks usually have a huge search space, along with a non-negligible searching cost. This requires effective multi-objective search algorithms to alleviate the GPU costs. In this work, we implement a novel multi-objectives optimizer based on a recently proposed meta-algorithm called LaMOO on NAS tasks. In a nutshell, LaMOO speedups the search process by learning a model from observed samples to partition the search space and then focusing on promising regions likely to contain a subset of the Pareto frontier. Using LaMOO, we observe an improvement of more than 200% sample efficiency compared to Bayesian optimization and evolutionary-based multi-objective optimizers on different NAS datasets. For example, when combined with LaMOO, qEHVI achieves a 225% improvement in sample efficiency compared to using qEHVI alone in NasBench201. For real-world tasks, LaMOO achieves 97.36% accuracy with only 1.62M #Params on CIFAR10 in only 600 search samples. On ImageNet, our large model reaches 80.4% top-1 accuracy with only 522M #FLOPs.

ディープラーニングモデルを展開するには、推論精度のほかに、モデルサイズ、推論レイテンシ、#FLOPなどのニューラルネットワークメトリックを考慮する必要があります。このため、ディープラーニングモデルの設計者は、多目的最適化を活用して、複数の基準で効果的なディープニューラルネットワークを設計します。ただし、ニューラルアーキテクチャ検索(NAS)タスクには通常、検索スペースが広く、検索コストも無視できないため、多目的最適化をNASに適用するのは簡単ではありません。GPUコストを軽減するには、効果的な多目的検索アルゴリズムが必要です。この研究では、NASタスクで最近提案されたLaMOOと呼ばれるメタアルゴリズムに基づく新しい多目的最適化プログラムを実装します。簡単に言うと、LaMOOは、観測されたサンプルからモデルを学習して検索スペースを分割し、パレートフロンティアのサブセットを含む可能性のある有望な領域に焦点を当てることで、検索プロセスを高速化します。LaMOOを使用すると、さまざまなNASデータセットでベイジアン最適化や進化ベースの多目的最適化ツールと比較して、サンプル効率が200%以上向上することが確認されています。たとえば、LaMOOと組み合わせると、qEHVIはNasBench201でqEHVIのみを使用した場合と比較して、サンプル効率が225%向上します。実際のタスクでは、LaMOOはCIFAR10でわずか600の検索サンプルで、わずか1.62M #Paramsで97.36%の精度を達成します。ImageNetでは、当社の大規模モデルはわずか522M #FLOPsで80.4%のトップ1精度に達します。

Fermat Distances: Metric Approximation, Spectral Convergence, and Clustering Algorithms
フェルマー距離:メトリック近似、スペクトル収束、クラスタリングアルゴリズム

We analyze the convergence properties of Fermat distances, a family of density-driven metrics defined on Riemannian manifolds with an associated probability measure. Fermat distances may be defined either on discrete samples from the underlying measure, in which case they are random, or in the continuum setting, where they are induced by geodesics under a density-distorted Riemannian metric. We prove that discrete, sample-based Fermat distances converge to their continuum analogues in small neighborhoods with a precise rate that depends on the intrinsic dimensionality of the data and the parameter governing the extent of density weighting in Fermat distances. This is done by leveraging novel geometric and statistical arguments in percolation theory that allow for non-uniform densities and curved domains. Our results are then used to prove that discrete graph Laplacians based on discrete, sample-driven Fermat distances converge to corresponding continuum operators. In particular, we show the discrete eigenvalues and eigenvectors converge to their continuum analogues at a dimension-dependent rate, which allows us to interpret the efficacy of discrete spectral clustering using Fermat distances in terms of the resulting continuum limit. The perspective afforded by our discrete-to-continuum Fermat distance analysis leads to new clustering algorithms for data and related insights into efficient computations associated to density-driven spectral clustering. Our theoretical analysis is supported with numerical simulations and experiments on synthetic and real image data.

私たちは、リーマン多様体上で定義され、関連する確率測度を持つ密度駆動計量の一種であるフェルマー距離の収束特性を解析します。フェルマー距離は、基礎となる測度からの離散サンプル上で定義される場合はランダムであり、連続体設定の場合は密度歪んだリーマン計量の下で測地線によって誘導されます。私たちは、離散的なサンプルベースのフェルマー距離が、データの固有の次元とフェルマー距離における密度の重み付けの範囲を制御するパラメータに依存する正確な速度で、小さな近傍内の連続体類似物に収束することを証明した。これは、不均一な密度と湾曲した領域を許容する浸透理論における新しい幾何学的および統計的議論を活用することによって行われます。次に、我々の結果を使用して、離散的なサンプル駆動フェルマー距離に基づく離散グラフラプラシアンが、対応する連続体演算子に収束することを証明した。特に、離散固有値と固有ベクトルが次元依存の速度で連続類似体に収束することを示しています。これにより、結果として得られる連続極限の観点から、フェルマー距離を使用した離散スペクトルクラスタリングの効果を解釈できます。離散から連続へのフェルマー距離分析によって得られる視点は、データの新しいクラスタリングアルゴリズムと、密度駆動型スペクトルクラスタリングに関連する効率的な計算に関する洞察につながります。私たちの理論分析は、数値シミュレーションと、合成画像データと実画像データの実験によってサポートされています。

Spherical Rotation Dimension Reduction with Geometric Loss Functions
幾何損失関数による球面回転寸法の縮小

Modern datasets often exhibit high dimensionality, yet the data reside in low-dimensional manifolds that can reveal underlying geometric structures critical for data analysis. A prime example of such a dataset is a collection of cell cycle measurements, where the inherently cyclical nature of the process can be represented as a circle or sphere. Motivated by the need to analyze these types of datasets, we propose a nonlinear dimension reduction method, Spherical Rotation Component Analysis (SRCA), that incorporates geometric information to better approximate low-dimensional manifolds. SRCA is a versatile method designed to work in both high-dimensional and small sample size settings. By employing spheres or ellipsoids, SRCA provides a low-rank spherical representation of the data with general theoretic guarantees, effectively retaining the geometric structure of the dataset during dimensionality reduction. A comprehensive simulation study, along with a successful application to human cell cycle data, further highlights the advantages of SRCA compared to state-of-the-art alternatives, demonstrating its superior performance in approximating the manifold while preserving inherent geometric structures.

現代のデータセットは高次元であることが多いですが、データは低次元の多様体に存在するため、データ分析に重要な基礎となる幾何学的構造を明らかにすることができます。このようなデータセットの代表的な例は細胞周期測定のコレクションで、プロセスの本来の周期的な性質は円または球として表すことができます。このような種類のデータセットを分析する必要性から、我々は非線形次元削減法である球状回転成分分析(SRCA)を提案します。これは幾何学的情報を組み込んで低次元多様体をより適切に近似します。SRCAは、高次元と小規模なサンプルサイズの設定の両方で機能するように設計された多目的な方法です。球または楕円体を使用することで、SRCAは一般的な理論的保証を備えたデータの低ランク球面表現を提供し、次元削減中にデータセットの幾何学的構造を効果的に保持します。包括的なシミュレーション研究と、ヒト細胞周期データへの成功した適用により、最先端の代替手段と比較したSRCAの利点がさらに強調され、固有の幾何学的構造を維持しながら多様体を近似する際の優れたパフォーマンスが実証されています。

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks
ニューラルネットワークの学習における極端な数値感度と安定性エッジのPDEに基づく説明

We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD), and its variants. We show numerical error (on the order of the smallest floating point bit and thus the most extreme or limiting numerical perturbations induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance (sensitivities), comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new insights and predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.

私たちは、確率的勾配降下法(SGD)とその変種を使用した深層ネットワークの現在のトレーニング方法において、制約された数値的不安定性を発見しました。私たちは、ディープネットのトレーニングにおける浮動小数点演算から生じる数値誤差(最小浮動小数点ビットのオーダー、したがって最も極端または制限的な数値摂動)が大幅に増幅され、SGDの確率性によるテスト精度の変動に匹敵する、大きなテスト精度の変動(感度)につながる可能性があることを示しています。私たちは、これが、反復および重みテンソル空間の領域にわたって制限された、つまり局所化された最適化ダイナミクスの不安定性に起因する可能性が高いことを示しています。私たちは、偏微分方程式(PDE)の数値解析を使用した理論的枠組みを提示し、畳み込みニューラルネットワーク(CNN)の勾配降下法PDEを解析することによってこれを行っています。私たちは、学習率と重みの減衰に関する特定の条件下でのみ、それが安定していることを示す。私たちは、条件に違反したときに爆発するのではなく、不安定性を抑制できることを示す。私たちは、これが、局所的な線形化によって制御されるCNNの勾配降下法に関連する非線形PDEの結果であることを示す。離散化のステップサイズを過度に駆動すると、拘束不安定性が変化し、安定化効果をもたらします。私たちは、拘束不安定性を、最近発見された安定端(EoS)現象に関連付けます。この現象では、損失を最適化し続けながら、古典理論によって予測される安定したステップサイズを超え、依然として収束しています。拘束不安定性はEoSで発生するため、私たちの理論は、特に正則化の役割とネットワークの複雑さへの依存性について、EoSに関する新しい洞察と予測を提供します。

Two is Better Than One: Regularized Shrinkage of Large Minimum Variance Portfolios
2つは1つよりも優れている:大規模な最小分散ポートフォリオの正規化された縮小

In this paper, we construct a shrinkage estimator of the global minimum variance (GMV) portfolio by combining two techniques: Tikhonov regularization and direct shrinkage of portfolio weights. More specifically, we employ a double shrinkage approach, where the covariance matrix and portfolio weights are shrunk simultaneously. The ridge parameter controls the stability of the covariance matrix, while the portfolio shrinkage intensity shrinks the regularized portfolio weights to a predefined target. Both parameters simultaneously minimize, with probability one, the out-of-sample variance as the number of assets $p$ and the sample size $n$ tend to infinity, while their ratio $p/n$ tends to a constant $c > 0$. This method can also be seen as the optimal combination of the well-established linear shrinkage approach of Ledoit and Wolf (2004) and the shrinkage of the portfolio weights by Bodnar, Parolya and Schmid (2018). No specific distribution is assumed for the asset returns, except for the assumption of finite moments of order $4 + \varepsilon$ for $\varepsilon > 0$. The performance of the double shrinkage estimator is investigated via extensive simulation and empirical studies. The suggested method significantly outperforms its predecessor (without regularization) and the nonlinear shrinkage approach in terms of the out-of-sample variance, Sharpe ratio, and other empirical measures in the majority of scenarios. Moreover, it maintains the most stable portfolio weights with uniformly smallest turnover.

この論文では、ティホノフ正則化とポートフォリオウェイトの直接縮小という2つの手法を組み合わせて、グローバル最小分散(GMV)ポートフォリオの縮小推定量を構築します。具体的には、共分散行列とポートフォリオウェイトが同時に縮小される二重縮小アプローチを採用しています。リッジパラメータは共分散行列の安定性を制御し、ポートフォリオ縮小強度は正則化されたポートフォリオウェイトを事前定義されたターゲットに縮小します。両方のパラメータは、資産数$p$とサンプルサイズ$n$が無限大に近づく一方で、それらの比率$p/n$が定数$c > 0$に近づくため、確率1で同時にサンプル外分散を最小化します。この方法は、LedoitとWolf (2004)の確立された線形縮小アプローチと、Bodnar、Parolya、Schmid (2018)によるポートフォリオウェイトの縮小の最適な組み合わせと見ることもできます。資産収益については、$\varepsilon > 0$の場合の$4 + \varepsilon$オーダーの有限モーメントの仮定を除き、特定の分布は想定されていません。二重収縮推定量のパフォーマンスは、広範なシミュレーションと実証研究によって調査されています。提案された方法は、サンプル外分散、シャープ比、およびその他の実証的尺度の点で、ほとんどのシナリオで先行手法(正規化なし)および非線形収縮アプローチを大幅に上回っています。さらに、均一に最小の回転率で最も安定したポートフォリオの重みを維持します。

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning
分散縮小を伴う分散自然方策勾配による協調的マルチエージェント強化学習のための

This paper studies a policy optimization problem arising from collaborative multi-agent reinforcement learning in a decentralized setting where agents communicate with their neighbors over an undirected graph to maximize the sum of their cumulative rewards. A novel decentralized natural policy gradient method, dubbed Momentum-based Decentralized Natural Policy Gradient (MDNPG), is proposed, which incorporates natural gradient, momentum-based variance reduction, and gradient tracking into the decentralized stochastic gradient ascent framework. The $\mathcal{O}(n^{-1}\epsilon^{-3})$ sample complexity for MDNPG to converge to an $\epsilon$-stationary point has been established under standard assumptions, where $n$ is the number of agents. It indicates that MDNPG can achieve the optimal convergence rate for decentralized policy gradient methods and possesses a linear speedup in contrast to centralized optimization methods. Moreover, superior empirical performance of MDNPG over other state-of-the-art algorithms has been demonstrated by extensive numerical experiments.

この論文では、分散設定における協調的マルチエージェント強化学習から生じるポリシー最適化問題を研究します。分散設定では、エージェントが無向グラフを介して近隣エージェントと通信し、累積報酬の合計を最大化します。Momentum-based Decentralized Natural Policy Gradient (MDNPG)と呼ばれる新しい分散型自然ポリシー勾配法が提案されています。これは、自然勾配、Momentum-based分散型分散型自然ポリシー勾配、分散型確率的勾配上昇フレームワークに勾配追跡を組み込んだものです。MDNPGが$\epsilon$定常点に収束するためのサンプル複雑度は、標準的な仮定の下で確立されています($n$はエージェント数)。これは、MDNPGが分散型ポリシー勾配法の最適な収束率を達成でき、集中型最適化法とは対照的に線形の高速化を実現できることを示しています。さらに、MDNPGが他の最先端アルゴリズムよりも優れた実験的パフォーマンスを発揮することが、広範な数値実験によって実証されています。

Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning
安全な強化学習への応用による安全なブラックボックス最適化のためのログバリア

Optimizing noisy functions online, when evaluating the objective requires experiments on a deployed system, is a crucial task arising in manufacturing, robotics and various other domains. Often, constraints on safe inputs are unknown ahead of time, and we only obtain noisy information, indicating how close we are to violating the constraints. Yet, safety must be guaranteed at all times, not only for the final output of the algorithm. We introduce a general approach for seeking a stationary point in high dimensional non-linear stochastic optimization problems in which maintaining safety during learning is crucial. Our approach called LB-SGD, is based on applying stochastic gradient descent (SGD) with a carefully chosen adaptive step size to a logarithmic barrier approximation of the original problem. We provide a complete convergence analysis of non-convex, convex, and strongly-convex smooth constrained problems, with first-order and zeroth-order feedback. Our approach yields efficient updates and scales better with dimensionality compared to existing approaches. We empirically compare the sample complexity and the computational cost of our method with existing safe learning approaches. Beyond synthetic benchmarks, we demonstrate the effectiveness of our approach on minimizing constraint violation in policy search tasks in safe reinforcement learning (RL).

ノイズの多い関数をオンラインで最適化することは、目的を評価するために展開されたシステムでの実験が必要な場合、製造、ロボット工学、およびその他のさまざまな分野で発生する重要なタスクです。多くの場合、安全な入力の制約は事前に不明であり、制約に違反する可能性を示すノイズの多い情報のみを取得します。ただし、アルゴリズムの最終出力だけでなく、常に安全性を保証する必要があります。学習中に安全性を維持することが重要である高次元の非線形確率最適化問題で定常点を探すための一般的なアプローチを紹介します。LB-SGDと呼ばれるアプローチは、慎重に選択された適応ステップサイズで確率的勾配降下法（SGD）を元の問題の対数バリア近似に適用することに基づいています。1次および0次のフィードバックを使用して、非凸、凸、および強凸の滑らかな制約付き問題の完全な収束分析を提供します。私たちのアプローチは、既存のアプローチと比較して、効率的な更新をもたらし、次元に応じてより適切にスケーリングします。私たちは、サンプルの複雑さと我々の方法の計算コストを、既存の安全な学習アプローチと経験的に比較します。合成ベンチマークを超えて、安全な強化学習(RL)のポリシー検索タスクにおける制約違反を最小限に抑える我々のアプローチの有効性を実証します。

Cluster-Adaptive Network A/B Testing: From Randomization to Estimation
クラスター適応型ネットワーク A/B テスト: ランダム化から推定まで

The performance of A/B testing in both online and offline experimental settings hinges on mitigating network interference and achieving covariate balancing. These experiments often involve an observable network with identifiable clusters, and measurable cluster-level and individual-level attributes. Exploiting these inherent characteristics holds potential for refining experimental design and subsequent statistical analyses. In this article, we propose a novel cluster-adaptive network A/B testing procedure, which contains a cluster-adaptive randomization (CLAR) and a cluster-adjusted estimator (CAE) to facilitate the design of the experiment and enhance the performance of ATE estimation. The CLAR sequentially assigns clusters to minimize the Mahalanobis distance, which further leads to the balance of the cluster-level covariates and the within-cluster-averaged individual-level covariates. The cluster-adjusted estimator (CAE) is tailored to offset biases caused by network interference. The proposed procedure has the following two folds of the desirable properties. First, we show that the Malanobis distance calculated for the two levels of covariates is $O_p(m^{-1})$, where $m$ represents the number of clusters. This result justifies the simultaneous balance of the cluster-level and individual-level covariates. Under mild conditions, we derive the asymptotic normality of CAE and demonstrate the benefit of covariate balancing on improving the precision for estimating ATE. The proposed A/B testing procedure is easy to calculate, consistent, and achieves higher accuracy. Extensive numerical studies are conducted to demonstrate the finite sample property of the proposed network A/B testing procedure.

オンラインとオフラインの両方の実験設定でのA/Bテストのパフォーマンスは、ネットワーク干渉を緩和し、共変量のバランスをとることにかかっています。これらの実験では、多くの場合、識別可能なクラスターと、測定可能なクラスターレベルおよび個人レベルの属性を持つ観測可能なネットワークが関係します。これらの固有の特性を利用することで、実験設計とその後の統計分析を改良できる可能性があります。この記事では、実験の設計を容易にし、ATE推定のパフォーマンスを向上させるクラスター適応ランダム化(CLAR)とクラスター調整推定量(CAE)を含む、新しいクラスター適応型ネットワークA/Bテスト手順を提案します。CLARは、マハラノビス距離を最小化するようにクラスターを順次割り当て、クラスターレベルの共変量とクラスター内平均の個人レベルの共変量のバランスをさらにとります。クラスター調整推定量(CAE)は、ネットワーク干渉によって引き起こされるバイアスを相殺するように調整されています。提案された手順には、次の2つの望ましい特性があります。まず、2つのレベルの共変量に対して計算されたマラノビス距離が$O_p(m^{-1})$であることを示します。ここで、$m$はクラスターの数を表します。この結果は、クラスターレベルと個人レベルの共変量の同時バランスを正当化します。穏やかな条件下では、CAEの漸近正規性を導出し、共変量バランスがATEの推定精度を向上させる利点を示します。提案されたA/Bテスト手順は計算が簡単で、一貫性があり、より高い精度を実現します。提案されたネットワークA/Bテスト手順の有限サンプル特性を実証するために、広範な数値研究が行われます。

On the Computational and Statistical Complexity of Over-parameterized Matrix Sensing
過剰パラメータ化された行列センシングの計算および統計の複雑さについて

We consider solving the low-rank matrix sensing problem with the Factorized Gradient Descent (FGD) method when the specified rank is larger than the true rank. We refer to this as over-parameterized matrix sensing.If the ground truth signal $\mathbf{X}^* \in \mathbb{R}^{d \times d}$ is of rank $r$, but we try to recover it using $\mathbf{F} \mathbf{F}^\top$ where $\mathbf{F} \in \mathbb{R}^{d \times k}$ and $k>r$, the existing statistical analysis either no longer holds or produces a vacuous statistical error upper bound (infinity) due to the flat local curvature of the loss function around the global maxima.By decomposing the factorized matrix $\mathbf{F}$ into separate column spaces to capture the impact of using $k > r$, we show that $\left\| {\mathbf{F}_t \mathbf{F}_t – \mathbf{X}^*} \right\|_F^2$ converges sub-linearly to a statistical error of $\tilde{\mathcal{O}} (k d \sigma^2/n)$ after $\tilde{\mathcal{O}}(\frac{\sigma_{r}}{\sigma}\sqrt{\frac{n}{d}})$ iterations, where $\mathbf{F}_t$ is the output of FGD after $t$ iterations, $\sigma^2$ is the variance of the observation noise, $\sigma_{r}$ is the $r$-th largest eigenvalue of $\mathbf{X}^*$, and $n$ is the number of samples.With a precise characterization of the convergence behavior and the statistical error, our results, therefore, offer a comprehensive picture of the statistical and computational complexity if we solve the over-parameterized matrix sensing problem with FGD.

私たちは、指定されたランクが真のランクよりも大きい場合、因数分解勾配降下法(FGD)を使用して低ランク行列センシング問題を解決することを検討します。これを過剰パラメータ化された行列センシングと呼びます。グラウンドトゥルース信号$\mathbf{X}^* \in \mathbb{R}^{d \times d}$のランクが$r$であるが、$\mathbf{F} \in \mathbb{R}^{d \times k}$かつ$k>r$である$\mathbf{F} \mathbf{F}^\top$を使用してこれを回復しようとすると、既存の統計分析は保持されなくなるか、損失関数のグローバル最大値付近の平坦な局所曲率により、空虚な統計誤差の上限(無限大)が生成されます。因数分解された行列$\mathbf{F}$を別々の列空間に分解して、$k > r$を使用する影響を捕捉すると、$\left\left\left {\mathbf{F}_t \mathbf{F}_t – \mathbf{X}^*} \right\|_F^2$は、$\tilde{\mathcal{O}}(\frac{\sigma_{r}}{\sigma}\sqrt{\frac{n}{d}})$回の反復後に、統計誤差$\tilde{\mathcal{O}} (k d \sigma^2/n)$に線形収束します。ここで、$\mathbf{F}_t$は、$t$回の反復後のFGDの出力、$\sigma^2$は、観測ノイズの分散、$\sigma_{r}$は、$\mathbf{X}^*$の$r$番目に大きい固有値、$n$はサンプル数です。収束挙動と統計誤差の正確な特徴付けにより、したがって、結果は、FGDを使用して過剰パラメータ化されたマトリックスセンシング問題を解決した場合の統計的および計算的複雑さの包括的な図を示します。

Optimization-based Causal Estimation from Heterogeneous Environments
異種環境からの最適化に基づく因果推定

This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments—and ones that exhibit sufficient heterogeneity—CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model and more accurate predictions under interventions.

この論文では、因果推定に対する新しい最適化アプローチを紹介します。共変量と結果を含むデータが与えられた場合、どの共変量が結果の原因であり、因果関係の強さはどの程度でしょうか。従来の機械学習(ML)では、最適化の目標は予測精度を最大化することです。ただし、一部の共変量は結果と非因果関係を示す場合があります。このような偽の関連は従来のMLに予測力を提供しますが、結果を因果的に解釈することを妨げます。この論文では、純粋な予測と因果推論のギャップを埋める最適化アルゴリズムであるCoCoを提案します。CoCoは、最近提案された環境という概念を活用します。環境とは、因果関係は不変であるが、共変量の分布は環境ごとに変化する共変量/応答のデータセットです。複数の環境からのデータセット(十分な異質性を示すもの)が与えられた場合、CoCoは因果解が唯一の解である目的を最大化します。このアプローチの理論的基礎を説明し、シミュレーションおよび実際のデータセットでの有効性を実証します。従来のMLや既存の方法と比較して、CoCoは因果モデルのより正確な推定と介入下でのより正確な予測を提供します。

Optimal Locally Private Nonparametric Classification with Public Data
公開データによる最適局所プライベートノンパラメトリック分類

In this work, we investigate the problem of public data assisted non-interactive Local Differentially Private (LDP) learning with a focus on non-parametric classification. Under the posterior drift assumption, we for the first time derive the mini-max optimal convergence rate with LDP constraint. Then, we present a novel approach, the locally differentially private classification tree, which attains the mini-max optimal convergence rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tuning and provides a fast converging estimator. Comprehensive experiments conducted on synthetic and real data sets show the superior performance of our proposed methods. Both our theoretical and experimental findings demonstrate the effectiveness of public data compared to private data, which leads to practical suggestions for prioritizing non-private data collection.

この研究では、ノンパラメトリック分類に焦点を当てて、非対話型のLDP(Local Differentially Private)学習を支援する公開データの問題を調査します。事後ドリフトの仮定の下で、LDP制約を使用したミニマックス最適収束率を初めて導出します。次に、ミニマックスの最適収束率を達成する新しいアプローチである局所微分プライベート分類ツリーを提示します。さらに、パラメータの調整を回避し、高速な収束推定量を提供するデータ駆動型のプルーニング手順を設計します。合成データセットと実データセットで行われた包括的な実験は、私たちが提案した方法の優れた性能を示しています。私たちの理論的および実験的知見は、プライベートデータと比較して公開データの有効性を実証しており、非プライベートデータ収集を優先するための実用的な提案につながります。

Learning to Warm-Start Fixed-Point Optimization Algorithms
固定小数点最適化アルゴリズムのウォームスタートの学習

We introduce a machine-learning framework to warm-start fixed-point optimization algorithms. Our architecture consists of a neural network mapping problem parameters to warm starts, followed by a predefined number of fixed-point iterations. We propose two loss functions designed to either minimize the fixed-point residual or the distance to a ground truth solution. In this way, the neural network predicts warm starts with the end-to-end goal of minimizing the downstream loss. An important feature of our architecture is its flexibility, in that it can predict a warm start for fixed-point algorithms run for any number of steps, without being limited to the number of steps it has been trained on. We provide PAC-Bayes generalization bounds on unseen data for common classes of fixed-point operators: contractive, linearly convergent, and averaged. Applying this framework to well-known applications in control, statistics, and signal processing, we observe a significant reduction in the number of iterations and solution time required to solve these problems, through learned warm starts.

私たちは、ウォームスタート固定小数点最適化アルゴリズムに機械学習フレームワークを導入します。このアーキテクチャは、問題パラメータをウォームスタートにマッピングするニューラルネットワークと、それに続く定義済みの固定小数点反復回数で構成されます。固定小数点残差またはグラウンドトゥルースソリューションまでの距離を最小化するように設計された2つの損失関数を提案します。このようにして、ニューラルネットワークは、ダウンストリーム損失を最小化するというエンドツーエンドの目標を掲げてウォームスタートを予測します。このアーキテクチャの重要な特徴は柔軟性です。つまり、トレーニングされたステップ数に制限されることなく、任意のステップ数で実行される固定小数点アルゴリズムのウォームスタートを予測できます。一般的なクラスの固定小数点演算子(収縮、線形収束、平均)について、未知のデータに対するPAC-Bayesの一般化境界を提供します。このフレームワークを制御、統計、信号処理のよく知られたアプリケーションに適用すると、学習されたウォームスタートを通じて、これらの問題を解決するために必要な反復回数と解決時間が大幅に削減されます。

Nonparametric Regression Using Over-parameterized Shallow ReLU Neural Networks
過度にパラメーター化された浅い ReLU ニューラルネットワークを使用したノンパラメトリック回帰

It is shown that over-parameterized neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes, if the weights are suitably constrained or regularized. Specifically, we consider the nonparametric regression of estimating an unknown $d$-variate function by using shallow ReLU neural networks. It is assumed that the regression function is from the H\”older space with smoothness $\alpha<(d+3)/2$ or a variation space corresponding to shallow neural networks, which can be viewed as an infinitely wide neural network. In this setting, we prove that least squares estimators based on shallow neural networks with certain norm constraints on the weights are minimax optimal, if the network width is sufficiently large. As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks, which may be of independent interest.

過度にパラメーター化されたニューラルネットワークは、重みが適切に制約または正則化されている場合、特定の滑らかな関数クラスから関数を学習するためのミニマックス最適収束率(対数係数まで)を達成できることが示されています。具体的には、浅いReLUニューラルネットワークを使用して未知の$d$-variate関数を推定するノンパラメトリック回帰を考えます。回帰関数は、滑らかさが$alpha<(d+3)/2$のH”古い空間、または無限に広いニューラルネットワークと見なすことができる浅いニューラルネットワークに対応する変動空間からのものであると仮定します。この設定では、ネットワーク幅が十分に大きい場合、重みに特定のノルム制約を持つ浅いニューラルネットワークに基づく最小二乗推定量がミニマックス最適であることを証明します。副産物として、浅いReLUニューラルネットワークのローカルRademacher複雑性に対して、サイズに依存しない新しい境界を導き出します。これは独立した関心事である可能性があります。

Nonparametric Copula Models for Multivariate, Mixed, and Missing Data
多変量データ、混合データ、欠損データのノンパラメトリックコピュラモデル

Modern data sets commonly feature both substantial missingness and many variables of mixed data types, which present significant challenges for estimation and inference. Complete case analysis, which proceeds using only the observations with fully-observed variables, is often severely biased, while model-based imputation of missing values is limited by the ability of the model to capture complex dependencies among (possibly many) variables of mixed data types. To address these challenges, we develop a novel Bayesian mixture copula for joint and nonparametric modeling of multivariate count, continuous, ordinal, and unordered categorical variables, and deploy this model for inference, prediction, and imputation of missing data. Most uniquely, we introduce a new and computationally efficient strategy for marginal distribution estimation that eliminates the need to specify any marginal models yet delivers posterior consistency for each marginal distribution and the copula parameters under missingness-at-random. Extensive simulation studies demonstrate exceptional modeling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach offers a resolution.

現代のデータセットは、一般的に、大幅な欠損と混合データタイプの多くの変数の両方を特徴としており、推定と推論に大きな課題をもたらします。完全に観測された変数を含む観測のみを使用して続行する完全なケース分析は、多くの場合、深刻な偏りがあり、一方、欠損値のモデルベースの補完は、混合データタイプの（おそらく多数の）変数間の複雑な依存関係を捕捉するモデルの能力によって制限されます。これらの課題に対処するために、私たちは、多変量カウント、連続、順序、および順序なしカテゴリ変数の結合およびノンパラメトリックモデリングのための新しいベイズ混合コピュラを開発し、このモデルを推論、予測、および欠損データの補完に展開します。最もユニークなのは、私たちは、周辺分布推定のための新しい計算効率の高い戦略を導入し、周辺モデルを指定する必要性を排除しながら、ランダム欠損の下での各周辺分布とコピュラパラメータの事後一貫性を実現します。広範囲にわたるシミュレーション研究により、特に混合データタイプ、複雑な欠損メカニズム、非線形依存関係において、競合方法に比べて優れたモデリングおよび補完機能が実証されています。最後に、欠損データの不適切な処理によって統計分析が歪む可能性があること、および提案されたアプローチによって解決策がどのように提供されるかを強調したデータ分析を示します。

An Analysis of Quantile Temporal-Difference Learning
分位点時間差分学習の解析

We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.

私たちは、いくつかの成功した強化学習の大規模応用において重要な要素であることが証明されている分布強化学習アルゴリズムである分位点時間差分学習(QTD)を分析します。これらの経験的な成功にもかかわらず、QTDの理論的な理解はこれまでとらえどころのないものでした。標準的な確率的近似ツールで解析できる従来のTD学習とは異なり、QTD更新は収縮マッピングを近似せず、非線形性が高く、複数の固定点を持つ場合があります。この論文の主な結果は、確率1の動的計画法の関連ファミリの不動点への収束の証明であり、QTDを確固たる理論的基盤に築き上げます。この証明は、確率的近似理論と非平滑解析を通じて、QTDと非線形微分介在物との間の接続を確立します。

Conformal Inference for Online Prediction with Arbitrary Distribution Shifts
任意の分布シフトを持つオンライン予測のための共形推論

We consider the problem of forming prediction sets in an online setting where the distribution generating the data is allowed to vary over time. Previous approaches to this problem suffer from over-weighting historical data and thus may fail to quickly react to the underlying dynamics. Here, we correct this issue and develop a novel procedure with provably small regret over all local time intervals of a given width. We achieve this by modifying the adaptive conformal inference (ACI) algorithm of Gibbs and Candès (2021) to contain an additional step in which the step-size parameter of ACI’s gradient descent update is tuned over time. Crucially, this means that unlike ACI, which requires knowledge of the rate of change of the data-generating mechanism, our new procedure is adaptive to both the size and type of the distribution shift. Our methods are highly flexible and can be used in combination with any baseline predictive algorithm that produces point estimates or estimated quantiles of the target without the need for distributional assumptions. We test our techniques on two real-world datasets aimed at predicting stock market volatility and COVID-19 case counts and find that they are robust and adaptive to real-world distribution shifts.

私たちは、データを生成する分布が時間とともに変化することが許されるオンライン設定で予測セットを形成する問題を考察します。この問題に対するこれまでのアプローチは、履歴データに過剰な重み付けをするという問題を抱えており、そのため根本的なダイナミクスに迅速に対応できない可能性があります。ここでは、この問題を修正し、特定の幅のすべてのローカル時間間隔にわたって証明可能なほど小さい後悔を伴う新しい手順を開発します。これは、GibbsとCandès (2021)の適応共形推論(ACI)アルゴリズムを修正して、ACIの勾配降下更新のステップサイズパラメータを時間とともに調整する追加ステップを含めることで実現します。重要なのは、データ生成メカニズムの変化率に関する知識を必要とするACIとは異なり、新しい手順は分布シフトのサイズとタイプの両方に適応できるということです。私たちの方法は非常に柔軟性が高く、分布の仮定を必要とせずにターゲットの点推定値または推定分位値を生成するベースライン予測アルゴリズムと組み合わせて使用できます。私たちは、株式市場のボラティリティとCOVID-19の感染者数を予測することを目的とした2つの現実世界のデータセットでこの手法をテストし、それが堅牢で現実世界の分布の変化に適応できることを発見しました。

More Efficient Estimation of Multivariate Additive Models Based on Tensor Decomposition and Penalization
テンソル分解とペナルティ化に基づく多変量加法モデルのより効率的な推定

We consider parsimonious modeling of high-dimensional multivariate additive models using regression splines, with or without sparsity assumptions. The approach is based on treating the coefficients in the spline expansions as a third-order tensor. Note the data does not have tensor predictors or tensor responses, which distinguishes our study from the existing ones. A Tucker decomposition is used to reduce the number of parameters in the tensor. We also combined the Tucker decomposition with penalization to enable variable selection. The proposed method can avoid the statistical inefficiency caused by estimating a large number of nonparametric functions. We provide sufficient conditions under which the proposed tensor-based estimators achieve the optimal rate of convergence for the nonparametric regression components. We conduct simulation studies to demonstrate the effectiveness of the proposed novel approach in fitting high-dimensional multivariate additive models and illustrate its application on a breast cancer copy number variation and gene expression data set.

私たちは、スパース仮定の有無にかかわらず、回帰スプラインを使用した高次元多変量加法モデルの簡潔なモデリングを検討します。このアプローチは、スプライン展開の係数を3次テンソルとして扱うことに基づいています。データにはテンソル予測子やテンソル応答がないため、我々の研究は既存の研究と区別されます。テンソルのパラメータ数を減らすためにタッカー分解を使用します。また、変数選択を可能にするために、タッカー分解とペナルティを組み合わせました。提案された方法は、多数のノンパラメトリック関数を推定することによって生じる統計的非効率性を回避できます。私たちは、提案されたテンソルベースの推定量がノンパラメトリック回帰成分の最適な収束率を達成するための十分な条件を提供します。私たちは、高次元多変量加法モデルのフィッティングにおける提案された新しいアプローチの有効性を実証するためにシミュレーション研究を実施し、乳がんのコピー数変異と遺伝子発現データセットへの適用を示す。

A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment
ノイズコントラスティブバックドア調整による因果関係のカーネルテスト

Causal inference grows increasingly complex as the dimension of confounders increases. Given treatments $X$, outcomes $Y$, and measured confounders $Z$, we develop a non-parametric method to test the do-null hypothesis that, after an intervention on $X$, there is no marginal dependence of $Y$ on $X$, against the general alternative. Building on the Hilbert-Schmidt Independence Criterion (HSIC) for marginal independence testing, we propose backdoor-HSIC (bd-HSIC), an importance weighted HSIC which combines density ratio estimation with kernel methods. Experiments on simulated data verify the correct size and that the estimator has power for both binary and continuous treatments under a large number of confounding variables. Additionally, we establish convergence properties of the estimators of covariance operators used in bd-HSIC. We investigate the advantages and disadvantages of bd-HSIC against parametric tests as well as the importance of using the do-null testing in contrast to marginal or conditional independence testing. A complete implementation can be found at https://github.com/MrHuff/kgformula.

交絡因子の次元が大きくなるにつれて、因果推論はますます複雑になります。治療$X$、結果$Y$、測定された交絡因子$Z$が与えられた場合、一般的な対立仮説に対して、$X$への介入後、$Y$の$X$への限界依存性がないというdo-null仮説をテストするためのノンパラメトリックな方法を開発します。限界独立性テストのためのヒルベルト-シュミット独立性基準(HSIC)に基づいて、密度比推定とカーネル法を組み合わせた重要度加重HSICであるバックドアHSIC (bd-HSIC)を提案します。シミュレートされたデータでの実験により、サイズが正しいこと、および推定量が多数の交絡変数の下でバイナリ治療と連続治療の両方に対して検出力を持っていることが確認されます。さらに、bd-HSICで使用される共分散演算子の推定量の収束特性を確立します。パラメトリックテストに対するbd-HSICの利点と欠点、および限界独立性テストや条件付き独立性テストと比較してdo-nullテストを使用することの重要性を調査します。完全な実装はhttps://github.com/MrHuff/kgformulaにあります。

Assessing the Overall and Partial Causal Well-Specification of Nonlinear Additive Noise Models
非線形加法騒音モデルの全体的および部分的な因果的ウェル仕様の評価

We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution. We then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data.

私たちは、非線形因果加法モデルと潜在的に不均一なノイズモデルにおけるモデルの仕様ミスを検出する方法を提案します。私たちは、そのような誤指定の場合でも因果効果を推測できる予測変数を特定することを目指しています。多変量観測データ分布の知識に基づいて一般的なフレームワークを開発します。次に、有限のサンプルデータのアルゴリズムを提案し、その漸近特性について説明し、シミュレーションデータと実データでそのパフォーマンスを示します。

Simple Cycle Reservoirs are Universal
シンプルサイクルリザーバーはユニバーサルです

Reservoir computation models form a subclass of recurrent neural networks with fixed non-trainable input and dynamic coupling weights. Only the static readout from the state space (reservoir) is trainable, thus avoiding the known problems with propagation of gradient information backwards through time. Reservoir models have been successfully applied in a variety of tasks and were shown to be universal approximators of time-invariant fading memory dynamic filters under various settings. Simple cycle reservoirs (SCR) have been suggested as severely restricted reservoir architecture, with equal weight ring connectivity of the reservoir units and input-to-reservoir weights of binary nature with the same absolute value. Such architectures are well suited for hardware implementations without performance degradation in many practical tasks. In this contribution, we rigorously study the expressive power of SCR in the complex domain and show that they are capable of universal approximation of any unrestricted linear reservoir system (with continuous readout) and hence any time-invariant fading memory filter over uniformly bounded input streams.

リザーバ計算モデルは、固定されたトレーニング不可能な入力と動的な結合重みを持つリカレントニューラルネットワークのサブクラスを形成します。状態空間(リザーバ)からの静的な読み出しのみがトレーニング可能であるため、勾配情報が時間を通じて逆方向に伝播するという既知の問題を回避できます。リザーバモデルはさまざまなタスクにうまく適用されており、さまざまな設定で時間不変フェーディングメモリ動的フィルタの普遍的な近似値であることが示されています。単純サイクルリザーバ(SCR)は、リザーバユニットの等重みリング接続と、同じ絶対値を持つバイナリ特性の入力からリザーバへの重みを持つ、厳しく制限されたリザーバアーキテクチャとして提案されています。このようなアーキテクチャは、多くの実用的なタスクでパフォーマンスを低下させることなくハードウェア実装に適しています。この投稿では、複雑な領域でのSCRの表現力を厳密に調査し、SCRが任意の制限のない線形リザーバシステム(連続読み出し)の普遍的な近似が可能であり、したがって一様に境界付けられた入力ストリーム上の任意の時間不変フェーディングメモリフィルタの普遍的な近似が可能であることを説明します。

On the Computational Complexity of Metropolis-Adjusted Langevin Algorithms for Bayesian Posterior Sampling
ベイズ事後サンプリングのためのメトロポリス調整ランジュバンアルゴリズムの計算複雑性について

In this paper, we examine the computational complexity of sampling from a Bayesian posterior (or pseudo-posterior) using the Metropolis-adjusted Langevin algorithm (MALA). MALA first employs a discrete-time Langevin SDE to propose a new state, and then adjusts the proposed state using Metropolis-Hastings rejection. Most existing theoretical analyses of MALA rely on the smoothness and strong log-concavity properties of the target distribution, which are often lacking in practical Bayesian problems. Our analysis hinges on statistical large sample theory, which constrains the deviation of the Bayesian posterior from being smooth and log-concave in a very specific way. In particular, we introduce a new technique for bounding the mixing time of a Markov chain with a continuous state space via the $s$-conductance profile, offering improvements over existing techniques in several aspects. By employing this new technique, we establish the optimal parameter dimension dependence of $d^{1/3}$ and condition number dependence of $\kappa$ in the non-asymptotic mixing time upper bound for MALA after the burn-in period, under a standard Bayesian setting where the target posterior distribution is close to a $d$-dimensional Gaussian distribution with a covariance matrix having a condition number $\kappa$. We also prove a matching mixing time lower bound for sampling from a multivariate Gaussian via MALA to complement the upper bound.

この論文では、メトロポリス調整ランジュバンアルゴリズム(MALA)を使用してベイズ事後分布(または擬似事後分布)からサンプリングする際の計算の複雑さについて検討します。MALAは、最初に離散時間ランジュバンSDEを使用して新しい状態を提案し、次にメトロポリス-ヘイスティングス棄却法を使用して提案された状態を調整します。MALAの既存の理論分析のほとんどは、ターゲット分布の滑らかさと強い対数凹性に依存していますが、これらは実際のベイズ問題では欠けていることがよくあります。私たちの分析は、ベイズ事後分布が滑らかで対数凹であることからの逸脱を非常に具体的な方法で制限する統計的大標本理論に依存しています。特に、$s$-コンダクタンスプロファイルを介して連続状態空間を持つマルコフ連鎖の混合時間を制限するための新しい手法を紹介し、いくつかの点で既存の手法よりも改善されています。この新しい手法を採用することで、ターゲット事後分布が条件数$\kappa$を持つ共分散行列を持つ$d$次元ガウス分布に近い標準的なベイズ設定の下で、バーンイン期間後のMALAの非漸近的混合時間上限における$d^{1/3}$の最適なパラメータ次元依存性と$\kappa$の条件数依存性を確立します。また、上限を補完するために、MALAを介して多変量ガウスからサンプリングするための一致する混合時間の下限も証明します。

Generalization and Stability of Interpolating Neural Networks with Minimal Width
最小幅のニューラルネットワーク補間における一般化と安定性

We investigate the generalization and optimization properties of shallow neural-network classifiers trained by gradient descent in the interpolating regime. Specifically, in a realizable scenario where model weights can achieve arbitrarily small training error $\epsilon$ and their distance from initialization is $g(\epsilon)$, we demonstrate that gradient descent with $n$ training data achieves training error $O(g(1/T)^2\big/T)$ and generalization error $O(g(1/T)^2\big/n)$ at iteration $T$, provided there are at least $m=\Omega(g(1/T)^4)$ hidden neurons. We then show that our realizable setting encompasses a special case where data are separable by the model’s neural tangent kernel. For this and logistic-loss minimization, we prove the training loss decays at a rate of $\tilde O(1/ T)$ given polylogarithmic number of neurons $m=\Omega(\log^4 (T))$. Moreover, with $m=\Omega(\log^{4} (n))$ neurons and $T\approx n$ iterations, we bound the test loss by $\tilde{O}(1/ n)$. Our results differ from existing generalization outcomes using the algorithmic-stability framework, which necessitate polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak-convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective’s non-convexity, this leads to convergence and generalization-gap bounds that resemble those found in the convex setting of linear logistic regression.

私たちは、補間方式で勾配降下法によって訓練された浅いニューラルネットワーク分類器の一般化と最適化の特性を調査します。具体的には、モデルの重みが任意の小さな訓練誤差$\epsilon$を達成でき、初期化からの距離が$g(\epsilon)$である実現可能なシナリオでは、少なくとも$m=\Omega(g(1/T)^4)$個の隠れニューロンがある場合、$n$個の訓練データを使用した勾配降下法が、反復$T$で訓練誤差$O(g(1/T)^2\big/T)$と一般化誤差$O(g(1/T)^2\big/n)$を達成することを示します。次に、実現可能な設定には、データがモデルのニューラル接線カーネルによって分離可能な特殊なケースが含まれることを示します。これとロジスティック損失の最小化のために、ニューロンの多重対数数$m=\Omega(\log^4 (T))$が与えられた場合、トレーニング損失は$\tilde O(1/ T)$の速度で減少することを証明します。さらに、ニューロンが$m=\Omega(\log^{4} (n))$個で反復回数が$T\approx n$回の場合、テスト損失は$\tilde{O}(1/ n)$に制限されます。私たちの結果は、多項式幅を必要とし、最適ではない一般化率をもたらす、アルゴリズム安定性フレームワークを使用した既存の一般化結果とは異なります。私たちの分析の中心となるのは、新しい自己有界弱凸性プロパティの使用です。これは、十分にパラメータ化されたニューラルネットワーク分類器の一般化されたローカル準凸性プロパティにつながります。最終的には、目的関数が非凸であるにもかかわらず、線形ロジスティック回帰の凸設定で見られるものに似た収束と一般化ギャップ境界が得られます。

Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression
分割統治法カーネルベースの関数線形回帰の統計的最適性

Previous analysis of regularized functional linear regression in a reproducing kernel Hilbert space (RKHS) typically requires the target function to be contained in this kernel space. This paper studies the convergence performance of divide-and-conquer estimators in the scenario that the target function does not necessarily reside in the underlying RKHS. As a decomposition-based scalable approach, the divide-and-conquer estimators of functional linear regression can substantially reduce the algorithmic complexities in time and memory. We develop an integral operator approach to establish sharp finite sample upper bounds for prediction with divide-and-conquer estimators under various regularity conditions of explanatory variables and target function. We also prove the asymptotic optimality of the derived rates by building the mini-max lower bounds. Finally, we consider the convergence of noiseless estimators and show that the rates can be arbitrarily fast under mild conditions.

再現カーネルヒルベルト空間(RKHS)での正則化関数線形回帰の以前の分析では、通常、ターゲット関数がこのカーネル空間に含まれている必要があります。この論文では、ターゲット関数が必ずしも基礎となるRKHSに存在するとは限らないシナリオでの分割統治推定器の収束性能を研究します。分解ベースのスケーラブルなアプローチとして、関数型線形回帰の分割統治推定量は、時間とメモリのアルゴリズムの複雑さを大幅に軽減できます。説明変数とターゲット関数の様々な規則性条件下での分割統治推定量を用いた予測のためのシャープな有限サンプル上限を確立するための積分演算子アプローチを開発します。また、導出されたレートの漸近最適性を、ミニマックス下限を構築することで証明します。最後に、ノイズレス推定量の収束を検討し、穏やかな条件下でレートが任意に速くなることを示します。

Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations
離散観測からの均質線形ODEシステムの学習における識別可能性と漸近性

Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, for example, identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory. When observations are disturbed by measurement noise, we prove that under mild conditions, the parameter estimator based on the Nonlinear Least Squares (NLS) method is consistent and asymptotic normal with $n^{-1/2}$ convergence rate. Based on the asymptotic normality property, we construct confidence sets for the unknown system parameters and propose a new method to infer the causal structure of the ODE system, that is, inferring whether there is a causal link between system variables. Furthermore, we extend the results to degraded observations, including aggregated and time-scaled ones. To the best of our knowledge, our work is the first systematic study of the identifiability and asymptotic properties in learning linear ODE systems. We also construct simulations with various system dimensions to illustrate the established theoretical results.

常微分方程式(ODE)は最近、機械学習で大きな注目を集めています。しかし、統計的推定の識別可能性や漸近特性などの理論的側面はまだ不明瞭です。この論文では、単一の軌跡からサンプリングされた等間隔のエラーフリー観測のシーケンスから、同次線形ODEシステムの識別可能性の十分条件を導出します。観測が測定ノイズによって乱れた場合、軽度の条件下では、非線形最小二乗法(NLS)に基づくパラメータ推定量が、収束率が$n^{-1/2}$で一貫性があり漸近正規性があることを証明します。漸近正規性特性に基づいて、未知のシステムパラメータの信頼セットを構築し、ODEシステムの因果構造を推測する、つまりシステム変数間に因果関係があるかどうかを推測する新しい方法を提案します。さらに、集約された観測や時間スケールされた観測を含む劣化した観測に結果を拡張します。私たちの知る限り、私たちの研究は線形ODEシステムの学習における識別可能性と漸近特性に関する初めての体系的な研究です。また、確立された理論的結果を示すために、さまざまなシステム次元でのシミュレーションを構築しています。

Robust Black-Box Optimization for Stochastic Search and Episodic Reinforcement Learning
確率的探索とエピソード的強化学習のためのロバストブラックボックス最適化

Black-box optimization is a versatile approach to solve complex problems where the objective function is not explicitly known and no higher order information is available. Due to its general nature, it finds widespread applications in function optimization as well as machine learning, especially episodic reinforcement learning tasks. While traditional black-box optimizers like CMA-ES may falter in noisy scenarios due to their reliance on ranking-based transformations, a promising alternative emerges in the form of the Model-based Relative Entropy Stochastic Search (MORE) algorithm. MORE can be derived from natural policy gradients and compatible function approximation and directly optimizes the expected fitness without resorting to rankings. However, in its original formulation, MORE often cannot achieve state of the art performance. In this paper, we improve MORE by decoupling the update of the search distribution’s mean and covariance and an improved entropy scheduling technique based on an evolution path resulting in faster convergence, and a simplified model learning approach in comparison to the original paper. We show that our algorithm performs comparable to state-of-the-art black-box optimizers on standard benchmark functions. Further, it clearly outperforms ranking-based methods and other policy-gradient based black-box algorithms as well as state of the art deep reinforcement learning algorithms when used for episodic reinforcement learning tasks.

ブラックボックス最適化は、目的関数が明示的にわかっておらず、高次の情報も利用できない複雑な問題を解決するための多目的なアプローチです。その汎用性により、関数最適化や機械学習、特にエピソード強化学習タスクに広く応用されています。CMA-ESなどの従来のブラックボックス最適化は、ランキングベースの変換に依存しているため、ノイズの多いシナリオではうまく機能しない可能性がありますが、モデルベースの相対エントロピー確率的探索(MORE)アルゴリズムという有望な代替手段が登場しています。MOREは、自然なポリシー勾配と互換性のある関数近似から導出でき、ランキングに頼ることなく期待される適合性を直接最適化します。ただし、元の定式化では、MOREは最先端のパフォーマンスを達成できないことがよくあります。この論文では、検索分布の平均と共分散の更新と、進化パスに基づく改良されたエントロピースケジューリング手法を切り離してMOREを改善し、収束を高速化し、元の論文と比較してモデル学習アプローチを簡素化します。私たちのアルゴリズムは、標準的なベンチマーク関数において最先端のブラックボックスオプティマイザーに匹敵するパフォーマンスを発揮することを示しています。さらに、エピソード強化学習タスクに使用した場合、ランキングベースの方法やその他のポリシー勾配ベースのブラックボックスアルゴリズム、最先端の深層強化学習アルゴリズムよりも明らかに優れたパフォーマンスを発揮します。

Kernel Thinning
カーネルの薄化

We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}_{\star}$ and $O(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. The maximum discrepancy in integration error is $O_d(n^{-1/2}\sqrt{\log n})$ in probability for compactly supported $\mathbb{P}$ and $O_d(n^{-\frac{1}{2}} (\log n)^{(d+1)/2}\sqrt{\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-1/4})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. Moreover, the same construction delivers near-optimal $L^\infty$ coresets in $O(n^2)$ time. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\’ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning, in dimensions $d=2$ through $100$.

私たちは、i.i.d.サンプリングや標準のシンニングよりも効果的に分布$\mathbb{P}$を圧縮するための新しい手順であるカーネルシンニングを紹介します。適切な再生カーネル$\mathbf{k}_{\star}$と$O(n^2)$の時間を与えられた場合、カーネルシンニングは、関連する再生カーネルヒルベルト空間全体で同等の最悪ケースの積分誤差で、$\mathbb{P}$の$n$ポイント近似を$\sqrt{n}$ポイント近似に圧縮します。積分誤差の最大矛盾は、コンパクトにサポートされた$\mathbb{P}$の確率で$O_d(n^{-1/2}\sqrt{\log n})$であり、$\mathbb{R}^d$上の部分指数$\mathbb{P}$では$O_d(n^{-\frac{1}{2}} (\log n)^{(d+1)/2}\sqrt{\log\log n})$です。対照的に、$\mathbb{P}$からの等サイズのi.i.d.サンプルは、$\Omega(n^{-1/4})$積分誤差を被ります。私たちの部分指数保証は、$[0,1]^d$上の均一$\mathbb{P}$の古典的な準モンテカルロ誤差率に似ていますが、$\mathbb{R}^d$上の一般的な分布と広範囲の一般的なカーネルに適用されます。さらに、同じ構成により、ほぼ最適な$L^\infty$コアセットが$O(n^2)$時間で生成されます。私たちは、この結果を使用して、ガウス、Mat\’ern、およびBスプラインカーネルの明示的な非漸近最大平均不一致境界を導出し、次元$d=2$から$100$までのカーネルシニングがi.i.d.サンプリングおよび標準的なマルコフ連鎖モンテカルロシニングよりも実際的な利点を示す2つのビネットを紹介します。

Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions
緩和平滑性条件下での確率的二値最適化のための最適アルゴリズム

We consider stochastic bilevel optimization problems involving minimizing an upper-level ($\texttt{UL}$) function that is dependent on the arg-min of a strongly-convex lower-level ($\texttt{LL}$) function. Several algorithms utilize Neumann series to approximate certain matrix inverses involved in estimating the implicit gradient of the $\texttt{UL}$ function (hypergradient). The state-of-the-art StOchastic Bilevel Algorithm ($\texttt{SOBA}$) instead uses stochastic gradient descent steps to solve the linear system associated with the explicit matrix inversion. This modification enables $\texttt{SOBA}$ to obtain a sample complexity of $\mathcal{O}(1/\epsilon^{2})$ for finding an $\epsilon$-stationary point. Unfortunately, the current analysis of $\texttt{SOBA}$ relies on the assumption of higher-order smoothness for the $\texttt{UL}$ and $\texttt{LL}$ functions to achieve optimality. In this paper, we introduce a novel fully single-loop and Hessian-inversion-free algorithmic framework for stochastic bilevel optimization and present a tighter analysis under standard smoothness assumptions (first-order Lipschitzness of the $\texttt{UL}$ function and second-order Lipschitzness of the $\texttt{LL}$ function). Furthermore, we show that a slight modification of our algorithm can handle a more general multi-objective robust bilevel optimization problem. For this case, we obtain the state-of-the-art oracle complexity results demonstrating the generality of both the proposed algorithmic and analytic frameworks. Numerical experiments demonstrate the performance gain of the proposed algorithms over existing ones.

私たちは、強凸下位レベル($\texttt{LL}$)関数のarg-minに依存する上位レベル($\texttt{UL}$)関数を最小化する確率的二レベル最適化問題を考察します。いくつかのアルゴリズムは、$\texttt{UL}$関数の暗黙の勾配(超勾配)を推定する際に関係する特定の逆行列を近似するためにノイマン級数を使用します。最先端のStOchastic Bilevel Algorithm ($\texttt{SOBA}$)は、代わりに確率的勾配降下法のステップを使用して、明示的な逆行列に関連する線形システムを解く。この変更により、$\texttt{SOBA}$は$\epsilon$定常点を見つけるために$\mathcal{O}(1/\epsilon^{2})$のサンプル複雑度を得ることができます。残念ながら、現在の$\texttt{SOBA}$の解析では、最適性を達成するために$\texttt{UL}$関数と$\texttt{LL}$関数の高次の滑らかさの仮定に依存しています。この論文では、確率的二階層最適化のための新しい完全にシングルループでヘッセ行列反転のないアルゴリズムフレームワークを紹介し、標準的な滑らかさの仮定($\texttt{UL}$関数の1次Lipschitznessと$\texttt{LL}$関数の2次Lipschitzness)の下でのより厳密な解析を示します。さらに、アルゴリズムを少し変更することで、より一般的な多目的の堅牢な二階層最適化問題を処理できることを示します。この場合、提案されたアルゴリズムと解析フレームワークの両方の一般性を示す最先端のオラクル複雑性結果が得られます。数値実験では、既存のアルゴリズムよりも提案されたアルゴリズムのパフォーマンスが向上することが実証されています。

Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks
二部ネットワークのための次数補正潜在ブロックモデルの変分推定量

Bipartite graphs are ubiquitous across various scientific and engineering fields. Simultaneously grouping the two types of nodes in a bipartite graph via biclustering represents a fundamental challenge in network analysis for such graphs. The latent block model (LBM) is a commonly used model-based tool for biclustering. However, the effectiveness of the LBM is often limited by the influence of row and column sums in the data matrix. To address this limitation, we introduce the degree-corrected latent block model (DC-LBM), which accounts for the varying degrees in row and column clusters, significantly enhancing performance on real-world data sets and simulated data. We develop an efficient variational expectation-maximization algorithm by creating closed-form solutions for parameter estimates in the M steps. Furthermore, we prove the label consistency and the rate of convergence of the variational estimator under the DC-LBM, allowing the expected graph density to approach zero as long as the average expected degrees of rows and columns approach infinity when the size of the graph increases.

二部グラフは、さまざまな科学および工学の分野で広く使用されています。二部グラフ内の2種類のノードをバイクラスタリングによって同時にグループ化することは、このようなグラフのネットワーク分析における基本的な課題です。潜在ブロックモデル(LBM)は、バイクラスタリングによく使用されるモデルベースのツールです。ただし、LBMの有効性は、データマトリックス内の行と列の合計の影響によって制限されることがよくあります。この制限に対処するために、次数補正潜在ブロックモデル(DC-LBM)を導入します。これは、行と列のクラスターのさまざまな次数を考慮し、実際のデータセットとシミュレーションデータのパフォーマンスを大幅に向上させます。Mステップでパラメーター推定の閉じた形式のソリューションを作成することにより、効率的な変分期待値最大化アルゴリズムを開発します。さらに、DC-LBMでの変分推定量のラベルの一貫性と収束率を証明し、グラフのサイズが大きくなると、行と列の平均期待次数が無限大に近づく限り、期待グラフ密度がゼロに近づくことを可能にします。

Statistical Inference for Fairness Auditing
公正性監査のための統計的推論

Before deploying a black-box model in high-stakes problems, it is important to evaluate the model’s performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as “fairness auditing,” in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich—even infinite—collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.

ブラックボックスモデルをハイステークスの問題に展開する前に、敏感なサブポピュレーションに対するモデルのパフォーマンスを評価することが重要です。たとえば、再犯予測タスクでは、予測モデルの誤検出率が許容できないほど高い人口統計グループを特定したり、そのようなグループが存在しないことを証明したりする必要がある場合があります。この論文では、このタスク(「公平性監査」と呼ばれることが多い)を多重仮説検定の観点から説明します。ブートストラップを使用して、統計的保証のあるグループの集合全体でパフォーマンスの格差を同時に制限する方法を示します。私たちの方法を使用して、モデルのパフォーマンス不足の影響を受けるサブポピュレーションにフラグを付け、モデルが適切に機能するサブポピュレーションを証明できます。重要なのは、私たちの監査はモデルに依存せず、ほぼすべてのパフォーマンスメトリックまたはグループ公平性基準に適用できることです。私たちの方法は、非常に豊富な(無限の)サブポピュレーションの集合にも対応します。さらに、特定の分布シフトでパフォーマンスを評価する方法を示すことで、サブポピュレーションを超えて一般化します。予測推論とアルゴリズムの公平性に関するベンチマークデータセットで提案された方法をテストし、監査によって解釈可能で信頼できる保証を提供できることがわかりました。

Adjusted Wasserstein Distributionally Robust Estimator in Statistical Learning
統計学習における調整済みワッサースタイン分布ロバスト推定量

We propose an adjusted Wasserstein distributionally robust estimator—based on a nonlinear transformation of the Wasserstein distributionally robust (WDRO) estimator in statistical learning. The classic WDRO estimator is asymptotically biased, while our adjusted WDRO estimator is asymptotically unbiased, resulting in a smaller asymptotic mean squared error. Further, under certain conditions, our proposed adjustment technique provides a general principle to de-bias asymptotically biased estimators. Specifically, we will investigate how the adjusted WDRO estimator is developed in the generalized linear model, including logistic regression, linear regression, and Poisson regression. Numerical experiments demonstrate the favorable practical performance of the adjusted estimator over the classic one.

私たちは、統計的学習におけるワッサーシュタイン分布ロバスト(WDRO)推定量の非線形変換に基づく調整済みワッサーシュタイン分布ロバスト推定量を提案します—。従来のWDRO推定量は漸近的に偏っていますが、調整済みWDRO推定量は漸近的に偏っていないため、漸近平均二乗誤差が小さくなります。さらに、特定の条件下では、提案された調整手法は、漸近的に偏った推定量を偏りなくするための一般原則を提供します。具体的には、ロジスティック回帰、線形回帰、ポアソン回帰などの一般化線形モデルで調整済みWDRO推定量がどのように開発されるかを調査します。数値実験は、調整された推定量が従来の推定量よりも良好な実用的性能を示しています。

DoWhy-GCM: An Extension of DoWhy for Causal Inference in Graphical Causal Models
DoWhy-GCM:グラフィカル因果モデルにおける因果推論のためのDoWhyの拡張

We present DoWhy-GCM, an extension of the DoWhy Python library, which leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation, DoWhy-GCM addresses diverse causal queries, such as identifying the root causes of outliers and distributional changes, attributing causal influences to the data generating process of each node, or diagnosis of causal structures. With DoWhy-GCM, users typically specify cause-effect relations via a causal graph, fit causal mechanisms, and pose causal queries—all with just a few lines of code. The general documentation is available at https://www.pywhy.org/dowhy and the DoWhy-GCM specific code at https://github.com/py-why/dowhy/tree/main/dowhy/gcm.

私たちは、DoWhy Pythonライブラリの拡張機能であるDoWhy-GCMは、グラフィカルな因果モデルを活用します。DoWhy-GCMは、効果推定を主眼とする既存の因果関係ライブラリとは異なり、外れ値や分布変化の根本原因の特定、各ノードのデータ生成プロセスへの因果影響の帰属、因果構造の診断など、多様な因果関係の問い合わせに対応します。DoWhy-GCMでは、通常、ユーザーは因果関係グラフを介して因果関係を特定し、因果メカニズムを適合させ、因果関係クエリを提起します—これらすべてをわずか数行のコードで行うことができます。一般的なドキュメントはhttps://www.pywhy.org/dowhyで、DoWhy-GCM固有のコードはで入手できますhttps://github.com/py-why/dowhy/tree/main/dowhy/gcm.

Flexible Bayesian Product Mixture Models for Vector Autoregressions
ベクトル自己回帰のための柔軟なベイジアン積混合モデル

Bayesian non-parametric methods based on Dirichlet process mixtures have seen tremendous success in various domains and are appealing in being able to borrow information by clustering samples that share identical parameters. However, such methods can face hurdles in heterogeneous settings where objects are expected to cluster only along a subset of axes or where clusters of samples share only a subset of identical parameters. We overcome such limitations by developing a novel class of product of Dirichlet process location-scale mixtures that enables independent clustering at multiple scales, which results in varying levels of information sharing across samples. First, we develop the approach for independent multivariate data. Subsequently we generalize it to multivariate time-series data under the framework of multi-subject Vector Autoregressive (VAR) models that is our primary focus, which go beyond parametric single-subject VAR models. We establish posterior consistency and develop efficient posterior computation for implementation. Extensive numerical studies involving VAR models show distinct advantages over competing methods in terms of estimation, clustering, and feature selection accuracy. Our resting state fMRI analysis from the Human Connectome Project reveals biologically interpretable connectivity differences between distinct intelligence groups, while another air pollution application illustrates the superior forecasting accuracy compared to alternate methods.

ディリクレ過程混合に基づくベイジアンノンパラメトリック法は、さまざまな分野で大きな成功を収めており、同じパラメータを共有するサンプルをクラスタリングすることで情報を借用できるという点で魅力的です。ただし、このような方法は、オブジェクトが軸のサブセットに沿ってのみクラスタリングされると予想される場合や、サンプルのクラスタが同一のパラメータのサブセットのみを共有する場合など、異種環境では障害に直面する可能性があります。私たちは、複数のスケールで独立したクラスタリングを可能にし、サンプル間でさまざまなレベルの情報共有をもたらす、ディリクレ過程の位置スケール混合の積の新しいクラスを開発することで、このような制限を克服します。まず、独立した多変量データに対するアプローチを開発します。次に、パラメトリックな単一被験者VARモデルを超える、私たちの主な焦点である多被験者ベクトル自己回帰(VAR)モデルのフレームワークの下で、それを多変量時系列データに一般化します。私たちは事後一貫性を確立し、実装のために効率的な事後計算を開発します。VARモデルに関する広範な数値研究では、推定、クラスタリング、および特徴選択の精度に関して、競合する手法よりも明確な利点が示されています。ヒューマンコネクトームプロジェクトによる安静時のfMRI分析では、異なる知能グループ間の生物学的に解釈可能な接続性の違いが明らかになり、また別の大気汚染アプリケーションでは、他の方法と比較して優れた予測精度が実証されました。

A Variational Approach to Bayesian Phylogenetic Inference
ベイズ系統推定への変分アプローチ

Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive graphical model for tree topology distributions, and a structured amortization of the branch lengths over tree topologies for a suitable variational family of distributions. We train the variational approximation via stochastic gradient ascent and adopt gradient estimators for continuous and discrete variational parameters separately to deal with the composite latent space of phylogenetic models. We show that our variational approach provides competitive performance to MCMC, while requiring much fewer (though more costly) iterations due to a more efficient exploration mechanism enabled by variational inference. Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods.

ベイズ系統分類推論は現在、単純な提案メカニズムによるマルコフ連鎖モンテカルロ法で行われています。これは探索効率を妨げ、正確な事後推定値を得るためにはしばしば長時間の実行を必要とします。この論文では、ベイズ系統分類解析のための変分フレームワークという代替アプローチを提示します。私たちは、サブスプリットベイズネットワーク、ツリートポロジ分布の表現力豊かなグラフィカルモデル、ツリートポロジ上の枝の長さの構造化された償却を組み合わせて、適切な変分分布ファミリーを作成することを提案します。私たちは、変分近似を確率的勾配上昇法でトレーニングし、連続および離散変分パラメータの勾配推定量を個別に採用して、系統分類モデルの複合潜在空間を処理します。変分アプローチは、変分推論によって可能になるより効率的な探索メカニズムにより、必要な反復回数がはるかに少ない（ただしコストは高い）一方で、MCMCと競合するパフォーマンスを提供することを示します。実際のデータを用いたベイズ系統推論問題のベンチマーク実験により、当社の方法の有効性と効率性が実証されています。

Fat-Shattering Dimension of k-fold Aggregations
kフォールド凝集体の脂肪粉砕次元

We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing k functions, one from each of the k classes, and computing pointwise an “aggregate” function of these, such as the median, mean, and maximum. The bounds are stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on k. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.

私たちは、実数値関数クラスの集計ルールの脂肪を粉砕する次元の推定値を提供します。後者は、各kクラスから1つずつk関数を選択し、中央値、平均値、最大値などのこれらの”集計”関数を点単位で計算するすべての方法で構成されます。境界は、コンポーネントクラスの脂肪を粉砕する次元の観点から示されます。線形関数クラスとアフィン関数クラスの場合、かなり鋭い上限と対応する下限を提供し、特にkへの最適な依存性を実現します。その過程で、文献内の多くの誤った主張を指摘し修正することに加えて、いくつかの既知の結果を改善します。

Unified Binary and Multiclass Margin-Based Classification
統一されたバイナリおよびマルチクラスマージンベースの分類

The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated.

マージンロスの概念は、バイナリ分類のアルゴリズムの開発と分析の中心でした。しかし、今日まで、マルチクラス分類のマージン損失の類似性についてはコンセンサスが得られていません。この研究では、多くの一般的なものを含む広範なマルチクラス損失関数を、バイナリ損失のマージン形式の一般化である相対マージン形式で表現できることを示します。相対マージン形式は、以前の研究で示されたように、マルチクラス損失の理解と分析に広く役立ちます(Wang and Scott、2020、2021)。このマルチクラス損失の表現方法の有用性をさらに実証するために、これを使用して、Bartlettら(2006)のバイナリマージン損失の分類較正に関する重要な結果をマルチクラスに拡張します。次に、Fenchel-Young損失のクラスを分析し、分類較正されていることがわかっているこれらの損失のセットを拡張します。

Neural Feature Learning in Function Space
関数空間における神経特徴学習

We present a novel framework for learning system design with neural feature extractors. First, we introduce the feature geometry, which unifies statistical dependence and feature representations in a function space equipped with inner products. This connection defines function-space concepts on statistical dependence, such as norms, orthogonal projection, and spectral decomposition, exhibiting clear operational meanings. In particular, we associate each learning setting with a dependence component and formulate learning tasks as finding corresponding feature approximations. We propose a nesting technique, which provides systematic algorithm designs for learning the optimal features from data samples with off-the-shelf network architectures and optimizers. We further demonstrate multivariate learning applications, including conditional inference and multimodal learning, where we present the optimal features and reveal their connections to classical approaches.

私たちは、ニューラル特徴抽出器を用いたシステム設計の学習のための新しいフレームワークを紹介します。まず、内積を備えた関数空間において統計的依存性と特徴表現を統一する特徴ジオメトリを紹介します。この接続は、ノルム、直交射影、スペクトル分解などの統計的依存性に関する関数空間の概念を定義し、明確な操作上の意味を示します。特に、各学習設定を依存関係コンポーネントに関連付け、対応する特徴近似を見つけるように学習課題を定式化します。私たちは、既製のネットワークアーキテクチャとオプティマイザを使用して、データサンプルから最適な特徴を学習するための体系的なアルゴリズム設計を提供するネスティング技術を提案します。さらに、条件付き推論やマルチモーダル学習などの多変量学習アプリケーションを示し、最適な特徴を提示し、古典的なアプローチとの関連性を明らかにします。

PyGOD: A Python Library for Graph Outlier Detection
PyGOD: グラフ外れ値検出のための Python ライブラリ

PyGOD is an open-source Python library for detecting outliers in graph data. As the first comprehensive library of its kind, PyGOD supports a wide array of leading graph-based methods for outlier detection under an easy-to-use, well-documented API designed for use by both researchers and practitioners. PyGOD provides modularized components of the different detectors implemented so that users can easily customize each detector for their purposes. To ease the construction of detection workflows, PyGOD offers numerous commonly used utility functions. To scale computation to large graphs, PyGOD supports functionalities for deep models such as sampling and mini-batch processing. PyGOD uses best practices in fostering code reliability and maintainability, including unit testing, continuous integration, and code coverage. To facilitate accessibility, PyGOD is released under a BSD 2-Clause license at https://pygod.org and at the Python Package Index (PyPI).

PyGODは、グラフデータの外れ値を検出するためのオープンソースのPythonライブラリです。PyGODは、この種の最初の包括的なライブラリとして、研究者と実務家の両方が使用できるように設計された、使いやすく、十分に文書化されたAPIの下で、外れ値検出のための主要なグラフベースの方法を幅広くサポートしています。PyGODは、実装されたさまざまな検出器のモジュール化されたコンポーネントを提供し、ユーザーが各検出器を自分の目的に合わせて簡単にカスタマイズできるようにします。検出ワークフローの構築を容易にするために、PyGODは一般的に使用される多数のユーティリティ関数を提供しています。計算を大きなグラフにスケーリングするために、PyGODはサンプリングやミニバッチ処理などのディープモデルの機能をサポートしています。PyGODは、単体テスト、継続的インテグレーション、コードカバレッジなど、コードの信頼性と保守性を促進するためのベストプラクティスを使用します。アクセシビリティを容易にするために、PyGODはBSD 2-Clauseライセンスの下でhttps://pygod.orgとPython Package Index (PyPI)でリリースされています。

Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria
共変量シフトの祝福と呪い:敵対的学習ダイナミクス、方向性収束、および均衡

Covariate distribution shifts and adversarial perturbations present robustness challenges to the conventional statistical learning framework: mild shifts in the test covariate distribution can significantly affect the performance of the statistical model learned based on the training distribution. The model performance typically deteriorates when extrapolation happens: namely, covariates shift to a region where the training distribution is scarce, and naturally, the learned model has little information. For robustness and regularization considerations, adversarial perturbation techniques are proposed as a remedy; however, careful study needs to be carried out about what extrapolation region adversarial covariate shift will focus on, given a learned model. This paper precisely characterizes the extrapolation region, examining both regression and classification in an infinite-dimensional setting. We study the implications of adversarial covariate shifts to subsequent learning of the equilibrium—the Bayes optimal model—in a sequential game framework. We exploit the dynamics of the adversarial learning game and reveal the curious effects of the covariate shift to equilibrium learning and experimental design. In particular, we establish two directional convergence results that exhibit distinctive phenomena: (1) a blessing in regression, the adversarial covariate shifts in an exponential rate to an optimal experimental design for rapid subsequent learning; (2) a curse in classification, the adversarial covariate shifts in a subquadratic rate to the hardest experimental design trapping subsequent learning.

共変量分布シフトと敵対的摂動は、従来の統計学習フレームワークの堅牢性に課題をもたらします。テスト共変量分布のわずかなシフトが、トレーニング分布に基づいて学習された統計モデルのパフォーマンスに重大な影響を与える可能性があります。モデルのパフォーマンスは、通常、外挿が発生すると低下します。つまり、共変量はトレーニング分布が乏しい領域にシフトし、当然、学習されたモデルには情報がほとんどありません。堅牢性と正則化の考慮事項については、敵対的摂動手法が解決策として提案されていますが、学習されたモデルが与えられた場合、敵対的共変量シフトがどの外挿領域に焦点を当てるかについて、慎重な研究を行う必要があります。この論文では、無限次元設定での回帰と分類の両方を調べて、外挿領域を正確に特徴付けます。シーケンシャルゲームフレームワークでの均衡、つまりベイズ最適モデルのその後の学習に対する敵対的共変量シフトの影響を調査します。我々は敵対的学習ゲームのダイナミクスを利用し、共変量の平衡学習および実験設計へのシフトの興味深い効果を明らかにします。特に、特徴的な現象を示す2つの方向性収束結果を確立します。(1)回帰における祝福として、敵対的共変量は指数関数的に、その後の学習を迅速に行うための最適な実験設計にシフトします。(2)分類における呪いとして、敵対的共変量は二次関数的率で、その後の学習を阻止する最も困難な実験設計にシフトします。

Fixed points of nonnegative neural networks
非負のニューラルネットワークの固定点

We use fixed point theory to analyze nonnegative neural networks, which we define as neural networks that map nonnegative vectors to nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable mappings within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks having inputs and outputs of the same dimension, and these conditions are weaker than those recently obtained using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general nonnegative neural networks. From a practical perspective, our results contribute to the understanding of the behavior of autoencoders, and we also offer valuable mathematical machinery for future developments in deep equilibrium models.

私たちは、非負値ニューラルネットワークを解析するために不動点理論を使用します。非負値ニューラルネットワークとは、非負値ベクトルを非負値ベクトルにマッピングするニューラルネットワークと定義します。まず、非負値の重みとバイアスを持つ非負値ニューラルネットワークは、非線形ペロン・フロベニウス理論の枠組みの中で、単調かつ（弱く）スケーラブルなマッピングとして認識できることを示す。この事実により、同じ次元の入力と出力を持つ非負値ニューラルネットワークの不動点の存在条件を与えることができ、これらの条件は、最近凸解析の議論を使用して得られた条件よりも弱い。さらに、非負値の重みとバイアスを持つ非負値ニューラルネットワークの不動点集合の形状は区間であり、穏やかな条件下では点に退化することを証明します。これらの結果は、より一般的な非負値ニューラルネットワークの不動点の存在を取得するために使用されます。実用的な観点から見ると、私たちの研究結果はオートエンコーダの動作の理解に貢献し、また深層平衡モデルの将来の発展のための貴重な数学的仕組みも提供します。

Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks
ノルム制約、過剰パラメータ化、2層ニューラルネットワークによる学習

Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks as the curse of dimensionality (CoD) cannot be evaded when trying to approximate even a single ReLU neuron. In this paper, we study a suitable function space for over-parameterized two-layer neural networks with bounded norms (e.g., the path norm, the Barron norm) in the perspective of sample complexity and generalization properties. First, we show that the path norm (as well as the Barron norm) is able to obtain width-independence sample complexity bounds, which allows for uniform convergence guarantees. Based on this result, we derive the improved result of metric entropy for $\epsilon$-covering up to $O(\epsilon^{-\frac{2d}{d+2}})$ ($d$ is the input dimension and the depending constant is at most polynomial order of $d$) via the convex hull technique, which demonstrates the separation with kernel methods with $\Omega(\epsilon^{-d})$ to learn the target function in a Barron space. Second, this metric entropy result allows for building a sharper generalization bound under a general moment hypothesis setting, achieving the rate at $O(n^{-\frac{d+2}{2d+2}})$. Our analysis is novel in that it offers a sharper and refined estimation for metric entropy (with a clear dependence relationship on the dimension $d$) and unbounded sampling in the estimation of the sample error and the output error.

最近の研究では、再生カーネルヒルベルト空間(RKHS)は、単一のReLUニューロンを近似しようとするときにも次元の呪い(CoD)を回避できないため、ニューラルネットワークによる関数のモデル化に適した空間ではないことが示されています。この論文では、サンプル複雑度と一般化特性の観点から、制限付きノルム(パスノルム、バロンノルムなど)を持つオーバーパラメータ化された2層ニューラルネットワークに適した関数空間について検討します。まず、パスノルム(およびバロンノルム)は幅に依存しないサンプル複雑度の境界を取得でき、均一な収束保証を可能にすることを示します。この結果に基づいて、凸包技術を介して最大$O(\epsilon^{-\frac{2d}{d+2}})$ ($d$は入力次元で、従属定数は最大で$d$の多項式次数)までの$\epsilon$被覆の計量エントロピーの改善された結果を導出します。これは、バロン空間でターゲット関数を学習するための$\Omega(\epsilon^{-d})$を使用したカーネル法との分離を示しています。第2に、この計量エントロピーの結果により、一般モーメント仮説設定の下でより鋭い一般化境界を構築することができ、$O(n^{-\frac{d+2}{2d+2}})$で速度を達成できます。私たちの分析は、計量エントロピーのより鋭く洗練された推定(次元$d$への明確な依存関係を持つ)と、サンプル誤差と出力誤差の推定における無制限のサンプリングを提供するという点で斬新です。

A Survey on Multi-player Bandits
マルチプレイバンディットに関する調査

Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real communication networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations.

主にコグニティブ無線ネットワークへの応用により、マルチプレイヤーバンディットは過去10年間で多くの関心を集めました。その理論的側面では、かなりの進歩が見られました。しかし、現在のアルゴリズムは適用可能とはほど遠いものであり、これらの理論的な結果と実際の通信ネットワークでのマルチプレイヤーバンディットアルゴリズムの実装の可能性との間には多くの障害が残っています。この調査は、豊富なマルチプレイヤーバンディットの文献を文脈化し、整理します。既存の研究に照らして、将来の研究のための明確な方向性がいくつか現れます。これらの異なる方向性をさらに研究することで、現実世界の状況に適応した理論的なアルゴリズムにつながる可能性があると考えています。

Transport-based Counterfactual Models
トランスポートベースの反事実モデル

Counterfactual frameworks have grown popular in machine learning for both explaining algorithmic decisions but also defining individual notions of fairness, more intuitive than typical group fairness conditions. However, state-of-the-art models to compute counterfactuals are either unrealistic or unfeasible. In particular, while Pearl’s causal inference provides appealing rules to calculate counterfactuals, it relies on a model that is unknown and hard to discover in practice. We address the problem of designing realistic and feasible counterfactuals in the absence of a causal model. We define transport-based counterfactual models as collections of joint probability distributions between observable distributions, and show their connection to causal counterfactuals. More specifically, we argue that optimal-transport theory defines relevant transport-based counterfactual models, as they are numerically feasible, statistically-faithful, and can coincide under some assumptions with causal counterfactual models. Finally, these models make counterfactual approaches to fairness feasible, and we illustrate their practicality and efficiency on fair learning. With this paper, we aim at laying out the theoretical foundations for a new, implementable approach to counterfactual thinking.

反事実的フレームワークは、アルゴリズムの決定を説明するだけでなく、一般的なグループ公平性条件よりも直感的な個々の公平性の概念を定義するために、機械学習で人気が高まっています。ただし、反事実を計算するための最先端のモデルは非現実的または実行不可能です。特に、パールの因果推論は反事実を計算するための魅力的なルールを提供しますが、実際には未知で発見が難しいモデルに依存しています。因果モデルがない場合に現実的で実行可能な反事実を設計する問題に取り組みます。輸送ベースの反事実モデルを、観測可能な分布間の結合確率分布のコレクションとして定義し、因果反事実との関連を示します。より具体的には、最適輸送理論が、数値的に実行可能で、統計的に忠実であり、いくつかの仮定の下で因果反事実モデルと一致できるため、関連する輸送ベースの反事実モデルを定義すると主張します。最後に、これらのモデルにより公平性に対する反事実的アプローチが実現可能となり、公平な学習におけるその実用性と効率性を示します。この論文では、反事実的思考に対する新しい実装可能なアプローチの理論的基礎を示すことを目指しています。

Adaptive Latent Feature Sharing for Piecewise Linear Dimensionality Reduction
区分的線形次元削減のための適応潜在特徴共有

Linear Gaussian exploratory tools such as principal component analysis (PCA) and factor analysis (FA) are widely used for exploratory analysis, pre-processing, data visualization, and related tasks. Because the linear-Gaussian assumption is restrictive, for very high dimensional problems, they have been replaced by robust, sparse extensions or more flexible discrete-continuous latent feature models. Discrete-continuous latent feature models specify a dictionary of features dependent on subsets of the data and then infer the likelihood that each data point shares any of these features. This is often achieved using rich-get-richer assumptions about the feature allocation process where the dictionary tries to couple the feature frequency with the portion of total variance that it explains. In this work, we propose an alternative approach that allows for better control over the feature to data point allocation. This new approach is based on two-parameter discrete distribution models which decouple feature sparsity and dictionary size, hence capturing both common and rare features in a parsimonious way. The new framework is used to derive a novel adaptive variant of factor analysis (aFA), as well as an adaptive probabilistic principal component analysis (aPPCA) capable of flexible structure discovery and dimensionality reduction in a wide variety of scenarios. We derive both standard Gibbs sampling, as well as efficient expectation-maximisation inference approximations converging orders of magnitude faster, to a reasonable point estimate solution. The utility of the proposed aPPCA and aFA models is demonstrated on standard tasks such as feature learning, data visualization, and data whitening. We show that aPPCA and aFA can extract interpretable, high-level features for raw MNIST or COLI-20 images, or when applied to the analysis of autoencoder features. We also demonstrate that replacing common PCA pre-processing pipelines in the analysis of functional magnetic resonance imaging (fMRI) data with aPPCA, leads to more robust and better-localised blind source separation of neural activity.

主成分分析(PCA)や因子分析(FA)などの線形ガウス探索ツールは、探索的分析、前処理、データ視覚化、および関連タスクに広く使用されています。線形ガウス仮定は制限的であるため、非常に高次元の問題では、堅牢でスパースな拡張またはより柔軟な離散連続潜在特徴モデルに置き換えられています。離散連続潜在特徴モデルは、データのサブセットに依存する特徴の辞書を指定し、各データポイントがこれらの特徴のいずれかを共有する可能性を推測します。これは、辞書が特徴の頻度とそれが説明する総分散の部分を結合しようとする特徴割り当てプロセスに関するrich-get-richer仮定を使用して実現されることがよくあります。この研究では、特徴からデータポイントへの割り当てをより適切に制御できる代替アプローチを提案します。この新しいアプローチは、特徴のスパース性と辞書のサイズを切り離す2パラメーターの離散分布モデルに基づいており、一般的な特徴とまれな特徴の両方を簡潔な方法でキャプチャします。新しいフレームワークは、さまざまなシナリオで柔軟な構造発見と次元削減が可能な、新しい適応型因子分析(aFA)と適応型確率主成分分析(aPPCA)を導出するために使用されます。標準的なギブスサンプリングと、桁違いに高速に収束して妥当な点推定ソリューションに至る効率的な期待最大化推論近似の両方を導出します。提案されたaPPCAおよびaFAモデルの有用性は、特徴学習、データ視覚化、データホワイトニングなどの標準的なタスクで実証されています。aPPCAおよびaFAは、生のMNISTまたはCOLI-20画像、またはオートエンコーダー特徴の分析に適用した場合に、解釈可能な高レベルの特徴を抽出できることを示しています。また、機能的磁気共鳴画像(fMRI)データの分析で一般的なPCA前処理パイプラインをaPPCAに置き換えると、神経活動のブラインドソース分離がより堅牢で適切にローカライズされることも示しています。

Topological Node2vec: Enhanced Graph Embedding via Persistent Homology
トポロジカルNode2vec:パーシステントホモロジーによるグラフ埋め込みの強化

Node2vec is a graph embedding method that learns a vector representation for each node of a weighted graph while seeking to preserve relative proximity and global structure. Numerical experiments suggest Node2vec struggles to recreate the topology of the input graph. To resolve this we introduce a topological loss term to be added to the training loss of Node2vec which tries to align the persistence diagram (PD) of the resulting embedding as closely as possible to that of the input graph. Following results in computational optimal transport, we carefully adapt entropic regularization to PD metrics, allowing us to measure the discrepancy between PDs in a differentiable way. Our modified loss function can then be minimized through gradient descent to reconstruct both the geometry and the topology of the input graph. We showcase the benefits of this approach using demonstrative synthetic examples.

Node2vecは、相対的な近接性とグローバル構造を保持しながら、重み付けグラフの各ノードのベクトル表現を学習するグラフ埋め込み方法です。数値実験では、Node2vecが入力グラフのトポロジーを再現するのに苦労していることが示唆されています。これを解決するために、Node2vecの学習損失に追加するトポロジカル損失項を導入し、結果として得られる埋め込みの永続性図(PD)を入力グラフの永続性図(PD)にできるだけ近づけようとします。計算上の最適輸送の結果に続いて、エントロピー正則化をPDメトリックに慎重に適応させ、PD間の不一致を微分可能な方法で測定できるようにします。その後、勾配降下法によって修正損失関数を最小化し、入力グラフのジオメトリとトポロジーの両方を再構築できます。このアプローチの利点を、実証的な合成例を使用して紹介します。

Granger Causal Inference in Multivariate Hawkes Processes by Minimum Message Length
最小メッセージ長による多変量ホークス過程のグレンジャー因果推論

Multivariate Hawkes processes (MHPs) are versatile probabilistic tools used to model various real-life phenomena: earthquakes, operations on stock markets, neuronal activity, virus propagation and many others. In this paper, we focus on MHPs with exponential decay kernels and estimate connectivity graphs, which represent the Granger causal relations between their components. We approach this inference problem by proposing an optimization criterion and model selection algorithm based on the minimum message length (MML) principle. MML compares Granger causal models using the Occam’s razor principle in the following way: even when models have a comparable goodness-of-fit to the observed data, the one generating the most concise explanation of the data is preferred. While most of the state-of-art methods using lasso-type penalization tend to overfitting in scenarios with short time horizons, the proposed MML-based method achieves high F1 scores in these settings. We conduct a numerical study comparing the proposed algorithm to other related classical and state-of-art methods, where we achieve the highest F1 scores in specific sparse graph settings. We illustrate the proposed method also on G7 sovereign bond data and obtain causal connections, which are in agreement with the expert knowledge available in the literature.

多変量ホークス過程(MHP)は、地震、株式市場での操作、神経活動、ウイルスの伝播など、さまざまな現実の現象をモデル化するために使用される多目的な確率ツールです。この論文では、指数関数的減衰カーネルを持つMHPに焦点を当て、コンポーネント間のGranger因果関係を表す接続グラフを推定します。この推論問題にアプローチするために、最小メッセージ長(MML)原理に基づく最適化基準とモデル選択アルゴリズムを提案します。MMLは、オッカムの剃刀原理を使用してGranger因果モデルを次のように比較します。つまり、モデルが観測データに匹敵する適合度を持つ場合でも、データの最も簡潔な説明を生成するモデルが優先されます。Lassoタイプのペナルティを使用する最先端の方法のほとんどは、短い時間範囲のシナリオで過剰適合する傾向がありますが、提案されたMMLベースの方法は、これらの設定で高いF1スコアを達成します。提案されたアルゴリズムを他の関連する古典的および最先端の方法と比較する数値研究を実施し、特定のスパースグラフ設定で最高のF1スコアを達成しました。提案された方法をG7ソブリン債データでも説明し、文献で入手可能な専門知識と一致する因果関係を取得しました。

Representation Learning via Manifold Flattening and Reconstruction
多様体の平坦化と再構成による表現学習

A common assumption for real-world, learnable data is its possession of some low-dimensional structure, and one way to formalize this structure is through the manifold hypothesis: that learnable data lies near some low-dimensional manifold. Deep learning architectures often have a compressive autoencoder component, where data is mapped to a lower-dimensional latent space, but often many architecture design choices are done by hand, since such models do not inherently exploit mathematical structure of the data. To utilize this geometric data structure, we propose an iterative process in the style of a geometric flow for explicitly constructing a pair of neural networks layer-wise that linearize and reconstruct an embedded submanifold, from finite samples of this manifold. Our such-generated neural networks, called Flattening Networks (FlatNet), are theoretically interpretable, computationally feasible at scale, and generalize well to test data, a balance not typically found in manifold-based learning methods. We present empirical results and comparisons to other models on synthetic high-dimensional manifold data and 2D image data. Our code is publicly available.

現実世界の学習可能なデータに対する一般的な仮定は、それが何らかの低次元構造を持っているということであり、この構造を形式化する1つの方法は、学習可能なデータが何らかの低次元多様体の近くにあるという多様体仮説を使用することです。ディープラーニングアーキテクチャには、データが低次元の潜在空間にマッピングされる圧縮オートエンコーダコンポーネントが含まれることがよくありますが、このようなモデルはデータの数学的構造を本質的に利用しないため、アーキテクチャ設計の選択の多くは手作業で行われることがよくあります。この幾何学的データ構造を利用するために、この多様体の有限サンプルから埋め込まれたサブ多様体を線形化して再構築するニューラルネットワークのペアを層ごとに明示的に構築するための、幾何学的フローのスタイルの反復プロセスを提案します。このように生成されたニューラルネットワークは、フラット化ネットワーク(FlatNet)と呼ばれ、理論的に解釈可能で、大規模な計算が可能であり、テストデータに適切に一般化されます。これは、多様体ベースの学習方法では通常見られないバランスです。合成高次元多様体データと2D画像データに関する実験結果と他のモデルとの比較を示します。コードは公開されています。

Bagging Provides Assumption-free Stability
バギングは仮定のない安定性を提供します

Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant. Empirical results validate our findings, showing that bagging successfully stabilizes even highly unstable base algorithms.

バギングは、機械学習モデルを安定させるための重要な手法です。この論文では、あらゆるモデルの袋詰めの安定性に関する有限サンプル保証を導き出します。この結果は、データの分布、基本アルゴリズムのプロパティ、または共変量の次元性に仮定を置かない。当社の保証は、袋詰めの多くのバリエーションに適用され、一定まで最適です。経験的な結果により、私たちの発見が検証され、バギングが非常に不安定な基本アルゴリズムでも成功裏に安定化することが示されています。

Fairness guarantees in multi-class classification with demographic parity
人口統計学的同等性を備えたマルチクラス分類における公平性の保証

Algorithmic Fairness is an established area of machine learning, willing to reduce the influence of hidden bias in the data. Yet, despite its wide range of applications, very few works consider the multi-class classification setting from the fairness perspective. We focus on this question and extend the definition of approximate fairness in the case of Demographic Parity to multi-class classification. We specify the corresponding expressions of the optimal fair classifiers in the attribute-aware case and both for binary and multi-categorical sensitive attributes. This suggests a plug-in data-driven procedure, for which we establish theoretical guarantees. The enhanced estimator is proved to mimic the behavior of the optimal rule both in terms of fairness and risk. Notably, fairness guarantees are distribution-free. The approach is evaluated on both synthetic and real datasets and reveals very effective in decision making with a preset level of unfairness. In addition, our method is competitive (if not better) with the state-of-the-art in binary and multi-class tasks.

アルゴリズムによる公平性は、データに隠れたバイアスの影響を減らすことを目的とした、機械学習の確立された分野です。しかし、その幅広い応用範囲にもかかわらず、公平性の観点からマルチクラス分類設定を考慮した研究はほとんどありません。私たちはこの問題に焦点を当て、人口統計的平等の場合のおおよその公平性の定義をマルチクラス分類に拡張します。属性認識の場合、およびバイナリとマルチカテゴリの敏感な属性の両方について、最適な公平な分類器の対応する式を指定します。これは、理論的保証を確立するプラグインデータ駆動型手順を示唆しています。強化された推定器は、公平性とリスクの両方の観点から最適なルールの動作を模倣することが証明されています。特に、公平性の保証は分布フリーです。このアプローチは、合成データセットと実際のデータセットの両方で評価され、事前に設定されたレベルの不公平性で意思決定を行うのに非常に効果的であることが明らかになりました。さらに、私たちの方法は、バイナリおよびマルチクラスタスクの最先端の方法と競合します（優れているとは限りません）。

Regimes of No Gain in Multi-class Active Learning
マルチクラスアクティブラーニングにおける利益なしの体制

We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $\mathbb{P}(Y=y|X=x)$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction—most relevant in multi-class settings—between active and passive learning. Namely, we show that some seemingly benign nuances in notions of margin—involving the uniqueness of the Bayes classes, which have no apparent effect on rates in passive learning—determine whether or not any active learner can outperform passive learning rates. While a shorter conference version of this work already alluded to these nuances, it focused on the binary case and thus failed to be conclusive as to the source of difficulty in the multi-class setting: we show here that it suffices that the Bayes classifier fails to be unique, as opposed to needing all classes to be Bayes optimal, for active learning to yield no gain over passive learning.

私たちは、滑らかな回帰関数を使用したノンパラメトリック分類を検討しますが、$mathbb{P}(Y=y|X=x)$は、能動的学習と受動的学習の両方で速い速度または遅い速度を決定します。ここでは、能動的学習と受動的学習の間の顕著な違い—マルチクラス設定で最も関連性が高い—を解明します。つまり、ベイズクラスの独自性を含むマージンの概念の一見良性のニュアンスが、受動的学習の速度に明らかな影響を与えないことを示しています——能動的な学習者が受動的な学習率を上回ることができるかどうかを決定します。この作業の短い会議バージョンでは、すでにこれらのニュアンスをほのめかしていますが、バイナリのケースに焦点を当てていたため、多クラス設定の困難の原因について決定的になることができませんでした:ここでは、ベイズ分類器が一意でないことで十分であることを示しています。すべてのクラスがベイズ最適である必要があるのとは対照的に、能動的学習が受動的学習よりも利益をもたらさないこと。

Learning Optimal Dynamic Treatment Regimens Subject to Stagewise Risk Controls
ステージごとのリスクコントロールの対象となる最適な動的治療レジメンの学習

Dynamic treatment regimens (DTRs) aim at tailoring individualized sequential treatment rules that maximize cumulative beneficial outcomes by accommodating patients’ heterogeneity in decision-making. For many chronic diseases including type 2 diabetes mellitus (T2D), treatments are usually multifaceted in the sense that aggressive treatments with a higher expected reward are also likely to elevate the risk of acute adverse events. In this paper, we propose a new weighted learning framework, namely benefit-risk dynamic treatment regimens (BR-DTRs), to address the benefit-risk trade-off. The new framework relies on a backward learning procedure by restricting the induced risk of the treatment rule to be no larger than a pre-specified risk constraint at each treatment stage. Computationally, the estimated treatment rule solves a weighted support vector machine problem with a modified smooth constraint. Theoretically, we show that the proposed DTRs are Fisher consistent, and we further obtain the convergence rates for both the value and risk functions. Finally, the performance of the proposed method is demonstrated via extensive simulation studies and application to a real study for T2D patients.

動的治療レジメン(DTR)は、患者の意思決定における多様性に対応することで、累積的な有益な結果を最大化する個別化された連続治療ルールをカスタマイズすることを目的としています。2型糖尿病(T2D)を含む多くの慢性疾患では、治療は通常多面的であり、期待される報酬が高い積極的な治療は、急性有害事象のリスクも高める可能性があります。この論文では、ベネフィットリスク動的治療レジメン(BR-DTR)という新しい重み付け学習フレームワークを提案し、ベネフィットリスクのトレードオフに対処します。新しいフレームワークは、治療ルールの誘発リスクを各治療段階で事前に指定されたリスク制約以下に制限することにより、後方学習手順に依存します。計算的には、推定された治療ルールは、修正された滑らかな制約を持つ重み付けサポートベクターマシンの問題を解決します。理論的には、提案されたDTRがフィッシャー整合であることを示し、さらに価値関数とリスク関数の両方の収束率を取得します。最後に、提案された方法のパフォーマンスは、広範なシミュレーション研究と、2型糖尿病患者に対する実際の研究への適用を通じて実証されます。

Margin-Based Active Learning of Classifiers
分類器のマージンベースのアクティブラーニング

We study active learning of multiclass classifiers, focusing on the realizable transductive setting. The input is a finite subset $X$ of some metric space, and the concept to be learned is a partition $\mathcal{C}$ of $X$ into $k$ classes. The goal is to learn $\mathcal{C}$ by querying the labels of as few elements of $X$ as possible. This is a useful subroutine in pool-based active learning, and is motivated by applications where labels are expensive to obtain. Our main result is that, in very different settings, there exist interesting notions of margin that yield efficient active learning algorithms. First, we consider the case $X \subset \mathbb{R}^m$, assuming that each class has an unknown “personalized” margin separating it from the rest. Second, we consider the case where $X$ is a finite metric space, and the classes are convex with margin according to the geodesic distances in the thresholded connectivity graph. In both cases, we give algorithms that learn $\mathcal{C}$ exactly, in polynomial time, using $\mathcal{O}(\log n)$ label queries, where $\mathcal{O}(\cdot)$ hides a near-optimal dependence on the dimension of the metric spaces. Our results actually hold for or can be adapted to more general settings, such as pseudometric and semimetric spaces.

私たちは、実現可能なトランスダクティブ設定に焦点を当てて、マルチクラス分類器の能動学習を研究しています。入力は、あるメトリック空間の有限サブセット$X$であり、学習する概念は、$X$の$k$クラスへの分割$\mathcal{C}$です。目標は、可能な限り少ない$X$の要素のラベルを照会することで、$\mathcal{C}$を学習することです。これは、プールベースの能動学習で役立つサブルーチンであり、ラベルの取得にコストがかかるアプリケーションを動機としています。主な結果は、非常に異なる設定で、効率的な能動学習アルゴリズムを生み出す興味深いマージンの概念が存在することです。まず、各クラスが他のクラスと区別する未知の「パーソナライズされた」マージンを持っていると仮定して、ケース$X \subset \mathbb{R}^m$を検討します。次に、$X$が有限メトリック空間であり、クラスが、しきい値接続グラフの測地線距離に従ってマージンを持つ凸型である場合を検討します。どちらの場合も、$\mathcal{O}(\log n)$ラベルクエリを使用して、多項式時間で$\mathcal{C}$を正確に学習するアルゴリズムを提供します。ここで、$\mathcal{O}(\cdot)$は、距離空間の次元に対するほぼ最適な依存性を隠します。私たちの結果は、擬似距離空間や半距離空間などのより一般的な設定に実際に当てはまるか、または適応できます。

Random Subgraph Detection Using Queries
クエリを使用したランダムなサブグラフ検出

The planted densest subgraph detection problem refers to the task of testing whether in a given (random) graph there is a subgraph that is unusually dense. Specifically, we observe an undirected and unweighted graph on $n$ vertices. Under the null hypothesis, the graph is a realization of an Erdös-R{\’e}nyi graph with edge probability (or, density) $q$. Under the alternative, there is a subgraph on $k$ vertices with edge probability $p>q$. The statistical as well as the computational barriers of this problem are well-understood for a wide range of the edge parameters $p$ and $q$. In this paper, we consider a natural variant of the above problem, where one can only observe a relatively small part of the graph using adaptive edge queries. For this model, we determine the number of queries necessary and sufficient (accompanied with a quasi-polynomial optimal algorithm) for detecting the presence of the planted subgraph. We also propose a polynomial-time algorithm which is able to detect the planted subgraph, albeit with more queries compared to the above lower bound. We conjecture that in the leftover regime, no polynomial-time algorithms exist. Our results resolve two open questions posed in the past literature.

植えられた最も密なサブグラフ検出問題とは、与えられた（ランダムな）グラフに異常に密なサブグラフがあるかどうかをテストするタスクを指します。具体的には、$n$個の頂点に無向で重み付けされていないグラフを観察します。帰無仮説では、グラフはエッジ確率（または密度）$q$を持つErdös-R{\’e}nyiグラフの実現です。対立仮説では、エッジ確率$p>q$を持つ$k$個の頂点にサブグラフがあります。この問題の統計的および計算上の障壁は、エッジパラメーター$p$および$q$の広い範囲でよく理解されています。この論文では、上記の問題の自然な変種を検討します。この変種では、適応エッジクエリを使用してグラフの比較的小さな部分のみを観察できます。このモデルでは、植えられたサブグラフの存在を検出するために必要かつ十分なクエリの数（準多項式最適アルゴリズムを伴う）を決定します。また、上記の下限値に比べてクエリ数が多くなるものの、植え付けられたサブグラフを検出できる多項式時間アルゴリズムも提案します。残りの領域では、多項式時間アルゴリズムは存在しないと推測します。私たちの結果は、過去の文献で提起された2つの未解決の問題を解決します。

Classification with Deep Neural Networks and Logistic Loss
ディープニューラルネットワークとロジスティック損失による分類

Deep neural networks (DNNs) trained with the logistic loss (also known as the cross entropy loss) have made impressive advancements in various binary classification tasks. Despite the considerable success in practice, generalization analysis for binary classification with deep neural networks and the logistic loss remains scarce. The unboundedness of the target function forthe logistic loss in binary classification is the main obstacle to deriving satisfactory generalization bounds. In this paper, we aim to fill this gap by developing a novel theoretical analysis and using it to establish tight generalization bounds for training fully connected ReLU DNNs with logistic loss in binary classification. Our generalization analysis is based on an elegant oracle-type inequality which enables us to deal with the boundedness restriction of the target function. Using this oracle-type inequality, we establish generalization bounds for fully connected ReLU DNN classifiers $\hat{f}^{\text{FNN}}_n$ trained by empirical logistic risk minimization with respect to i.i.d. samples of size $n$, which lead to sharp rates of convergence as $n\to\infty$. In particular, we obtain optimal convergence rates for $\hat{f}^{\text{FNN}}_n$ (up to some logarithmic factor) only requiring the Hölder smoothness of the conditional class probability $\eta$ of data. Moreover, we consider a compositional assumption that requires $\eta$ to be the composition of several vector-valued multivariate functions of which each component function is either a maximum value function or a Hölder smooth function only depending on a small number of its input variables. Under this assumption, we can even derive optimal convergence rates for $\hat{f}^{\text{FNN}}_n$ (up to some logarithmic factor) which are independent of the input dimension of data. This result explains why in practice DNN classifiers can overcome the curse of dimensionality and perform well in high-dimensional classification problems. Furthermore, we establish dimension-free rates of convergence under other circumstances such as when the decision boundary is piecewise smooth and the input data are bounded away from it. Besides the novel oracle-type inequality, the sharp convergence rates presented in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with deep neural networks.

ロジスティック損失(クロスエントロピー損失とも呼ばれる)でトレーニングされたディープニューラルネットワーク(DNN)は、さまざまなバイナリ分類タスクで目覚ましい進歩を遂げてきました。実際にはかなりの成功を収めているにもかかわらず、ディープニューラルネットワークとロジスティック損失を使用したバイナリ分類の一般化分析は依然として不足しています。バイナリ分類におけるロジスティック損失のターゲット関数の非有界性は、満足のいく一般化境界を導き出す上での主な障害です。この論文では、新しい理論的分析を開発し、それを使用してバイナリ分類でロジスティック損失を持つ完全接続ReLU DNNをトレーニングするための厳密な一般化境界を確立することで、このギャップを埋めることを目指しています。私たちの一般化分析は、ターゲット関数の有界性制限に対処できるエレガントなオラクル型不等式に基づいています。このオラクル型不等式を使用して、i.i.d.に関する経験的ロジスティックリスク最小化によってトレーニングされた完全接続ReLU DNN分類器$\hat{f}^{\text{FNN}}_n$の一般化境界を確立します。サンプルサイズが$n$の場合、$n\to\infty$のときに急激な収束率が得られます。特に、データの条件付きクラス確率$\eta$のHölder平滑性のみを必要とする、$\hat{f}^{\text{FNN}}_n$ (ある対数係数まで)の最適収束率を取得します。さらに、$\eta$が、各コンポーネント関数が最大値関数または少数の入力変数のみに依存するHölder平滑関数のいずれかである、いくつかのベクトル値多変量関数の合成であることを要求する合成仮定を考慮します。この仮定の下では、データの入力次元に依存しない$\hat{f}^{\text{FNN}}_n$ (ある対数係数まで)の最適収束率を導出することもできます。この結果は、実際にDNN分類器が次元の呪いを克服し、高次元の分類問題で優れたパフォーマンスを発揮できる理由を説明しています。さらに、決定境界が区分的に滑らかで、入力データがそこから離れて制限されている場合など、他の状況下での次元フリー収束率を確立します。新しいオラクル型不等式に加えて、私たちの論文で提示された急激な収束率は、ReLU DNNによって自然対数関数をゼロ付近(制限がない場合)に近似するための厳しい誤差境界にも起因しています。さらに、対応するミニマックス下限を証明することで、レートの最適性に関する私たちの主張を正当化します。これらの結果はすべて文献では新しいものであり、ディープニューラルネットワークによる分類の理論的理解を深めるでしょう。

Spectral learning of multivariate extremes
多変量極値のスペクトル学習

We propose a spectral clustering algorithm for analyzing the dependence structure of multivariate extremes. More specifically, we focus on the asymptotic dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory. Our work studies the theoretical performance of spectral clustering based on a random $k$-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that, under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods.

私たちは、多変量極値の依存構造を解析するためのスペクトルクラスタリングアルゴリズムを提案します。より具体的には、極値理論における角度またはスペクトル測定によって特徴付けられる多変量極値の漸近依存性に焦点を当てます。私たちの仕事は、極値サンプル、つまり半径が大きなしきい値を超えるランダムベクトルの角度部分から構築されたランダムな$k$最近傍グラフに基づいて、スペクトルクラスタリングの理論的パフォーマンスを研究しています。特に、線形因子モデルから生じる極値の漸近分布を導き出し、特定の条件下で、スペクトルクラスタリングがこのモデルで発生する極値のクラスターを一貫して識別できることを証明します。この結果を活用して、角度測定を学習するためのシンプルで一貫性のある推定戦略を提案します。私たちの理論的知見は、私たちの方法の有限サンプル性能を示す数値実験によって補完されます。

Sum-of-norms clustering does not separate nearby balls
ノルム和クラスタリングでは近くのボールを分離しない

Sum-of-norms clustering is a popular convexification of $K$-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as $2\sqrt{2}$. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.

Sum-of-normsクラスタリングは、$K$-meansクラスタリングの一般的な凸化です。データセットが、単位半径の2つのばらばらのボールの和集合上の一様尺度に従って分布した多数の独立確率変数で構成されている場合、およびボールが互いに十分に接近している場合、標準和クラスタリングは通常、データセットの2つのクラスターへの分解を回復できないことを示します。次元が無限大になる傾向があるため、これは2つのボールの中心間の距離が$2sqrt{2}$と同じくらい大きいと仮定しても発生します。これを示すために、データセットが一般的な尺度に置き換えられる標準和クラスタリングの連続バージョンを導入して分析します。特に、クラスタリングのローカル-グローバル特性を述べ、証明します。これは、離散的なデータポイントの場合でも新しいようです。

An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization
ゼロ次非平滑非凸確率最適化のための最適次元依存性を持つアルゴリズム

We study the complexity of producing $(\delta,\epsilon)$-stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations. Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $\Omega(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal. We refute this conjecture by providing a faster algorithm that has complexity $O(d\delta^{-1}\epsilon^{-3})$, which is optimal (up to numerical constants) with respect to $d$ and also optimal with respect to the accuracy parameters $\delta,\epsilon$, thus solving an open question due to Lin et al. (2022). Moreover, the convergence rate achieved by our algorithm is also optimal for smooth objectives, proving that in the nonconvex stochastic zero-order setting, nonsmooth optimization is as easy as smooth optimization. We provide algorithms that achieve the aforementioned convergence rate in expectation as well as with high probability. Our analysis is based on a simple yet powerful lemma regarding the Goldstein-subdifferential set, which allows utilizing recent advancements in first-order nonsmooth nonconvex optimization.

私たちは、ノイズの多い関数評価のみを使用して、滑らかでも凸でもない可能性のあるリプシッツ目的関数の$(\delta,\epsilon)$定常点を生成する複雑性について研究します。最近の研究では、このタスクを解決するいくつかの確率的ゼロ次アルゴリズムが提案されていますが、それらはすべて、最適であると推測された問題の次元$d$に対して、$\Omega(d^{3/2})$の次元依存性に悩まされています。私たちは、複雑性$O(d\delta^{-1}\epsilon^{-3})$を持つより高速なアルゴリズムを提供することでこの推測を反駁します。これは、$d$に関して最適(数値定数まで)であり、精度パラメーター$\delta,\epsilon$に関しても最適であり、Linら(2022)による未解決の問題を解決します。さらに、私たちのアルゴリズムによって達成される収束率は滑らかな目的に対しても最適であり、非凸確率的ゼロ次設定では、非滑らかな最適化は滑らかな最適化と同じくらい簡単であることを証明しています。私たちは、期待値と高い確率で前述の収束率を達成するアルゴリズムを提供します。私たちの分析は、ゴールドスタイン部分微分集合に関するシンプルでありながら強力な補題に基づいており、これにより、1次非滑らかな非凸最適化の最近の進歩を活用できます。

Linear Distance Metric Learning with Noisy Labels
ノイズの多いラベルを使用した線形距離メトリック学習

In linear distance metric learning, we are given data in one Euclidean metric space and the goal is to find an appropriate linear map to another Euclidean metric space which respects certain distance conditions as much as possible. In this paper, we formalize a simple and elegant method which reduces to a general continuous convex loss optimization problem, and for different noise models we derive the corresponding loss functions. We show that even if the data is noisy, the ground truth linear metric can be learned with any precision provided access to enough samples, and we provide a corresponding sample complexity bound. Moreover, we present an effective way to truncate the learned model to a low-rank model that can provably maintain the accuracy in the loss function and in parameters — the first such results of this type. Several experimental observations on synthetic and real data sets support and inform our theoretical results.

線形距離メトリック学習では、1つのユークリッドメトリック空間にデータが与えられ、特定の距離条件を可能な限り尊重する別のユークリッドメトリック空間への適切な線形マップを見つけることが目標です。この論文では、一般的な連続凸損失最適化問題に還元するシンプルでエレガントな方法を形式化し、さまざまなノイズモデルに対応する損失関数を導き出します。データにノイズが多い場合でも、十分なサンプルへのアクセスが提供されれば、グラウンドトゥルース線形メトリックを任意の精度で学習でき、対応するサンプルの複雑さの範囲を提供することを示します。さらに、学習したモデルを、損失関数とパラメータの精度を証明可能に維持できる低ランクモデルに切り捨てる効果的な方法を提示します(このタイプの最初の結果)。合成データセットと実データセットに関するいくつかの実験的観察は、私たちの理論的結果を裏付け、情報を提供します。

OpenBox: A Python Toolkit for Generalized Black-box Optimization
OpenBox: 一般化されたブラックボックス最適化のための Python ツールキット

Black-box optimization (BBO) has a broad range of applications, including automatic machine learning, experimental design, and database knob tuning. However, users still face challenges when applying BBO methods to their problems at hand with existing software packages in terms of applicability, performance, and efficiency. This paper presents OpenBox, an open-source BBO toolkit with improved usability. It implements user-friendly interfaces and visualization for users to define and manage their tasks. The modular design behind OpenBox facilitates its flexible deployment in existing systems. Experimental results demonstrate the effectiveness and efficiency of OpenBox over existing systems. The source code of OpenBox is available at https://github.com/PKU-DAIR/open-box.

ブラックボックス最適化(BBO)には、自動機械学習、実験計画、データベースノブチューニングなど、幅広いアプリケーションがあります。しかし、既存のソフトウェアパッケージの問題にBBO手法を適用する場合、適用性、パフォーマンス、効率性の面で依然として課題に直面しています。この論文では、ユーザビリティが向上したオープンソースのBBOツールキットであるOpenBoxを紹介します。ユーザーフレンドリーなインターフェースと視覚化を実装し、ユーザーがタスクを定義および管理できるようにします。OpenBoxの背後にあるモジュラー設計により、既存のシステムへの柔軟な展開が容易になります。実験結果は、既存のシステムに対するOpenBoxの有効性と効率性を示しています。OpenBoxのソースコードはhttps://github.com/PKU-DAIR/open-boxで入手できます。

Generative Adversarial Ranking Nets
敵対的生成ランキングネット

We propose a new adversarial training framework — generative adversarial ranking networks (GARNet) to learn from user preferences among a list of samples so as to generate data meeting user-specific criteria. Verbosely, GARNet consists of two modules: a ranker and a generator. The generator fools the ranker to raise generated samples to the top; while the ranker learns to rank generated samples at the bottom. Meanwhile, the ranker learns to rank samples regarding the interested property by training with preferences collected on real samples. The adversarial ranking game between the ranker and the generator enables an alignment between the generated data distribution and the user-preferred data distribution with theoretical guarantees and empirical verification. Specifically, we first prove that when training with full preferences on a discrete property, the learned distribution of GARNet rigorously coincides with the distribution specified by the given score vector based on user preferences. The theoretical results are then extended to partial preferences on a discrete property and further generalized to preferences on a continuous property. Meanwhile, numerous experiments show that GARNet can retrieve the distribution of user-desired data based on full/partial preferences in terms of various interested properties (i.e., discrete/continuous property, single/multiple properties). Code is available at https://github.com/EvaFlower/GARNet.

私たちは、サンプルのリストの中からユーザーの好みを学習し、ユーザー固有の基準を満たすデータを生成するための新しい敵対的トレーニングフレームワーク、生成的敵対的ランキングネットワーク(GARNet)を提案します。簡単に言うと、GARNetはランカーとジェネレーターの2つのモジュールで構成されます。ジェネレーターはランカーをだまして生成されたサンプルを最上位に上げ、ランカーは生成されたサンプルを最下位にランク付けすることを学習します。一方、ランカーは実際のサンプルで収集された好みを使用してトレーニングすることで、関心のあるプロパティに関するサンプルをランク付けすることを学習します。ランカーとジェネレーター間の敵対的ランキングゲームにより、理論的な保証と経験的な検証により、生成されたデータ分布とユーザーが好むデータ分布の整合が可能になります。具体的には、まず、離散的プロパティの完全な好みを使用してトレーニングすると、GARNetの学習された分布が、ユーザーの好みに基づいて指定されたスコアベクトルによって指定された分布と厳密に一致することを証明します。次に、理論的な結果を離散的プロパティの部分的な好みに拡張し、さらに連続的プロパティの好みに一般化します。一方、多数の実験により、GARNetはさまざまな関心のあるプロパティ(離散/連続プロパティ、単一/複数のプロパティなど)に関する完全/部分的な設定に基づいて、ユーザーが希望するデータの分布を取得できることが示されています。コードはhttps://github.com/EvaFlower/GARNetで入手できます。

Predictive Inference with Weak Supervision
弱い監視による予測推論

The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets—sets that cover a true label with a prescribed probability, independent of the underlying distribution—using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments.

大規模な統計的機械学習でラベルを取得する費用は、部分的および弱くラベル付けされたデータを魅力的にしますが、そのようなデータをモデルの適合や検証にどのように活用するかは必ずしも明らかではありません。部分的な監督と検証の間のギャップを埋める方法論を提示し、有効な予測信頼度セット—、基礎となる分布とは無関係に、規定された確率で真のラベルをカバーするセットを提供する共形予測フレームワークを開発します—弱ラベル付きデータを使用します。そのために、カバレッジと予測妥当性の(必要な)新しい概念を導入し、次にいくつかのアプリケーションシナリオを開発して、分類といくつかの大規模構造化予測問題のための効率的なアルゴリズムを提供します。新しいカバレッジ定義により、いくつかの実験を通じて、より厳密で情報量の多い(しかし有効な)信頼セットが可能になるという仮説を裏付けています。

Functions with average smoothness: structure, algorithms, and learning
平均的な滑らかさを持つ関数: 構造、アルゴリズム、学習

We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds — assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the empirical covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.

私たちは、計量空間上の実数値関数を効率的に学習するための平均平滑度分析プログラムを開始します。リプシッツ定数を正規化子として使用する代わりに、各ポイントで局所勾配を定義し、関数の複雑さをこれらの値の平均として測定します。平均は最大値よりも大幅に小さくなる可能性があるため、この複雑さの尺度は、リプシッツ定数を局所勾配の平均に置き換える改良が可能であるとして、かなりシャープな一般化境界をもたらすことができます。我々の最初の大きな貢献は、まさにそのような分布に敏感な境界を得ることです。これには、いくつかの技術的課題を克服する必要があり、おそらく最も困難なのは、周囲のものよりもはるかに挙動が悪くなる可能性がある経験的被覆数の境界を設定することであった。我々の組み合わせの結果には、ランダムサンプルのラベルを平滑化する効率的なアルゴリズムが伴い、サンプルから空間全体への拡張が平均的に高い確率で滑らかであり続けることが保証されます。その過程で、我々が定義する関数クラスに驚くほど豊富な組み合わせ的および解析的構造が見つかる。

Differentially Private Data Release for Mixed-type Data via Latent Factor Models
潜在因子モデルによる混合型データの差分プライベートデータ公開

Differential privacy is a particular data privacy-preserving technology which enables synthetic data or statistical analysis results to be released with a minimum disclosure of private information from individual records. The tradeoff between privacy-preserving and utility guarantee is always a challenge for differential privacy technology, especially for synthetic data generation. In this paper, we propose a differentially private data synthesis algorithm for mixed-type data with correlation based on latent factor models. The proposed method can add a relatively small amount of noise to synthetic data under a given level of privacy protection while capturing correlation information. Moreover, the proposed algorithm can generate synthetic data preserving the same data type as mixed-type original data, which greatly improves the utility of synthetic data. The key idea of our method is to perturb the factor matrix and factor loading matrix to construct a synthetic data generation model, and to utilize link functions with privacy protection to ensure consistency of synthetic data type with original data. The proposed method can generate privacy-preserving synthetic data at low computation cost even when the original data is high-dimensional. In theory, we establish differentially private properties of the proposed method. Our numerical studies also demonstrate superb performance of the proposed method on the utility guarantee of the statistical analysis based on privacy-preserved synthetic data.

差分プライバシーは、個々の記録からの個人情報の開示を最小限に抑えながら、合成データまたは統計分析結果を公開できる特別なデータプライバシー保護技術です。プライバシー保護と有用性の保証との間のトレードオフは、差分プライバシー技術、特に合成データ生成にとって常に課題です。この論文では、潜在因子モデルに基づく相関のある混合型データに対する差分プライバシーデータ合成アルゴリズムを提案します。提案された方法は、相関情報をキャプチャしながら、特定のレベルのプライバシー保護の下で合成データに比較的少量のノイズを追加できます。さらに、提案されたアルゴリズムは、混合型の元のデータと同じデータ型を保持する合成データを生成できるため、合成データの有用性が大幅に向上します。私たちの方法の重要なアイデアは、因子行列と因子負荷行列を摂動させて合成データ生成モデルを構築し、プライバシー保護付きのリンク関数を使用して、合成データ型と元のデータの一貫性を確保することです。提案された方法は、元のデータが高次元であっても、低い計算コストでプライバシー保護された合成データを生成できます。理論的には、提案された方法の差分プライバシー特性を確立します。私たちの数値的研究は、プライバシーが保護された合成データに基づく統計分析の有用性保証に関して、提案された方法が優れたパフォーマンスを発揮することも実証しています。

The Non-Overlapping Statistical Approximation to Overlapping Group Lasso
重なり合うグループラッソに対する非重なり統計的近似

The group lasso penalty is widely used to introduce structured sparsity in statistical learning, characterized by its ability to eliminate predefined groups of parameters automatically. However, when the groups overlap, solving the group lasso problem can be time-consuming in high-dimensional settings due to groups’ non-separability. This computational challenge has limited the applicability of the overlapping group lasso penalty in cutting-edge areas, such as gene pathway selection and graphical model estimation. This paper introduces a non-overlapping and separable penalty designed to efficiently approximate the overlapping group lasso penalty. The approximation substantially enhances the computational efficiency in optimization, especially for large-scale and high-dimensional problems. We show that the proposed penalty is the tightest separable relaxation of the overlapping group lasso norm within the family of $\ell_{q_1}/\ell_{q_2}$ norms. Moreover, the estimators derived from our proposed norm are statistically equivalent to those derived from the overlapping group lasso penalty in terms of estimation error, support recovery, and minimax rate under the squared loss. The effectiveness of our method is demonstrated through extensive simulation examples and a predictive task of cancer tumors.

グループLassoペナルティは、統計学習に構造化されたスパース性を導入するために広く使用されており、定義済みのパラメータグループを自動的に削除できることが特徴です。ただし、グループが重複している場合、グループの非分離性のために、高次元設定でグループLasso問題を解くのに時間がかかることがあります。この計算上の課題により、遺伝子経路の選択やグラフィカルモデル推定などの最先端領域では、重複グループLassoペナルティの適用が制限されています。この論文では、重複グループLassoペナルティを効率的に近似するように設計された、重複しない分離可能なペナルティを紹介します。この近似により、特に大規模で高次元の問題の場合、最適化の計算効率が大幅に向上します。提案されたペナルティは、$\ell_{q_1}/\ell_{q_2}$ノルムのファミリー内で、重複グループLassoノルムの最も緊密な分離可能な緩和であることを示します。さらに、提案したノルムから導出された推定量は、推定誤差、サポート回復、および二乗損失の下でのミニマックス率に関して、重複グループLassoペナルティから導出された推定量と統計的に同等です。この方法の有効性は、広範なシミュレーション例と癌腫瘍の予測タスクを通じて実証されています。

Faster Rates of Differentially Private Stochastic Convex Optimization
微分プライベート確率的凸最適化の高速レート

In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $\theta>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively when $\theta\geq 2$, where $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $\theta\geq \bar{\theta}>1$ for some known $\bar{\theta}$. Next, we show that the excess population risk of population functions satisfying TNC with parameter $\theta\geq 2$ is always lower bounded by $\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\Omega((\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively, which matches our upper bounds. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is non-negative and the optimal value of population risk is sufficiently small. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log(1/\delta)}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ and $O(\frac{d^2}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ for any $\tau> 1$ in $(\epsilon,\delta)$-DP and $\epsilon$-DP model respectively if the sample size $n$ is sufficiently large. These results circumvent their corresponding lower bounds in (Feldman et al., 2020) for general strongly convex functions. Finally, we conduct experiments of our new methods on real-world data. Experimental results also provide new insights into established theories.

この論文では、差分プライバシー確率的凸最適化(DP-SCO)の問題を再検討し、一般的な凸関数と強凸関数の以前の結果よりも高速な、いくつかの特別なクラスの関数の過剰人口リスクを示します。論文の最初の部分では、人口リスク関数が、あるパラメータ$\theta>1$でTysbakovノイズ条件(TNC)を満たすケースを検討します。具体的には、まず、損失関数に関するいくつかの軽度の仮定の下で、$\theta\geq 2$のとき、$\epsilon$-DPと$(\epsilon, \delta)$-DPに対してそれぞれ$\tilde{O}((\frac{1}{\sqrt{n}}+\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $と$\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})^\frac{\theta}{\theta-1})$という上限を達成できる出力を実現できるアルゴリズムが存在することを示します。ここで、$n$はサンプルサイズ、$d$は空間の次元です。次に、非効率性の問題に対処し、上限を$\text{Poly}(\log n)$係数で改善し、既知の$\bar{\theta}$に対して$\theta\geq \bar{\theta}>1$の場合に拡張します。次に、パラメーター$\theta\geq 2$でTNCを満たす集団関数の過剰集団リスクは、$\epsilon$-DPと$(\epsilon, \delta)$-DPに対してそれぞれ$\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $と$\Omega((\frac{\sqrt{d\log(1/\delta)}}{n\epsilon})^\frac{\theta}{\theta-1})$によって常に下限が定められ、これが上限と一致することを示します。第2部では、集団リスク関数が強く凸である特殊なケースに焦点を当てます。これまでの研究とは異なり、ここでは損失関数が非負であり、母集団リスクの最適値が十分に小さいと仮定します。これらの追加の仮定に基づいて、サンプルサイズ$n$が十分に大きい場合、出力がそれぞれ$(\epsilon,\delta)$-DPモデルと$\epsilon$-DPモデルの任意の$\tau> 1$に対して$O(\frac{d\log(1/\delta)}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$と$O(\frac{d^2}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$の上限を達成できる新しい方法を提案します。これらの結果は、一般的な強凸関数に対する(Feldmanら、2020)の対応する下限を回避します。最後に、実際のデータで新しい方法の実験を行います。実験結果は、確立された理論への新しい洞察も提供します。

Nonasymptotic analysis of Stochastic Gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization
非凸最適化のための局所条件下における確率勾配ハミルトニアンモンテカルロの非漸近解析

We provide a nonasymptotic analysis of the convergence of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) to a target measure in Wasserstein-2 distance without assuming log-concavity. Our analysis quantifies key theoretical properties of the SGHMC as a sampler under local conditions which significantly improves the findings of previous results. In particular, we prove that the Wasserstein-2 distance between the target and the law of the SGHMC is uniformly controlled by the step-size of the algorithm, therefore demonstrate that the SGHMC can provide high-precision results uniformly in the number of iterations. The analysis also allows us to obtain nonasymptotic bounds for nonconvex optimization problems under local conditions and implies that the SGHMC, when viewed as a nonconvex optimizer, converges to a global minimum with the best known rates. We apply our results to obtain nonasymptotic bounds for scalable Bayesian inference and nonasymptotic generalization bounds.

私たちは、確率勾配ハミルトニアンモンテカルロ(SGHMC)のWasserstein-2距離のターゲット測度への収束の非漸近解析を、対数凹を仮定せずに提供します。私たちの分析は、局所条件下でのサンプラーとしてのSGHMCの主要な理論的特性を定量化し、以前の結果の調査結果を大幅に改善します。特に、目標とSGHMCの法則との間のWasserstein-2距離がアルゴリズムのステップサイズによって一様に制御されることを証明し、したがって、SGHMCが反復回数で一様に高精度の結果を提供できることを実証します。また、この解析では、局所条件下での非凸最適化問題の非漸近限界を得ることができ、SGHMCを非凸最適化子と見なすと、既知の最良のレートでグローバル最小値に収束することを意味します。結果を適用して、スケーラブルなベイズ推論と非漸近一般化境界の非漸近境界を取得します。

Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits
全球非定常多腕バンディットの有限時間解析

We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.

私たちは、アームのモデルパラメータが時間とともに変化する非定常多腕バンディット問題を考察します。私たちは、データストリームに関する文献から適応型ウィンドウ技法を活用するバンディットアルゴリズムクラスである適応型リセットバンディット（ADR-バンディット）を紹介します。我々はまず、独立した関心事である適応型ウィンドウ技法から得られる推定値の品質に関する新しい保証を提供します。さらに、私たちは、変化が瞬時に起こる急激な環境と変化が徐々に起こる漸進的な環境という2つの典型的な環境でADR-バンディットの有限時間分析を行う。私たちは、急激な変化または漸進的な変化が協調的に起こる場合、すなわちグローバル変化と呼ぶ場合に、ADR-バンディットがほぼ最適なパフォーマンスを発揮することを実証します。私たちは、このようなグローバル変化を想定する場合、強制的な探索は不要であることを実証します。既存の非定常バンディットアルゴリズムとは異なり、ADR-バンディットは、定常環境だけでなく、グローバル変化を伴う非定常環境でも最適なパフォーマンスを発揮します。我々の実験は、提案されたアルゴリズムが合成環境と現実世界の環境で既存のアプローチよりも優れていることを示しています。

Stable Implementation of Probabilistic ODE Solvers
確率的ODEソルバーの安定した実装

Probabilistic solvers for ordinary differential equations (ODEs) provide efficient quantification of numerical uncertainty associated with the simulation of dynamical systems. Their convergence rates have been established by a growing body of theoretical analysis. However, these algorithms suffer from numerical instability when run at high order or with small step sizes—that is, exactly in the regime in which they achieve the highest accuracy. The present work proposes and examines a solution to this problem. It involves three components: accurate initialisation, a coordinate change preconditioner that makes numerical stability concerns step-size-independent, and square-root implementation. Using all three techniques enables numerical computation of probabilistic solutions of ODEs with algorithms of order up to 11, as demonstrated on a set of challenging test problems. The resulting rapid convergence is shown to be competitive with high-order, state-of-the-art, classical methods. As a consequence, a barrier between analysing probabilistic ODE solvers and applying them to interesting machine learning problems is effectively removed.

常微分方程式(ODE)の確率的ソルバーは、動的システムのシミュレーションに関連する数値的不確実性を効率的に定量化します。その収束率は、理論分析の増加によって確立されています。ただし、これらのアルゴリズムは、高次または小さなステップサイズで実行すると、つまり、最高の精度を達成する領域で実行すると、数値的に不安定になります。この研究では、この問題の解決策を提案し、検証します。解決策には、正確な初期化、数値的安定性をステップサイズに依存しない座標変更前処理、および平方根実装の3つの要素が含まれます。3つの手法すべてを使用すると、一連の難しいテスト問題で実証されているように、最大11次までのアルゴリズムを使用してODEの確率的解を数値的に計算できます。結果として得られる迅速な収束は、高次で最先端の古典的な方法と競合できることが示されています。その結果、確率的ODEソルバーの分析と、それらを興味深い機械学習の問題に適用することの間の障壁が効果的に取り除かれます。

More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity
さらなるPACベイズ境界: 制限された損失から、一般的なテール動作を伴う損失、いつでも有効になる損失まで

In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni’s bound that holds uniformly for all parameter values. This leads to new fast-rate and mixed-rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast-rate bound is equivalent to the Seeger–Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss’ cumulative generating function is bounded, and a bound when the loss’ second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the “in probability” parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters’ space. Finally, using a simple technique that is applicable to any existing bound, we extend all previous results to anytime-valid bounds.

この論文では、さまざまなタイプの損失に対する新しい高確率PAC-Bayes境界を紹介します。まず、範囲が制限された損失に対して、すべてのパラメータ値に一様に当てはまるCatoni境界の強化版を復元します。これにより、解釈可能で、文献の以前の境界よりも厳密な新しい高速レート境界と混合レート境界が得られます。特に、高速レート境界はSeeger-Langford境界と同等です。次に、より一般的なテール動作を持つ損失に対して、2つの新しいパラメータフリー境界を導入します。1つは、損失の累積生成関数が制限されている場合のPAC-Bayes Chernoffアナログ、もう1つは損失の2番目のモーメントが制限されている場合の境界です。これら2つの境界は、「確率」パラメータ最適化問題に対する可能なイベントの空間の離散化に基づく新しい手法を使用して取得されます。この手法は、パラメータ空間のグリッドを最適化する以前のアプローチよりも単純で、より一般的です。最後に、既存の境界に適用できる単純な手法を使用して、これまでのすべての結果をいつでも有効な境界に拡張します。

Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space
ニューラルヒルベルトラダー:関数空間における多層ニューラルネットワーク

To characterize the function space explored by neural networks (NNs) is an important aspect of learning theory. In this work, noticing that a multi-layer NN generates implicitly a hierarchy of reproducing kernel Hilbert spaces (RKHSs) -named a neural Hilbert ladder (NHL) – we define the function space as an infinite union of RKHSs, which generalizes the existing Barron space theory of two-layer NNs. We then establish several theoretical properties of the new space. First, we prove a correspondence between functions expressed by L-layer NNs and those belonging to L-level NHLs. Second, we prove generalization guarantees for learning an NHL with a controlled complexity measure. Third, we derive a non-Markovian dynamics of random fields that governs the evolution of the NHL which is induced by the training of multi-layer NNs in an infinite-width mean-field limit. Fourth, we show examples of depth separation in NHLs under the ReLU activation function. Finally, we perform numerical experiments to illustrate the feature learning aspect of NN training through the lens of NHLs.

ニューラルネットワーク(NN)によって探索される関数空間を特徴付けることは、学習理論の重要な側面です。この研究では、多層NNが暗黙的に再生カーネルヒルベルト空間(RKHS)の階層(ニューラルヒルベルトラダー(NHL)と呼ばれる)を生成することに注目し、関数空間をRKHSの無限和集合として定義し、2層NNの既存のバロン空間理論を一般化します。次に、新しい空間のいくつかの理論的特性を確立します。まず、L層NNによって表現される関数とLレベルNHLに属する関数との間の対応を証明します。次に、制御された複雑性尺度を使用してNHLを学習するための一般化保証を証明します。3番目に、無限幅の平均場限界で多層NNをトレーニングすることによって誘導されるNHLの進化を支配するランダム場の非マルコフダイナミクスを導出します。4番目に、ReLU活性化関数の下でのNHLの深度分離の例を示す。最後に、NHLの観点からNNトレーニングの特徴学習の側面を説明するために数値実験を実行します。

QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration
QDax:ハードウェアアクセラレーションを備えた品質ダイバーシティおよびポピュレーションベースのアルゴリズムのライブラリ

QDax is an open-source library with a streamlined and modular API for Quality-Diversity (QD) optimisation algorithms in Jax. The library serves as a versatile tool for optimisation purposes, ranging from black-box optimisation to continuous control. QDax offers implementations of popular QD, Neuroevolution, and Reinforcement Learning (RL) algorithms, supported by various examples. All the implementations can be just-in-time compiled with Jax, facilitating efficient execution across multiple accelerators, including GPUs and TPUs. These implementations effectively demonstrate the framework’s flexibility and user-friendliness, easing experimentation for research purposes. Furthermore, the library is thoroughly documented and has 93% test coverage.

QDaxは、JaxのQuality-Diversity(QD)最適化アルゴリズムのための合理化されたモジュール式APIを備えたオープンソースライブラリです。このライブラリは、ブラックボックスの最適化から継続的な制御まで、最適化のための汎用性の高いツールとして機能します。QDaxは、一般的なQD、Neuroevolution、および強化学習(RL)アルゴリズムの実装を提供し、さまざまな例でサポートされています。すべての実装はJaxを使用してジャストインタイムコンパイルできるため、GPUやTPUなどの複数のアクセラレータ間で効率的に実行できます。これらの実装は、フレームワークの柔軟性と使いやすさを効果的に示し、研究目的での実験を容易にします。さらに、ライブラリは徹底的に文書化されており、93%のテストカバレッジを備えています。

Random Forest Weighted Local Fréchet Regression with Random Objects
ランダムオブジェクトによるランダムフォレスト重み付け局所フレシェ回帰

Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and Müller (2019) established a general paradigm of Fréchet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fréchet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method uses these weights as the local average to solve the conditional Fréchet mean, while the second method performs local linear Fréchet regression, both significantly improving existing Fréchet regression methods. Based on the theory of infinite order U-processes and infinite order $M_{m_n}$-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data.

統計分析は、距離空間からの複雑なデータにますます直面しています。PetersenとMüller (2019)は、複雑な距離空間値の応答とユークリッド予測子を使用したフレシェ回帰の一般的なパラダイムを確立しました。ただし、そのローカルアプローチにはノンパラメトリックカーネルスムージングが含まれており、次元の呪いに悩まされています。この問題に対処するために、この論文では、ランダムフォレストの重み付けローカルフレシェ回帰パラダイムという新しいパラダイムを提案します。このアプローチの主なメカニズムは、ランダムフォレストによって生成されたローカル適応カーネルに依存しています。最初の方法では、これらの重みをローカル平均として使用して条件付きフレシェ平均を解き、2番目の方法ではローカル線形フレシェ回帰を実行します。どちらも既存のフレシェ回帰方法を大幅に改善しています。無限次数U過程と無限次数$M_{m_n}$推定量の理論に基づいて、ユークリッド応答を伴うランダムフォレストの現在の大規模サンプル理論を特別なケースとしてカバーする、局所定数推定量の一貫性、収束率、および漸近正規性を確立します。数値研究では、分布関数、対称正定値行列、球面データなど、一般的に遭遇するいくつかの種類の応答で、私たちの方法の優位性が示されています。私たちの提案の実用的なメリットは、ニューヨークのタクシーデータと人間の死亡率データへの適用を通じても実証されています。

PhAST: Physics-Aware, Scalable, and Task-Specific GNNs for Accelerated Catalyst Design
PhAST:加速触媒設計のための物理認識型、スケーラブル、タスク固有のGNN

Mitigating the climate crisis requires a rapid transition towards lower-carbon energy. Catalyst materials play a crucial role in the electrochemical reactions involved in numerous industrial processes key to this transition, such as renewable energy storage and electrofuel synthesis. To reduce the energy spent on such activities, we must quickly discover more efficient catalysts to drive electrochemical reactions. Machine learning (ML) holds the potential to efficiently model materials properties from large amounts of data, accelerating electrocatalyst design. The Open Catalyst Project OC20 dataset was constructed to that end. However, ML models trained on OC20 are still neither scalable nor accurate enough for practical applications. In this paper, we propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy. This includes improvements in (1) the graph creation step, (2) atom representations, (3) the energy prediction head, and (4) the force prediction head. We describe these contributions, referred to as PhAST, and evaluate them thoroughly on multiple architectures. Overall, PhAST improves energy MAE by 4 to 42% while dividing compute time by 3 to 8× depending on the targeted task/model. PhAST also enables CPU training, leading to 40× speedups in highly parallelized settings. Python package: https://phast.readthedocs.io.

気候危機を緩和するには、低炭素エネルギーへの急速な移行が必要です。触媒材料は、再生可能エネルギー貯蔵や電気燃料合成など、この移行の鍵となる多くの産業プロセスに関与する電気化学反応において重要な役割を果たします。このような活動に費やされるエネルギーを削減するには、電気化学反応を促進するより効率的な触媒を迅速に発見する必要があります。機械学習(ML)は、大量のデータから材料特性を効率的にモデル化し、電気触媒の設計を加速させる可能性があります。Open Catalyst Project OC20データセットは、この目的のために構築されました。ただし、OC20でトレーニングされたMLモデルは、実際のアプリケーションに十分なスケーラビリティも精度もまだありません。この論文では、ほとんどのアーキテクチャに適用可能なタスク固有のイノベーションを提案し、計算効率と精度の両方を向上させます。これには、(1)グラフ作成ステップ、(2)原子表現、(3)エネルギー予測ヘッド、(4)力予測ヘッドの改善が含まれます。PhASTと呼ばれるこれらの貢献について説明し、複数のアーキテクチャで徹底的に評価します。全体的に、PhASTは、対象となるタスク/モデルに応じて計算時間を3～8倍に短縮しながら、エネルギーMAEを4～42%向上させます。また、PhASTはCPUトレーニングも可能にし、高度に並列化された設定で40倍の高速化を実現します。Pythonパッケージ: https://phast.readthedocs.io。

Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need?
実世界のデータに対する教師なし異常検出アルゴリズム:いくつ必要ですか?

In this study we evaluate 33 unsupervised anomaly detection algorithms on 52 real-world multivariate tabular data sets, performing the largest comparison of unsupervised anomaly detection algorithms to date. On this collection of data sets, the EIF (Extended Isolation Forest) algorithm significantly outperforms the most other algorithms. Visualizing and then clustering the relative performance of the considered algorithms on all data sets, we identify two clear clusters: one with “local” data sets, and another with “global” data sets. “Local” anomalies occupy a region with low density when compared to nearby samples, while “global” occupy an overall low density region in the feature space. On the local data sets the $k$NN ($k$-nearest neighbor) algorithm comes out on top. On the global data sets, the EIF (extended isolation forest) algorithm performs the best. Also taking into consideration the algorithms’ computational complexity, a toolbox with these two unsupervised anomaly detection algorithms suffices for finding anomalies in this representative collection of multivariate data sets. By providing access to code and data sets, our study can be easily reproduced and extended with more algorithms and/or data sets.

この研究では、52の実際の多変量表形式データセットで33の教師なし異常検出アルゴリズムを評価し、これまでで最大の教師なし異常検出アルゴリズムの比較を行いました。このデータセットのコレクションでは、EIF (Extended Isolation Forest)アルゴリズムが他のほとんどのアルゴリズムを大幅に上回りました。検討対象のアルゴリズムのすべてのデータセットでの相対的なパフォーマンスを視覚化してクラスタリングすると、2つの明確なクラスタが特定されました。1つは「ローカル」データセット、もう1つは「グローバル」データセットです。「ローカル」異常は近くのサンプルと比較して密度が低い領域を占めますが、「グローバル」異常は特徴空間で全体的に密度が低い領域を占めます。ローカルデータセットでは、$k$NN ($k$近傍)アルゴリズムがトップになります。グローバルデータセットでは、EIF (拡張分離フォレスト)アルゴリズムが最高のパフォーマンスを発揮します。また、アルゴリズムの計算の複雑さを考慮すると、この2つの教師なし異常検出アルゴリズムを備えたツールボックスは、この代表的な多変量データセットのコレクションで異常を見つけるのに十分です。コードとデータセットへのアクセスを提供することで、私たちの研究は簡単に再現でき、より多くのアルゴリズムやデータセットを使用して拡張できます。

Multi-class Probabilistic Bounds for Majority Vote Classifiers with Partially Labeled Data
部分的にラベル付けされたデータを持つ多数決投票分類器の多クラス確率的境界

In this paper, we propose a probabilistic framework for analyzing a multi-class majority vote classifier in the case where training data is partially labeled. First, we derive a multi-class transductive bound over the risk of the majority vote classifier, which is based on the classifier’s vote distribution over each class. Then, we introduce a mislabeling error model to analyze the error of the majority vote classifier in the case of the pseudo-labeled training data. We derive a generalization bound over the majority vote error when imperfect labels are given, taking into account the mean and the variance of the prediction margin. Finally, we demonstrate an application of the derived transductive bound for self-training to find automatically the confidence threshold used to determine unlabeled examples for pseudo-labeling. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised approaches.

この論文では、トレーニングデータが部分的にラベル付けされている場合に、マルチクラス多数決分類器を分析するための確率的フレームワークを提案します。まず、各クラスに対する分類器の投票分布に基づく多数決分類器のリスクに対する多クラス変換限界を導き出します。次に、擬似ラベル付けされた教師データの場合の多数決分類器の誤差を分析するために、誤ラベル付けエラーモデルを導入します。不完全なラベルが与えられた場合、予測マージンの平均と分散を考慮して、多数決エラーに対する一般化を導き出します。最後に、擬似ラベリングのラベル付けされていない例を決定するために使用される信頼閾値を自動的に見つけるための自己トレーニングのための導出された変換限界のアプリケーションを示します。さまざまなデータセットでの経験的結果は、いくつかの最先端の半教師ありアプローチと比較して、私たちのフレームワークの有効性を示しています。

Information Processing Equalities and the Information–Risk Bridge
情報処理の等価性と情報リスクの橋渡し

We introduce two new classes of measures of information for statistical experiments which generalise and subsume φ-divergences, integral probability metrics, N-distances (MMD), and (f,Γ) divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational φ-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical information processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization.

私たちは、φ発散を一般化および包含する統計実験のための情報測定の2つの新しいクラス、積分確率メトリック、N距離(MMD)、および2つ以上の分布間の(f、Γ)発散を導入します。これにより、情報の測定値と統計的決定問題のベイズリスクとの間に単純な幾何学的関係を導き出すことができ、したがって、変分φ発散表現を完全に対称的な方法で複数の分布に拡張できます。新しい発散の族は、マルコフ作用素の作用の下で閉じられ、古典的な情報処理の不等式の洗練と一般化である情報処理の等式がもたらされます。この等式は、古典的なリスク最小化における仮説クラスの選択の重要性についての洞察を提供します。

Nonparametric Regression for 3D Point Cloud Learning
3D点群学習のためのノンパラメトリック回帰

In recent years, there has been an exponentially increased amount of point clouds collected with irregular shapes in various areas. Motivated by the importance of solid modeling for point clouds, we develop a novel and efficient smoothing tool based on multivariate splines over the triangulation to extract the underlying signal and build up a 3D solid model from the point cloud. The proposed method can denoise or deblur the point cloud effectively, provide a multi-resolution reconstruction of the actual signal, and handle sparse and irregularly distributed point clouds to recover the underlying trajectory. In addition, our method provides a natural way of numerosity data reduction. We establish the theoretical guarantees of the proposed method, including the convergence rate and asymptotic normality of the estimator, and show that the convergence rate achieves optimal nonparametric convergence. We also introduce a bootstrap method to quantify the uncertainty of the estimators. Through extensive simulation studies and a real data example, we demonstrate the superiority of the proposed method over traditional smoothing methods in terms of estimation accuracy and efficiency of data reduction.

近年、さまざまな分野で不規則な形状の点群が急増しています。点群のソリッドモデリングの重要性に着目し、三角測量上の多変量スプラインに基づく新しい効率的なスムージングツールを開発して、基礎となる信号を抽出し、点群から3Dソリッドモデルを構築します。提案された方法は、点群のノイズやぼかしを効果的に除去し、実際の信号の多重解像度再構成を提供し、まばらで不規則に分布する点群を処理して基礎となる軌跡を復元できます。さらに、この方法は、自然な方法で数のデータ削減を実現します。推定量の収束率や漸近正規性など、提案方法の理論的保証を確立し、収束率が最適なノンパラメトリック収束を達成することを示します。また、推定量の不確実性を定量化するためのブートストラップ法も紹介します。広範なシミュレーション研究と実際のデータ例を通じて、推定精度とデータ削減の効率の点で、提案された方法が従来の平滑化方法よりも優れていることを実証します。

AMLB: an AutoML Benchmark
AMLB: AutoML のベンチマーク

Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.

異なるAutoMLフレームワークを比較することは、非常に困難で、多くの場合、間違って行われます。AutoMLフレームワークを比較する際に、ベストプラクティスに従い、よくある間違いを回避する、オープンで拡張可能なベンチマークを紹介します。71の分類タスクと33の回帰タスクにわたって、9つのよく知られたAutoMLフレームワークを徹底的に比較します。AutoMLフレームワーク間の違いは、多面的な分析で調査され、モデルの精度、推論時間とのトレードオフ、フレームワークの障害が評価されます。また、Bradley-Terryツリーを使用して、相対的なAutoMLフレームワークのランキングが異なるタスクのサブセットを検出します。ベンチマークには、多くのAutoMLフレームワークと統合され、フレームワークのインストールとリソースの割り当てから詳細な評価まで、経験的評価プロセスをエンドツーエンドで自動化するオープンソースツールが付属しています。ベンチマークはパブリックデータセットを使用し、他のAutoMLフレームワークやタスクで簡単に拡張でき、最新の結果が掲載されたWebサイトがあります。

Materials Discovery using Max K-Armed Bandit
Max K-Armed Bandit を使用した材料の発見

Search algorithms for bandit problems are applicable in materials discovery. However, objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max $K$-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. However, typical MKB algorithms are not directly applicable to materials discovery due to some difficulties. The typical algorithms have many hyperparameters and some difficulty in the directly implementation for the materials discovery. Thus, we propose a new MKB algorithm using an upper confidence bound of expected improvement of the best reward. This approach is guaranteed to be asymptotic to greedy oracles, which does not depend on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process, unless the optimal arm coincides in the MKB and conventional bandit settings.

バンディット問題の検索アルゴリズムは、材料の発見に適用できます。ただし、従来のバンディット問題の目的は、材料の発見の目的とは異なります。従来のバンディット問題は総報酬の最大化を目指しますが、材料の発見は材料特性のブレークスルーを目指します。単一の最良の報酬の獲得を目指す最大Kアームドバンディット(MKB)問題は、従来のバンディットよりも発見タスクに適しています。ただし、一般的なMKBアルゴリズムは、いくつかの困難さのため、材料の発見に直接適用できません。一般的なアルゴリズムには多くのハイパーパラメータがあり、材料の発見に直接実装するにはいくつかの困難があります。そのため、私たちは、最良の報酬の期待される改善の信頼上限を使用する新しいMKBアルゴリズムを提案します。このアプローチは、時間範囲に依存しない貪欲オラクルに漸近することが保証されています。さらに、他のMKBアルゴリズムと比較して、提案されたアルゴリズムにはハイパーパラメータが1つしかないため、材料の発見に有利です。私たちは、モンテカルロツリー検索を使用して、提案されたアルゴリズムを合成問題と分子設計のデモンストレーションに適用しました。結果によると、MKBと従来のバンディット設定で最適なアームが一致しない限り、提案されたアルゴリズムは、検索プロセスの後期段階で他のバンディットアルゴリズムよりも安定して優れたパフォーマンスを発揮しました。

Semi-supervised Inference for Block-wise Missing Data without Imputation
補完なしのブロックワイズ欠損データに対する半教師あり推論

We consider statistical inference for single or low-dimensional parameters in a high-dimensional linear model under a semi-supervised setting, wherein the data are a combination of a labelled block-wise missing data set of a relatively small size and a large unlabelled data set. The proposed method utilises both labelled and unlabelled data without any imputation or removal of the missing observations. The asymptotic properties of the estimator are established under regularity conditions. Hypothesis testing for low-dimensional coefficients are also studied. Extensive simulations are conducted to examine the theoretical results. The method is evaluated on the Alzheimer’s Disease Neuroimaging Initiative data.

私たちは、データが比較的小さいサイズのラベル付きブロック単位の欠損データセットと大きなラベル付けされていないデータセットの組み合わせである半教師付き設定の下で、高次元線形モデルにおける単一次元または低次元パラメータの統計的推論を検討します。提案された方法は、ラベル付きデータとラベルなしデータの両方を利用し、欠落している観測値の補完や削除は行わずに行います。推定量の漸近特性は、規則性条件下で確立されます。低次元係数の仮説検定も研究されます。理論的な結果を検証するために、広範なシミュレーションが行われます。この方法は、AlzheimerのDisease Neuroimaging Initiativeのデータで評価されます。

Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization
適応性と非定常性:オンライン凸最適化のための問題依存動的後悔

We investigate online convex optimization in non-stationary environments and choose dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is $\mathcal{O}(\sqrt{T(1+P_T)})$. Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the guarantee for some easy problem instances, particularly when online functions are smooth. Specifically, we introduce novel online algorithms that can exploit smoothness and replace the dependence on $T$ in dynamic regret with problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of these two terms. These quantities are at most $\mathcal{O}(T)$ while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile safeguard the same rate in the worst case. Notably, our proposed algorithms can achieve favorable dynamic regret with only one gradient per iteration, sharing the same gradient query complexity as the static regret minimization methods. To accomplish this, we introduce the collaborative online ensemble framework. The proposed framework employs a two-layer online ensemble to handle non-stationarity, and uses optimistic online learning and further introduces crucial correction terms to enable effective collaboration within the meta-base two layers, thereby attaining adaptivity. We believe the framework can be useful for broader problems.

私たちは、非定常環境におけるオンライン凸最適化を調査し、オンラインアルゴリズムによって発生した累積損失と任意の実行可能なコンパレータシーケンスの損失との差として定義される動的リグレットをパフォーマンス指標として選択します。$T$を時間範囲、$P_T$を環境の非定常性を本質的に反映するパス長とすると、最先端の動的リグレットは$\mathcal{O}(\sqrt{T(1+P_T)})$です。この境界は凸関数に対してミニマックス最適であることが証明されていますが、この論文では、特にオンライン関数が滑らかな場合、いくつかの簡単な問題インスタンスの保証をさらに強化できることを示します。具体的には、滑らかさを活用し、動的リグレットの$T$への依存性を問題に依存する量、つまり損失関数の勾配の変化、コンパレータシーケンスの累積損失、およびこれら2つの項の最小値に置き換えることができる新しいオンラインアルゴリズムを紹介します。これらの量は最大で$\mathcal{O}(T)$ですが、良性の環境でははるかに小さくなる可能性があります。したがって、私たちの結果は問題の本質的な難しさに適応的です。なぜなら、境界は簡単な問題に対する既存の結果よりも狭く、一方で最悪の場合でも同じレートを保護するからです。特に、私たちが提案するアルゴリズムは、静的な後悔最小化方法と同じ勾配クエリの複雑さを共有しながら、反復ごとに1つの勾配だけで好ましい動的後悔を達成できます。これを実現するために、私たちは協力的なオンラインアンサンブルフレームワークを導入します。提案されたフレームワークは、非定常性を処理するために2層のオンラインアンサンブルを採用し、楽観的なオンライン学習を使用し、さらにメタベースの2層内で効果的なコラボレーションを可能にする重要な修正項を導入して、適応性を実現します。私たちは、このフレームワークがより広範な問題に役立つと考えています。

Scaling Speech Technology to 1,000+ Languages
音声技術を1,000+言語に拡張

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task while providing improved accuracy compared to prior work. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

音声技術の言語範囲を拡大することで、より多くの人が情報にアクセスしやすくなる可能性があります。しかし、現在の音声技術は約100言語に制限されており、これは世界中で話されている7,000を超える言語のごく一部です。Massively Multilingual Speech (MMS)プロジェクトでは、タスクに応じてサポートされる言語の数を10～40倍に増やし、以前の作業と比較して精度を向上させています。主な要素は、公開されている宗教文書の読み上げに基づく新しいデータセットと、自己教師あり学習の効果的な活用です。私たちは、1,406の言語をカバーする事前トレーニング済みのwav2vec 2.0モデル、1,107の言語に対応する単一の多言語自動音声認識モデル、同じ数の言語に対応する音声合成モデル、および4,017の言語に対応する言語識別モデルを構築しました。実験では、当社の多言語音声認識モデルは、ラベル付けされたデータのごく一部でトレーニングしながら、FLEURSベンチマークの54言語でWhisperの単語エラー率を半分以上に低減することが示されています。

MAP- and MLE-Based Teaching
MAPおよびMLEベースの教育

Imagine a learner $L$ who tries to infer a hidden concept from a collection of observations. Building on the work of Ferri et al we assume the learner to be parameterized by priors $P(c)$ and by $c$-conditional likelihoods $P(z|c)$ where $c$ ranges over all concepts in a given class $C$ and $z$ ranges over all observations in an observation set $Z$. $L$ is called a MAP-learner (resp.~an MLE-learner) if it thinks of a collection $S$ of observations as a random sample and returns the concept with the maximum a-posteriori probability (resp.~the concept which maximizes the $c$-conditional likelihood of $S$). Depending on whether $L$ assumes that $S$ is obtained from ordered or unordered sampling resp.~from sampling with or without replacement, we can distinguish four different sampling modes. Given a target concept $c^* \in C$, a teacher for a MAP-learner $L$ aims at finding a smallest collection of observations that causes $L$ to return $c^*$. This approach leads in a natural manner to various notions of a MAP- or MLE-teaching dimension of a concept class $C$. Our main results are as follows. First, we show that this teaching model has some desirable monotonicity properties. Second we clarify how the four sampling modes are related to each other. As for the (important!) special case, where concepts are subsets of a domain and observations are 0,1-labeled examples, we obtain some additional results. First of all, we characterize the MAP- and MLE-teaching dimension associated with an optimally parameterized MAP-learner graph-theoretically. From this central result, some other ones are easy to derive. It is shown, for instance, that the MLE-teaching dimension is either equal to the MAP-teaching dimension or exceeds the latter by $1$. It is shown furthermore that these dimensions can be bounded from above by the so-called antichain number, the VC-dimension and related combinatorial parameters. Moreover they can be computed in polynomial time.

観測値の集合から隠れた概念を推論しようとする学習者$L$を想像してください。Ferriらの研究に基づいて、学習者は事前確率$P(c)$と$c$条件付き尤度$P(z|c)$によってパラメータ化されると仮定します。ここで、$c$は特定のクラス$C$内のすべての概念にわたり、$z$は観測セット$Z$内のすべての観測にわたります。観測値の集合$S$をランダムサンプルと見なし、事後確率が最大となる概念(それぞれ$S$の$c$条件付き尤度を最大化する概念)を返す場合、$L$はMAP学習者(またはMLE学習者)と呼ばれます。$L$が$S$が順序付きサンプリングまたは順序なしサンプリングから、または置換ありまたは置換なしサンプリングから得られると仮定するかどうかによって、4つの異なるサンプリングモードを区別できます。対象概念$c^* \in C$が与えられた場合、MAP学習者$L$の教師は、$L$が$c^*$を返すような観測値の最小コレクションを見つけることを目指します。このアプローチは、概念クラス$C$のMAPまたはMLE教師次元のさまざまな概念に自然につながりました。主な結果は次のとおりです。まず、この教師モデルには望ましい単調性特性があることを示します。次に、4つのサンプリングモードが互いにどのように関連しているかを明らかにします。概念がドメインのサブセットであり、観測値が0,1ラベル付きの例である(重要な!)特殊なケースについては、追加の結果が得られます。まず、最適にパラメーター化されたMAP学習者に関連付けられたMAPおよびMLE教師次元をグラフ理論的に特徴付けます。この中心的な結果から、他のいくつかの結果を簡単に導き出すことができます。たとえば、MLEティーチング次元はMAPティーチング次元と等しいか、後者を1だけ超えることが示されています。さらに、これらの次元は、いわゆる反鎖数、VC次元、および関連する組み合わせパラメータによって上限が定められることが示されています。さらに、これらは多項式時間で計算できます。

A General Framework for the Analysis of Kernel-based Tests
カーネルベースのテストを分析するための一般的なフレームワーク

Kernel-based tests provide a simple yet effective framework that uses the theory of reproducing kernel Hilbert spaces to design non-parametric testing procedures. In this paper, we propose new theoretical tools that can be used to study the asymptotic behaviour of kernel-based tests in various data scenarios and in different testing problems. Unlike current approaches, our methods avoid working with U and V-statistics expansions that usually lead to lengthy and tedious computations and asymptotic approximations. Instead, we work directly with random functionals on the Hilbert space to analyse kernel-based tests. By harnessing the use of random functionals, our framework leads to much cleaner analyses, involving less tedious computations. Additionally, it offers the advantage of accommodating pre-existing knowledge regarding test-statistics as many of the random functionals considered in applications are known statistics that have been studied comprehensively. To demonstrate the efficacy of our approach, we thoroughly examine two categories of kernel tests, along with three specific examples of kernel tests, including a novel kernel test for conditional independence testing.

カーネルベースのテストは、カーネルヒルベルト空間の再現理論を使用してノンパラメトリックテスト手順を設計する、シンプルでありながら効果的なフレームワークを提供します。この論文では、さまざまなデータシナリオとさまざまなテスト問題におけるカーネルベースのテストの漸近的動作を研究するために使用できる新しい理論的ツールを提案します。現在のアプローチとは異なり、私たちの方法では、通常、長くて退屈な計算と漸近近似につながるU統計とV統計の展開を扱いません。代わりに、ヒルベルト空間上のランダム関数を直接操作してカーネルベースのテストを分析します。ランダム関数の使用を活用することで、私たちのフレームワークは、退屈な計算が少なく、はるかにクリーンな分析につながります。さらに、アプリケーションで考慮されるランダム関数の多くは、包括的に研究されている既知の統計であるため、テスト統計に関する既存の知識に対応できるという利点があります。私たちのアプローチの有効性を示すために、2つのカテゴリのカーネルテストと、条件付き独立性テスト用の新しいカーネルテストを含む3つのカーネルテストの具体的な例を徹底的に調べます。

Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent
オーバーパラメータ化多層ニューラルネットワーク:ニューラルタンジェントカーネルの一様集中と確率的勾配降下法の収束

There have been exciting progresses in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks through the lens of neural tangent kernel (NTK). However, there remain two significant gaps between theory and practice. First, the existing convergence theory only takes into account the contribution of the NTK from the last hidden layer, while in practice the intermediate layers also play an instrumental role. Second, most existing works assume that the training data are provided a priori in a batch, while less attention has been paid to the important setting where the training data arrive in a stream. In this paper, we close these two gaps. We first show that with random initialization, the NTK function converges to some deterministic function uniformly for all layers as the number of neurons tends to infinity. Then we apply the uniform convergence result to further prove that the prediction error of multi-layer neural networks under SGD converges in expectation in the streaming data setting. A key ingredient in our proof is to show the number of activation patterns of an $L$-layer neural network with width $m$ is only polynomial in $m$ although there are $mL$ neurons in total.

ニューラルタンジェントカーネル(NTK)の観点から、過剰パラメータ化されたニューラルネットワークにおける勾配降下法(GD)と確率的勾配降下法(SGD)の収束を理解する上で、大きな進歩がありました。しかし、理論と実践の間には依然として2つの大きなギャップがあります。まず、既存の収束理論では最後の隠れ層からのNTKの寄与のみが考慮されていますが、実際には中間層も重要な役割を果たします。次に、ほとんどの既存の研究では、トレーニングデータがバッチで事前に提供されることを前提としていますが、トレーニングデータがストリームで到着するという重要な設定にはあまり注意が払われていません。この論文では、この2つのギャップを埋めます。まず、ランダム初期化では、ニューロン数が無限大に近づくにつれて、NTK関数がすべての層で一様に何らかの決定論的関数に収束することを示します。次に、一様収束の結果を適用して、ストリーミングデータ設定でSGD下の多層ニューラルネットワークの予測誤差が期待通りに収束することをさらに証明します。私たちの証明の重要な要素は、合計で$mL$個のニューロンがあるにもかかわらず、幅$m$の$L$層ニューラルネットワークの活性化パターンの数は$m$の多項式のみであることを示すことです。

Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces
カーネルバナッハ空間の再現における学習のためのスパース表現定理

Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is data-dependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs, the sequence space $\ell_1(\mathbb{N})$ and the measure space, can have sparse representer theorems for both MNI and regularization models.

学習ソリューションのスパース性は、機械学習において望ましい特徴です。特定の再生カーネルバナッハ空間(RKBS)は、スパース学習法に適した仮説空間です。この論文の目的は、どのような種類のRKBSが学習ソリューションのスパース性を促進できるかを理解することです。RKBSにおける2つの典型的な学習モデル、最小ノルム補間(MNI)問題と正則化問題を検討します。まず、これらの問題のソリューションに対する明示的な表現定理を確立します。これは、データに依存するノルム関数のサブ微分集合の極値の線形結合によってソリューションセットの極値を表します。次に、ソリューションの明示的な表現を、観測データの数よりも少ない項を持つスパースカーネル表現に変換できるRKBSの十分条件を提案します。提案された十分条件の下で、正則化パラメーターが正則化されたソリューションのスパース性に及ぼす役割を調査します。さらに、2つの特定のRKBS、シーケンス空間$\ell_1(\mathbb{N})$と測度空間が、MNIと正則化モデルの両方に対してスパース表現定理を持つことができることを示します。

Exploration of the Search Space of Gaussian Graphical Models for Paired Data
ペアデータのためのガウスグラフィカルモデルの探索空間の探索

We consider the problem of learning a Gaussian graphical model in the case where the observations come from two dependent groups sharing the same variables. We focus on a family of coloured Gaussian graphical models specifically suited for the paired data problem. Commonly, graphical models are ordered by the submodel relationship so that the search space is a lattice, called the model inclusion lattice. We introduce a novel order between models, named the twin order. We show that, embedded with this order, the model space is a lattice that, unlike the model inclusion lattice, is distributive. Furthermore, we provide the relevant rules for the computation of the neighbours of a model. The latter are more efficient than the same operations in the model inclusion lattice, and are then exploited to achieve a more efficient exploration of the search space. These results can be applied to improve the efficiency of both greedy and Bayesian model search procedures. Here, we implement a stepwise backward elimination procedure and evaluate its performance both on synthetic and real-world data.

私たちは、観測値が同じ変数を共有する2つの従属グループから得られる場合のガウスグラフィカルモデルの学習の問題を検討します。特にペアデータの問題に適した、色付きのガウスグラフィカルモデルのファミリーに焦点を当てます。通常、グラフィカルモデルはサブモデルの関係によって順序付けられ、検索空間はモデル包含格子と呼ばれる格子になります。モデル間に双子順序という新しい順序を導入します。この順序が組み込まれたモデル空間は、モデル包含格子とは異なり、分散的な格子であることを示します。さらに、モデルの近傍を計算するための関連ルールを提供します。後者は、モデル包含格子の同じ操作よりも効率的であり、検索空間のより効率的な探索を実現するために活用されます。これらの結果は、貪欲法とベイジアン法の両方のモデル検索手順の効率を向上させるために適用できます。ここでは、段階的な後方消去手順を実装し、合成データと実世界のデータの両方でそのパフォーマンスを評価します。

The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective
データ拡張の良い面、悪い面、そして醜い面:暗黙のスペクトル正則化の視点

Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between overparameterized and underparameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.

データ拡張(DA)は、現代の機械学習のパフォーマンスを強化するための強力な手段です。コンピュータービジョンにおける変換やスケーリングなどの特定の拡張は、従来、同じ分布から新しい(人工的な)データを生成することで一般化を改善すると考えられてきました。しかし、この従来の観点では、トレーニングデータの分布を大きく変える現代の機械学習で普及している拡張(ランダムマスキング、カットアウト、ミックスアップなど)の成功を説明できません。この研究では、DAの一般的なクラスが、パラメーター不足およびパラメーター過剰の線形モデルの一般化に与える影響を特徴付けるための新しい理論的フレームワークを開発します。このフレームワークにより、DAは、a)トレーニングデータに依存した方法でデータ共分散行列の固有値の相対的な割合を操作すること、およびb)リッジ回帰によってデータ共分散行列のスペクトル全体を均一にブーストすることという2つの異なる効果の組み合わせによって、暗黙的なスペクトル正則化を誘発することが明らかになりました。これらの効果を一般的な拡張に適用すると、過剰パラメータ化と不足パラメータ化のレジーム間の一般化の不一致や、回帰タスクと分類タスク間の違いなど、さまざまな現象が発生します。私たちのフレームワークは、DAが一般化に及ぼす微妙で時には驚くべき影響を強調し、新しい拡張設計のテストベッドとして機能します。

Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality
決定依存分布による確率近似:漸近正規性と最適性

We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of Hájek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal.

私たちは、決定依存の問題に対して確率的近似アルゴリズムを解析し、アルゴリズムによって使用されるデータ分布が反復シーケンスに沿って進化します。このような問題の主な例は、パフォーマンス予測とそのマルチプレイヤー拡張に現れます。穏やかな仮定の下では、アルゴリズムの平均反復と解の間の偏差は漸近的に正規であり、共分散が勾配ノイズと分布シフトの影響を明確に分離することを示します。さらに、HájekとLe Camの研究に基づいて、平均化によるアルゴリズムの漸近性能が局所的にミニマックス最適であることを示します。

Minimax Rates for High-Dimensional Random Tessellation Forests
高次元ランダムテッセレーションフォレストのミニマックスレート

Random forests are a popular class of algorithms used for regression and classification. The algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax optimal rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. This work shows that a large class of random forests with general split directions also achieve minimax optimal rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, and random forests derived from Poisson hyperplane tessellations. These are the first results showing that random forest variants with oblique splits can obtain minimax optimality in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory.

ランダムフォレストは、回帰と分類に使用される一般的なアルゴリズムのクラスです。2001年にBreimanによって導入されたアルゴリズムとその多くのバリアントは、特徴空間の軸に沿った分割から構築されたランダム化された決定木のアンサンブルです。そのようなバリアントの1つであるMondrianフォレストは、オンライン設定を処理するために提案され、任意の次元でミニマックス最適率が得られた最初のクラスのランダムフォレストです。ただし、軸に沿った分割への制限では、特徴間の依存関係を捉えることができず、斜めの分割を使用するランダムフォレストは、多くのタスクで実験的なパフォーマンスが向上しています。この研究では、一般的な分割方向を持つ大規模なクラスのランダムフォレストでも、任意の次元でミニマックス最適率が達成されることを示しています。このクラスには、STITフォレスト、任意の分割方向に一般化されたMondrianフォレスト、およびポアソン超平面分割から派生したランダムフォレストが含まれます。これらは、斜めの分割を持つランダムフォレストのバリアントが任意の次元でミニマックス最適性を実現できることを示す最初の結果です。私たちの証明手法は、確率幾何学における定常ランダムテッセレーションの理論を統計学習理論に新しく応用することに依存しています。

Nonparametric Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks
深層ReQUニューラルネットワークによるノンクロッシング分位点回帰過程のノンパラメトリック推定

We propose a penalized nonparametric approach to estimating the quantile regression process (QRP) in a nonseparable model using rectifier quadratic unit (ReQU) activated deep neural networks and introduce a novel penalty function to enforce non-crossing of quantile regression curves. We establish the non-asymptotic excess risk bounds for the estimated QRP and derive the mean integrated squared error for the estimated QRP under mild smoothness and regularity conditions. To establish these non-asymptotic risk and estimation error bounds, we also develop a new error bound for approximating $C^s$ smooth functions with $s >1$ and their derivatives using ReQU activated neural networks. This is a new approximation result for ReQU networks and is of independent interest and may be useful in other problems. Our numerical experiments demonstrate that the proposed method is competitive with or outperforms two existing methods, including methods using reproducing kernels and random forests for nonparametric quantile regression.

私たちは、整流器二次単位(ReQU)活性化深層ニューラルネットワークを使用して、非分離モデルにおける分位回帰プロセス(QRP)を推定するためのペナルティ付きノンパラメトリック手法を提案し、分位回帰曲線の非交差を強制する新しいペナルティ関数を導入します。推定されたQRPの非漸近的過剰リスク境界を確立し、軽度の滑らかさと規則性の条件下での推定QRPの平均積分二乗誤差を導出します。これらの非漸近的リスクと推定誤差境界を確立するために、我々はまた、ReQU活性化ニューラルネットワークを使用して、$s >1$の$C^s$滑らかな関数とその導関数を近似するための新しい誤差境界を開発します。これは、ReQUネットワークの新しい近似結果であり、独立した関心事であり、他の問題にも役立つ可能性があります。数値実験により、提案された方法は、ノンパラメトリック分位回帰に再生カーネルとランダムフォレストを使用する方法を含む、既存の2つの方法と競合するか、それらよりも優れていることが実証されています。

Spatial meshing for general Bayesian multivariate models
一般的なベイズ多変量モデルの空間メッシュ

Quantifying spatial and/or temporal associations in multivariate geolocated data of different types is achievable via spatial random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. In this article, we introduce Bayesian models of spatially referenced data in which the likelihood or the latent process (or both) are not Gaussian. First, we exploit the advantages of spatial processes built via directed acyclic graphs, in which case the spatial nodes enter the Bayesian hierarchy and lead to posterior sampling via routine Markov chain Monte Carlo (MCMC) methods. Second, motivated by the possible inefficiencies of popular gradient-based sampling approaches in the multivariate contexts on which we focus, we introduce the simplified manifold preconditioner adaptation (SiMPA) algorithm which uses second order information about the target but avoids expensive matrix operations. We demostrate the performance and efficiency improvements of our methods relative to alternatives in extensive synthetic and real world remote sensing and community ecology applications with large scale data at up to hundreds of thousands of spatial locations and up to tens of outcomes. Software for the proposed methods is part of R package meshed, available on CRAN.

さまざまな種類の多変量地理位置情報データの空間的および/または時間的関連性をベイズ階層モデルで空間ランダム効果を介して定量化することは可能ですが、私たちが焦点を当てているますます一般的になっている大規模データ設定で空間依存性が潜在的ガウス過程(GP)としてエンコードされると、深刻な計算ボトルネックが発生します。分析の扱いやすさが低下すると計算効率にさらなる障害が生じるため、非ガウスモデルでは状況がさらに悪化します。この記事では、尤度または潜在過程(またはその両方)がガウスではない、空間参照データのベイズモデルを紹介します。まず、有向非巡回グラフを介して構築された空間過程の利点を活用します。この場合、空間ノードはベイズ階層に入り、通常のマルコフ連鎖モンテカルロ(MCMC)法を介して事後サンプリングにつながります。次に、私たちが注目している多変量コンテキストにおける一般的な勾配ベースのサンプリング手法の非効率性の可能性に着目し、ターゲットに関する2次情報を使用しながら高価な行列演算を回避する、簡素化されたマニフォールド前処理適応(SiMPA)アルゴリズムを導入します。私たちは、数十万の空間位置と数十の結果の大規模データを使用した、広範な合成および実世界のリモートセンシングとコミュニティエコロジーアプリケーションにおける代替方法と比較した私たちの方法のパフォーマンスと効率性の向上を実証します。提案された方法のソフトウェアは、CRANで入手できるRパッケージmeshedの一部です。

A Semi-parametric Estimation of Personalized Dose-response Function Using Instrumental Variables
操作変数を用いた個別線量反応関数のセミパラメトリック推定

In the application of instrumental variable analysis that conducts causal inference in the presence of unmeasured confounding, invalid instrumental variables and weak instrumental variables often exist which complicate the analysis. In this paper, we propose a model-free dimension reduction procedure to select the invalid instrumental variables and refine them into lower-dimensional linear combinations. The procedure also combines the weak instrumental variables into a few stronger instrumental variables that best condense their information. We then introduce the personalized dose-response function that incorporates the subject’s personal characteristics into the conventional dose-response function, and use the reduced data from dimension reduction to propose a novel and easily implementable nonparametric estimator of this function. The proposed approach is suitable for both discrete and continuous treatment variables, and is robust to the dimensionality of data. Its effectiveness is illustrated by the simulation studies and the data analysis of ADNI-DoD study, where the causal relationship between depression and dementia is investigated.

測定されていない交絡因子の存在下で因果推論を行う手段変数分析の適用では、無効な手段変数と弱い手段変数が存在することが多く、分析が複雑になります。この論文では、無効な手段変数を選択し、それらをより低次元の線形結合に精製するためのモデルフリー次元削減手順を提案します。この手順では、弱い手段変数を、それらの情報を最もよく凝縮するいくつかのより強い手段変数に結合します。次に、被験者の個人特性を従来の用量反応関数に組み込んだパーソナライズされた用量反応関数を導入し、次元削減から削減されたデータを使用して、この関数の新しい簡単に実装できるノンパラメトリック推定量を提案します。提案されたアプローチは、離散的および連続的な治療変数の両方に適しており、データの次元に対して堅牢です。その有効性は、うつ病と認知症の因果関係を調査したシミュレーション研究とADNI-DoD研究のデータ分析によって実証されています。

Learning Non-Gaussian Graphical Models via Hessian Scores and Triangular Transport
ヘッセスコアと三角輸送による非ガウスグラフィカルモデルの学習

Undirected probabilistic graphical models represent the conditional dependencies, or Markov properties, of a collection of random variables. Knowing the sparsity of such a graphical model is valuable for modeling multivariate distributions and for efficiently performing inference. While the problem of learning graph structure from data has been studied extensively for certain parametric families of distributions, most existing methods fail to consistently recover the graph structure for non-Gaussian data. Here we propose an algorithm for learning the Markov structure of continuous and non-Gaussian distributions. To characterize conditional independence, we introduce a score based on integrated Hessian information from the joint log-density, and we prove that this score upper bounds the conditional mutual information for a general class of distributions. To compute the score, our algorithm SING estimates the density using a deterministic coupling, induced by a triangular transport map, and iteratively exploits sparse structure in the map to reveal sparsity in the graph. For certain non-Gaussian datasets, we show that our algorithm recovers the graph structure even with a biased approximation to the density. Among other examples, we apply SING to learn the dependencies between the states of a chaotic dynamical system with local interactions.

無向確率グラフィカルモデルは、ランダム変数の集合の条件付き依存性、つまりマルコフ特性を表します。このようなグラフィカルモデルのスパース性を知ることは、多変量分布をモデル化し、推論を効率的に実行するために役立ちます。データからグラフ構造を学習する問題は、特定のパラメトリック分布ファミリーについて広範に研究されてきましたが、既存の方法のほとんどは、非ガウスデータのグラフ構造を一貫して回復できません。ここでは、連続分布と非ガウス分布のマルコフ構造を学習するアルゴリズムを提案します。条件付き独立性を特徴付けるために、結合対数密度からの統合ヘッセ情報に基づくスコアを導入し、このスコアが一般的なクラスの分布の条件付き相互情報量の上限となることを証明します。スコアを計算するために、アルゴリズムSINGは、三角形の輸送マップによって誘導される決定論的結合を使用して密度を推定し、マップ内のスパース構造を繰り返し利用してグラフ内のスパース性を明らかにします。特定の非ガウスデータセットでは、密度の偏った近似値であっても、アルゴリズムがグラフ構造を回復できることを示します。他の例として、SINGを適用して、ローカル相互作用を持つカオス動的システムの状態間の依存関係を学習します。

On the Learnability of Out-of-distribution Detection
分布外検出の学習可能性について

Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms, and corresponding learning theory is still an open problem. To study the generalization of OOD detection, this paper investigates the probably approximately correct (PAC) learning theory of OOD detection that fits the commonly used evaluation metrics in the literature. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we offer theoretical support for representative OOD detection works based on our OOD theory.

教師あり学習は、トレーニングデータとテストデータが同じ分布からのものであるという仮定の下で分類器をトレーニングすることを目的としています。上記の仮定を緩和するために、研究者はより現実的な設定、つまり、テストデータがトレーニング中に不明なクラス(つまり、OODデータ)から取得される可能性がある分布外(OOD)検出を研究してきました。OODデータは入手できず、多様性があるため、効果的なOOD検出アルゴリズムには優れた一般化能力が不可欠であり、対応する学習理論はまだ未解決の問題です。OOD検出の一般化を研究するために、この論文では、文献で一般的に使用されている評価メトリックに適合する、OOD検出のおそらくほぼ正しい(PAC)学習理論を調査します。まず、OOD検出の学習可能性に必要な条件を見つけます。次に、この条件を使用して、いくつかのシナリオでのOOD検出の学習可能性に関するいくつかの不可能性定理を証明します。不可能性定理はイライラさせられますが、これらの不可能性定理のいくつかの条件は、いくつかの実際のシナリオでは成り立たない可能性があることがわかりました。この観察に基づいて、次に、いくつかの実際のシナリオにおけるOOD検出の学習可能性を特徴付けるために必要かつ十分な条件をいくつか示します。最後に、OOD理論に基づいて、代表的なOOD検出作業に対する理論的サポートを提供します。

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training
Win:ウェイトディケイを統合したネステロフアクセラレーションによるネットワークトレーニングの高速化

Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner”, and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks.

大規模なデータセットでディープネットワークをトレーニングするのは、計算上困難です。この研究では、「適応勾配アルゴリズムを一般的な方法で加速する方法」という問題を検討し、適応アルゴリズムを加速するための効果的な重み減衰積分ネステロフ加速(Win)を提案しています。AdamWとAdamを例にとり、反復ごとに、バニラトレーニング損失と近似点法にヒントを得た動的正則化を組み合わせた動的損失を構築し、動的損失の1次および2次テイラー近似をそれぞれ最小化して変数を更新します。これにより、保守的なステップと積極的なステップを使用して更新し、これら2つの更新を線形に結合するWin加速が得られます。次に、WinをWin2に拡張し、複数の積極的な更新ステップを使用して収束を高速化します。次に、WinとWin2を一般的なLAMBおよびSGDオプティマイザーに適用します。この透明な導出により、他の加速方法とそれらの適応アルゴリズムへの統合に関する洞察が得られる可能性があります。実験結果では、視覚分類および言語モデリングタスクにおいて、WinおよびWin2で高速化されたAdamW、Adam、LAMB、SGDが、標準の同等製品よりも収束速度が速く、パフォーマンスが優れていることが実証されています。

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains
一般ドメインで定義されたニューラルネットワーク関連カーネル関数のクラスの固有値減衰率について

In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb{S}^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underground truth function $f\in [\mathcal H_{\mathrm{NTK}}]^{s}$, an interpolation space associated with the RKHS $\mathcal{H}_{\mathrm{NTK}}$ of NTK. We also showed that the overfitted neural network can not generalize well. We believe our approach for determining the EDR of kernels might be also of independent interests.

この論文では、$mathbb{S}^{d}$ではなく、一般的なドメインで定義されたカーネル関数の大規模なクラスの固有値減衰率(EDR)を決定するための戦略を提供します。このクラスのカーネル関数には、限定するものではないが、異なる深さおよび種々の活性化関数を有するニューラルネットワークに関連付けられたニューラルタンジェントカーネルが含まれます。ワイドニューラルネットワークの学習のダイナミクスが一般ドメイン上のニューラルタンジェントカーネル回帰のダイナミクスに一様に近似することを証明した後、地下の真理関数[mathcal H_{mathrm{NTK}}]^{s}$、NTKのRKHS $mathcal{H}_{mathrm{NTK}}$に関連付けられた補間空間$fin [mathcal $NTK}}]^{s}$を仮定すると、ワイドニューラルネットワークのミニマックス最適性をさらに説明できます。また、過適合したニューラルネットワークはうまく一般化できないことも示しました。カーネルのEDRを決定するための私たちのアプローチも、独立した利益をもたらす可能性があると考えています。

Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions
パワー則スペクトル条件下での最適化のための厳密な収束速度限界

Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions, resulting in power law convergence rates for iterative solutions of these problems by gradient-based algorithms. In this paper, we propose a new spectral condition providing tighter upper bounds for problems with power law optimization trajectories. We use this condition to build a complete picture of upper and lower bounds for a wide range of optimization algorithms – Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients – with an emphasis on the underlying schedules of learning rate and momentum. In particular, we demonstrate how an optimally accelerated method, its schedule, and convergence upper bound can be obtained in a unified manner for a given shape of the spectrum. Also, we provide first proofs of tight lower bounds for convergence rates of Steepest Descent and Conjugate Gradients under spectral power laws with general exponents. Our experiments show that the obtained convergence bounds and acceleration strategies are not only relevant for exactly quadratic optimization problems, but also fairly accurate when applied to the training of neural networks.

二次問題における最適化のパフォーマンスは、スペクトルの低位部分に敏感に依存します。大規模な(実質的に無限次元の)問題の場合、スペクトルのこの部分は、多くの場合、自然にべき乗分布で表現または近似できるため、勾配ベースのアルゴリズムによるこれらの問題の反復ソリューションの収束率はべき乗法則になります。この論文では、べき乗法則の最適化軌道を持つ問題に対して、より厳しい上限を提供する新しいスペクトル条件を提案します。この条件を使用して、学習率と運動量の基本的なスケジュールに重点を置き、勾配降下法、最急降下法、ヘビーボール法、共役勾配法など、さまざまな最適化アルゴリズムの上限と下限の完全な図を作成します。特に、スペクトルの特定の形状に対して、最適に加速された方法、そのスケジュール、および収束の上限を統一された方法で取得する方法を示します。また、一般指数を持つスペクトルべき法則の下で、最急降下法と共役勾配法の収束率の厳しい下限値の最初の証明も提供します。実験では、得られた収束上限値と加速戦略は、厳密に二次最適化問題に関係するだけでなく、ニューラルネットワークのトレーニングに適用した場合にもかなり正確であることが示されています。

ptwt – The PyTorch Wavelet Toolbox
ptwt – PyTorch ウェーブレットツールボックス

The fast wavelet transform is an essential workhorse in signal processing. Wavelets are local in the spatial- or temporal- and the frequency-domain. This property enables frequency domain analysis while preserving some spatiotemporal information. Until recently, wavelets rarely appeared in the machine learning literature. We provide the PyTorch Wavelet Toolbox to make wavelet methods more accessible to the deep learning community. Our PyTorch Wavelet Toolbox is well documented. A pip package is installable with `pip install ptwt`.

高速ウェーブレット変換は、信号処理に不可欠な主力製品です。ウェーブレットは、空間領域または時間領域と周波数領域でローカルです。このプロパティにより、一部の時空間情報を保持しながら周波数領域解析が可能になります。最近まで、ウェーブレットが機械学習の文献に登場することはめったにありませんでした。PyTorch Wavelet Toolboxは、ウェーブレット手法を深層学習コミュニティにとってより身近なものにするために提供されています。PyTorch Wavelet Toolboxは十分に文書化されています。pipパッケージは「pip install ptwt」でインストールできます。

Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria
LDAモデルでのトピック数の選択 – 選択基準のモンテカルロ比較

Selecting the number of topics in Latent Dirichlet Allocation (LDA) models is considered to be a difficult task, for which various approaches have been proposed. In this paper the performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be applied to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the considered data generation processes (DGPs) are revealed. Practical recommendations for LDA model selection in applications are derived.

LDA(Latent Dirichlet Allocation)モデルでは、トピック数の選択が難しいと考えられ、様々なアプローチが提案されています。この論文では、最近開発された特異ベイズ情報量基準(sBIC)の性能を評価し、代替モデル選択基準の性能と比較します。sBICは、単一の統計モデルに適用できる標準BICの一般化です。比較はモンテカルロシミュレーションに基づいており、トピックの数、ドキュメントの数、およびコーパス内のドキュメントのサイズに関して異なる、いくつかの代替設定に対して実行されます。パフォーマンスは、トピックの正しい数だけでなく、考慮されたデータ生成プロセス(DGP)からの関連トピックが明らかになっているかどうかも考慮したさまざまな基準を使用して測定されます。アプリケーションでのLDAモデル選択に関する実用的な推奨事項を導き出します。

Functional Directed Acyclic Graphs
関数有向非巡回グラフ

In this article, we introduce a new method to estimate a directed acyclic graph (DAG) from multivariate functional data. We build on the notion of faithfulness that relates a DAG with a set of conditional independences among the random functions. We develop two linear operators, the conditional covariance operator and the partial correlation operator, to characterize and evaluate the conditional independence. Based on these operators, we adapt and extend the PC-algorithm to estimate the functional directed graph, so that the computation time depends on the sparsity rather than the full size of the graph. We study the asymptotic properties of the two operators, derive their uniform convergence rates, and establish the uniform consistency of the estimated graph, all of which are obtained while allowing the graph size to diverge to infinity with the sample size. We demonstrate the efficacy of our method through both simulations and an application to a time-course proteomic dataset.

この記事では、多変量関数データから有向非巡回グラフ(DAG)を推定する新しい方法を紹介します。私たちは、DAGをランダム関数間の条件付き独立性のセットに関連付ける忠実性の概念に基づいています。条件付き共分散演算子と偏相関演算子の2つの線形演算子を開発して、条件付き独立性を特徴付けて評価します。これらの演算子に基づいて、PCアルゴリズムを適応および拡張して、関数有向グラフを推定し、計算時間がグラフのフルサイズではなくスパース性に依存するようにします。2つの演算子の漸近特性を研究し、それらの均一な収束率を導き出し、推定されたグラフの均一な一貫性を確立します。これらはすべて、グラフサイズがサンプルサイズに対して無限大に発散できるようにしながら得られます。私たちは、シミュレーションと時間経過プロテオミクスデータセットへの適用の両方を通じて、この方法の有効性を実証しています。

Unlabeled Principal Component Analysis and Matrix Completion
ラベルなし主成分分析と行列補完

We introduce robust principal component analysis from a data matrix in which the entries of its columns have been corrupted by permutations, termed Unlabeled Principal Component Analysis (UPCA). Using algebraic geometry, we establish that UPCA is a well-defined algebraic problem since we prove that the only matrices of minimal rank that agree with the given data are row-permutations of the ground-truth matrix, arising as the unique solutions of a polynomial system of equations. Further, we propose an efficient two-stage algorithmic pipeline for UPCA suitable for the practically relevant case where only a fraction of the data have been permuted. Stage-I employs outlier-robust PCA methods to estimate the ground-truth column-space. Equipped with the column-space, Stage-II applies recent methods for unlabeled sensing to restore the permuted data. Allowing for missing entries on top of permutations in UPCA leads to the problem of unlabeled matrix completion, for which we derive theory and algorithms of similar flavor. Experiments on synthetic data, face images, educational and medical records reveal the potential of our algorithms for applications such as data privatization and record linkage.

私たちは、列のエントリが順列によって破損したデータマトリックスからの堅牢な主成分分析、つまりラベルなし主成分分析(UPCA)を紹介します。代数幾何学を使用して、与えられたデータと一致する最小ランクのマトリックスのみが、多項式方程式系の一意の解として生じる、グラウンドトゥルースマトリックスの行順列であることを証明することにより、UPCAが明確に定義された代数問題であることを証明します。さらに、データの一部のみが順列されているという実際的なケースに適した、UPCAの効率的な2段階アルゴリズムパイプラインを提案します。ステージIでは、外れ値に強いPCA手法を使用してグラウンドトゥルース列空間を推定します。ステージIIでは、列空間を使用して、ラベルなしセンシングの最新の手法を適用し、順列されたデータを復元します。UPCAで順列に加えてエントリが欠落していることを許容すると、ラベルなしマトリックス補完の問題が発生します。この問題に対して、同様の理論とアルゴリズムを導出します。合成データ、顔画像、教育記録、医療記録に関する実験により、データの民営化や記録のリンクなどのアプリケーションにおける当社のアルゴリズムの可能性が明らかになりました。

Distributed Estimation on Semi-Supervised Generalized Linear Model
半教師あり一般化線形モデルによる分布推定

Semi-supervised learning is devoted to using unlabeled data to improve the performance of machine learning algorithms. In this paper, we study the semi-supervised generalized linear model (GLM) in the distributed setup. In the cases of single or multiple machines containing unlabeled data, we propose two distributed semi-supervised algorithms based on the distributed approximate Newton method. When the labeled local sample size is small, our algorithms still give a consistent estimation, while fully supervised methods fail to converge. Moreover, we theoretically prove that the convergence rate is greatly improved when sufficient unlabeled data exists. Therefore, the proposed method requires much fewer rounds of communications to achieve the optimal rate than its fully-supervised counterpart. In the case of the linear model, we prove the rate lower bound after one round of communication, which shows that rate improvement is essential. Finally, several simulation analyses and real data studies are provided to demonstrate the effectiveness of our method.

半教師あり学習は、ラベルなしデータを使用して機械学習アルゴリズムのパフォーマンスを向上させることを目的としています。この論文では、分散設定における半教師あり一般化線形モデル(GLM)について検討します。ラベルなしデータを含む単一または複数のマシンの場合、分散近似ニュートン法に基づく2つの分散半教師ありアルゴリズムを提案します。ラベル付きローカルサンプルサイズが小さい場合、アルゴリズムは一貫した推定値を提供しますが、完全教師ありの方法は収束しません。さらに、十分なラベルなしデータが存在すると収束率が大幅に向上することを理論的に証明します。したがって、提案された方法では、完全教師ありの方法よりも、最適な速度を達成するために必要な通信回数がはるかに少なくなります。線形モデルの場合、1回の通信後に速度の下限を証明し、速度の向上が不可欠であることを示します。最後に、いくつかのシミュレーション分析と実際のデータスタディを実施して、この方法の有効性を示します。

Towards Explainable Evaluation Metrics for Machine Translation
機械翻訳の説明可能な評価指標に向けて

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.

BLEUなどの古典的な語彙重複メトリクスとは異なり、機械翻訳の現在の評価メトリクスのほとんど(COMETやBERTScoreなど)は、ブラックボックスの大規模言語モデルに基づいています。これらのメトリクスは人間の判断と強い相関関係にあることが多いのですが、最近の研究では、品質の低い古典的なメトリクスが依然として優勢であることが示されています。その理由の1つは、意思決定プロセスの透明性が高いためです。したがって、新しい高品質のメトリクスをより広く受け入れるためには、説明可能性が重要になります。このコンセプトペーパーでは、説明可能な機械翻訳メトリクスの主要な特性と主要な目標を特定し、確立された目標と特性に関連付けて、最新の手法を包括的に統合します。このコンテキストでは、ChatGPTやGPT4などの生成モデルに基づく説明可能なメトリクスに対する最新の最先端のアプローチについても説明します。最後に、自然言語の説明を含む次世代のアプローチのビジョンを提供します。私たちの研究が、説明可能な評価メトリクスに関する将来の研究を促進およびガイドし、間接的に、より優れた透明性の高い機械翻訳システムにも貢献することを願っています。

Differentially private methods for managing model uncertainty in linear regression
線形回帰におけるモデルの不確実性を管理するための微分プライベート法

In this article, we propose differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We propose Bayesian methods based on mixtures of $g$-priors and non-Bayesian methods based on likelihood-ratio statistics and information criteria. The procedures are asymptotically consistent and straightforward to implement with existing software. We focus on practical issues such as adjusting critical values so that hypothesis tests have adequate type I error rates and quantifying the uncertainty introduced by the privacy-ensuring mechanisms.

この記事では、仮説検定、モデル平均化、および正規線形モデルのモデル選択のための微分プライベートな方法を提案します。$g$事前確率の混合に基づくベイズ法と、尤度比統計と情報基準に基づく非ベイズ法を提案します。手順は漸近的に一貫性があり、既存のソフトウェアで簡単に実装できます。仮説検定が適切なタイプIエラー率を持つように臨界値を調整することや、プライバシー保護メカニズムによってもたらされる不確実性を定量化するなど、実際的な問題に焦点を当てています。

Data Summarization via Bilevel Optimization
バイレベル最適化によるデータ要約

The increasing availability of massive data sets poses various challenges for machine learning. Prominent among these is learning models under hardware or human resource constraints. In such resource-constrained settings, a simple yet powerful approach is operating on small subsets of the data. Coresets are weighted subsets of the data that provide approximation guarantees for the optimization objective. However, existing coreset constructions are highly model-specific and are limited to simple models such as linear regression, logistic regression, and k-means. In this work, we propose a generic coreset construction framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem. In contrast to existing approaches, our framework does not require model-specific adaptations and applies to any twice differentiable model, including neural networks. We show the effectiveness of our framework for a wide range of models in various settings, including training non-convex models online and batch active learning.

膨大なデータセットが利用可能になることで、機械学習にはさまざまな課題が生じています。これらの課題の中で顕著なのは、ハードウェアや人的資源の制約下でのモデルの学習です。このようなリソースが制約された設定では、データの小さなサブセットを操作するというシンプルでありながら強力なアプローチがあります。コアセットは、最適化目標の近似値を保証するデータの重み付きサブセットです。ただし、既存のコアセット構築はモデル固有性が高く、線形回帰、ロジスティック回帰、k-meansなどの単純なモデルに限定されています。この研究では、コアセット選択をカーディナリティ制約付き2レベル最適化問題として定式化する、汎用的なコアセット構築フレームワークを提案します。既存のアプローチとは対照的に、私たちのフレームワークはモデル固有の適応を必要とせず、ニューラルネットワークを含む任意の2回微分可能なモデルに適用されます。オンラインでの非凸モデルのトレーニングやバッチアクティブラーニングなど、さまざまな設定での幅広いモデルに対するフレームワークの有効性を示します。

Pareto Smoothed Importance Sampling
パレート平滑化重要度サンプリング

Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be highly variable when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates, and convergence diagnostics. The presented Pareto $\hat{k}$ finite sample convergence rate diagnostic is useful for any Monte Carlo estimator.

重要度加重は、間違った分布からの引き込みを考慮してモンテカルロ積分を調整する一般的な方法ですが、重要度比の右裾が重い場合、結果の推定値は非常に変動する可能性があります。これは、近似分布によって十分に捕捉されないターゲット分布の側面がある場合に定期的に発生し、その場合は、極端な重要度比を変更することでより安定した推定値を取得できます。シミュレートされた重要度比の分布の上裾に一般化されたパレート分布を当てはめて、重要度の重みを安定させる新しい方法を提示します。この方法は、重要度サンプリング推定値を安定させるための既存の方法よりも経験的に優れたパフォーマンスを発揮し、安定化された有効サンプルサイズの推定値、モンテカルロ誤差推定、および収束診断が含まれます。提示されたパレート$hat{k}$有限サンプル収束率診断は、任意のモンテカルロ推定量に役立ちます。

Policy Gradient Methods in the Presence of Symmetries and State Abstractions
対称性と状態の抽象化の存在下での方策勾配法

Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method’s ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction.

高次元で複雑な問題に対する強化学習(RL)では、効率と一般化を向上させるために抽象化が利用されています。この論文では、連続制御設定における抽象化について研究し、マルコフ決定過程(MDP)準同型の定義を連続状態およびアクション空間の設定に拡張します。確率的および決定論的ポリシーの両方について、抽象MDPに関するポリシー勾配定理を導出します。ポリシー勾配の結果により、環境のおおよその対称性を利用してポリシーを最適化できます。これらの定理に基づいて、緩いバイシミュレーションメトリックを使用して、ポリシーとMDP準同型マップを同時に学習できるアクタークリティックアルゴリズムのファミリを提案します。最後に、連続対称性を持つ一連の環境を紹介し、そのような対称性がある場合のアクション抽象化に対するアルゴリズムの能力をさらに実証します。環境だけでなく、DeepMind Control Suiteの困難な視覚制御タスクでも、この方法の有効性を示します。私たちの方法は、表現学習にMDP準同型を利用する能力があり、パフォーマンスが向上し、潜在空間の視覚化により、学習した抽象化の構造が明確に示されます。

Scaling Instruction-Finetuned Language Models
命令のスケーリング – 微調整された言語モデル

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks (at time of release), such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

命令として表現されたデータセットのコレクションで言語モデルを微調整すると、モデルのパフォーマンスと未知のタスクへの一般化が向上することが示されています。この論文では、(1)タスク数のスケーリング、(2)モデルサイズのスケーリング、(3)思考連鎖データの微調整に特に焦点を当てて、命令の微調整を検討します。上記の側面で命令を微調整すると、さまざまなモデルクラス(PaLM、T5、U-PaLM)、プロンプト設定(ゼロショット、数ショット、CoT)、評価ベンチマーク(MMLU、BBH、TyDiQA、MGSM、オープンエンド生成、RealToxicityPrompts)のパフォーマンスが劇的に向上することがわかりました。たとえば、1.8Kタスクで命令を微調整したFlan-PaLM 540Bは、PaLM 540Bを大幅に上回ります(平均で+9.4%)。Flan-PaLM 540Bは、5ショットMMLUで75.2%など、いくつかのベンチマークで最先端のパフォーマンスを達成しています(リリース時)。また、PaLM 62Bなどのはるかに大規模なモデルと比較しても強力な少数ショットのパフォーマンスを実現するFlan-T5チェックポイント1も公開しています。全体として、命令の微調整は、事前トレーニング済み言語モデルのパフォーマンスと使いやすさを向上させる一般的な方法です。

Tangential Wasserstein Projections
タンジェンシャルワッサースタイン投影法

We develop a notion of projections between sets of probability measures using the geometric properties of the $2$-Wasserstein space. In contrast to existing methods, it is designed for multivariate probability measures that need not be regular, and is computationally efficient to implement via regression. The idea is to work on tangent cones of the Wasserstein space using generalized geodesics. Its structure and computational properties make the method applicable in a variety of settings where probability measures need not be regular, from causal inference to the analysis of object data. An application to estimating causal effects yields a generalization of the synthetic controls method for systems with general heterogeneity described via multivariate probability measures.

私たちは、$2$-Wasserstein空間の幾何学的特性を使用して、確率測度のセット間の射影の概念を開発します。既存の方法とは対照的に、規則的である必要のない多変量確率測定用に設計されており、回帰を介して実装する計算効率が優れています。このアイデアは、一般化測地線を使用してWasserstein空間の接円錐に取り組むことです。その構造と計算特性により、この手法は、因果推論からオブジェクトデータの分析まで、確率測定が規則的である必要のないさまざまな設定に適用できます。因果効果の推定への応用により、多変量確率測定によって記述される一般的な不均一性を持つシステムに対する合成制御法の一般化が得られます。

Learnability of Linear Port-Hamiltonian Systems
線形ポート・ハミルトニアンシステムの学習可能性

A complete structure-preserving learning scheme for single-input/single-output (SISO) linear port-Hamiltonian systems is proposed. The construction is based on the solution, when possible, of the unique identification problem for these systems, in ways that reveal fundamental relationships between classical notions in control theory and crucial properties in the machine learning context, like structure-preservation and expressive power. In the canonical case, it is shown that, {up to initializations,} the set of uniquely identified systems can be explicitly characterized as a smooth manifold endowed with global Euclidean coordinates, which allows concluding that the parameter complexity necessary for the replication of the dynamics is only $\mathcal{O}(n)$ and not $\mathcal{O}(n^2)$, as suggested by the standard parametrization of these systems. Furthermore, it is shown that linear port-Hamiltonian systems can be learned while remaining agnostic about the dimension of the underlying data-generating system. Numerical experiments show that this methodology can be used to efficiently estimate linear port-Hamiltonian systems out of input-output realizations, making the contributions in this paper the first example of a structure-preserving machine learning paradigm for linear port-Hamiltonian systems based on explicit representations of this model category.

単一入力/単一出力(SISO)線形ポートハミルトンシステムのための完全な構造保存学習スキームが提案されています。この構築は、可能な場合は、これらのシステムの一意の識別問題の解決に基づいており、制御理論の古典的な概念と、構造保存や表現力などの機械学習コンテキストの重要な特性との間の基本的な関係を明らかにする方法で行われます。標準的なケースでは、{初期化まで}一意に識別されたシステムのセットは、グローバルユークリッド座標を備えた滑らかな多様体として明示的に特徴付けることができることが示されています。これにより、ダイナミクスの複製に必要なパラメーターの複雑さは、これらのシステムの標準的なパラメーター化で示唆されているように、$\mathcal{O}(n)$のみであり、$\mathcal{O}(n^2)$ではないと結論付けることができます。さらに、線形ポートハミルトンシステムは、基礎となるデータ生成システムの次元について不可知のままで学習できることが示されています。数値実験により、この方法論は入出力実現から線形ポートハミルトンシステムを効率的に推定するために使用できることが示されており、本論文の貢献は、このモデルカテゴリの明示的な表現に基づく線形ポートハミルトンシステムに対する構造保存機械学習パラダイムの最初の例となっています。

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning
マルチエージェント強化学習におけるオフポリシー行動予測

Learning anticipation in Multi-Agent Reinforcement Learning (MARL) is a reasoning paradigm where agents anticipate the learning steps of other agents to improve cooperation among themselves. As MARL uses gradient-based optimization, learning anticipation requires using Higher-Order Gradients (HOG), with so-called HOG methods. Existing HOG methods are based on policy parameter anticipation, i.e., agents anticipate the changes in policy parameters of other agents. Currently, however, these existing HOG methods have only been developed for differentiable games or games with small state spaces. In this work, we demonstrate that in the case of non-differentiable games with large state spaces, existing HOG methods do not perform well and are inefficient due to their inherent limitations related to policy parameter anticipation and multiple sampling stages. To overcome these problems, we propose Off-Policy Action Anticipation (OffPA2), a novel framework that approaches learning anticipation through action anticipation, i.e., agents anticipate the changes in actions of other agents, via off-policy sampling. We theoretically analyze our proposed OffPA2 and employ it to develop multiple HOG methods that are applicable to non-differentiable games with large state spaces. We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance.

マルチエージェント強化学習(MARL)における学習予測は、エージェントが他のエージェントの学習ステップを予測して、エージェント間の協力を改善する推論パラダイムです。MARLは勾配ベースの最適化を使用するため、学習予測には、いわゆるHOG法による高次勾配(HOG)を使用する必要があります。既存のHOG法は、ポリシーパラメータ予測に基づいています。つまり、エージェントは他のエージェントのポリシーパラメータの変更を予測します。ただし、現在、これらの既存のHOG法は、微分可能なゲームまたは小さな状態空間を持つゲームに対してのみ開発されています。この研究では、大きな状態空間を持つ微分不可能なゲームの場合、ポリシーパラメータ予測と複数のサンプリングステージに関連する固有の制限により、既存のHOG法のパフォーマンスが悪く、非効率的であることを示します。これらの問題を克服するために、我々はオフポリシーアクション予測(OffPA2)を提案します。これは、アクション予測を通じて予測を学習する新しいフレームワークです。つまり、エージェントはオフポリシーサンプリングを介して他のエージェントのアクションの変化を予測します。我々は提案したOffPA2を理論的に分析し、それを使用して、大きな状態空間を持つ微分不可能なゲームに適用できる複数のHOG手法を開発します。我々は大規模な一連の実験を行い、提案したHOG手法が効率とパフォーマンスに関して既存の手法よりも優れていることを示しています。

On Unbiased Estimation for Partially Observed Diffusions
部分観測拡散の偏りなし推定について

We consider a class of diffusion processes with finite-dimensional parameters and partially observed at discrete time instances. We propose a methodology to unbiasedly estimate the expectation of a given functional of the diffusion process conditional on parameters and data. When these unbiased estimators with appropriately chosen functionals are employed within an expectation-maximization algorithm or a stochastic gradient method, this enables statistical inference using the maximum likelihood or Bayesian framework. Compared to existing approaches, the use of our unbiased estimators allows one to remove any time-discretization bias and Markov chain Monte Carlo burn-in bias. Central to our methodology is a novel and natural combination of multilevel randomization schemes and unbiased Markov chain Monte Carlo methods, and the development of new couplings of multiple conditional particle filters. We establish under assumptions that our estimators are unbiased and have finite variance. We illustrate various aspects of our method on an Ornstein–Uhlenbeck model, a logistic diffusion model for population dynamics, and a neural network model for grid cells.

私たちは、有限次元パラメータを持ち、離散時間インスタンスで部分的に観測される拡散過程のクラスを考察します。私たちは、パラメータとデータに条件付けされた拡散過程の特定の関数の期待値を、偏りなく推定する方法を提案します。適切に選択された関数を持つこれらの偏りのない推定量を期待値最大化アルゴリズムまたは確率的勾配法内で使用すると、最大尤度またはベイズフレームワークを使用した統計的推論が可能になります。既存のアプローチと比較して、我々の偏りのない推定量を使用すると、時間離散化バイアスとマルコフ連鎖モンテカルロバーンインバイアスを排除することができます。我々の方法論の中心となるのは、マルチレベルランダム化スキームと偏りのないマルコフ連鎖モンテカルロ法の新しく自然な組み合わせと、複数の条件付き粒子フィルタの新しい結合の開発です。私たちは、推定量が偏りがなく有限分散を持つという仮定を確立します。私たちは、オルンシュタイン-ウーレンベックモデル、個体群動態のロジスティック拡散モデル、およびグリッドセルのニューラルネットワークモデルで、我々の方法のさまざまな側面を示す。

Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions
活性化関数の学習によるリプシッツ制約付きニューラルネットワークの改善

Lipschitz-constrained neural networks have several advantages over unconstrained ones and can be applied to a variety of problems, making them a topic of attention in the deep learning community. Unfortunately, it has been shown both theoretically and empirically that they perform poorly when equipped with ReLU activation functions. By contrast, neural networks with learnable 1-Lipschitz linear splines are known to be more expressive. In this paper, we show that such networks correspond to global optima of a constrained functional optimization problem that consists of the training of a neural network composed of 1-Lipschitz linear layers and 1-Lipschitz freeform activation functions with second-order total-variation regularization. Further, we propose an efficient method to train these neural networks. Our numerical experiments show that our trained networks compare favorably with existing 1-Lipschitz neural architectures.

リプシッツ制約付きニューラルネットワークは、制約なしニューラルネットワークに比べていくつかの利点があり、さまざまな問題に適用できるため、ディープラーニングコミュニティで注目されています。残念ながら、ReLU活性化機能を装備すると、理論的にも経験的にもパフォーマンスが低下することが示されています。対照的に、学習可能な1-Lipschitz線形スプラインを持つニューラルネットワークは、より表現力豊かであることが知られています。この論文では、このようなネットワークが、1-Lipschitz線形層と1-Lipschitz自由曲面活性化関数と2次全変正則化で構成されるニューラルネットワークの学習で構成される制約付き関数最適化問題のグローバル最適値に対応することを示します。さらに、これらのニューラルネットワークを効率的に訓練する方法を提案します。私たちの数値実験は、訓練されたネットワークが既存の1-Lipschitzニューラルアーキテクチャと良好に比較されることを示しています。

Mathematical Framework for Online Social Media Auditing
オンラインソーシャルメディア監査のための数学的フレームワーク

Social media platforms (SMPs) leverage algorithmic filtering (AF) as a means of selecting the content that constitutes a user’s feed with the aim of maximizing their rewards. Selectively choosing the contents to be shown on the user’s feed may yield a certain extent of influence, either minor or major, on the user’s decision-making, compared to what it would have been under a natural/fair content selection. As we have witnessed over the past decade, algorithmic filtering can cause detrimental side effects, ranging from biasing individual decisions to shaping those of society as a whole, for example, diverting users’ attention from whether to get the COVID-19 vaccine or inducing the public to choose a presidential candidate. The government’s constant attempts to regulate the adverse effects of AF are often complicated, due to bureaucracy, legal affairs, and financial considerations. On the other hand SMPs seek to monitor their own algorithmic activities to avoid being fined for exceeding the allowable threshold. In this paper, we mathematically formalize this framework and utilize it to construct a data-driven statistical auditing procedure to regulate AF from deflecting users’ beliefs over time, along with sample complexity guarantees. This state-of-the-art algorithm can be used either by authorities acting as external regulators or by SMPs for self-auditing.

ソーシャルメディアプラットフォーム(SMP)は、ユーザーのフィードを構成するコンテンツを選択する手段としてアルゴリズムフィルタリング(AF)を活用し、その報酬を最大化することを目指しています。ユーザーのフィードに表示されるコンテンツを厳選すると、自然で公正なコンテンツ選択の場合と比較して、ユーザーの意思決定に多少なりとも大きな影響を与える可能性があります。過去10年間に目撃したように、アルゴリズムフィルタリングは、個人の意思決定に偏りを与えることから社会全体の意思決定を形作るまで、有害な副作用を引き起こす可能性があります。たとえば、COVID-19ワクチンを接種するかどうかからユーザーの注意をそらしたり、大統領候補を選ぶよう国民を誘導したりします。AFの悪影響を規制しようとする政府の絶え間ない試みは、官僚主義、法務、財政上の考慮により、複雑になることがよくあります。一方、SMPは、許容しきい値を超えた場合に罰金を科せられないように、独自のアルゴリズム活動を監視しようとします。この論文では、このフレームワークを数学的に形式化し、それを利用して、サンプルの複雑性保証とともに、AFが時間の経過とともにユーザーの信念を歪めないように規制するためのデータ駆動型の統計監査手順を構築します。この最先端のアルゴリズムは、外部規制機関として機能する当局またはSMPが自己監査に使用できます。

An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates
一貫性のある多面体代理母の設計と解析のための埋め込みフレームワーク

We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for discrete problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g. rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and “convexifies” the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses: every discrete loss is embedded by some polyhedral loss, and every polyhedral loss embeds some discrete loss. Moreover, an embedding gives rise to a consistent link function as well as linear surrogate regret bounds. Our results are constructive, as we illustrate with several examples. In particular, our framework gives succinct proofs of consistency or inconsistency for existing polyhedral surrogates, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We go on to show additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.

私たちは、分類、ランキング、構造化予測などの離散問題に対して、埋め込みを介して凸代理損失関数を設計する自然なアプローチを形式化し、研究します。このアプローチでは、有限個の予測（ランキングなど）のそれぞれを$\mathbb{R}^d$内の点として埋め込み、元の損失値をこれらの点に割り当て、何らかの方法で損失を「凸化」して代理を得る。私たちは、このアプローチと多面体（区分線形凸）代理損失の間に強いつながりを確立します。すなわち、すべての離散損失は何らかの多面体損失によって埋め込まれ、すべての多面体損失は何らかの離散損失を埋め込む。さらに、埋め込みにより、一貫したリンク関数と線形代理リグレット境界が生じる。いくつかの例で示すように、我々の結果は建設的です。特に、私たちのフレームワークは、既存の多面体サロゲートの一貫性または非一貫性の簡潔な証明を提供し、一貫性のないサロゲートについては、これらのサロゲートが一貫性を持つ離散損失をさらに明らかにします。さらに、埋め込みとマッチングベイズリスクの同等性、非冗長性のさまざまな概念の同等性など、埋め込みの追加構造を示します。これらの結果を使用して、一貫性の必要条件である間接的な誘導が、多面体サロゲートを使用する場合にも十分であることを確認します。

Low-rank Variational Bayes correction to the Laplace method
ラプラス法に対する低ランク変分ベイズ補正

Approximate inference methods like the Laplace method, Laplace approximations and variational methods, amongst others, are popular methods when exact inference is not feasible due to the complexity of the model or the abundance of data. In this paper we propose a hybrid approximate method called Low-Rank Variational Bayes correction (VBC), that uses the Laplace method and subsequently a Variational Bayes correction in a lower dimension, to the joint posterior mean. The cost is essentially that of the Laplace method which ensures scalability of the method, in both model complexity and data size. Models with fixed and unknown hyperparameters are considered, for simulated and real examples, for small and large data sets.

ラプラス法、ラプラス近似法、変分法などの近似推論法は、モデルの複雑さやデータの豊富さのために正確な推論が不可能な場合によく使用される方法です。この論文では、ラプラス法を使用し、その後、下位次元の変分ベイズ補正を結合後平均にする低ランク変分ベイズ補正(VBC)と呼ばれるハイブリッド近似法を提案します。コストは基本的にラプラス法のコストであり、モデルの複雑さとデータサイズの両方でメソッドのスケーラビリティを確保します。固定および未知のハイパーパラメーターを持つモデルは、シミュレートされた例と実際の例、小さなデータセットと大きなデータセットについて考慮されます。

Scaling the Convex Barrier with Sparse Dual Algorithms
スパース双対アルゴリズムによる凸障壁のスケーリング

Tight and efficient neural network bounding is crucial to the scaling of neural network verification systems. Many efficient bounding algorithms have been presented recently, but they are often too loose to verify more challenging properties. This is due to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise-linear activations exists, it comes at the cost of exponentially many constraints and currently lacks an efficient customized solver. We alleviate this deficiency by presenting two novel dual algorithms: one operates a subgradient method on a small active set of dual variables, the other exploits the sparsity of Frank-Wolfe type optimizers to incur only a linear memory cost. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. At the same time, they share the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we can obtain better bounds than off-the-shelf solvers in only a fraction of their running time, attaining significant formal verification speed-ups.

ニューラルネットワーク検証システムのスケーリングには、厳密で効率的なニューラルネットワーク境界が不可欠です。最近、多くの効率的な境界アルゴリズムが発表されていますが、より困難な特性を検証するには緩すぎることがよくあります。これは、通常、ニューロンの数に比例するサイズの線形プログラムである、採用されている緩和の弱点によるものです。区分線形活性化のためのより厳密な線形緩和は存在しますが、指数関数的に多くの制約を犠牲にし、現在、効率的なカスタマイズされたソルバーがありません。私たちは、2つの新しいデュアルアルゴリズムを提示することでこの欠点を軽減します。1つは、デュアル変数の小さなアクティブセットに対してサブグラディエント法を実行し、もう1つはFrank-Wolfe型最適化のスパース性を利用して線形メモリコストのみを発生させます。どちらの方法も、新しい緩和の長所である厳密さと線形分離オラクルを取り戻します。同時に、より弱い緩和に対する以前のデュアルアプローチの利点(大規模な並列処理、GPU実装、反復あたりの低コスト、いつでも有効な境界)を共有します。その結果、既製のソルバーよりも優れた境界を、実行時間のほんの一部で取得でき、形式検証の大幅な高速化を実現できます。

Causal-learn: Causal Discovery in Python
因果学習:Pythonでの因果関係の発見

Causal discovery aims at revealing causal relations from observational data, which is a fundamental task in science and engineering. We describe causal-learn, an open-source Python library for causal discovery. This library focuses on bringing a comprehensive collection of causal discovery methods to both practitioners and researchers. It provides easy-to-use APIs for non-specialists, modular building blocks for developers, detailed documentation for learners, and comprehensive methods for all. Different from previous packages in R or Java, causal-learn is fully developed in Python, which could be more in tune with the recent preference shift in programming languages within related communities. The library is available at https://github.com/py-why/causal-learn.

因果関係の発見は、科学や工学における基本的な課題である観測データから因果関係を明らかにすることを目的としています。因果関係の発見のためのオープンソースのPythonライブラリであるcausal-learnについて説明します。このライブラリは、因果関係の発見方法の包括的なコレクションを実務家と研究者の両方に提供することに焦点を当てています。非専門家向けの使いやすいAPI、開発者向けのモジュール式ビルディングブロック、学習者向けの詳細なドキュメント、すべての人向けの包括的なメソッドを提供します。RやJavaの以前のパッケージとは異なり、causal-learnは完全にPythonで開発されており、関連するコミュニティ内のプログラミング言語の最近の好みの変化により調和している可能性があります。ライブラリはhttps://github.com/py-why/causal-learnで入手できます。

Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics
神経ダイナミクスの潜在成分を学習するための分解線形力学系(dLDS)

Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how observed neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time series data as a sparse combination of simpler, more interpretable components. Our model is trained through a dictionary learning procedure, where we leverage recent results in tracking sparse vectors over time. The decomposed nature of the dynamics is more expressive than previous switched approaches for a given number of parameters and enables modeling of overlapping and non-stationary dynamics. In both continuous-time and discrete-time instructional examples, we demonstrate that our model effectively approximates the original system, learns efficient representations, and captures smooth transitions between dynamical modes. Furthermore, we highlight our model’s ability to efficiently capture and demix population dynamics generated from multiple independent subnetworks, a task that is computationally impractical for switched models. Finally, we apply our model to neural “full brain” recordings of C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states.

集団レベルで神経ダイナミクスの解釈可能な表現を学習することは、観察された神経活動が知覚や行動にどのように関係するかを理解するための重要な第一歩です。神経ダイナミクスのモデルは、多くの場合、神経活動の低次元投影、または時間の経過に伴う神経状態に明示的に関連する動的システムの学習に焦点を当てています。動的システムを低次元多様体上のフローの代表として考えることで、これら2つのアプローチがどのように相互に関連しているかについて議論します。この概念に基づいて、時系列データの複雑な非定常および非線形ダイナミクスを、より単純で解釈しやすいコンポーネントのスパースな組み合わせとして表す、新しい分解された動的システムモデルを提案します。このモデルは、辞書学習手順を通じてトレーニングされ、時間の経過に伴うスパースベクトルの追跡に関する最近の結果を活用します。ダイナミクスの分解された性質は、特定の数のパラメーターに対する以前の切り替えアプローチよりも表現力があり、重複する非定常ダイナミクスのモデル化を可能にします。連続時間と離散時間の両方の教育例において、私たちのモデルが効果的に元のシステムを近似し、効率的な表現を学習し、動的モード間のスムーズな遷移を捉えることを実証します。さらに、複数の独立したサブネットワークから生成された個体群動態を効率的に捉えて分離するモデルの能力を強調します。これは、切り替えモデルでは計算上非現実的なタスクです。最後に、私たちのモデルをC. elegansデータの神経「フルブレイン」記録に適用し、離散状態に分類すると不明瞭になるダイナミクスの多様性を示します。

Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification
二項分類における敵対的代理リスクの存在と最小最大定理

We prove existence, minimax, and complementary slackness theorems for adversarial surrogate risks in binary classification. These results extend recent work that established analogous minimax and existence theorems for the adversarial classification risk. We show that such statements continue to hold for a very general class of surrogate losses; moreover, we remove some of the technical restrictions present in prior work. Our results provide an explanation for the phenomenon of transfer attacks and inform new directions in algorithm development.

私たちは、二項分類における敵対的代理リスクの存在、最小最大、および相補的スラックネス定理を証明します。これらの結果は、敵対的分類リスクの類似のミニマックス定理と存在定理を確立した最近の研究を拡張するものです。私たちは、そのようなステートメントが代理損失の非常に一般的なクラスに対して引き続き保持されることを示します。さらに、以前の作業に存在していた技術的な制限の一部を取り除きます。私たちの結果は、転送攻撃の現象を説明し、アルゴリズム開発の新たな方向性を示唆しています。

Data Thinning for Convolution-Closed Distributions
畳み込み閉分布のデータ間引き

We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.

私たちは、データ間引きを提案します。これは、観測値を、合計すると元の観測値になり、パラメータの（既知の）スケーリングまで元の観測値と同じ分布に従う2つ以上の独立した部分に分割する手法です。この非常に一般的な提案は、ガウス分布、ポアソン分布、負の二項分布、ガンマ分布、二項分布などを含む任意の畳み込み閉分布に適用できます。データ間引きは、モデルの選択、評価、推論に多くの用途があります。たとえば、データ間引きによるクロスバリデーションは、特にサンプル分割が適用できない設定において、通常のサンプル分割によるクロスバリデーション手法の魅力的な代替手段となります。シミュレーションと単一細胞RNAシーケンスデータへの応用において、我々はデータ間引きを使用して、従来のサンプル分割が魅力的でない、または利用できないk平均法クラスタリングや主成分分析などの教師なし学習手法の結果を検証できることを示す。

A projected semismooth Newton method for a class of nonconvex composite programs with strong prox-regularity
強い近接規則性を持つ非凸複合プログラムのクラスに対する射影セミスムースニュートン法

This paper aims to develop a Newton-type method to solve a class of nonconvex composite programs. In particular, the nonsmooth part is possibly nonconvex. To tackle the nonconvexity, we develop a notion of strong prox-regularity which is related to the singleton property and Lipschitz continuity of the associated proximal operator, and we verify it in various classes of functions, including weakly convex functions, indicator functions of proximally smooth sets, and two specific sphere-related nonconvex nonsmooth functions. In this case, the problem class we are concerned with covers smooth optimization problems on manifold and certain composite optimization problems on manifold. For the latter, the proposed algorithm is the first second-order type method. Combining with the semismoothness of the proximal operator, we design a projected semismooth Newton method to find a root of the natural residual induced by the proximal gradient method. Due to the possible nonconvexity of the feasible domain, an extra projection is added to the usual semismooth Newton step and new criteria are proposed for the switching between the projected semismooth Newton step and the proximal step. The global convergence is then established under the strong prox-regularity. Based on the BD regularity condition, we establish local superlinear convergence. Numerical experiments demonstrate the effectiveness of our proposed method compared with state-of-the-art ones.

この論文では、非凸複合計画のクラスを解くためのニュートン型法の開発を目的としています。特に、非平滑部分は非凸である可能性があります。非凸性に対処するために、関連する近似演算子のシングルトン特性とリプシッツ連続性に関連する強い近似正則性の概念を開発し、弱凸関数、近似的に滑らかな集合の指示関数、および2つの特定の球面関連の非凸非平滑関数を含むさまざまな関数のクラスでそれを検証します。この場合、関心のある問題クラスは、多様体上の滑らかな最適化問題と、多様体上の特定の複合最適化問題をカバーします。後者の場合、提案されたアルゴリズムは最初の2次型法です。近似演算子の半平滑性と組み合わせて、近似勾配法によって誘導される自然残差の根を見つけるための射影半平滑ニュートン法を設計します。実行可能領域が非凸である可能性があるため、通常の半滑らかなニュートンステップに追加の投影が追加され、投影された半滑らかなニュートンステップと近位ステップ間の切り替えに新しい基準が提案されます。その後、強い近似正則性の下でグローバル収束が確立されます。BD正則性条件に基づいて、ローカルな超線形収束を確立します。数値実験により、最先端の方法と比較して、提案された方法の有効性が実証されています。

Revisiting RIP Guarantees for Sketching Operators on Mixture Models
混合モデルでのオペレーターのスケッチに対する RIP 保証の再検討

In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators.

圧縮混合モデリングのスケッチのコンテキストでは、特定の混合モデルに関するスケッチ演算子の制限付きアイソメトリ特性の既存の証明を再検討します。既存の保証の欠点を検討した後、ランダムなフーリエ特徴を描画してランダムなスケッチ演算子を構築する際に重要度サンプリングを想定する必要性を回避する代替分析を提案します。私たちの分析は、スケッチ演算子を定義するために使用される周波数のセットにのみ依存する制限されたアイソメトリ定数の新しい決定論的境界に基づいています。次に、これらの境界を活用して、目的のRIP保証につながるランダムスケッチ演算子の濃度不等式を確立します。また、私たちの分析は、高速ランダム線形演算子に関連付けられた周波数を持つ構造化スケッチの理論的保証への扉を開きます。

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization
正則化リスク最小化のための分布シフト下での単調リスク関係

Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this paper, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error for ridge-regularized general linear models under covariate shift, as well as an approximate linear relation for linear inverse problems.

機械学習システムは、多くの場合、トレーニング分布とは異なる分布から引き出されたデータに適用されます。最近の研究では、さまざまな分類問題や信号再構成問題について、分布外の性能と分布内の性能と強く相関していることが示されています。この関係、またはより一般的には単調な関係が続く場合、それは重要な結果をもたらします。たとえば、一方のディストリビューションでのパフォーマンスを、もう一方のディストリビューションのパフォーマンスのプロキシとして最適化できます。この論文では、2つの分布に対するモデルのパフォーマンス間の単調な関係が予想される条件を研究します。共変量シフトの下でのリッジ正則化一般線形モデルについて、二乗誤差の正確な漸近線形関係と誤分類誤差の単調関係、および線形逆問題に対する近似線形関係を証明します。

Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks
Polygonal Unadjusted Langevin Algorithms:ニューラルネットワークのための安定で効率的な適応アルゴリズムの作成

We present a new class of Langevin-based algorithms, which overcomes many of the known shortcomings of popular adaptive optimizers that are currently used for the fine tuning of deep learning models. Its underpinning theory relies on recent advances of Euler-Krylov polygonal approximations for stochastic differential equations (SDEs) with monotone coefficients. As a result, it inherits the stability properties of tamed algorithms, while it addresses other known issues, e.g. vanishing gradients in deep learning. In particular, we provide a nonasymptotic analysis and full theoretical guarantees for the convergence properties of an algorithm of this novel class, which we named TH$\varepsilon$O POULA (or, simply, TheoPouLa). Finally, several experiments are presented with different types of deep learning models, which show the superior performance of TheoPouLa over many popular adaptive optimization algorithms.

私たちは、深層学習モデルの微調整に現在使用されている一般的な適応オプティマイザーの既知の欠点の多くを克服する、新しいクラスのLangevinベースのアルゴリズムを紹介します。その基礎となる理論は、単調係数を持つ確率微分方程式(SDE)のオイラー・クリロフ多角形近似の最近の進歩に依存しています。その結果、飼いならされたアルゴリズムの安定性特性を継承しながら、ディープラーニングの勾配の消失など、他の既知の問題にも対処します。特に、この新しいクラスのアルゴリズムの収束特性に対する非漸近解析と完全な理論的保証を提供し、TH$varepsilon$O POULA (または単にTheoPouLa)と名付けました。最後に、さまざまなタイプのディープラーニングモデルを使用したいくつかの実験が提示され、多くの一般的な適応最適化アルゴリズムに対するTheoPouLaの優れたパフォーマンスが示されています。

Axiomatic effect propagation in structural causal models
構造因果モデルにおける公理的効果伝播

We study effect propagation in a causal directed acyclic graph (DAG), with the goal of providing a flow-based decomposition of the effect (i.e., change in the outcome variable) as a result of changes in the source variables. We first compare various ideas on causality to quantify effect propagation, such as direct and indirect effects, path-specific effects, and degree of responsibility. We discuss the shortcomings of such approaches and propose a flow-based methodology, which we call recursive Shapley value (RSV). By considering a broader set of counterfactuals than existing methods, RSV obeys a unique adherence to four desirable flow-based axioms. Further, we provide a general path-based characterization of RSV for an arbitrary non-parametric structural equations model (SEM) defined on the underlying DAG. Interestingly, for the special class of linear SEMs, RSV exhibits a simple and tractable characterization (and hence, computation), which recovers the classical method of path coefficients and is equivalent to path-specific effects. For non-parametric SEMs, we use our general characterization to develop an unbiased Monte-Carlo estimation procedure with an exponentially decaying sample complexity. We showcase the application of RSV on two challenging problems on causality (causal overdetermination and causal unfairness).

私たちは、因果的有向非巡回グラフ(DAG)における効果の伝播を研究し、ソース変数の変化の結果としての効果(つまり、結果変数の変化)のフローベースの分解を提供することを目標としています。まず、直接的効果と間接的効果、パス固有の効果、責任の程度など、効果の伝播を定量化するための因果関係に関するさまざまな考え方を比較します。このようなアプローチの欠点について説明し、再帰的シャプレイ値(RSV)と呼ぶフローベースの方法論を提案します。既存の方法よりも広範な反事実的セットを考慮することにより、RSVは4つの望ましいフローベースの公理に独自に準拠します。さらに、基礎となるDAGで定義された任意のノンパラメトリック構造方程式モデル(SEM)のRSVの一般的なパスベースの特性評価を提供します。興味深いことに、特殊なクラスの線形SEMの場合、RSVは単純で扱いやすい特性(したがって計算)を示し、パス係数の古典的な方法を回復し、パス固有の効果と同等です。ノンパラメトリックSEMの場合、一般的な特性を使用して、サンプルの複雑性が指数関数的に減少する偏りのないモンテカルロ推定手順を開発します。因果関係に関する2つの困難な問題(因果過剰決定と因果不公平)に対するRSVの適用を紹介します。

Optimal First-Order Algorithms as a Function of Inequalities
不等式の関数としての最適1次アルゴリズム

In this work, we present a novel algorithm design methodology that finds the optimal algorithm as a function of inequalities. Specifically, we restrict convergence analyses of algorithms to use a prespecified subset of inequalities, rather than utilizing all true inequalities, and find the optimal algorithm subject to this restriction. This methodology allows us to design algorithms with certain desired characteristics. As concrete demonstrations of this methodology, we find new state-of-the-art accelerated first-order gradient methods using randomized coordinate updates and backtracking line searches.

この研究では、不等式の関数として最適なアルゴリズムを見つける新しいアルゴリズム設計方法論を提示します。具体的には、アルゴリズムの収束解析を、すべての真の不等式を利用するのではなく、事前に指定された不等式のサブセットを使用するように制限し、この制限の対象となる最適なアルゴリズムを見つけます。この方法論により、特定の望ましい特性を持つアルゴリズムを設計することができます。この方法論の具体的なデモンストレーションとして、ランダム化された座標更新とバックトラッキングライン検索を使用した新しい最先端の加速1次勾配法を見つけます。

Resource-Efficient Neural Networks for Embedded Systems
組み込みシステム向けのリソース効率の高いニューラルネットワーク

While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday’s applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on resource-efficient inference based on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark data sets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and prediction quality.

機械学習は伝統的にリソースを大量に消費するタスクですが、組み込みシステム、自律ナビゲーション、モノのインターネットのビジョンにより、リソース効率の高いアプローチへの関心が高まっています。これらのアプローチは、計算とエネルギーの観点から、パフォーマンスとリソース消費の間で慎重に選択されたトレードオフを目指しています。このようなアプローチの開発は、現在の機械学習研究における主要な課題の1つであり、事実上無制限の計算リソースを備えた科学環境から日常のアプリケーションへの機械学習テクノロジーのスムーズな移行を確実にするための鍵となります。この記事では、これらの現実世界の要件を容易にする機械学習技術の現在の最先端の概要を示します。特に、過去10年間の主流の機械学習モデルであるディープニューラルネットワーク(DNN)に基づくリソース効率の高い推論に焦点を当てます。主に、(i)量子化ニューラルネットワーク、(ii)ネットワークプルーニング、(iii)構造効率という3つの相互に排他的ではないカテゴリに分類できる膨大な文献の包括的な概要を示します。これらの手法は、トレーニング中または後処理として適用でき、メモリフットプリント、推論速度、エネルギー効率の観点から計算要件を削減するために広く使用されています。また、DNNの組み込みハードウェアのさまざまな概念と、機械学習手法との互換性、およびエネルギーとレイテンシの削減の可能性についても簡単に説明します。CPU、GPU、FPGAなどのリソースに制約のある組み込みシステムのセットに対して圧縮手法(量子化、プルーニング)を使用したよく知られたベンチマークデータセットの実験により、この議論を実証します。得られた結果は、リソース効率と予測品質の間の適切なトレードオフを見つけることの難しさを浮き彫りにしています。

Trained Transformers Learn Linear Models In-Context
トレーニング済みの Transformer は、コンテキスト内で線形モデルを学習します

Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models’ predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.

トランスフォーマーなどの注意ベースのニューラルネットワークは、コンテキスト内学習(ICL)を示す優れた能力を示しています。つまり、目に見えないタスクからのトークンの短いプロンプトシーケンスが与えられると、パラメーターを更新せずに、トークンごとおよび次のトークンの適切な予測を作成できます。ラベル付きトレーニングデータとラベルなしテストデータのシーケンスをプロンプトとして埋め込むことで、トランスフォーマーは教師あり学習アルゴリズムのように動作できます。実際、最近の研究では、線形回帰問題のランダムなインスタンスでトランスフォーマーアーキテクチャをトレーニングすると、これらのモデルの予測が通常の最小二乗法の予測を模倣することが示されています。この現象の根底にあるメカニズムを理解するために、線形回帰タスクで勾配フローによってトレーニングされた単一の線形自己注意層を持つトランスフォーマーのICLのダイナミクスを調査します。非凸性にもかかわらず、適切なランダム初期化による勾配フローは目的関数のグローバル最小値を見つけることを示します。このグローバル最小値では、新しい予測タスクからのラベル付き例のテストプロンプトが与えられた場合、トランスフォーマーはテストプロンプト分布全体で最良の線形予測子と競合する予測誤差を達成します。さらに、さまざまな分布シフトに対するトレーニング済みトランスフォーマーの堅牢性を評価し、いくつかのシフトは許容されるものの、プロンプトの共変量分布のシフトは許容されないことを示します。これに動機付けられて、共変量分布がプロンプト間で異なる可能性がある一般化ICL設定を検討します。勾配フローはこの設定でグローバル最小値を見つけることに成功しますが、トレーニング済みトランスフォーマーは軽度の共変量シフトに対しては依然として脆弱であることを示します。この発見を、共変量シフトに対してより堅牢であることを示す大規模な非線形トランスフォーマーアーキテクチャでの実験で補完します。

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees
収束保証による非平滑最適化のための Adam ファミリー法

In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods.

この論文では、特に非平滑ニューラルネットワークのトレーニングにおける非平滑最適化のためのアダムファミリー法の収束特性に関する包括的な研究を紹介します。私たちは、2タイムスケール更新方式を採用した新しい2タイムスケールフレームワークを導入し、その収束特性を穏やかな仮定の下で証明します。私たちが提案するフレームワークは、さまざまな一般的なアダムファミリーの方法を網羅しており、非平滑ニューラルネットワークのトレーニングにおけるこれらの方法の収束を保証します。さらに、ヘビーテールノイズを持つ非平滑ニューラルネットワークを訓練するための勾配クリッピング技術を組み込んだ確率的サブグラディエント法を開発します。私たちのフレームワークを通じて、私たちが提案した方法は、評価ノイズが可積分可能であると仮定された場合でも収束することを示します。広範な数値実験により、提案手法の高効率とロバスト性が実証されています。

Efficient Modality Selection in Multimodal Learning
マルチモーダル学習における効率的なモダリティ選択

Multimodal learning aims to learn from data of different modalities by fusing information from heterogeneous sources. Although it is beneficial to learn from more modalities, it is often infeasible to use all available modalities under limited computational resources. Modeling with all available modalities can also be inefficient and unnecessary when information across input modalities overlaps. In this paper, we study the modality selection problem, which aims to select the most useful subset of modalities for learning under a cardinality constraint. To that end, we propose a unified theoretical framework to quantify the learning utility of modalities, and we identify dependence assumptions to flexibly model the heterogeneous nature of multimodal data, which also allows efficient algorithm design. Accordingly, we derive a greedy modality selection algorithm via submodular maximization, which selects the most useful modalities with an optimality guarantee on learning performance. We also connect marginal-contribution-based feature importance scores, such as Shapley value, from the feature selection domain to the context of modality selection, to efficiently compute the importance of individual modality. We demonstrate the efficacy of our theoretical results and modality selection algorithms on 2 synthetic and 4 real-world data sets on a diverse range of multimodal data.

マルチモーダル学習は、異種ソースからの情報を融合することにより、異なるモダリティのデータから学習することを目的としています。より多くのモダリティから学習することは有益ですが、限られた計算リソースの下で利用可能なすべてのモダリティを使用することは多くの場合実行不可能です。利用可能なすべてのモダリティを使用したモデリングは、入力モダリティ間の情報が重複している場合、非効率的で不必要になることもあります。この論文では、カーディナリティ制約の下で学習に最も有用なモダリティのサブセットを選択することを目的とするモダリティ選択問題を研究します。そのために、モダリティの学習効用を定量化する統一された理論的枠組みを提案し、マルチモーダルデータの異種性質を柔軟にモデル化するための依存性仮定を特定し、効率的なアルゴリズム設計も可能にします。したがって、学習パフォーマンスの最適性が保証された最も有用なモダリティを選択する、サブモジュラー最大化による貪欲なモダリティ選択アルゴリズムを導出します。また、シャプレー値などの限界寄与ベースの特徴重要度スコアを特徴選択ドメインからモダリティ選択のコンテキストに接続して、個々のモダリティの重要性を効率的に計算します。さまざまなマルチモーダルデータに関する2つの合成データセットと4つの実世界データセットで、理論的結果とモダリティ選択アルゴリズムの有効性を実証します。

A Multilabel Classification Framework for Approximate Nearest Neighbor Search
近似近傍探索のための多ラベル分類フレームワーク

To learn partition-based index structures for approximate nearest neighbor (ANN) search, both supervised and unsupervised machine learning algorithms have been used. Existing supervised algorithms select all the points that belong to the same partition element as the query point as nearest neighbor candidates. Consequently, they formulate the learning task as finding a partition in which the nearest neighbors of a query point belong to the same partition element with it as often as possible. In contrast, we formulate the candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point. In the proposed framework, partition-based index structures are interpreted as partitioning classifiers for solving this classification problem. Empirical results suggest that, when combined with any partitioning strategy, the natural classifier based on the proposed framework leads to a strictly improved performance compared to the earlier candidate set selection methods. We also prove a sufficient condition for the consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees and (both dense and sparse) random projection trees.

近似最近傍(ANN)検索用のパーティションベースのインデックス構造を学習するために、教師ありおよび教師なしの両方の機械学習アルゴリズムが使用されてきました。既存の教師ありアルゴリズムは、クエリポイントと同じパーティション要素に属するすべてのポイントを最近傍候補として選択します。その結果、学習タスクは、クエリポイントの最近傍が可能な限り同じパーティション要素に属するパーティションを見つけることとして定式化されます。対照的に、ANN検索での候補セットの選択は、ラベルがクエリポイントの最近傍に対応するマルチラベル分類問題として直接定式化されます。提案されたフレームワークでは、パーティションベースのインデックス構造は、この分類問題を解決するためのパーティション分類子として解釈されます。実験結果によると、任意のパーティション戦略と組み合わせると、提案されたフレームワークに基づく自然分類子は、以前の候補セット選択方法と比較して大幅に改善されたパフォーマンスをもたらします。また、ANN検索の分割分類器の一貫性に対する十分な条件を証明し、時系列$k$-dツリーと(密と疎の両方の)ランダム投影ツリーに対してこの条件を検証することで結果を示します。

Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization
スコアリングルール最小化による生成ネットワークによる確率的予測

Probabilistic forecasting relies on past observations to provide a probability distribution for a future outcome, which is often evaluated against the realization using a scoring rule. Here, we perform probabilistic forecasting with generative neural networks, which parametrize distributions on high-dimensional spaces by transforming draws from a latent variable. Generative networks are typically trained in an adversarial framework. In contrast, we propose to train generative networks to minimize a predictive-sequential (or prequential) scoring rule on a recorded temporal sequence of the phenomenon of interest, which is appealing as it corresponds to the way forecasting systems are routinely evaluated. Adversarial-free minimization is possible for some scoring rules; hence, our framework avoids the cumbersome hyperparameter tuning and uncertainty underestimation due to unstable adversarial training, thus unlocking reliable use of generative networks in probabilistic forecasting. Further, we prove consistency of the minimizer of our objective with dependent data, while adversarial training assumes independence. We perform simulation studies on two chaotic dynamical models and a benchmark data set of global weather observations; for this last example, we define scoring rules for spatial data by drawing from the relevant literature. Our method outperforms state-of-the-art adversarial approaches, especially in probabilistic calibration, while requiring less hyperparameter tuning.

確率予測は、過去の観測値に基づいて将来の結果の確率分布を提供し、多くの場合、スコアリングルールを使用して実現に対して評価されます。ここでは、潜在変数からの抽出を変換することで高次元空間上の分布をパラメーター化する生成ニューラルネットワークを使用して確率予測を実行します。生成ネットワークは通常、敵対的フレームワークでトレーニングされます。対照的に、対象現象の記録された時系列シーケンスに対する予測シーケンシャル(または予測的)スコアリングルールを最小化するように生成ネットワークをトレーニングすることを提案します。これは、予測システムが日常的に評価される方法に対応するため魅力的です。一部のスコアリングルールでは、敵対的ではない最小化が可能です。したがって、私たちのフレームワークでは、不安定な敵対的トレーニングによる面倒なハイパーパラメータ調整や不確実性の過小評価を回避し、確率予測で生成ネットワークを信頼性高く使用できるようにします。さらに、敵対的トレーニングでは独立性が前提とされているのに対し、従属データを使用して目的の最小化が一貫していることを証明します。私たちは、2つのカオス力学モデルと、地球規模の気象観測のベンチマークデータセットに関するシミュレーション研究を実施しました。この最後の例では、関連文献を参考にして空間データのスコアリングルールを定義します。私たちの方法は、特に確率的キャリブレーションにおいて最先端の敵対的アプローチよりも優れており、ハイパーパラメータの調整も少なくて済みます。

Multiple Descent in the Multiple Random Feature Model
多重ランダム特徴モデルにおける多重降下

Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a “double random feature model” (DRFM) concatenating two types of random features, and study the excess risk achieved by the DRFM in ridge regression. We calculate the precise limit of the excess risk under the high dimensional framework where the training sample size, the dimension of data, and the dimension of random features tend to infinity proportionally. Based on the calculation, we further theoretically demonstrate that the risk curves of DRFMs can exhibit triple descent. We then provide a thorough experimental study to verify our theory. At last, we extend our study to the “multiple random feature model” (MRFM), and show that MRFMs ensembling $K$ types of random features may exhibit $(K+1)$-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models.

最近の研究では、過剰パラメータ化学習における二重降下現象が実証されています。この現象は最近の研究で調査されていますが、理論的には完全には理解されていません。この論文では、多成分予測モデルのクラスにおける多重降下現象を調査します。まず、2種類のランダムフィーチャを連結した「二重ランダムフィーチャモデル」(DRFM)を検討し、リッジ回帰でDRFMによって達成される過剰リスクを調査します。トレーニングサンプルサイズ、データの次元、ランダムフィーチャの次元が比例して無限大になる高次元フレームワークの下で、過剰リスクの正確な限界を計算します。計算に基づいて、DRFMのリスク曲線が三重降下を示す可能性があることをさらに理論的に実証します。次に、理論を検証するための徹底的な実験研究を提供します。最後に、研究を「多重ランダムフィーチャモデル」(MRFM)に拡張し、$K$種類のランダムフィーチャをアンサンブルするMRFMが$(K+1)$倍降下を示す可能性があることを示します。私たちの分析は、多成分予測モデルの学習では、一般的に特定の降下数を持つリスク曲線が存在することを指摘しています。

Mean-Square Analysis of Discretized Itô Diffusions for Heavy-tailed Sampling
ヘビーテールサンプリングのための離散化伊藤拡散の平均二乗解析

We analyze the complexity of sampling from a class of heavy-tailed distributions by discretizing a natural class of Itô diffusions associated with weighted Poincaré inequalities. Based on a mean-square analysis, we establish the iteration complexity for obtaining a sample whose distribution is $\epsilon$ close to the target distribution in the Wasserstein-2 metric. In this paper, our results take the mean-square analysis to its limits, i.e., we invariably only require that the target density has finite variance, the minimal requirement for a mean-square analysis. To obtain explicit estimates, we compute upper bounds on certain moments associated with heavy-tailed targets under various assumptions. We also provide similar iteration complexity results for the case where only function evaluations of the unnormalized target density are available by estimating the gradients using a Gaussian smoothing technique. We provide illustrative examples based on the multivariate $t$-distribution.

私たちは、重み付きポアンカレ不等式に関連する伊藤拡散の自然クラスを離散化することにより、ヘビーテール分布のクラスからのサンプリングの複雑さを分析します。平均二乗解析に基づいて、分布が$epsilon$でWasserstein-2メトリクスのターゲット分布に近いサンプルを取得するための反復計算量を確立します。この論文では、私たちの結果は平均二乗分析をその限界まで取ります、つまり、ターゲット密度が有限分散を持つこと、つまり平均二乗分析の最小要件のみを常に要求します。明示的な推定値を得るために、さまざまな仮定の下で、ヘビーテールターゲットに関連する特定のモーメントの上限を計算します。また、ガウス平滑化手法を使用して勾配を推定することにより、正規化されていないターゲット密度の関数評価のみが利用可能な場合についても、同様の反復計算量の結果を提供します。多変量$t$分布に基づく実例を提供します。

Invariant and Equivariant Reynolds Networks
不変量および等変量のレイノルズネットワーク

Various data exhibit symmetry, including permutations in graphs and point clouds. Machine learning methods that utilize this symmetry have achieved considerable success. In this study, we explore learning models for data exhibiting group symmetry. Our focus is on transforming deep neural networks using Reynolds operators, which average over the group to convert a function into an invariant or equivariant form. While learning methods based on Reynolds operators are well-established, they often face computational complexity challenges. To address this, we introduce two new methods that reduce the computational burden associated with the Reynolds operator: (i) Although the Reynolds operator traditionally averages over the entire group, we demonstrate that it can be effectively approximated by averaging over specific subsets of the group, termed the Reynolds design. (ii) We reveal that the pre-model does not require all input variables. Instead, using a select number of partial inputs (Reynolds dimension) is sufficient to achieve a universally applicable model. Employing these methods, which hinge on the Reynolds design and Reynolds dimension concepts, allows us to construct universally applicable models with manageable computational complexity. Our experiments on benchmark data indicate that our approach is more efficient than existing methods.

グラフやポイントクラウドの順列など、さまざまなデータが対称性を示します。この対称性を利用する機械学習手法は、かなりの成功を収めています。この研究では、グループ対称性を示すデータの学習モデルを検討します。焦点は、グループを平均して関数を不変形式または等変形式に変換するレイノルズ演算子を使用してディープニューラルネットワークを変換することです。レイノルズ演算子に基づく学習手法は十分に確立されていますが、計算の複雑さの課題に直面することがよくあります。これに対処するために、レイノルズ演算子に関連する計算負荷を軽減する2つの新しい手法を紹介します。(i)レイノルズ演算子は従来、グループ全体を平均しますが、レイノルズ設計と呼ばれるグループの特定のサブセットを平均することで効果的に近似できることを実証します。(ii)プレモデルにすべての入力変数が必要ではないことを明らかにします。代わりに、選択した数の部分入力(レイノルズ次元)を使用するだけで、普遍的に適用可能なモデルを実現できます。レイノルズ設計とレイノルズ次元の概念に基づくこれらの方法を採用することで、扱いやすい計算複雑性で普遍的に適用可能なモデルを構築できます。ベンチマークデータでの実験では、このアプローチが既存の方法よりも効率的であることが示されています。

Personalized PCA: Decoupling Shared and Unique Features
パーソナライズされたPCA:共有機能と独自の機能のデカップリング

In this paper, we tackle a significant challenge in PCA: heterogeneity. When data are collected from different sources with heterogeneous trends while still sharing some congruency, it is critical to extract shared knowledge while retaining the unique features of each source. To this end, we propose personalized PCA (PerPCA), which uses mutually orthogonal global and local principal components to encode both unique and shared features. We show that, under mild conditions, both unique and shared features can be identified and recovered by a constrained optimization problem, even if the covariance matrices are immensely different. Also, we design a fully federated algorithm inspired by distributed Stiefel gradient descent to solve the problem. The algorithm introduces a new group of operations called generalized retractions to handle orthogonality constraints, and only requires global PCs to be shared across sources. We prove the linear convergence of the algorithm under suitable assumptions. Comprehensive numerical experiments highlight PerPCA’s superior performance in feature extraction and prediction from heterogeneous datasets. As a systematic approach to decouple shared and unique features from heterogeneous datasets, PerPCA finds applications in several tasks, including video segmentation, topic extraction, and feature clustering.

この論文では、PCAにおける重要な課題である異質性に取り組みます。データが異質な傾向を持ちながらもある程度の一致性を持つさまざまなソースから収集される場合、各ソースの固有の特徴を保持しながら共有知識を抽出することが重要です。この目的のために、相互に直交するグローバルおよびローカル主成分を使用して固有の特徴と共有の特徴の両方をエンコードするパーソナライズされたPCA (PerPCA)を提案します。軽度の条件下では、共分散行列が非常に異なっていても、制約付き最適化問題によって固有の特徴と共有の特徴の両方を識別および回復できることを示します。また、この問題を解決するために、分散Stiefel勾配降下法にヒントを得た完全連合アルゴリズムを設計します。このアルゴリズムは、直交性制約を処理するために一般化撤回と呼ばれる新しい操作グループを導入し、ソース間でグローバルPCを共有するだけで済みます。適切な仮定の下でアルゴリズムの線形収束を証明します。包括的な数値実験により、異質なデータセットからの特徴抽出と予測におけるPerPCAの優れたパフォーマンスが強調されます。PerPCAは、異種のデータセットから共有機能と固有機能を分離する体系的なアプローチとして、ビデオのセグメンテーション、トピック抽出、機能クラスタリングなど、さまざまなタスクに応用されています。

Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee
Survival Kernets:スケーラブルで解釈可能なディープカーネル生存解析と精度保証

Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets

カーネル生存分析モデルは、任意の2つのデータポイント間の類似性を測定するカーネル関数を利用して、個々の生存分布を推定します。このようなカーネル関数は、ディープカーネル生存モデルを使用して学習できます。この論文では、モデル解釈と理論分析に適した方法で大規模なデータセットに拡張できる、生存カーネルと呼ばれる新しいディープカーネル生存モデルを紹介します。具体的には、トレーニングデータは、カーネルネッティングと呼ばれる分類と回帰のための最近開発されたトレーニングセット圧縮方式に基づいてクラスターに分割され、これを生存分析設定に拡張します。テスト時に、各データポイントはこれらのクラスターの加重組み合わせとして表され、各クラスターを視覚化できます。生存カーネルの特殊なケースでは、予測された生存分布に対して、対数係数まで最適な有限サンプル誤差境界を確立します。テスト時のスケーラビリティは前述のカーネルネット圧縮戦略を使用して実現されますが、トレーニング中のスケーラビリティは、XGBoostなどのツリーアンサンブルに基づくウォームスタート手順と、ニューラルアーキテクチャ検索を加速するヒューリスティックアプローチによって実現されます。さまざまなサイズ(最大約300万データポイント)の4つの標準生存分析データセットで、時間依存の一致指数に関してテストされたさまざまなベースラインと比較して、生存カーネルネットが非常に競争力があることを示しています。コードは、https://github.com/georgehc/survival-kernetsで入手できます。

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control
連続制御におけるヘビーテール方策探索のサンプルの複雑さとメタ安定性について

Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter $\alpha$, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy’s tail index $\alpha$, a Hölder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Lévy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.

強化学習は、システムダイナミクスモデルを使用せずに、時間の経過とともにインセンティブが順次明らかになる対話型の意思決定のフレームワークです。連続空間へのスケーリングのため、確率的ポリシー勾配(PG)更新を使用してパラメーター化されたポリシーを反復的に改善するポリシー検索に焦点を当てます。表形式のマルコフ決定問題(MDP)では、永続的な探索と適切なパラメーター化により、グローバルな最適性が得られます。対照的に、連続空間では、既存の収束結果が主に定常性または任意の局所極値に限定されていることからもわかるように、非凸性は病的な課題をもたらします。このギャップを埋めるために、状態空間でのジャンプの可能性を高めるテールインデックスパラメーター$\alpha$によって定義されるより重いテールの分布によって定義されるポリシーパラメーター化を通じて、連続空間での永続的な探索に踏み込みます。そうすることで、PGに共通するスコア関数の平滑性条件が無効になります。このように、定常性への収束率が、ポリシーのテールインデックス$\alpha$、Hölder連続性パラメーター、積分可能性条件、および初めてここで導入された探索許容パラメーターによってどのように決まるかを確立します。さらに、適切に定義されたマルコフ連鎖の終了および遷移時間分析を通じて、ローカル最大値のセットがテールインデックスに依存することを特徴付け、より重いテールのレヴィ過程に関連付けられたポリシーがより広いピークに収束することを特定します。この現象により、教師あり学習の摂動に対する安定性が向上し、特に近視眼的インセンティブと遠視的インセンティブが一致していない場合に、ポリシー検索のパフォーマンスが向上することも確認されています。

Convergence for nonconvex ADMM, with applications to CT imaging
非凸型ADMMの収束とCTイメージングへの応用

The alternating direction method of multipliers (ADMM) algorithm is a powerful and flexible tool for complex optimization problems of the form $\min\{f(x)+g(y) : Ax+By=c\}$. ADMM exhibits robust empirical performance across a range of challenging settings including nonsmoothness and nonconvexity of the objective functions $f$ and $g$, and provides a simple and natural approach to the inverse problem of image reconstruction for computed tomography (CT) imaging. From the theoretical point of view, existing results for convergence in the nonconvex setting generally assume smoothness in at least one of the component functions in the objective. In this work, our new theoretical results provide convergence guarantees under a restricted strong convexity assumption without requiring smoothness or differentiability, while still allowing differentiable terms to be treated approximately if needed. We validate these theoretical results empirically, with a simulated example where both $f$ and $g$ are nondifferentiable—and thus outside the scope of existing theory—as well as a simulated CT image reconstruction problem.

交互方向乗数法(ADMM)アルゴリズムは、形式$\min\{f(x)+g(y) : Ax+By=c\}$の複雑な最適化問題に対する強力で柔軟なツールです。ADMMは、目的関数$f$および$g$の非滑らかさや非凸性など、さまざまな困難な設定にわたって堅牢な経験的パフォーマンスを示し、コンピューター断層撮影(CT)画像の画像再構成の逆問題に対するシンプルで自然なアプローチを提供します。理論的な観点からは、非凸設定での収束に関する既存の結果は、一般に、目的関数の少なくとも1つのコンポーネント関数が滑らかであることを前提としています。この研究では、新しい理論的結果により、滑らかさや微分可能性を必要とせずに、制限された強い凸性仮定の下で収束が保証されますが、微分可能な項を必要に応じて近似的に扱うこともできます。私たちは、$f$と$g$の両方が微分不可能であり、したがって既存の理論の範囲外であるシミュレーション例と、シミュレーションされたCT画像再構成問題を使用して、これらの理論的結果を経験的に検証します。

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms
通信制約下における分布ガウス平均推定:最適レートと通信効率アルゴリズム

Distributed estimation of a Gaussian mean under communication constraints is studied in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between communication costs and statistical accuracy, are established under the independent protocols. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under communication constraints, both in terms of the optimal procedure design and the lower bound argument. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design. The optimality results and techniques developed in the present paper can be useful for solving other problems such as distributed nonparametric function estimation and sparse signal recovery.

通信制約下でのガウス平均の分散推定を、決定理論の枠組みで研究します。通信コストと統計的精度のトレードオフを特徴付けるミニマックス収束率は、独立したプロトコルの下で確立されます。通信効率が高く統計的に最適な手順が開発されます。単変量の場合、各ローカルマシンに少なくとも1ビットがある限り、最適率は総通信予算のみに依存します。しかし、多変量の場合、ミニマックス率はローカルマシン間の通信予算の特定の割り当てに依存します。ガウス平均の最適推定は、従来の設定では比較的単純であるが、通信制約下では、最適手順設計と下限の議論の両方の点で非常に複雑です。重要なステップは、ミニマックス推定問題を2つの段階、つまりローカリゼーションとリファインメントに分解することです。この重要な分解は、下限分析と最適手順設計の両方の枠組みを提供します。本論文で開発された最適性の結果と手法は、分散ノンパラメトリック関数推定やスパース信号回復などの他の問題を解決するために役立つ可能性があります。

Sparse NMF with Archetypal Regularization: Computational and Robustness Properties
元型正則化によるスパースNMF:計算特性とロバスト性特性

We consider the problem of sparse nonnegative matrix factorization (NMF) using archetypal regularization. The goal is to represent a collection of data points as nonnegative linear combinations of a few nonnegative sparse factors with appealing geometric properties, arising from the use of archetypal regularization. We generalize the notion of robustness studied in Javadi and Montanari (2019) (without sparsity) to the notions of (a) strong robustness that implies each estimated archetype is close to the underlying archetypes and (b) weak robustness that implies there exists at least one recovered archetype that is close to the underlying archetypes. Our theoretical results on robustness guarantees hold under minimal assumptions on the underlying data, and applies to settings where the underlying archetypes need not be sparse. We present theoretical results and illustrative examples to strengthen the insights underlying the notions of robustness. We propose new algorithms for our optimization problem; and present numerical experiments on synthetic and real data sets that shed further insights into our proposed framework and theoretical developments.

私たちは、原型正則化を用いたスパース非負値行列因子分解(NMF)の問題を検討します。目標は、原型正則化の使用から生じる魅力的な幾何学的特性を持つ少数の非負スパース因子の非負線形結合としてデータポイントのコレクションを表現することです。私たちは、JavadiとMontanari (2019) (スパース性なし)で研究された堅牢性の概念を、(a)推定された各原型が基礎となる原型に近いことを意味する強い堅牢性と、(b)基礎となる原型に近い回復された原型が少なくとも1つ存在することを意味する弱い堅牢性の概念に一般化します。堅牢性の保証に関する我々の理論的結果は、基礎となるデータに関する最小限の仮定の下で成り立ち、基礎となる原型がスパースである必要がない設定に適用できます。私たちは、堅牢性の概念の根底にある洞察を強化するために、理論的結果と例示的な例を示す。私たちは、最適化問題のための新しいアルゴリズムを提案します。そして、私たちが提案するフレームワークと理論的発展へのさらなる洞察をもたらす合成データセットと実際のデータセットに関する数値実験を紹介します。

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions
深層ネットワーク近似:ReLUを超えて多様な活性化関数へ

This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}^2$, $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $3N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,$\,$depth) scaling factors can be further reduced from $(3,2)$ to $(1,1)$ if $\varrho$ falls within a specific subset of $\mathscr{A}$. This subset includes activation functions such as $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, and $\mathtt{Mish}$.

この論文では、さまざまな活性化関数に対するディープニューラルネットワークの表現力について検討します。活性化関数セット$\mathscr{A}$は、$\mathtt{ReLU}$、$\mathtt{LeakyReLU}$、$\mathtt{ReLU}^2$、$\mathtt{ELU}$、$\mathtt{CELU}$、$\mathtt{SELU}$、$\mathtt{Softplus}$などの一般的に使用される活性化関数の大部分を包含するように定義されています。tt{GELU}$、$\mathtt{SiLU}$、$\mathtt{Swish}$、$\mathtt{Mish}$、$\mathtt{Sigmoid}$、$\mathtt{Tanh}$、$\mathtt{Arctan}$、$\mathtt{Softsign}$、$\mathtt{dSiLU}$、$\mathtt{SRS}$。任意の活性化関数$\varrho\in \mathscr{A}$について、幅$N$、深さ$L$の$\mathtt{ReLU}$ネットワークは、任意の有界集合上の幅$3N$、深さ$2L$の$\varrho$活性化ネットワークによって任意の精度で近似できることを実証します。この発見により、$\mathtt{ReLU}$ネットワークで達成されるほとんどの近似結果を、わずかに定数を増やしたとしても、さまざまな他の活性化関数に拡張できます。重要なことに、$\varrho$が$\mathscr{A}$の特定のサブセット内にある場合、(幅、$\、$深さ)スケーリング係数を$(3,2)$から$(1,1)$にさらに減らすことができることを確立しました。このサブセットには、$\mathtt{ELU}$、$\mathtt{CELU}$、$\mathtt{SELU}$、$\mathtt{Softplus}$、$\mathtt{GELU}$、$\mathtt{SiLU}$、$\mathtt{Swish}$、$\mathtt{Mish}$などのアクティベーション関数が含まれます。

Effect-Invariant Mechanisms for Policy Generalization
政策一般化のための効果不変メカニズム

Policy learning is an important component of many real-world learning systems. A major challenge in policy learning is how to adapt efficiently to unseen environments or tasks. Recently, it has been suggested to exploit invariant conditional distributions to learn models that generalize better to unseen environments. However, assuming invariance of entire conditional distributions (which we call full invariance) may be too strong of an assumption in practice. In this paper, we introduce a relaxation of full invariance called effect-invariance (e-invariance for short) and prove that it is sufficient, under suitable assumptions, for zero-shot policy generalization. We also discuss an extension that exploits e-invariance when we have a small sample from the test environment, enabling few-shot policy generalization. Our work does not assume an underlying causal graph or that the data are generated by a structural causal model; instead, we develop testing procedures to test e-invariance directly from data. We present empirical results using simulated data and a mobile health intervention dataset to demonstrate the effectiveness of our approach.

ポリシー学習は、多くの現実世界の学習システムの重要な構成要素です。ポリシー学習における大きな課題は、目に見えない環境やタスクに効率的に適応する方法です。最近、目に見えない環境によりよく一般化するモデルを学習するために、不変条件分布を利用することが提案されています。しかし、条件分布全体の不変性(完全不変性と呼ぶ)を想定することは、実際には強すぎる仮定である可能性があります。この論文では、効果不変性(略してe不変性)と呼ばれる完全不変性の緩和を導入し、適切な仮定の下では、ゼロショットポリシー一般化に十分であることを証明します。また、テスト環境から小さなサンプルがある場合にe不変性を利用して、少数ショットポリシー一般化を可能にする拡張についても説明します。私たちの研究では、基礎となる因果グラフや、データが構造的因果モデルによって生成されることを想定していません。代わりに、データから直接e不変性をテストするテスト手順を開発しています。シミュレーションデータとモバイルヘルス介入データセットを使用して実証結果を示し、私たちのアプローチの有効性を示します。

Pygmtools: A Python Graph Matching Toolkit
pygmtools: Python グラフマッチングツールキット

Graph matching aims to find node-to-node matching among multiple graphs, which is a fundamental yet challenging problem. To facilitate graph matching in scientific research and industrial applications, pygmtools is released, which is a Python graph matching toolkit that implements a comprehensive collection of two-graph matching and multi-graph matching solvers, covering both learning-free solvers as well as learning-based neural graph matching solvers. Our implementation supports numerical backends including Numpy, PyTorch, Jittor, Paddle, runs on Windows, MacOS and Linux, and is friendly to install and configure. Comprehensive documentations covering beginner’s guide, API reference and examples are available online. pygmtools is open-sourced under Mulan PSL v2 license.

グラフマッチングは、複数のグラフ間でノード間のマッチングを見つけることを目的としており、これは基本的でありながら困難な問題です。科学研究や産業アプリケーションでのグラフマッチングを容易にするために、2グラフマッチングとマルチグラフマッチングソルバーの包括的なコレクションを実装したPythonグラフマッチングツールキットであるpygmtoolsがリリースされました。これは、学習不要のソルバーと学習ベースのニューラルグラフマッチングソルバーの両方をカバーしています。私たちの実装は、Numpy、PyTorch、Jittor、Paddleなどの数値バックエンドをサポートし、Windows、MacOS、Linuxで動作し、インストールと構成が簡単です。ビギナーズガイド、APIリファレンス、および例を網羅した包括的なドキュメントは、オンラインで入手できます。pygmtoolsは、Mulan PSL v2ライセンスの下でオープンソース化されています。

Heterogeneous-Agent Reinforcement Learning
異種エージェント強化学習

The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL), and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint return and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which generally outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX.

インテリジェントマシン間の協力の必要性から、AI研究において協力型マルチエージェント強化学習（MARL）が普及しました。しかし、多くの研究はエージェント間のパラメータ共有に大きく依存しており、同種エージェント設定のみに限定され、トレーニングの不安定性と収束保証の欠如につながります。一般的な異種エージェント設定で効果的な協力を実現するために、前述の問題を解決する異種エージェント強化学習（HARL）アルゴリズムを提案します。私たちの研究結果の中心となるのは、マルチエージェント利点分解補題と順次更新スキームです。これらに基づいて、証明可能に正しい異種エージェント信頼領域学習（HATRL）を開発し、扱いやすい近似によってHATRPOとHAPPOを導出します。さらに、異種エージェントミラー学習（HAML）という新しいフレームワークを発見しました。これは、HATRPOとHAPPOの理論的保証を強化し、協力型MARLアルゴリズム設計の一般的なテンプレートを提供します。HAMLから派生したすべてのアルゴリズムは、本質的にジョイントリターンの単調な改善とナッシュ均衡への収束を享受できることを証明します。当然の結果として、HAMLは、HATRPOとHAPPOに加えて、HAA2C、HADDPG、HATD3など、既存のMA対応アルゴリズムよりも一般的に優れたパフォーマンスを発揮する新しいアルゴリズムを検証します。6つの難しいベンチマークでHARLアルゴリズムを包括的にテストし、MAPPOやQMIXなどの強力なベースラインと比較して、異種エージェントの調整における優れた有効性と安定性を実証します。

Sample-efficient Adversarial Imitation Learning
サンプル効率の良い敵対的模倣学習

Imitation learning, in which learning is performed by demonstration, has been studied and advanced for sequential decision-making tasks in which a reward function is not predefined. However, imitation learning methods still require numerous expert demonstration samples to successfully imitate an expert’s behavior. To improve sample efficiency, we utilize self-supervised representation learning, which can generate vast training signals from the given data. In this study, we propose a self-supervised representation-based adversarial imitation learning method to learn state and action representations that are robust to diverse distortions and temporally predictive, on non-image control tasks. In particular, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations that is robust to diverse distortions. We theoretically and empirically observe that making an informative feature manifold with less sample complexity significantly improves the performance of imitation learning. The proposed method shows a 39% relative improvement over existing adversarial imitation learning methods on MuJoCo in a setting limited to 100 expert state-action pairs. Moreover, we conduct comprehensive ablations and additional experiments using demonstrations with varying optimality to provide insights into a range of factors.

デモによって学習を行う模倣学習は、報酬関数が事前に定義されていない逐次意思決定タスクを対象に研究・開発されてきた。しかし、模倣学習法では、専門家の行動をうまく模倣するために、依然として多数の専門家のデモサンプルが必要です。サンプル効率を改善するために、我々は与えられたデータから膨大なトレーニング信号を生成できる自己教師あり表現学習を利用します。この研究では、非画像制御タスクにおいて、多様な歪みに対して堅牢で時間的に予測可能な状態・行動表現を学習するための自己教師あり表現ベースの敵対的模倣学習法を提案します。特に、表形式データに対する既存の自己教師あり学習法と比較して、多様な歪みに対して堅牢な状態・行動表現の異なる破損法を提案します。私たちは、より少ないサンプル複雑性で有益な特徴多様体を作成すると、模倣学習のパフォーマンスが大幅に向上することを理論的かつ経験的に観察します。提案された方法は、100の専門家の状態と行動のペアに限定された設定で、MuJoCo上の既存の敵対的模倣学習方法と比較して39%の相対的改善を示しています。さらに、さまざまな要因に関する洞察を提供するために、さまざまな最適性を持つデモンストレーションを使用して包括的なアブレーションと追加の実験を実施します。

Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent
確率的修正流れ,平均場極限および確率的勾配降下法の動力学

We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate – infinite width scaling regime.

私たちは、確率的修正流れと呼ばれる小さな学習率レジームにおける確率的勾配降下のための新しい制限ダイナミクスを提案します。これらのSDEは、円筒形のブラウン運動によって駆動され、規則的な拡散係数を持ち、多点統計を一致させることにより、いわゆる確率的修正方程式を改善します。2番目の貢献として、分布依存の確率的修正流れを導入し、小さな学習率-無限幅スケーリング体制における確率的勾配降下の変動する制限ダイナミクスを説明することを証明します。

Rates of convergence for density estimation with generative adversarial networks
敵対的生成ネットワークによる密度推定のための収束率

In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove an oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate with a significantly better statistical error term compared to the previously known results. The advantage of our bound becomes clear in application to nonparametric density estimation. We show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2\beta/(2\beta + d)}$, where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. This rate of convergence coincides (up to logarithmic factors) with minimax optimal for the considered class of densities.

この研究では、バニラの敵対的生成ネットワーク(GAN)の非漸近特性の徹底的な研究に着手します。私たちは、基礎となる密度$mathsf{p}^*$とGAN推定値との間のJensen-Shannon (JS)の発散について、以前に知られていた結果と比較して有意に優れた統計的誤差項を持つオラクル不等式を証明します。この境界の利点は、ノンパラメトリック密度推定への適用で明らかになります。GAN推定値と$mathsf{p}^*$との間のJSダイバージェンスは、$(log{n}/n)^{2beta/(2beta + d)}$と同じ速さで減衰することを示しています。ここで、$n$はサンプルサイズ、$beta$は$mathsf{p}^*$の滑らかさを決定します。この収束率は、考慮された密度のクラスに最適なミニマックスと一致します(対数係数まで)。

Additive smoothing error in backward variational inference for general state-space models
一般状態空間モデルの後方変分推論における加法平滑化誤差

We consider the problem of state estimation in general state-space models using variational inference. For a generic variational family defined using the same backward decomposition as the actual joint smoothing distribution, we establish under mixing assumptions that the variational approximation of expectations of additive state functionals induces an error which grows at most linearly in the number of observations. This guarantee is consistent with the known upper bounds for the approximation of smoothing distributions using standard Monte Carlo methods. We illustrate our theoretical result with state-of-the art variational solutions based both on the backward parameterization and on alternatives using forward decompositions. This numerical study proposes guidelines for variational inference based on neural networks in state-space models.

私たちは、変分推論を用いた一般的な状態空間モデルにおける状態推定の問題を考えます。実際の同時平滑化分布と同じ後方分解を使用して定義される一般的な変分族の場合、加法状態汎関数の期待値の変分近似が観測数で最大で線形に増加する誤差を誘発することを混合仮定の下で確立します。この保証は、標準のモンテカルロ法を使用した平滑化分布の近似に関する既知の上限と一致しています。私たちは、後方パラメータ化と前方分解を使用した代替法の両方に基づく最先端の変分解を使用して、理論的な結果を示しています。この数値研究は、状態空間モデルのニューラルネットワークに基づく変分推論のガイドラインを提案します。

Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality
浅いReLUネットワークに対する最適バンプ関数:重み減衰、深度分離、次元の呪い

In this note, we study how neural networks with a single hidden layer and ReLU activation interpolate data drawn from a radially symmetric distribution with target labels 1 at the origin and 0 outside the unit ball, if no labels are known inside the unit ball. With weight decay regularization and in the infinite neuron, infinite data limit, we prove that a unique radially symmetric minimizer exists, whose average parameters and Lipschitz constant grow as $d$ and $\sqrt{d}$ respectively. We furthermore show that the average weight variable grows exponentially in $d$ if the label $1$ is imposed on a ball of radius $\varepsilon$ rather than just at the origin. By comparison, a neural networks with two hidden layers can approximate the target function without encountering the curse of dimensionality.

このノートでは、単一の隠れ層とReLU活性化を持つニューラルネットワークが、原点にターゲットラベル1、ユニットボールの外側に0の放射状に対称な分布から引き出されたデータを補間する方法を研究します。重み減衰正則化と無限ニューロンでは、無限データ極限により、平均パラメータとリプシッツ定数がそれぞれ$d$と$sqrt{d}$として増加するユニークな放射状対称最小化器が存在することを証明します。さらに、ラベル$1$が原点だけでなく半径$varepsilon$のボールに課せられた場合、平均重み変数は$d$で指数関数的に増加することを示します。これに対し、2つの隠れ層を持つニューラルネットワークは、次元の呪いに遭遇することなく、ターゲット関数を近似できます。

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees
被覆木を用いた最小分離による数値的に安定なスパースガウス過程

Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.

ガウス過程は、地理空間モデリング、ベイズ最適化、潜在ガウスモデルなど、大規模な機械学習および意思決定システムの一部として頻繁に導入されています。システム内では、ガウス過程モデルは、システムの他の部分と正しく相互作用することを保証するために、安定して信頼性の高い方法で実行する必要があります。この研究では、誘導点に基づくスケーラブルなスパース近似の数値安定性を調査します。そのために、まず数値安定性を確認し、ガウス過程モデルが不安定になる可能性のある一般的な状況を示します。補間の文献で最初に開発された安定性理論に基づいて、実行される計算が数値的に安定するための誘導点に関する十分な条件、および場合によっては必要な条件を導きます。地理空間モデリングなどの低次元タスクの場合、これらの条件を満たす誘導点を計算するための自動化された方法を提案します。これは、独立した関心事であるカバーツリーデータ構造を変更することによって行われます。さらに、ガウス尤度による回帰の代替スパース近似を提案します。この近似では、パフォーマンスを少し犠牲にして、安定性をさらに向上させます。計算の安定性と、空間タスクにおける誘導ポイント法の予測パフォーマンスの関係を示す実例を示します。

On Tail Decay Rate Estimation of Loss Function Distributions
損失関数分布の裾部減衰率推定

The study of loss-function distributions is critical to characterize a model’s behaviour on a given machine-learning problem. While model quality is commonly measured by the average loss assessed on a testing set, this quantity does not ascertain the existence of the mean of the loss distribution. Conversely, the existence of a distribution’s statistical moments can be verified by examining the thickness of its tails. Cross-validation schemes determine a family of testing loss distributions conditioned on the training sets. By marginalizing across training sets, we can recover the overall (marginal) loss distribution, whose tail-shape we aim to estimate. Small sample-sizes diminish the reliability and efficiency of classical tail-estimation methods like Peaks-Over-Threshold, and we demonstrate that this effect is notably significant when estimating tails of marginal distributions composed of conditional distributions with substantial tail-location variability. We mitigate this problem by utilizing a result we prove: under certain conditions, the marginal-distribution’s tail-shape parameter is the maximum tail-shape parameter across the conditional distributions underlying the marginal. We label the resulting approach as `cross-tail estimation (CTE)’. We test CTE in a series of experiments on simulated and real data showing the improved robustness and quality of tail estimation as compared to classical approaches.

損失関数分布の研究は、特定の機械学習問題におけるモデルの挙動を特徴付けるために重要です。モデルの品質は、テストセットで評価される平均損失によって一般的に測定されますが、この量では損失分布の平均の存在は確認されません。逆に、分布の統計モーメントの存在は、分布の裾の厚さを調べることで確認できます。クロス検証スキームは、トレーニングセットを条件とするテスト損失分布のファミリーを決定します。トレーニングセット全体で周辺化することで、全体的な(周辺)損失分布を復元でき、その裾の形状を推定することを目指します。サンプルサイズが小さいと、Peaks-Over-Thresholdなどの従来の裾推定方法の信頼性と効率が低下します。この影響は、裾の位置の変動が大きい条件付き分布で構成される周辺分布の裾を推定するときに特に重要であることを実証します。私たちは、証明した結果を利用してこの問題を軽減します。特定の条件下では、周辺分布のテール形状パラメータは、周辺分布の基礎となる条件付き分布全体で最大のテール形状パラメータになります。結果として得られるアプローチを「クロステール推定(CTE)」と呼びます。シミュレーションデータと実際のデータを使用した一連の実験でCTEをテストし、従来のアプローチと比較してテール推定の堅牢性と品質が向上していることを示します。

Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces
無限次元空間間の演算子の深いノンパラメトリック推定

Learning operators between infinitely dimensional spaces is an important learning task arising in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.

無限次元空間間の演算子の学習は、機械学習、画像科学、数学的モデリング、シミュレーションなどで生じる重要な学習タスクです。この論文では、ディープニューラルネットワークを使用したLipschitz演算子のノンパラメトリック推定について検討します。適切に選択されたネットワーククラスに対する経験的リスク最小化器の一般化誤差の非漸近的上限が導出されます。ターゲット演算子が低次元構造を示すという仮定の下で、誤差境界はトレーニングサンプルサイズが増加するにつれて減少し、推定の固有次元に応じて魅力的な高速レートが実現します。この仮定は実際のアプリケーションのほとんどのシナリオをカバーしており、演算子推定でデータの低次元構造を利用することで高速レートを実現しています。また、ネットワーク構造(ネットワークの幅、深さ、スパース性など)がニューラルネットワーク推定器の一般化誤差に与える影響を調査し、学習効率を定量的に最大化するためのネットワーク構造の選択に関する一般的な提案を行います。

Post-Regularization Confidence Bands for Ordinary Differential Equations
常微分方程式の正則化後の信頼帯

Ordinary differential equation (ODE) is an important tool to study a system of biological and physical processes. A central question in ODE modeling is to infer the significance of individual regulatory effect of one signal variable on another. However, building confidence band for ODE with unknown regulatory relations is challenging, and it remains largely an open question. In this article, we construct the post-regularization confidence band for the individual regulatory function in ODE with unknown functionals and noisy data observations. Our proposal is the first of its kind, and is built on two novel ingredients. The first is a new localized kernel learning approach that combines reproducing kernel learning with local Taylor approximation, and the second is a new de-biasing method that tackles infinite-dimensional functionals and additional measurement errors. We show that the constructed confidence band has the desired asymptotic coverage probability, and the recovered regulatory network approaches the truth with probability tending to one. We establish the theoretical properties when the number of variables in the system can be either smaller or larger than the number of sampling time points, and we study the regime-switching phenomenon. We demonstrate the efficacy of the proposed method through both simulations and illustrations with two data applications.

常微分方程式(ODE)は、生物学的および物理的プロセスのシステムを研究するための重要なツールです。ODEモデリングにおける中心的な問題は、ある信号変数が別の信号変数に及ぼす個々の調節効果の重要性を推測することです。しかし、未知の調節関係を持つODEの信頼帯を構築することは困難であり、ほとんど未解決の問題のままです。この記事では、未知の関数とノイズの多いデータ観測を持つODEの個々の調節関数の正規化後の信頼帯を構築します。私たちの提案は、この種のものとしては初めてのものであり、2つの新しい要素に基づいています。1つ目は、再現カーネル学習とローカルテイラー近似を組み合わせた新しいローカルカーネル学習アプローチであり、2つ目は、無限次元関数と追加の測定誤差に対処する新しいバイアス除去方法です。構築された信頼帯には目的の漸近的カバレッジ確率があり、回復された調節ネットワークは確率が1に近づくにつれて真実に近づくことを示します。システム内の変数の数がサンプリング時点の数より少ないか多い場合の理論的特性を確立し、レジーム切り替え現象を研究します。提案された方法の有効性を、2つのデータアプリケーションを使用したシミュレーションと図解の両方を通じて実証します。

On the Generalization of Stochastic Gradient Descent with Momentum
運動量を伴う確率的勾配降下法の一般化について

While momentum-based accelerated variants of stochastic gradient descent (SGD) are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes, and show that it can train machine learning models for multiple epochs with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper bound on the expected true risk, in terms of the number of training steps, sample size, and momentum. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings.

機械学習モデルのトレーニングでは、モメンタムベースの加速型確率的勾配降下法(SGD)が広く使用されていますが、このような方法の一般化誤差に関する理論的理解はほとんどありません。この研究では、まず、標準ヘビーボールモメンタム(SGDM)を使用したSGDの複数エポックの安定性ギャップが無制限になる凸損失関数が存在することを示します。次に、滑らかなLipschitz損失関数について、修正されたモメンタムベースの更新規則、つまり初期モメンタムを使用したSGD (SGDEM)を幅広いステップサイズで解析し、それが一般化を保証しながら複数のエポックで機械学習モデルをトレーニングできることを示します。最後に、強く凸状の損失関数の特殊なケースについて、SGDEMの特殊な形式である標準SGDMの複数エポックも一般化されるようなモメンタムの範囲を見つけます。一般化に関する結果を拡張して、トレーニングステップの数、サンプルサイズ、およびモメンタムの観点から、予想される真のリスクの上限も作成します。実験評価により、数値結果と理論上の境界の一貫性が検証されます。SGDEMは、実際の分散設定でImageNet上のResNet-18をトレーニングするときに、SGDMの一般化エラーを改善します。

Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension
ネットワーク Lasso のクラスター構造の追求: 回復条件と非凸拡張

Network lasso (NL for short) is a technique for estimating models by simultaneously clustering data samples and fitting the models to them. It often succeeds in forming clusters thanks to the geometry of the sum of $\ell_2$ norm employed therein, but there may be limitations due to the convexity of the regularizer. This paper focuses on clustering generated by NL and strengthens it by creating a non-convex extension, called network trimmed lasso (NTL for short). Specifically, we initially investigate a sufficient condition that guarantees the recovery of the latent cluster structure of NL on the basis of the result of Sun et al. (2021) for convex clustering, which is a special case of NL for ordinary clustering. Second, we extend NL to NTL to incorporate a cardinality (or, $\ell_0$-)constraint and rewrite the constrained optimization problem defined with the $\ell_0$ norm, a discontinuous function, into an equivalent unconstrained continuous optimization problem. We develop ADMM algorithms to solve NTL and show their convergence results. Numerical illustrations indicate that the non-convex extension provides a more clear-cut cluster structure when NL fails to form clusters without incorporating prior knowledge of the associated parameters.

ネットワークラッソ(略してNL)は、データサンプルのクラスタリングとそれに対するモデルのフィッティングを同時に行うことでモデルを推定する手法です。NLは、そこで使用される$\ell_2$ノルムの和の幾何学により、多くの場合クラスターの形成に成功しますが、正則化子の凸性により制限がある場合があります。この論文では、NLによって生成されるクラスタリングに焦点を当て、ネットワークトリムラッソ(略してNTL)と呼ばれる非凸拡張を作成することでクラスタリングを強化します。具体的には、まず、通常のクラスタリングに対するNLの特殊ケースである凸クラスタリングに対するSunら(2021)の結果に基づいて、NLの潜在的なクラスター構造の回復を保証する十分条件を調査します。次に、NLをNTLに拡張してカーディナリティ(または$\ell_0$-)制約を組み込み、不連続関数である$\ell_0$ノルムで定義された制約付き最適化問題を、同等の制約なし連続最適化問題に書き換えます。NTLを解決するためのADMMアルゴリズムを開発し、その収束結果を示します。数値図は、関連するパラメータの事前知識を組み込まずにNLがクラスターを形成できない場合、非凸拡張によってより明確なクラスター構造が提供されることを示しています。

Iterate Averaging in the Quest for Best Test Error
最良のテストエラーを求めて平均化を繰り返す

We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena from our theoretical results: (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved generalisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results, together with empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.

私たちは、高次元二次関数上の真のリスク面とバッチリスク面の間のガウス過程摂動モデルを使用して、反復平均化の一般化パフォーマンスの向上を分析および説明します。理論的結果から、次の3つの現象を導き出しました。(1)反復平均化(IA)を大きな学習率および正則化と組み合わせて一般化を改善することの重要性。(2)平均化の頻度を減らす正当性。(3)適応型勾配法は、反復平均化を使用すると、非適応型のものと同じかそれ以上にうまく機能すると予想されること。これらの結果と、反復の解の多様性に対する適切な正則化の重要性に関する実証的調査に着想を得て、反復平均化を使用する2つの適応型アルゴリズムを提案します。これらは、確率的勾配降下法(SGD)と比較して大幅に優れた結果をもたらし、必要なチューニングが少なく、早期停止や検証セットの監視を必要としません。私たちは、さまざまな最新および従来のネットワークアーキテクチャ上のCIFAR-10/100、ImageNet、Penn Treebankデータセットで私たちのアプローチの有効性を紹介します。

Nonparametric Inference under B-bits Quantization
Bビット量子化におけるノンパラメトリック推論

Statistical inference based on lossy or incomplete samples is often needed in research areas such as signal/image processing, medical image storage, remote sensing, signal transmission. In this paper, we propose a nonparametric testing procedure based on samples quantized to $B$ bits through a computationally efficient algorithm. Under mild technical conditions, we establish the asymptotic properties of the proposed test statistic and investigate how the testing power changes as $B$ increases. In particular, we show that if $B$ exceeds a certain threshold, the proposed nonparametric testing procedure achieves the classical minimax rate of testing (Shang and Cheng, 2015) for spline models. We further extend our theoretical investigations to a nonparametric linearity test and an adaptive nonparametric test, expanding the applicability of the proposed methods. Extensive simulation studies {together with a real-data analysis} are used to demonstrate the validity and effectiveness of the proposed tests.

信号/画像処理、医療画像ストレージ、リモートセンシング、信号伝送などの研究分野では、損失のあるサンプルや不完全なサンプルに基づく統計的推論が必要になることがよくあります。この論文では、計算効率の高いアルゴリズムによって$B$ビットに量子化されたサンプルに基づくノンパラメトリック検定手順を提案します。軽度の技術的条件下で、提案された検定統計量の漸近特性を確立し、$B$が増加するにつれて検定力がどのように変化するかを調べます。特に、$B$が特定のしきい値を超えると、提案されたノンパラメトリック検定手順はスプラインモデルの古典的なミニマックス検定率(Shang and Cheng、2015)を達成することを示します。さらに、理論的調査をノンパラメトリック線形性検定と適応型ノンパラメトリック検定に拡張し、提案された方法の適用範囲を拡大します。広範なシミュレーション研究(実際のデータ分析と併せて)を使用して、提案された検定の妥当性と有効性を実証します。

Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box
決定論的目的関数によるブラックボックス変分推論:より速く、より正確に、さらにブラックボックス化

Automatic differentiation variational inference (ADVI) offers fast and easy-to-use posterior approximation in multiple modern probabilistic programming languages. However, its stochastic optimizer lacks clear convergence criteria and requires tuning parameters. Moreover, ADVI inherits the poor posterior uncertainty estimates of mean-field variational Bayes (MFVB). We introduce “deterministic ADVI” (DADVI) to address these issues. DADVI replaces the intractable MFVB objective with a fixed Monte Carlo approximation, a technique known in the stochastic optimization literature as the “sample average approximation” (SAA). By optimizing an approximate but deterministic objective, DADVI can use off-the-shelf second-order optimization, and, unlike standard mean-field ADVI, is amenable to more accurate posterior covariances via linear response (LR). In contrast to existing worst-case theory, we show that, on certain classes of common statistical problems, DADVI and the SAA can perform well with relatively few samples even in very high dimensions, though we also show that such favorable results cannot extend to variational approximations that are too expressive relative to mean-field ADVI. We show on a variety of real-world problems that DADVI reliably finds good solutions with default settings (unlike ADVI) and, together with LR covariances, is typically faster and more accurate than standard ADVI.

自動微分変分推論(ADVI)は、複数の最新の確率的プログラミング言語で、高速で使いやすい事後近似を提供します。ただし、その確率的最適化には明確な収束基準がなく、調整パラメータが必要です。さらに、ADVIは平均場変分ベイズ(MFVB)の事後不確実性推定値の低さを継承しています。これらの問題に対処するために、「決定論的ADVI」(DADVI)を導入します。DADVIは、扱いにくいMFVB目的を固定モンテカルロ近似に置き換えます。この手法は、確率的最適化の文献では「サンプル平均近似」(SAA)として知られています。近似的だが決定論的な目的を最適化することで、DADVIは既製の2次最適化を使用でき、標準的な平均場ADVIとは異なり、線形応答(LR)を介してより正確な事後共分散を利用できます。既存の最悪ケース理論とは対照的に、一般的な統計問題の特定のクラスでは、DADVIとSAAは、非常に高い次元でも比較的少ないサンプルで良好なパフォーマンスを発揮できることを示しています。ただし、このような好ましい結果は、平均場ADVIに比べて表現力が強すぎる変分近似には適用できないことも示しています。さまざまな現実の問題で、DADVIはデフォルト設定で確実に良好なソリューションを見つけられること(ADVIとは異なります)、LR共分散と組み合わせると、通常は標準のADVIよりも高速で正確であることを示します。

On Sufficient Graphical Models
十分なグラフィカルモデルについて

We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting.

私たちは、最近開発された非線形の十分な次元削減手法を条件付き独立性の評価に適用することで、十分なグラフィカルモデルを導入します。グラフィカルモデルは、ガウスまたはコピュラガウスの仮定などの分布仮定を行わないため、本質的にノンパラメトリックです。ただし、条件付き独立性を特徴付けるために高次元カーネルに依存する完全にノンパラメトリックなグラフィカルモデルとは異なり、グラフィカルモデルは、十分な予測子のセットが与えられ、次元が大幅に削減された条件付き独立性に基づいています。このようにして、高次元カーネルに伴う次元の呪いを回避します。推定値の集団レベルの特性、収束率、および変数選択の一貫性を開発します。シミュレーション比較とDREAM 4 Challengeデータセットの分析により、ガウスまたはコピュラガウスの仮定に違反した場合に、この方法が既存の方法よりも優れており、高次元設定でも優れたパフォーマンスを維持することを実証します。

Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond
局所的な偏りのない機械学習:分位点治療効果とその先に関する効率的な推論

We consider estimating a low-dimensional parameter in an estimating equation involving high-dimensional nuisance functions that depend on the target parameter as an input. A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated. Existing approaches based on flexibly estimating the nuisances and plugging in the estimates, such as debiased machine learning (DML), require we learn the nuisance at all possible inputs. For (L)QTE, DML requires we learn the whole covariate-conditional cumulative distribution function. We instead propose localized debiased machine learning (LDML), which avoids this burdensome step and needs only estimate nuisances at a single initial rough guess for the target parameter. For (L)QTE, LDML involves learning just two regression functions, a standard task for machine learning methods. We prove that under lax rate conditions our estimator has the same favorable asymptotic behavior as the infeasible estimator that uses the unknown true nuisances. Thus, LDML notably enables practically-feasible and theoretically-grounded efficient estimation of important quantities in causal inference such as (L)QTEs when we must control for many covariates and/or flexible relationships, as we demonstrate in empirical studies.

私たちは、入力としてターゲットパラメータに依存する高次元のニューサンス関数を含む推定方程式で低次元パラメータを推定することを検討します。代表的な例は、因果推論における（局所）分位処理効果（（L）QTE）の効率的な推定方程式であり、推定する分位で評価される共変量条件付き累積分布関数を含む。ニューサンスを柔軟に推定し、推定値を差し込むことに基づく既存のアプローチ、例えばバイアス除去機械学習（DML）では、すべての可能な入力でニューサンスを学習する必要があります。（L）QTEの場合、DMLでは共変量条件付き累積分布関数全体を学習する必要があります。我々は代わりに、この面倒なステップを回避し、ターゲットパラメータの単一の初期の大まかな推測でニューサンスを推定するだけでよい、局所的バイアス除去機械学習（LDML）を提案します。（L）QTEの場合、LDMLでは機械学習法の標準的なタスクである2つの回帰関数のみを学習する必要があります。緩いレート条件下では、私たちの推定量は、未知の真のニューサンスを使用する実行不可能な推定量と同じ好ましい漸近挙動を示すことを証明します。したがって、LDMLは、実証研究で実証されているように、多くの共変量や柔軟な関係を制御する必要がある場合、(L)QTEなどの因果推論における重要な量の実際的に実行可能で理論に基づいた効率的な推定を可能にします。

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks
初期化の影響について:2層ニューラルネットワークのスケーリングパス

In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with nonzero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal-transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite-dimensional convex counterpart. We formulate the corresponding functional-optimization problem and investigate its main properties. In particular, we show that, as the scale of the initialization ranges between $0$ and $+\infty$, the associated path interpolates continuously between the so-called kernel and rich regimes. Numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly, even beyond these extreme points.

教師あり学習では、正規化パスは、ゼロから初期化される勾配降下法の最適化パスの便利な理論的代理として使用されることがあります。この論文では、異なるスケールで重みの非ゼロ初期分布を持つ無限幅2層ReLUニューラルネットワークの正規化パスの修正について検討します。不均衡な最適輸送理論との関連を利用して、2層ネットワークトレーニングの非凸性にもかかわらず、この問題は無限次元の凸対応を許容することを示します。対応する機能最適化問題を定式化し、その主な特性を調査します。特に、初期化のスケールが$0$から$+\infty$の範囲にある場合、関連するパスはいわゆるカーネルレジームとリッチレジームの間を連続的に補間することを示します。数値実験により、私たちの設定では、スケーリングパスと最適化パスの最終状態は、これらの極端なポイントを超えても同様に動作することが確認されています。

Improving physics-informed neural networks with meta-learned optimization
メタ学習最適化による物理情報に基づくニューラルネットワークの改善

We show that the error achievable using physics-informed neural networks for solving differential equations can be substantially reduced when these networks are trained using meta-learned optimization methods rather than using fixed, hand-crafted optimizers as traditionally done. We choose a learnable optimization method based on a shallow multi-layer perceptron that is meta-trained for specific classes of differential equations. We illustrate meta-trained optimizers for several equations of practical relevance in mathematical physics, including the linear advection equation, Poisson’s equation, the Korteweg-de Vries equation and Burgers’ equation. We also illustrate that meta-learned optimizers exhibit transfer learning abilities, in that a meta-trained optimizer on one differential equation can also be successfully deployed on another differential equation.

私たちは、微分方程式を解くために物理学に基づいたニューラルネットワークを使用して達成可能な誤差は、これらのネットワークが従来のように固定された手作りの最適化子を使用するのではなく、メタ学習された最適化方法を使用して訓練された場合に大幅に減少できることを示します。私たちは、微分方程式の特定のクラスに対してメタトレーニングされた浅い多層パーセプトロンに基づく学習可能な最適化手法を選択します。線形移流方程式、ポアソン方程式、Korteweg-de Vries方程式、Burgers方程式など、数理物理学で実用的に関連するいくつかの方程式について、メタトレーニング済みのオプティマイザを示します。また、メタ学習オプティマイザーは転移学習能力を示し、ある微分方程式でメタ学習されたオプティマイザーを別の微分方程式にもうまく展開できることも示しています。

A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent
連続時間近似と確率的勾配降下法の比較

Applying a stochastic gradient descent (SGD) method for minimizing an objective gives rise to a discrete-time process of estimated parameter values. In order to better understand the dynamics of the estimated values, many authors have considered continuous-time approximations of SGD. We refine existing results on the weak error of first-order ODE and SDE approximations to SGD for non-infinitesimal learning rates. In particular, we explicitly compute the linear term in the error expansion of gradient flow and two of its stochastic counterparts, with respect to a discretization parameter $h$. In the example of linear regression, we demonstrate the general inferiority of the deterministic gradient flow approximation in comparison to the stochastic ones, for batch sizes which are not too large. Further, we demonstrate that for Gaussian features an SDE approximation with state-independent noise (CC) is preferred over using a state-dependent coefficient (NCC). The same comparison holds true for features of low kurtosis or large batch sizes. However, the relationship reverses for highly leptokurtic features or small batch sizes.

確率的勾配降下法(SGD)を目的関数の最小化に適用すると、推定パラメータ値の離散時間プロセスが生じます。推定値のダイナミクスをより深く理解するために、多くの著者がSGDの連続時間近似を検討してきました。私たちは、非無限小学習率に対するSGDへの1次ODEおよびSDE近似の弱誤差に関する既存の結果を改良します。特に、離散化パラメータ$h$に関して、勾配フローとその確率的対応物2つの誤差展開における線形項を明示的に計算します。線形回帰の例では、バッチサイズが大きすぎない場合、確率的近似と比較して決定論的勾配フロー近似が一般に劣っていることを示します。さらに、ガウス特徴の場合、状態依存係数(NCC)を使用するよりも状態非依存ノイズ(CC)を使用したSDE近似が望ましいことを示します。同じ比較は、尖度が低い特徴やバッチサイズが大きい特徴にも当てはまります。ただし、尖度が高い特徴やバッチサイズが小さい特徴では、関係が逆になります。

Critically Assessing the State of the Art in Neural Network Verification
ニューラルネットワーク検証の最先端を批判的に評価する

Recent research has proposed various methods to formally verify neural networks against minimal input perturbations; this verification task is also known as local robustness verification. The research area of local robustness verification is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability modulo theories. At the same time, the problem instances encountered when performing local robustness verification differ based on the network to be verified, the property to be verified and the specific network input. This raises the question of which verification algorithm is most suitable for solving specific types of instances of the local robustness verification problem. To answer this question, we performed a systematic performance analysis of several CPU- and GPU-based local robustness verification systems on a newly and carefully assembled set of 79 neural networks, of which we verified a broad range of robustness properties, while taking a practitioner’s point of view — a perspective that complements the insights from initiatives such as the VNN competition, where the participating tools are carefully adapted to the given benchmarks by their developers. Notably, we show that no single best algorithm dominates performance across all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the potential of leveraging algorithm portfolios for more efficient local robustness verification. We quantify this complementarity using various performance measures, such as the Shapley value. Furthermore, we confirm the notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported.

最近の研究では、最小限の入力変動に対してニューラルネットワークを形式的に検証するさまざまな方法が提案されています。この検証タスクは、ローカルロバストネス検証とも呼ばれます。ローカルロバストネス検証の研究分野は非常に多様で、検証者は混合整数計画法や満足可能性法理論など、多数の手法に依存しています。同時に、ローカルロバストネス検証を実行するときに発生する問題のインスタンスは、検証対象のネットワーク、検証対象のプロパティ、および特定のネットワーク入力によって異なります。これにより、ローカルロバストネス検証問題の特定のタイプのインスタンスを解決するのに最も適した検証アルゴリズムはどれかという疑問が生じます。この疑問に答えるために、私たちは、79個のニューラルネットワークの新しく慎重に組み立てられたセットに対して、いくつかのCPUベースおよびGPUベースのローカルロバストネス検証システムの体系的なパフォーマンス分析を実行しました。これらのニューラルネットワークでは、幅広いロバストネスプロパティを検証しましたが、その際に実践者の視点を取り入れました。この視点は、参加ツールが開発者によって特定のベンチマークに慎重に適応されるVNNコンペティションなどのイニシアチブからの洞察を補完するものです。特に、すべての検証問題インスタンスでパフォーマンスを左右する単一の最善のアルゴリズムは存在しないことを示しています。代わりに、私たちの結果は検証者のパフォーマンスの相補性を明らかにし、アルゴリズムポートフォリオを活用してより効率的なローカル堅牢性検証を実現する可能性を示しています。この相補性は、Shapley値などのさまざまなパフォーマンスメトリックを使用して定量化します。さらに、ほとんどのアルゴリズムはReLUベースのネットワークのみをサポートし、他のアクティベーション関数は十分にサポートされていないという考えを確認します。

Estimating the Minimizer and the Minimum Value of a Regression Function under Passive Design
パッシブ設計における回帰関数の最小化器と最小値の推定

We propose a new method for estimating the minimizer $\boldsymbol{x}^*$ and the minimum value $f^*$ of a smooth and strongly convex regression function $f$ from the observations contaminated by random noise. Our estimator $\boldsymbol{z}_n$ of the minimizer $\boldsymbol{x}^*$ is based on a version of the projected gradient descent with the gradient estimated by a regularized local polynomial algorithm. Next, we propose a two-stage procedure for estimation of the minimum value $f^*$ of regression function $f$. At the first stage, we construct an accurate enough estimator of $\boldsymbol{x}^*$, which can be, for example, $\boldsymbol{z}_n$. At the second stage, we estimate the function value at the point obtained in the first stage using a rate optimal nonparametric procedure. We derive non-asymptotic upper bounds for the quadratic risk and optimization risk of $\boldsymbol{z}_n$, and for the risk of estimating $f^*$. We establish minimax lower bounds showing that, under certain choice of parameters, the proposed algorithms achieve the minimax optimal rates of convergence on the class of smooth and strongly convex functions.

私たちは、ランダムノイズに汚染された観測値から、滑らかで強く凸な回帰関数$f$の最小値$\boldsymbol{x}^*$と最小値$f^*$を推定する新しい方法を提案します。最小値$\boldsymbol{x}^*$の推定値$\boldsymbol{z}_n$は、正規化された局所多項式アルゴリズムによって推定された勾配を持つ投影勾配降下法のバージョンに基づく。次に、回帰関数$f$の最小値$f^*$を推定する2段階の手順を提案します。第1段階では、十分に正確な$\boldsymbol{x}^*$の推定値(たとえば$\boldsymbol{z}_n$)を構築します。第2段階では、速度最適ノンパラメトリック手順を使用して、第1段階で取得したポイントでの関数値を推定します。私たちは、$\boldsymbol{z}_n$の二次リスクと最適化リスク、および$f^*$を推定するリスクの非漸近的な上限を導出します。特定のパラメータ選択の下で、提案されたアルゴリズムが滑らかで強い凸関数のクラスでミニマックス最適収束率を達成することを示すミニマックス下限を確立します。

Modeling Random Networks with Heterogeneous Reciprocity
異種相反性を持つランダムネットワークのモデル化

Reciprocity, or the tendency of individuals to mirror behavior, is a key measure that describes information exchange in a social network. Users in social networks tend to engage in different levels of reciprocal behavior. Differences in such behavior may indicate the existence of communities that reciprocate links at varying rates. In this paper, we develop methodology to model the diverse reciprocal behavior in growing social networks. In particular, we present a preferential attachment model with heterogeneous reciprocity that imitates the attraction users have for popular users, plus the heterogeneous nature by which they reciprocate links. We compare Bayesian and frequentist model fitting techniques for large networks, as well as computationally efficient variational alternatives. Cases where the number of communities is known and unknown are both considered. We apply the presented methods to the analysis of Facebook and Reddit networks where users have non-uniform reciprocal behavior patterns. The fitted model captures the heavy-tailed nature of the empirical degree distributions in the datasets and identifies multiple groups of users that differ in their tendency to reply to and receive responses to wallposts and comments.

相互性、つまり個人が行動を真似する傾向は、ソーシャルネットワークでの情報交換を表す重要な尺度です。ソーシャルネットワークのユーザーは、さまざまなレベルの相互行動を行う傾向があります。このような行動の違いは、さまざまな速度でリンクを相互に交換するコミュニティの存在を示している可能性があります。この論文では、成長中のソーシャルネットワークにおける多様な相互行動をモデル化する方法を開発します。特に、ユーザーが人気のあるユーザーに抱く魅力と、ユーザーがリンクを相互に交換する異質性を模倣する、異質な相互性を持つ優先的アタッチメントモデルを紹介します。大規模ネットワークのベイズモデルと頻度モデルフィッティング手法、および計算効率の高い変分代替法を比較します。コミュニティの数がわかっている場合とわからない場合の両方を考慮します。提示した方法を、ユーザーの相互行動パターンが一様でないFacebookネットワークとRedditネットワークの分析に適用します。適合モデルは、データセット内の経験的度分布のヘビーテール特性を捉え、ウォールポストやコメントへの返信や応答の受信傾向が異なる複数のユーザーグループを識別します。

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment
放棄されたマルチアームバンディットの探索、搾取、関与

The traditional multi-armed bandit (MAB) model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where “A” stands for abandonment and the abandonment probability depends on the current recommended item and the user’s past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results show that the proposed algorithms have significantly lower regret than the traditional UCB and KL-UCB, and Q-learning-based algorithms.

推薦システムの従来の多腕バンディット(MAB)モデルでは、ユーザーが学習期間全体にわたってシステム内に留まることを前提としています。ALEKSなどの新しいオンライン教育プラットフォームやTikTokなどの新しい動画推薦システムでは、ユーザーがアプリに費やす時間は、推薦コンテンツの魅力度によって異なります。推薦されたアイテムがユーザーを引き付けない場合、ユーザーは一時的にシステムを離れることがあります。これらのシステムにおける探索、活用、エンゲージメントを理解するために、MAB-Aと呼ばれる新しいモデルを提案します。ここで、「A」は放棄を表し、放棄確率は現在の推薦アイテムとユーザーの過去の経験(状態と呼ばれる)によって異なります。ULCBとKL-ULCBという2つのアルゴリズムを提案します。どちらも、ユーザーが以前の推薦アイテムを気に入った場合は探索を増やし(楽観的)、そうでない場合は探索を減らします(悲観的)。ULCBとKL-ULCBの両方が対数後悔$O(\log K)$を達成することを証明します。ここで、$K$は訪問数(またはエピソード数)です。さらに、KL-ULCBにおける後悔の境界は漸近的に鋭い。提案されたアルゴリズムを一般状態設定にも拡張します。シミュレーション結果によると、提案されたアルゴリズムは、従来のUCBやKL-UCB、Q学習ベースのアルゴリズムよりも後悔が大幅に低いことがわかった。

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models
混合モデルにおけるノンパラメトリック最尤推定量の効率的でスケーラブルな計算について

In this paper, we focus on the computation of the nonparametric maximum likelihood estimator (NPMLE) in multivariate mixture models. Our approach discretizes this infinite dimensional convex optimization problem by setting fixed support points for the NPMLE and optimizing over the mixing proportions. We propose an efficient and scalable semismooth Newton based augmented Lagrangian method (ALM). Our algorithm outperforms the state-of-the-art methods (Kim et al., 2020; Koenker and Gu, 2017), capable of handling $n \approx 10^6$ data points with $m \approx 10^4$ support points. A key advantage of our approach is its strategic utilization of the solution’s sparsity, leading to structured sparsity in Hessian computations. As a result, our algorithm demonstrates better scaling in terms of $m$ when compared to the mixsqp method (Kim et al., 2020). The computed NPMLE can be directly applied to denoising the observations in the framework of empirical Bayes. We propose new denoising estimands in this context along with their consistent estimates. Extensive numerical experiments are conducted to illustrate the efficiency of our ALM. In particular, we employ our method to analyze two astronomy data sets: (i) Gaia-TGAS Catalog (Anderson et al., 2018) containing approximately $1.4 \times 10^6$ data points in two dimensions, and (ii) a data set from the APOGEE survey (Majewski et al., 2017) with approximately $2.7 \times 10^4$ data points.

この論文では、多変量混合モデルにおけるノンパラメトリック最大尤度推定量(NPMLE)の計算に焦点を当てています。私たちのアプローチは、NPMLEに固定サポートポイントを設定し、混合比率を最適化することで、この無限次元凸最適化問題を離散化します。効率的でスケーラブルな半滑らかなニュートンベースの拡張ラグランジュ法(ALM)を提案します。私たちのアルゴリズムは最先端の方法(Kimら, 2020; Koenker and Gu, 2017)よりも優れており、$m \approx 10^4$のサポートポイントを持つ$n \approx 10^6$データポイントを処理できます。私たちのアプローチの主な利点は、ソリューションのスパース性を戦略的に活用し、ヘッセ行列の計算で構造化されたスパース性を実現することです。その結果、私たちのアルゴリズムは、mixsqp法(Kimら, 2020)と比較して、$m$に関して優れたスケーリングを示します。計算されたNPMLEは、経験的ベイズの枠組みの中で観測データのノイズ除去に直接適用できます。この文脈で、私たちは新しいノイズ除去推定量とその一貫した推定値を提案します。私たちのALMの効率性を示すために、広範な数値実験が行われます。特に、私たちは私たちの方法を使用して、2つの天文学データセットを分析します。(i) 2次元で約$1.4 \times 10^6$のデータポイントを含むGaia-TGASカタログ(Andersonら、2018)、および(ii)約$2.7 \times 10^4$のデータポイントを含むAPOGEEサーベイ(Majewskiら、2017)のデータセットです。

Decorrelated Variable Importance
非相関変数の重要度

Because of the widespread use of black box prediction methods such as random forests and neural nets, there is renewed interest in developing methods for quantifying variable importance as part of the broader goal of interpretable prediction. A popular approach is to define a variable importance parameter — known as LOCO (Leave Out COvariates) — based on dropping covariates from a regression model. This is essentially a nonparametric version of $R^2$. This parameter is very general and can be estimated nonparametrically, but it can be hard to interpret because it is affected by correlation between covariates. We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models.

ランダムフォレストやニューラルネットなどのブラックボックス予測手法が広く使用されているため、解釈可能な予測という広範な目標の一部として、変数の重要度を定量化する手法の開発に新たな関心が寄せられています。一般的なアプローチは、回帰モデルからの共変量のドロップに基づいて、LOCO(Leave Out COvariates)—と呼ばれる変数重要度パラメータ—を定義することです。これは基本的に、$R^2$のノンパラメトリックバージョンです。このパラメータは非常に一般的であり、ノンパラメトリックに推定できますが、共変量間の相関の影響を受けるため、解釈が難しい場合があります。LOCOの修正バージョンを定義することにより、相関の影響を軽減する方法を提案します。この新しいパラメーターをノンパラメトリックに推定することは困難ですが、セミパラメトリックモデルを使用して推定する方法を示します。

Model-Free Representation Learning and Exploration in Low-Rank MDPs
低ランク MDP におけるモデルフリー表現の学習と探索

The low-rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low-rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.

低ランクのMDPは、強化学習における表現学習と探索を研究するための重要なモデルとして浮上しています。既知の表現では、モデルフリーの探索戦略がいくつか存在します。対照的に、未知の表現設定のすべてのアルゴリズムはモデルベースであるため、完全なダイナミクスをモデル化する機能が必要です。この研究では、低ランクMDPのための初のモデルフリー表現学習アルゴリズムを紹介します。アルゴリズムによる重要な貢献は、新しいミニマックス表現学習目標であり、統計的特性と計算特性のトレードオフが異なるバリアントを提供します。この表現学習ステップを、報酬のない方法で状態空間をカバーする探索戦略とインターリーブします。結果として得られるアルゴリズムは、サンプル効率が高いことが証明されており、複雑な環境に拡張するための一般的な関数近似に対応できます。

Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method
投影検出力法による相関ガウス・ウィグナー模型のシードグラフマッチング

In the graph matching problem we observe two graphs $G,H$ and the goal is to find an assignment (or matching) between their vertices such that some measure of edge agreement is maximized. We assume in this work that the observed pair $G,H$ has been drawn from the Correlated Gaussian Wigner (CGW) model — a popular model for correlated weighted graphs — where the entries of the adjacency matrices of $G$ and $H$ are independent Gaussians and each edge of $G$ is correlated with one edge of $H$ (determined by the unknown matching) with the edge correlation described by a parameter $\sigma \in [0,1)$. In this paper, we analyse the performance of the projected power method (PPM) as a seeded graph matching algorithm where we are given an initial partially correct matching (called the seed) as side information. We prove that if the seed is close enough to the ground-truth matching, then with high probability, PPM iteratively improves the seed and recovers the ground-truth matching (either partially or exactly) in $O(\log n)$ iterations. Our results prove that PPM works even in regimes of constant $\sigma$, thus extending the analysis in (Mao et al., 2023) for the sparse Correlated Erdos-Renyi (CER) model to the (dense) CGW model. As a byproduct of our analysis, we see that the PPM framework generalizes some of the state-of-art algorithms for seeded graph matching. We support and complement our theoretical findings with numerical experiments on synthetic data.

グラフマッチング問題では、2つのグラフ$G,H$を観察し、その目標は、エッジの一致の尺度が最大化されるように、それらの頂点間の割り当て(またはマッチング)を見つけることです。この研究では、観察されたペア$G,H$は、相関重み付きグラフの一般的なモデルである相関ガウスウィグナー(CGW)モデルから抽出されたものと仮定します。このモデルでは、$G$と$H$の隣接行列のエントリは独立したガウス分布であり、$G$の各エッジは、パラメーター$\sigma \in [0,1)$によって記述されるエッジ相関で、$H$の1つのエッジ(未知のマッチングによって決定)と相関しています。この論文では、サイド情報として初期の部分的に正しいマッチング(シードと呼ばれる)が与えられるシードグラフマッチングアルゴリズムとしての投影累乗法(PPM)のパフォーマンスを分析します。シードが真のマッチングに十分近い場合、PPMは高い確率でシードを反復的に改善し、$O(\log n)$回の反復で真のマッチングを（部分的にまたは完全に）回復することを証明します。私たちの結果は、PPMが一定の$\sigma$の状態でも機能することを証明しており、これにより、（Maoら、2023）の疎な相関エルデシュ・レニ（CER）モデルの分析を（密な）CGWモデルに拡張しています。私たちの分析の副産物として、PPMフレームワークはシード付きグラフマッチングの最先端のアルゴリズムの一部を一般化していることがわかります。私たちは、合成データでの数値実験により理論的発見をサポートし、補完しています。

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization
エントロピー正則化を用いた競争ゲームのための高速方策Extragradient法

This paper investigates the problem of computing the equilibrium of competitive games in the form of two-player zero-sum games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE)—which are solutions to zero-sum two-player matrix games with entropy regularization—at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent’s actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.

この論文では、2人プレイのゼロ和ゲームの形式での競争ゲームの均衡を計算する問題を調査します。これは、確率単体制約付きの制約付き鞍点最適化問題としてモデル化されることが多いです。制約のない設定でのエクストラグラディエント法の最後の反復収束を理解するための最近の取り組みにもかかわらず、制約のある設定でのこれらの方法、特に乗法更新を使用する方法の理論的根拠は、目的関数が双線形の場合でも、依然として非常に不十分です。シングルエージェント強化学習とゲーム理論におけるエントロピー正則化のアルゴリズム的役割に動機付けられて、私たちは、エントロピー正則化によるゼロ和2人プレイのマトリックスゲームの解である量子応答均衡(QRE)を線形速度で見つけるための、証明可能な効率性を持つエクストラグラディエント法を開発します。提案されたアルゴリズムは、分散型の方法で実装することができ、各プレイヤーは、対戦相手の行動を直接観察することなく、独自の報酬を使用して対称的かつ乗法的な更新を反復的に実行します。さらに、エントロピー正則化のノブを制御することにより、提案されたアルゴリズムは、ナッシュ均衡が一意であると仮定することなく、非正則化マトリックスゲームの近似ナッシュ均衡を線形以下の速度で見つけることができます。私たちの方法は、同様の速度で（エントロピー正則化された）ゼロサムマルコフゲームを解くための効率的なポリシー超勾配アルゴリズムにもつながります。私たちの収束速度はすべて、対数係数まで状態空間と行動空間のサイズに依存しないほぼ次元フリーであり、収束を加速するためのエントロピー正則化の積極的な役割を強調しています。

Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic
模倣品の力:ランキングアルゴリズム、拡張設計、対称統計の影響

The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype – a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.

ノックオフフィルターは、高次元線形モデルに対する最近の偽発見率(FDR)制御方法です。ノックオフには、ランキングアルゴリズム、拡張設計、対称統計という3つの主要コンポーネントがあり、各コンポーネントには複数の選択肢があることを指摘します。3つのコンポーネントのさまざまな組み合わせを検討することで、ノックオフのさまざまなバリエーションのコレクションが得られます。これらすべてのバリエーションは有限サンプルのFDR制御を保証し、そのパワーを比較することが私たちの目標です。回帰係数にまれで弱い信号モデルを想定し、偽陽性率と偽陰性率の明示的な式を導出することで、ノックオフのさまざまなバリエーションのパワーを比較します。私たちの結果は、FDRをターゲットレベルで制御する際のパワーを向上させる方法に関する新しい洞察を提供します。また、ノックオフのパワーをそのプロトタイプ(同じランキングアルゴリズムを使用するが、理想的なしきい値にアクセスできる方法)と比較します。この比較により、FDRを制御するためのデータ駆動型しきい値を見つけることで支払う追加の代償が明らかになります。

Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
有限和最適化問題の下側複雑性限界: 結果と構成

In this paper we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider a so-called proximal incremental first-order oracle (PIFO) algorithm, which employs the individual component function’s gradient and proximal information provided by PIFO to update the variable. To incorporate loopless methods, we also allow the PIFO algorithm to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles.

この論文では、目的関数が$n$個の個々のコンポーネント関数の平均である有限和最適化問題の複雑さの下限について検討します。変数を更新するために、個々のコンポーネント関数の勾配とPIFOによって提供される近接情報を使用する、いわゆる近位増分一次オラクル(PIFO)アルゴリズムを検討します。ループレスメソッドを組み込むために、PIFOアルゴリズムが全勾配をまれに取得できるようにもします。私たちは、従来の例の三角行列を$n$個のグループに分割する、ハードインスタンスを構築する新しいアプローチを開発します。この構成は、PIFOアルゴリズムの解析に適しています。この構成に基づいて、目的関数が凸凹または非凸強凹で、コンポーネント関数のクラスが$L$平均平滑である場合の有限和ミニマックス最適化問題の複雑さの下限を確立します。これらの境界のほとんどは、対数因数までの既存の上限とほぼ一致しています。また、平滑性と平均平滑性の両方の仮定の下で、有限和最小化問題に対して、以前の研究と同様の下限を導出します。下限は、滑らかな関数の近似オラクルの方が勾配オラクルよりもそれほど強力ではないことを示唆しています。

On Truthing Issues in Supervised Classification
教師付き分類における真実性の問題について

Ideal supervised classification assumes known correct labels, but various truthing issues can arise in practice: noisy labels; multiple, conflicting labels for a sample; missing labels; and different labeler combinations for different samples. Previous work introduced a noisy-label model, which views the observed noisy labels as random variables conditioned on the unobserved correct labels. It has mainly focused on estimating the conditional distribution of the noisy labels and the class prior, as well as estimating the correct labels or training with noisy labels. In a complementary manner, given the conditional distribution and class prior, we apply estimation theory to classifier testing, training, and comparison of different combinations of labelers. First, for binary classification, we construct a testing model and derive approximate marginal posteriors for accuracy, precision, recall, probability of false alarm, and F-score, and joint posteriors for ROC and precision-recall analysis. We propose minimum mean-square error (MMSE) testing, which employs empirical Bayes algorithms to estimate the testing-model parameters and then computes optimal point estimates and credible regions for the metrics. We extend the approach to multi-class classification to obtain optimal estimates of accuracy and individual confusion-matrix elements. Second, we present a unified view of training that covers probabilistic (i.e., discriminative or generative) and non-probabilistic models. For the former, we adjust maximum-likelihood or maximum a posteriori training for truthing issues; for the latter, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk. We also describe suboptimal training that is compatible with existing infrastructure. Third, we observe that mutual information lets one express any labeler combination as an equivalent single labeler, implying that multiple mediocre labelers can be as informative as, or more informative than, a single expert labeler. Experiments demonstrate the effectiveness of the methods and confirm the implication.

理想的な教師あり分類では、既知の正しいラベルが想定されますが、実際にはさまざまな真理値の問題が発生する可能性があります。たとえば、ノイズの多いラベル、サンプルに対する複数の矛盾するラベル、ラベルの欠落、サンプルごとに異なるラベラーの組み合わせなどです。以前の研究では、観測されたノイズの多いラベルを、観測されていない正しいラベルを条件とするランダム変数と見なすノイズの多いラベルモデルが導入されました。このモデルでは、主に、ノイズの多いラベルの条件付き分布とクラス事前分布の推定、および正しいラベルの推定やノイズの多いラベルを使用したトレーニングに重点を置いています。補完的に、条件付き分布とクラス事前分布が与えられた場合、推定理論を分類器のテスト、トレーニング、およびさまざまなラベラーの組み合わせの比較に適用します。まず、バイナリ分類の場合、テストモデルを構築し、精度、精度、再現率、誤警報確率、Fスコアのおおよその限界事後分布と、ROCおよび精度-再現率分析の結合事後分布を導出します。私たちは、最小平均二乗誤差(MMSE)テストを提案します。これは、経験的ベイズアルゴリズムを使用してテストモデルのパラメーターを推定し、次にメトリックの最適な点推定値と信頼できる領域を計算します。このアプローチをマルチクラス分類に拡張して、精度と個々の混同行列要素の最適な推定値を取得します。次に、確率的(つまり、識別的または生成的)モデルと非確率的モデルをカバーするトレーニングの統一的なビューを示します。前者の場合、真実の問題に対して最大尤度または最大事後確率トレーニングを調整します。後者の場合、経験的リスクのMMSE推定を最小化するMMSEトレーニングを提案します。また、既存のインフラストラクチャと互換性のある次善のトレーニングについても説明します。3番目に、相互情報量によって、任意のラベル付け者の組み合わせを同等の単一のラベル付け者として表現できることを観察します。これは、複数の平凡なラベル付け者が、単一の熟練したラベル付け者と同等か、それ以上に有益である可能性があることを意味します。実験により、この方法の有効性が実証され、その意味が確認されます。

関連記事