Journal of Machine Learning Research Papers: Volume 24の論文一覧

Journal of Machine Learning Research Papers Volume 24に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Foundation Models and Fair Use
基礎モデルとフェアユース

Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Third, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.

既存の基礎モデルは、著作権で保護された素材でトレーニングされています。これらのモデルを展開すると、データ作成者が適切な帰属や補償を受けられなかった場合、法的および倫理的リスクの両方が生じる可能性があります。米国および他のいくつかの国では、著作権で保護されたコンテンツを使用して基礎モデルを構築しても、フェアユースの原則により責任を負わされることはありません。ただし、注意点があります。モデルが著作権で保護されたデータに類似した出力を生成する場合、特にそのデータの市場に影響を与えるシナリオでは、モデルの出力にフェアユースが適用されなくなる可能性があります。この作業では、フェアユースが保証されているわけではなく、モデルの開発と展開をフェアユースの領域にしっかりと収めるために追加の作業が必要になる可能性があることを強調します。まず、著作権で保護されたコンテンツに基づいて基礎モデルを開発および展開する潜在的なリスクを調査します。関連する米国の判例法を確認し、テキスト、ソースコード、および視覚芸術を生成するための既存および潜在的なアプリケーションと比較します。実験により、人気のある基礎モデルは、著作権で保護された素材にかなり類似したコンテンツを生成できることが確認されています。次に、フェアユースに準拠した基礎モデルを維持するのに役立つ技術的な緩和策について説明します。私たちは、緩和戦略を現在の法律の状態に合わせるために、さらなる研究が必要であると主張します。第三に、法律と技術的緩和策は共進化すべきであると提案します。たとえば、他の政策メカニズムと組み合わせることで、強力な技術的ツールが侵害の被害を緩和するために使用される場合、法律はより明示的にセーフハーバーを考慮することができます。この共進化は、知的財産とイノベーションのバランスをとるのに役立つ可能性があり、これはフェアユースの本来の目的を物語っています。ただし、ここで説明する戦略は万能薬ではなく、基盤モデルの潜在的な被害に対処するポリシーを開発するには、さらなる作業が必要であることを強調します。

Boosting Multi-agent Reinforcement Learning via Contextual Prompting
コンテキストプロンプトによるマルチエージェント強化学習の促進

Multi-agent reinforcement learning (MARL) has gained increasing attention due to its ability to enable multiple agents to learn policies simultaneously. However, the bootstrapping error arises from the difference between the estimated Q value and the real discounted return and accumulates backward through dynamic programming iterations. This error can become even larger as the number of agents increases, due to the exponential growth of agent interactions, resulting in infeasible learning time and incorrect actions during early training steps. To address this challenge, we observe that previously collected trajectories are useful contexts, model them using a contextual predictor to yield the next action and observation, and use the contextual predictor to replace the Q value function or utility function during the early training phase. Furthermore, we employ a joint-action sampling mechanism to restrict the action space and dynamically select policies from the vanilla utility network and those from the contextual trajectory predictor to perform rollout processes. By reasonably constraining the action space and rollout process, we can significantly accelerate the algorithm training process. Our framework applies to various value-based MARL methods in both centralized training decentralized execution (CTDE) and non-CTDE scenarios where agents are accessible (non-accessible) to global states during the training process. Experimental results on three tasks, Spread, Tag, and Reference, from the Particle World Environment (PWE) show that our framework significantly accelerates the training process of existing state-of-the-art CTDE and non-CTDE MARL methods, while also competing with or outperforming their original versions.

マルチエージェント強化学習(MARL)は、複数のエージェントが同時にポリシーを学習できることから、ますます注目を集めています。しかし、ブートストラップ誤差は推定Q値と実際の割引収益の差から生じ、動的プログラミングの反復を通じて後方に蓄積されます。エージェントの相互作用が指数関数的に増加するため、エージェントの数が増えるとこの誤差はさらに大きくなり、学習時間が実行不可能になり、初期のトレーニング手順で誤ったアクションが発生します。この課題に対処するために、以前に収集した軌跡が有用なコンテキストであることに注目し、コンテキスト予測子を使用してそれらをモデル化して次のアクションと観察を生成し、コンテキスト予測子を使用して、初期のトレーニング段階でQ値関数またはユーティリティ関数を置き換えます。さらに、ジョイントアクションサンプリングメカニズムを使用してアクションスペースを制限し、バニラユーティリティネットワークからのポリシーとコンテキスト軌跡予測子からのポリシーを動的に選択して、ロールアウトプロセスを実行します。アクションスペースとロールアウトプロセスを合理的に制約することで、アルゴリズムのトレーニングプロセスを大幅に加速できます。私たちのフレームワークは、集中型トレーニング分散実行(CTDE)と、トレーニングプロセス中にエージェントがグローバル状態にアクセスできる(アクセスできない)非CTDEシナリオの両方で、さまざまな値ベースのMARL方法に適用されます。Particle World Environment (PWE)の3つのタスク(Spread、Tag、Reference)での実験結果から、私たちのフレームワークは、既存の最先端のCTDEおよび非CTDE MARL方法のトレーニングプロセスを大幅に加速すると同時に、元のバージョンと競合するか、それを上回るパフォーマンスを発揮することがわかります。

Finding Groups of Cross-Correlated Features in Bi-View Data
Bi-View データでの相互相関特徴量のグループの検索

Datasets in which measurements of two (or more) types are obtained from a common set of samples arise in many scientific applications. A common problem in the exploratory analysis of such data is to identify groups of features of different data types that are strongly associated. A bimodule is a pair (A,B) of feature sets from two data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A,B) is stable if A coincides with the set of features that have significant aggregate correlation with the features in B, and vice-versa. This paper proposes an iterative-testing based bimodule search procedure (BSP) to identify stable bimodules. Compared to existing methods for detecting cross-correlated features, BSP was the best at recovering true bimodules with sufficient signal, while limiting the false discoveries. In addition, we applied BSP to the problem of expression quantitative trait loci (eQTL) analysis using data from the GTEx consortium. BSP identified several thousand SNP-gene bimodules. While many of the individual SNP-gene pairs appearing in the discovered bimodules were identified by standard eQTL methods, the discovered bimodules revealed genomic subnetworks that appeared to be biologically meaningful and worthy of further scientific investigation.

多くの科学的応用において、共通のサンプルセットから2種類(またはそれ以上)の測定値が得られるデータセットが発生します。このようなデータの探索的分析でよくある問題は、強く関連している異なるデータタイプの特徴のグループを識別することです。バイモジュールとは、2つのデータタイプからの特徴セットのペア(A、B)であり、Aの特徴とBの特徴の間の総計相互相関が大きいものです。バイモジュール(A、B)が安定しているのは、AがBの特徴と有意な総計相関を持つ特徴セットと一致する場合であり、その逆も同様です。この論文では、安定したバイモジュールを識別するために、反復テストベースのバイモジュール検索手順(BSP)を提案します。相互相関特徴を検出するための既存の方法と比較して、BSPは、誤った検出を制限しながら、十分な信号で真のバイモジュールを回復するのに最適でした。さらに、GTExコンソーシアムのデータを使用して、BSPを発現量的形質遺伝子座(eQTL)分析の問題に適用しました。BSPは数千のSNP遺伝子バイモジュールを特定しました。発見されたバイモジュールに現れる個々のSNP遺伝子ペアの多くは標準的なeQTL方法によって特定されましたが、発見されたバイモジュールは、生物学的に意味があり、さらなる科学的調査に値すると思われるゲノムサブネットワークを明らかにしました。

Bayesian Spanning Tree: Estimating the Backbone of the Dependence Graph
ベイジアンスパニングツリー: 依存関係グラフのバックボーンの推定

In multivariate data analysis, it is often important to estimate a graph characterizing dependence among $p$ variables. A popular strategy in Gaussian graphical models and latent Gaussian graphical models uses the non-zero entries in a $p\times p$ covariance or precision matrix, typically requiring restrictive modeling assumptions for accurate graph recovery. To improve model robustness, we instead focus on estimating the backbone of the dependence graph. We use a spanning tree likelihood, based on a minimalist graphical model that is purposely overly-simplified. Taking a Bayesian approach, we place a prior on the space of trees and quantify uncertainty in the graphical model. In both theory and experiments, we show that this model does not require the population graph to be a spanning tree or the covariance to satisfy assumptions beyond positive-definiteness. The model accurately recovers the backbone of the population graph at a rate competitive with existing approaches but with better robustness. We show combinatorial properties of the spanning tree, which may be of independent interest, and develop an efficient Gibbs sampler for Bayesian inference. Analyzing electroencephalography data using a hidden Markov model with each latent state modeled by a spanning tree, we show that results are much more interpretable compared with popular alternatives.

多変量データ分析では、多くの場合、p個の変数間の依存関係を特徴付けるグラフを推定することが重要になります。ガウスグラフィカルモデルと潜在ガウスグラフィカルモデルの一般的な戦略では、p\times p共分散または精度行列の非ゼロエントリを使用しますが、通常、正確なグラフ回復には制限的なモデリング仮定が必要です。モデルの堅牢性を向上させるために、代わりに依存関係グラフのバックボーンの推定に焦点を当てます。意図的に過度に単純化された最小限のグラフィカルモデルに基づく、スパニングツリー尤度を使用します。ベイジアンアプローチを採用して、ツリーの空間に事前確率を配置し、グラフィカルモデルの不確実性を定量化します。理論と実験の両方で、このモデルでは、集団グラフがスパニングツリーである必要も、共分散が正定値性を超える仮定を満たす必要もないことを示しています。このモデルは、既存のアプローチと競合する速度で、より堅牢に、集団グラフのバックボーンを正確に回復します。私たちは、独立した関心事であるかもしれないスパニングツリーの組み合わせ特性を示し、ベイズ推論のための効率的なギブスサンプラーを開発します。各潜在状態をスパニングツリーでモデル化した隠れマルコフモデルを使用して脳波データを分析すると、一般的な代替方法と比較して結果がはるかに解釈しやすいことがわかります。

RVCL: Evaluating the Robustness of Contrastive Learning via Verification
RVCL:検証によるコントラスティブ学習のロバスト性の評価

Contrastive adversarial training has successfully improved the robustness of contrastive learning (CL). However, the robustness metric in these methods depends on attack algorithms, image labels, and downstream tasks, introducing reliability concerns. To address these issues, this paper proposes a novel Robustness Verification framework for Contrastive Learning (RVCL). Specifically, we define the verification problem of CL from deterministic and probabilistic perspectives, then provide several effective metrics to evaluate the robustness of CL encoder. Furthermore, we use extreme value theory to reveal the relationship between the robust radius of the CL encoder and that of the supervised downstream task. Extensive experiments on various benchmark models and datasets validate theoretical findings, and further demonstrate RVCL’s capability to evaluate the robustness of both CL encoders and images. Our code is available at https://github.com/wzekai99/RVCL-JMLR.

敵対的コントラスティブ学習は、コントラスティブ学習(CL)の堅牢性を向上させることに成功しました。ただし、これらの方法の堅牢性メトリックは、攻撃アルゴリズム、イメージラベル、およびダウンストリームタスクに依存するため、信頼性の問題が生じます。これらの問題に対処するために、この論文では、新しいRobustness Verification framework for Contrastive Learning(RVCL)を提案します。具体的には、CLの検証問題を決定論的および確率的観点から定義し、CLエンコーダのロバスト性を評価するためのいくつかの効果的な指標を提供します。さらに、極値理論を用いて、CLエンコーダのロバスト半径と教師下流タスクのロバスト半径との関係を明らかにします。さまざまなベンチマークモデルとデータセットでの広範な実験により、理論的な知見が検証され、CLエンコーダーと画像の両方のロバスト性を評価するRVCLの能力がさらに実証されています。当社のコードはhttps://github.com/wzekai99/RVCL-JMLRで入手できます。

Adaptive Learning of Density Ratios in RKHS
RKHSにおける密度比の適応学習

Estimating the ratio of two probability densities from finitely many observations of the densities is a central problem in machine learning and statistics with applications in two-sample testing, divergence estimation, generative modeling, covariate shift adaptation, conditional density estimation, and novelty detection. In this work, we analyze a large class of density ratio estimation methods that minimize a regularized Bregman divergence between the true density ratio and a model in a reproducing kernel Hilbert space (RKHS). We derive new finite-sample error bounds, and we propose a Lepskii type parameter choice principle that minimizes the bounds without knowledge of the regularity of the density ratio. In the special case of square loss, our method adaptively achieves a minimax optimal error rate. A numerical illustration is provided.

密度の有限個の観測値から2つの確率密度の比率を推定することは、機械学習と統計学の中心的な問題であり、2サンプル検定、発散推定、生成モデリング、共変量シフト適応、条件付き密度推定、新規性検出に応用できます。この研究では、真の密度比と再現カーネルヒルベルト空間(RKHS)のモデルとの間の正則化されたブレグマン発散を最小化する大規模な密度比推定法のクラスを分析します。新しい有限サンプル誤差範囲を導出し、密度比の規則性を知らずに範囲を最小化するLepskii型パラメーター選択原理を提案します。二乗損失の特殊なケースでは、この手法は適応的にミニマックスの最適エラー率を達成します。数値の図が示されています。

Revisiting inference after prediction
予測後の推論の再検討

Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.

最近の研究では、予測ベースの推論という非常に一般的な方法に焦点が当てられています。つまり、(i)事前トレーニング済みの機械学習モデルを使用して観測されていない応答変数を予測し、次に(ii)その予測された応答といくつかの共変量との関連性について推論を実行します。Wangら(2020)が指摘しているように、(ii)で標準的な推論アプローチを適用しても、観測されていない(予測された応答ではなく)応答と共変量との関連性を正確に定量化することはできません。最近の研究では、Wangら(2020)とAngelopoulosら(2023)は、観測されていない応答と共変量との関連性について有効な推論を可能にするために、ステップ(ii)の修正を提案しています。ここでは、Angelopoulosらによって提案された方法が、観測されていない応答と共変量との関連性について有効な推論を可能にすることを示します。(2023)は、観測されていない応答を予測するために使用される事前トレーニング済みの機械学習モデルの品質に関係なく、タイプ1のエラー率をうまく制御し、正しい名目カバレッジで信頼区間を提供します。ただし、Wangら(2020)によって提案された方法は、実際にはめったに当てはまらない非常に強力な条件下でのみ有効な推論を提供します。たとえば、機械学習モデルが研究対象の母集団の真の回帰関数を完全に推定する場合などです。

A Unified Approach to Controlling Implicit Regularization via Mirror Descent
ミラーディセントによる陰的正則化の制御に対する統一的なアプローチ

Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their “preferred” solutions, a phenomenon commonly referred to as implicit regularization. In particular, it has been argued that gradient descent (GD) induces an implicit $\ell_2$-norm regularization in regression and classification problems. However, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems, indicating a gap in a general approach for controlling the implicit regularization. To address this, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Further, we show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions. Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances.

大規模ニューラルネットワークの目覚ましい成功に触発されて、過剰パラメータ化モデルの一般化パフォーマンスを理解することに大きな関心が寄せられています。最適化アルゴリズムが「推奨」ソリューションを通じて一般化にどのように影響するか、一般に暗黙的正則化と呼ばれる現象を特徴付けることに多大な努力が注がれてきました。特に、勾配降下法(GD)は回帰問題と分類問題で暗黙的な$\ell_2$ノルム正則化を誘導すると主張されてきました。しかし、さまざまなアルゴリズムの暗黙的正則化は、特定のジオメトリまたは特定の学習問題クラスに限定されており、暗黙的正則化を制御するための一般的なアプローチにギャップがあることを示しています。これに対処するために、GDの注目すべき一般化であるミラー降下法(MD)を使用して、回帰と分類の両方の設定で暗黙的正則化を制御する統一的なアプローチを紹介します。具体的には、一般的な同次ポテンシャル関数のクラスを持つMDが、線形分類問題に対する一般化された最大マージン解に向かって収束することを示し、それによって分類設定における長年の疑問に答えます。さらに、MDは効率的に実装でき、適切な条件下では高速収束を実現できることを示します。包括的な実験を通じて、MDは、異なる正規化子を持つ学習済みモデルを生成するための多目的な方法であり、その結果、異なる一般化パフォーマンスが得られることを実証します。

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning
インスタンス依存の信頼度と強化学習のための早期停止

Reinforcement learning algorithms are known to exhibit a variety of convergence rates depending on the problem structure. Recent years have witnessed considerable progress in developing theory that is instance-dependent, along with algorithms that achieve such instance-optimal guarantees. However, important questions remain in how to utilize such notions for inferential purposes, or for early stopping, so that data and computational resources can be saved for “easy” problems. This paper develops data-dependent procedures that output instance-dependent confidence regions for evaluating and optimizing policies in a Markov decision process. Notably, our procedures require only black-box access to an instance-optimal algorithm, and re-use the samples used in the estimation algorithm itself. The resulting data-dependent stopping rule adapts instance-specific difficulty of the problem and allows for early termination for problems with favorable structure. We highlight benefit of such early stopping rules via some numerical studies.

強化学習アルゴリズムは、問題の構造に応じてさまざまな収束率を示すことが知られています。近年、インスタンスに依存する理論の開発と、そのようなインスタンス最適な保証を実現するアルゴリズムにおいて、かなりの進歩が見られてきました。しかし、そのような概念を推論目的または早期停止に利用して、データと計算リソースを「簡単な」問題のために節約する方法については、重要な問題が残っています。この論文では、マルコフ決定プロセスでポリシーを評価および最適化するためのインスタンス依存の信頼領域を出力するデータ依存の手順を開発します。特に、私たちの手順では、インスタンス最適アルゴリズムへのブラックボックスアクセスのみが必要であり、推定アルゴリズム自体で使用されるサンプルを再利用します。結果として得られるデータ依存の停止規則は、問題のインスタンス固有の難易度を適応させ、好ましい構造の問題の早期終了を可能にします。私たちは、いくつかの数値研究を通じて、そのような早期停止規則の利点を強調します。

Hierarchical Kernels in Deep Kernel Learning
深層カーネル学習における階層カーネル

Kernel methods are built upon the mathematical theory of reproducing kernels and reproducing kernel Hilbert spaces. They enjoy good interpretability thanks to the solid mathematical foundation. Recently, motivated by deep neural networks in deep learning, which construct learning functions by successive compositions of activation functions and linear functions, a class of methods termed as deep kernel learning has appeared in the literature. The core of deep kernel learning is hierarchical kernels that are constructed from a base reproducing kernel by successive compositions. In this paper, we characterize the corresponding reproducing kernel Hilbert spaces of hierarchical kernels, and study conditions ensuring that the reproducing kernel Hilbert space will be expanding as the layer of hierarchical kernels increases. The results will answer whether the expressive power of hierarchical kernels will be improving as the layer increases, and give guidance to the construction of hierarchical kernels for deep kernel learning.

カーネル法は、再生カーネルと再生カーネルヒルベルト空間の数学的理論に基づいて構築されています。これらは、堅固な数学的基礎のおかげで、優れた解釈可能性を享受しています。最近、活性化関数と線形関数の連続的な合成によって学習関数を構築するディープラーニングのディープニューラルネットワークに触発されて、ディープカーネル学習と呼ばれる一連の方法が文献に登場しました。ディープカーネル学習の中核は、連続的な合成によって基本再生カーネルから構築される階層カーネルです。この論文では、階層カーネルの対応する再生カーネルヒルベルト空間を特徴付け、階層カーネルの層が増えるにつれて再生カーネルヒルベルト空間が拡大することを保証する条件を検討します。結果は、層が増えるにつれて階層カーネルの表現力が向上するかどうかに答え、ディープカーネル学習の階層カーネルの構築に指針を与えます。

A Scalable and Efficient Iterative Method for Copying Machine Learning Classifiers
機械学習分類器をコピーするためのスケーラブルで効率的な反復方法

Differential replication through copying refers to the process of replicating the decision behavior of a machine learning model using another model that possesses enhanced features and attributes. This process is relevant when external constraints limit the performance of an industrial predictive system. Under such circumstances, copying enables the retention of original prediction capabilities while adapting to new demands. Previous research has focused on the single-pass implementation for copying. This paper introduces a novel sequential approach that significantly reduces the amount of computational resources needed to train or maintain a copy, leading to reduced maintenance costs for companies using machine learning models in production. The effectiveness of the sequential approach is demonstrated through experiments with synthetic and real-world datasets, showing significant reductions in time and resources, while maintaining or improving accuracy.

コピーによる差分レプリケーションとは、強化された機能と属性を持つ別のモデルを使用して、機械学習モデルの決定動作を複製するプロセスを指します。このプロセスは、外部制約によって産業用予測システムのパフォーマンスが制限される場合に関連します。このような状況下では、コピーにより、新しい要求に適応しながら、元の予測機能を保持することができます。これまでの研究では、コピーのシングルパス実装に焦点が当てられていました。この論文では、コピーのトレーニングや保守に必要な計算リソースの量を大幅に削減する新しいシーケンシャルアプローチを紹介し、本番環境で機械学習モデルを使用する企業のメンテナンスコストを削減します。シーケンシャルアプローチの有効性は、合成データセットと実世界のデータセットを使用した実験を通じて実証されており、精度を維持または向上させながら、時間とリソースを大幅に削減することが示されています。

Semiparametric Inference Using Fractional Posteriors
分数事後関数を用いたセミパラメトリック推論

We establish a general Bernstein–von Mises theorem for approximately linear semiparametric functionals of fractional posterior distributions based on nonparametric priors. This is illustrated in a number of nonparametric settings and for different classes of prior distributions, including Gaussian process priors. We show that fractional posterior credible sets can provide reliable semiparametric uncertainty quantification, but have inflated size. To remedy this, we further propose a shifted-and-rescaled fractional posterior set that is an efficient confidence set having optimal size under regularity conditions. As part of our proofs, we also refine existing contraction rate results for fractional posteriors by sharpening the dependence of the rate on the fractional exponent.

私たちは、ノンパラメトリック事前分布に基づく分数事後分布の近似線形セミパラメトリック汎関数に対する一般的なバーンスタイン・フォン・ミーゼスの定理を確立します。これは、いくつかのノンパラメトリック設定と、ガウス過程事前分布を含む事前分布のさまざまなクラスで示されています。フラクショナル事後信頼性セットは、信頼性の高いセミパラメトリック不確かさの定量化を提供できるが、サイズが膨らんでいることを示します。これを改善するために、規則性条件下で最適なサイズを持つ効率的な信頼セットである、シフトおよび再スケーリングされた分数後部セットをさらに提案します。証明の一部として、分数指数に対するレートの依存性を鮮明にすることにより、分数事後分布の既存の収縮率の結果も改良します。

Fourier Neural Operator with Learned Deformations for PDEs on General Geometries
一般幾何学上の偏微分方程式の学習変形を持つフーリエニューラル演算子

Deep learning surrogate models have shown promise in solving partial differential equations (PDEs). Among them, the Fourier neural operator (FNO) achieves good accuracy, and is significantly faster compared to numerical solvers, on a variety of PDEs, such as fluid flows. However, the FNO uses the Fast Fourier transform (FFT), which is limited to rectangular domains with uniform grids. In this work, we propose a new framework, viz., Geo-FNO, to solve PDEs on arbitrary geometries. Geo-FNO learns to deform the input (physical) domain, which may be irregular, into a latent space with a uniform grid. The FNO model with the FFT is applied in the latent space. The resulting Geo-FNO model has both the computation efficiency of FFT and the flexibility of handling arbitrary geometries. Our Geo-FNO is also flexible in terms of its input formats, viz., point clouds, meshes, and design parameters are all valid inputs. We consider a variety of PDEs such as the Elasticity, Plasticity, Euler’s, and Navier-Stokes equations, and both forward modeling and inverse design problems. Comprehensive cost-accuracy experiments show that Geo-FNO is $10^5$ times faster than the standard numerical solvers and twice more accurate compared to direct interpolation on existing ML-based PDE solvers such as the standard FNO.

ディープラーニングの代替モデルは、偏微分方程式(PDE)を解くのに有望であることが示されています。その中でも、フーリエニューラルオペレータ(FNO)は、流体の流れなどのさまざまなPDEに対して優れた精度を実現し、数値ソルバーに比べて大幅に高速です。ただし、FNOは高速フーリエ変換(FFT)を使用しますが、これは均一なグリッドを持つ長方形の領域に限定されます。この研究では、任意のジオメトリ上のPDEを解くための新しいフレームワーク、つまりGeo-FNOを提案します。Geo-FNOは、不規則な可能性がある入力(物理)領域を均一なグリッドを持つ潜在空間に変形することを学習します。FFTを使用したFNOモデルは、潜在空間に適用されます。結果として得られるGeo-FNOモデルは、FFTの計算効率と任意のジオメトリを処理できる柔軟性の両方を備えています。Geo-FNOは、入力形式に関しても柔軟性があり、ポイントクラウド、メッシュ、設計パラメータはすべて有効な入力です。弾性方程式、塑性方程式、オイラー方程式、ナビエ・ストークス方程式などのさまざまなPDEと、順方向モデリング問題および逆設計問題の両方を考慮します。包括的なコスト精度実験により、Geo-FNOは標準の数値ソルバーよりも$10^5$倍高速であり、標準FNOなどの既存のMLベースのPDEソルバーの直接補間と比較して2倍の精度であることが示されました。

Distributed Statistical Inference under Heterogeneity
不均一性下での分散統計的推論

We consider distributed statistical optimization and inference in the presence of heterogeneity among distributed data blocks. A weighted distributed estimator is proposed to improve the statistical efficiency of the standard ”split-and-conquer” estimator for the common parameter shared by all the data blocks. The weighted distributed estimator is at least as efficient as the would-be full sample and the generalized method of moment estimators with the latter two estimators requiring full data access. A bias reduction is formulated for the weighted distributed estimator to accommodate much larger numbers of data blocks (relaxing the constraint from $K = o(N^{1/2})$ to $K = o(N^{2/3})$, where $K$ is the number of blocks and $N$ is the total sample size) than the existing methods without sacrificing the statistical efficiency at the same time. The mean squared error bounds, the asymptotic distributions, and the corresponding statistical inference procedures of the weighted distributed and the debiased estimators are derived, which show an advantageous performance of the debiased weighted estimators when the number of data blocks is large.

私たちは、分散データブロック間に異質性がある場合の分散統計最適化と推論を検討します。すべてのデータブロックで共有される共通パラメータに対する標準的な「分割統治」推定量の統計的効率を改善するために、重み付き分散推定量が提案されています。重み付き分散推定量は、完全なサンプル推定量や一般化モーメント法推定量と少なくとも同等の効率性があります。後者の2つの推定量は完全なデータアクセスを必要とします。重み付き分散推定量に対してバイアス削減が定式化され、既存の方法よりもはるかに多くのデータブロックに対応します(制約を$K = o(N^{1/2})$から$K = o(N^{2/3})$に緩和します。ここで、$K$はブロックの数、$N$はサンプルの総サイズです)。同時に統計的効率を犠牲にすることなく。重み付き分散推定量とバイアス除去推定量の平均二乗誤差境界、漸近分布、および対応する統計的推論手順が導出され、バイアス除去重み付き推定量の有利なパフォーマンスが示されます。データブロックの数が多い場合の推定値。

Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice
PAC最適超事後分布によるスケーラブルなPACベイズ・メタ学習: 理論から実践へ

Meta-Learning aims to speed up the learning process on new tasks by acquiring useful inductive biases from datasets of related learning tasks. While, in practice, the number of related tasks available is often small, most of the existing approaches assume an abundance of tasks; making them unrealistic and prone to overfitting. A central question in the meta-learning literature is how to regularize to ensure generalization to unseen tasks. In this work, we provide a theoretical analysis using the PAC-Bayesian theory and present a generalization bound for meta-learning, which was first derived by Rothfuss et al. (2021). Crucially, the bound allows us to derive the closed form of the optimal hyper-posterior, referred to as PACOH, which leads to the best performance guarantees. We provide a theoretical analysis and empirical case study under which conditions and to what extent these guarantees for meta-learning improve upon PAC-Bayesian per-task learning bounds. The closed-form PACOH inspires a practical meta-learning approach that avoids the reliance on bi-level optimization, giving rise to a stochastic optimization problem that is amenable to standard variational methods that scale well. Our experiments show that, when instantiating the PACOH with Gaussian processes and Bayesian Neural Networks models, the resulting methods are more scalable, and yield state-of-the-art performance, both in terms of predictive accuracy and the quality of uncertainty estimates.

メタ学習は、関連する学習タスクのデータセットから有用な帰納的バイアスを取得することにより、新しいタスクの学習プロセスを高速化することを目的としています。実際には、利用可能な関連タスクの数は少ないことが多いですが、既存のアプローチのほとんどはタスクが豊富にあることを前提としているため、非現実的で過剰適合になりがちです。メタ学習の文献における中心的な問題は、未知のタスクへの一般化を確実にするためにどのように正規化するかということです。この研究では、PAC-ベイジアン理論を使用した理論的分析を提供し、メタ学習の一般化境界を提示します。これは、Rothfussら(2021)によって最初に導き出されました。重要なのは、この境界により、PACOHと呼ばれる最適な超事後分布の閉じた形式を導き出すことができ、最高のパフォーマンス保証につながることです。メタ学習のこれらの保証がどのような条件で、どの程度PAC-ベイジアンのタスクごとの学習境界を改善するかについて、理論的分析と実証的なケーススタディを提供します。閉じた形式のPACOHは、2レベル最適化への依存を回避する実用的なメタ学習アプローチを刺激し、拡張性に優れた標準変分法に適した確率的最適化問題を生み出します。私たちの実験では、ガウス過程とベイジアンニューラルネットワークモデルを使用してPACOHをインスタンス化すると、結果として得られる方法はより拡張性が高く、予測精度と不確実性推定の品質の両方の点で最先端のパフォーマンスが得られることが示されています。

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning
二重双対性:制約付き強化学習のための変分主双対方策最適化

We study the Constrained Convex Markov Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure, subject to a convex constraint. Designing algorithms for a constrained convex MDP faces several challenges, including (1) handling the large state space, (2) managing the exploration/exploitation tradeoff, and (3) solving the constrained optimization where the objective and the constraint are both nonlinear functions of the visitation measure. In this work, we present a model-based algorithm, Variational Primal-Dual Policy Optimization (VPDPO), in which Lagrangian and Fenchel duality are implemented to reformulate the original constrained problem into an unconstrained primal-dual optimization. The primal variables are updated by model-based value iteration following the principle of Optimism in the Face of Uncertainty (OFU), while the dual variables are updated by gradient ascent. Moreover, by embedding the visitation measure into a finite-dimensional space, we can handle large state spaces by incorporating function approximation. Two notable examples are (1) Kernelized Nonlinear Regulators and (2) Low-rank MDPs. We prove that with an optimistic planning oracle, our algorithm achieves sublinear regret and constraint violation in both cases and can attain the globally optimal policy of the original constrained problem.

私たちは、制約付き凸マルコフ決定過程(MDP)を研究します。その目的は、凸制約の下で、訪問測度の凸関数を最小化することにあります。制約付き凸MDPのアルゴリズムの設計には、(1)大きな状態空間の処理、(2)探索/活用のトレードオフの管理、(3)目的と制約が両方とも訪問測度の非線形関数である場合の制約付き最適化の解決など、いくつかの課題があります。この研究では、モデルベースのアルゴリズムである変分プライマル-デュアルポリシー最適化(VPDPO)を提示します。このアルゴリズムでは、ラグランジュおよびフェンシェルの双対性が実装され、元の制約付き問題を制約なしのプライマル-デュアル最適化に再定式化しています。プライマル変数は、不確実性に直面した楽観主義(OFU)の原則に従ってモデルベースの値反復によって更新され、デュアル変数は勾配上昇によって更新されます。さらに、訪問測度を有限次元空間に埋め込むことで、関数近似を組み込むことで大きな状態空間を扱うことができます。注目すべき2つの例は、(1)カーネル化非線形レギュレータと(2)低ランクMDPです。楽観的計画オラクルを使用すると、アルゴリズムは両方のケースで線形以下の後悔と制約違反を達成し、元の制約付き問題のグローバルに最適なポリシーを実現できることを証明します。

On Unbalanced Optimal Transport: Gradient Methods, Sparsity and Approximation Error
不平衡最適輸送について:勾配法、スパース性および近似誤差

We study the Unbalanced Optimal Transport (UOT) between two measures of possibly different masses with at most $n$ components, where the marginal constraints of standard Optimal Transport (OT) are relaxed via Kullback-Leibler divergence with regularization factor $\tau$. Although only Sinkhorn-based UOT solvers have been analyzed in the literature with the iteration complexity of ${O}\big(\tfrac{\tau \log(n)}{\varepsilon} \log\big(\tfrac{\log(n)}{{\varepsilon}}\big)\big)$ and per-iteration cost of $O(n^2)$ for achieving the desired error $\varepsilon$, their positively dense output transportation plans strongly hinder the practicality. On the other hand, while being vastly used as heuristics for computing UOT in modern deep learning applications and having shown success in sparse OT problem, gradient methods applied to UOT have not been formally studied. In this paper, we propose a novel algorithm based on Gradient Extrapolation Method (GEM-UOT) to find an $\varepsilon$-approximate solution to the UOT problem in $O\big( \kappa \log\big(\frac{\tau n}{\varepsilon}\big) \big)$ iterations with $\widetilde{O}(n^2)$ per-iteration cost, where $\kappa$ is the condition number depending on only the two input measures. Our proof technique is based on a novel dual formulation of the squared $\ell_2$-norm UOT objective, which fills the lack of sparse UOT literature and also leads to a new characterization of approximation error between UOT and OT. To this end, we further present a novel approach of OT retrieval from UOT, which is based on GEM-UOT with fine tuned $\tau$ and a post-process projection step. Extensive experiments on synthetic and real datasets validate our theories and demonstrate the favorable performance of our methods in practice. We showcase GEM-UOT on the task of color transfer in terms of both the quality of the transfer image and the sparsity of the transportation plan.

私たちは、最大$n$個の成分を持つ、おそらく異なる質量を持つ2つの測定間の不均衡最適輸送(UOT)を研究します。ここでは、標準最適輸送(OT)の限界制約が、正則化係数$\tau$によるKullback-Leiblerダイバージェンスによって緩和されます。文献では、反復複雑度が${O}\big(\tfrac{\tau \log(n)}{\varepsilon} \log\big(\tfrac{\log(n)}{{\varepsilon}}\big)\big)$で、目的の誤差$\varepsilon$を達成するための反復あたりのコストが$O(n^2)$であるSinkhornベースのUOTソルバーのみが分析されていますが、それらの正に密な出力輸送計画は実用性を著しく妨げています。一方、現代のディープラーニングアプリケーションでUOTを計算するためのヒューリスティックとして広く使用されており、スパースOT問題で成功していることが示されているものの、UOTに適用される勾配法は正式には研究されていません。この論文では、勾配外挿法（GEM-UOT）に基づく新しいアルゴリズムを提案し、反復あたりのコストが$\widetilde{O}（n^2）$である$O\big（\kappa \log\big（\frac{\tau n}{\varepsilon}\big）\big）$回の反復でUOT問題に対する$\varepsilon$近似解を見つけます。ここで、$\kappa$は2つの入力測度のみに依存する条件数です。私たちの証明手法は、2乗$\ell_2$ノルムUOT目的関数の新しいデュアル定式化に基づいており、スパースUOT文献の不足を補うとともに、UOTとOT間の近似誤差の新しい特徴付けにもつながります。この目的のために、我々はさらに、微調整された$\tau$と後処理投影ステップを備えたGEM-UOTに基づく、UOTからのOT取得の新しいアプローチを提示します。合成データセットと実際のデータセットでの広範な実験により、我々の理論が検証され、実際の方法の良好なパフォーマンスが実証されています。私たちは、転送画像の品質と輸送計画のスパース性の両方の観点から、色転送タスクにおけるGEM-UOTを紹介します。

Over-parameterized Deep Nonparametric Regression for Dependent Data with Its Applications to Reinforcement Learning
従属データに対する過剰パラメータ化された深層ノンパラメトリック回帰と強化学習への応用

In this paper, we provide statistical guarantees for over-parameterized deep nonparametric regression in the presence of dependent data. By decomposing the error, we establish non-asymptotic error bounds for deep estimation, which is achieved by effectively balancing the approximation and generalization errors. We have derived an approximation result for H{\”o}lder functions with constrained weights. Additionally, the generalization error is bounded by the weight norm, allowing for a neural network parameter number that is much larger than the training sample size. Furthermore, we address the issue of the curse of dimensionality by assuming that the samples originate from distributions with low intrinsic dimensions. Under this assumption, we are able to overcome the challenges posed by high-dimensional spaces. By incorporating an additional error propagation mechanism, we derive oracle inequalities for the over-parameterized deep fitted $Q$-iteration.

この論文では、従属データが存在する場合の過度にパラメーター化されたディープノンパラメトリック回帰の統計的保証を提供します。誤差を分解することにより、近似誤差と一般化誤差を効果的にバランスさせることにより、深い推定のための非漸近誤差境界を確立します。重みが制約されたH{“o}lder関数の近似結果を導き出しました。さらに、汎化誤差は重みノルムによって制限されるため、学習サンプルサイズよりもはるかに大きいニューラルネットワークパラメーター数が可能になります。さらに、次元の呪いの問題に、サンプルが低固有次元の分布に由来すると仮定することで対処します。この仮定の下で、私たちは高次元空間がもたらす課題を克服することができます。追加の誤差伝播メカニズムを組み込むことにより、過度にパラメータ化されたディープフィットされた$Q$-iterationのオラクル不等式を導出します。

A Novel Integer Linear Programming Approach for Global L0 Minimization
大域L0最小化のための新しい整数線形計画法

Given a vector $y \in \mathbb{R}^n$ and a matrix $H \in \mathbb{R}^{n\times m}$, the sparse approximation problem $\mathcal P_{0/p}$ asks for a point $x$ such that $\|y – Hx\|_p \leq \alpha$, for a given scalar $\alpha$, minimizing the size of the support $\|x\|_0 := \#\{j \ |\ x_j \neq 0 \}$. Existing convex mixed-integer programming formulations for $\mathcal P_{0/p}$ are of a kind referred to as “big-$M$”, meaning that they involve the use of a bound $M$ on the values of $x$. When a proper value for $M$ is not known beforehand, these formulations are not exact, in the sense that they may fail to recover the wanted global minimizer. In this work, we study the polytopes arising from these formulations and derive valid inequalities for them. We first use these inequalities to design a branch-and-cut algorithm for these models. Additionally, we prove that these inequalities are sufficient to describe the set of feasible supports for $\mathcal P_{0/p}$. Based on this result, we introduce a new (and the first to our knowledge) $M$-independent integer linear programming formulation for $\mathcal P_{0/p}$, which guarantees the recovery of the global minimizer. We propose a practical approach to tackle this formulation, which has exponentially many constraints. The proposed methods are then compared in computational experimentation to test their potential practical contribution.

ベクトル$y \in \mathbb{R}^n$と行列$H \in \mathbb{R}^{n\times m}$が与えられた場合、疎近似問題$\mathcal P_{0/p}$は、与えられたスカラー$\alpha$に対して、サポートのサイズ$\|x\|_0 := \#\{j \ |\ x_j \neq 0 \}$を最小化する、$\|y – Hx\|_p \leq \alpha$となる点$x$を求めます。$\mathcal P_{0/p}$の既存の凸混合整数計画法の定式化は、「big-$M$」と呼ばれる種類のものであり、$x$の値に対して境界$M$を使用することを意味します。$M$の適切な値が事前にわからない場合、これらの定式化は正確ではなく、必要なグローバル最小化を回復できない可能性があります。この研究では、これらの定式化から生じる多面体を調べ、それらに対して有効な不等式を導出します。最初にこれらの不等式を使用して、これらのモデルの分岐切断アルゴリズムを設計します。さらに、これらの不等式が$\mathcal P_{0/p}$の実行可能なサポートのセットを表すのに十分であることを証明します。この結果に基づいて、グローバル最小化の回復を保証する、$\mathcal P_{0/p}$の新しい(そして私たちの知る限り初の) $M$に依存しない整数線形計画法定式化を導入します。指数関数的に多くの制約があるこの定式化に対処するための実用的なアプローチを提案します。次に、提案された方法を計算実験で比較し、潜在的な実用的な貢献をテストします。

Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence
リーマンガウス・ニュートンによる低位テンソル推定:統計的最適性と2次収束

In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We consider an efficient Riemannian Gauss-Newton (RGN) method for low Tucker rank tensor estimation. Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first local quadratic convergence guarantee of RGN for lowrank tensor estimation in the noisy setting under some regularity conditions and provide the corresponding estimation error upper bounds. A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. The merit of RGN is illustrated through two machine learning applications: tensor regression and tensor SVD. Finally, we provide the simulation results to corroborate our theoretical findings.

この論文では、いくつかのノイズの多い線形測定から低いタッカーランクテンソルの推定について考察します。一般的な問題は、テンソル回帰、テンソル完了、テンソルPCA/SVDなど、アプリケーションから生じる多くの具体的な例をカバーしています。低タッカーランクテンソル推定のための効率的なリーマンガウスニュートン(RGN)法を検討します。文献のRGNの一般的な(超)線形収束保証とは異なり、いくつかの規則性条件下でのノイズの多い設定での低ランクテンソル推定に対するRGNの最初の局所二次収束保証を証明し、対応する推定誤差の上限を提供します。RGNの統計的最適性を示す、上限と一致する決定論的推定誤差の下限が提供されます。RGNの利点は、テンソル回帰とテンソルSVDという2つの機械学習アプリケーションを通じて示されています。最後に、理論的な結果を裏付けるためにシミュレーション結果を提供します。

Randomized Spectral Co-Clustering for Large-Scale Directed Networks
大規模有向ネットワークのためのランダム化スペクトル共クラスタリング

Directed networks are broadly used to represent asymmetric relationships among units. Co-clustering aims to cluster the senders and receivers of directed networks simultaneously. In particular, the well-known spectral clustering algorithm could be modified as the spectral co-clustering to co-cluster directed networks. However, large-scale networks pose great computational challenges to it. In this paper, we leverage sketching techniques and derive two randomized spectral co-clustering algorithms, one random-projection-based and the other random-sampling-based, to accelerate the co-clustering of large-scale directed networks. We theoretically analyze the resulting algorithms under two generative models – the stochastic co-block model and the degree-corrected stochastic co-block model, and establish their approximation error rates and misclustering error rates, indicating better bounds than the state-of-the-art results of co-clustering literature. Numerically, we design and conduct simulations to support our theoretical results and test the efficiency of the algorithms on real networks with up to millions of nodes. A publicly available R package RandClust is developed for better usability and reproducibility of the proposed methods.

有向ネットワークは、ユニット間の非対称な関係を表すために広く使用されています。共クラスタリングは、有向ネットワークの送信者と受信者を同時にクラスタリングすることを目的としています。特に、よく知られているスペクトルクラスタリングアルゴリズムは、スペクトル共クラスタリングとして修正され、有向ネットワークを共クラスタリングすることができます。ただし、大規模ネットワークでは計算上の大きな課題があります。この論文では、スケッチ技法を活用して、ランダム射影ベースとランダムサンプリングベースの2つのランダム化スペクトル共クラスタリングアルゴリズムを導出し、大規模有向ネットワークの共クラスタリングを高速化します。得られたアルゴリズムを2つの生成モデル(確率的共ブロックモデルと次数補正確率的共ブロックモデル)で理論的に分析し、近似エラー率とミスクラスタリングエラー率を確立して、共クラスタリングの文献の最先端の結果よりも優れた境界を示しています。数値的には、理論的な結果を裏付けるシミュレーションを設計して実行し、最大数百万のノードを持つ実際のネットワークでアルゴリズムの効率をテストします。提案された方法の使いやすさと再現性を向上させるために、公開されているRパッケージRandClustが開発されています。

On Learning Rates and Schrödinger Operators
学習率とシュレーディンガー演算子について

Understanding the iterative behavior of stochastic optimization algorithms for minimizing nonconvex functions remains a crucial challenge in demystifying deep learning. In particular, it is not yet understood why certain simple techniques are remarkably effective for tuning the learning rate in stochastic gradient descent (SGD), arguably the most basic optimizer for training deep neural networks. This class of techniques includes learning rate decay, which begins with a large initial learning rate and is gradually reduced. In this paper, we present a general theoretical analysis of the effect of the learning rate in SGD. Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (LR-dependent SDE) as a tool that allows us to set SGD distinctively apart from both gradient descent and stochastic gradient Langevin dynamics (SGLD). In contrast to prior research, our analysis builds on the analysis of a partial differential equation that models the evolution of probability densities, drawing insights from Wainwright and Jordan (2006); Jordan (2018). From this perspective, we derive the linear convergence rate of the probability densities, highlighting its dependence on the learning rate. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schrödinger operator associated with the LR-dependent SDE. This expression clearly reveals the dependence of the linear convergence rate on the learning rate—the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

非凸関数を最小化する確率的最適化アルゴリズムの反復動作を理解することは、ディープラーニングの謎を解く上で依然として重要な課題です。特に、ディープニューラルネットワークのトレーニングのための最も基本的な最適化法と言える確率的勾配降下法(SGD)において、学習率の調整に特定の単純な手法がなぜ驚くほど効果的なのかはまだわかっていません。このクラスの手法には、大きな初期学習率から始まり、徐々に減少していく学習率の減衰が含まれます。この論文では、SGDにおける学習率の影響に関する一般的な理論的分析を示します。私たちの分析は、学習率依存の確率微分方程式(LR依存SDE)を、SGDを勾配降下法や確率的勾配ランジュバン動力学(SGLD)と明確に区別できるツールとして使用することに基づいています。以前の研究とは対照的に、私たちの分析は確率密度の進化をモデル化する偏微分方程式の分析に基づいており、WainwrightとJordan (2006)からの洞察を引き出Jordan (2018)。この観点から、確率密度の線形収束率を導出し、学習率への依存性を強調します。さらに、LR依存SDEに関連するシュレーディンガー演算子の特殊なケースであるWitten-Laplacianのスペクトルを解析することで、最適な線形率の明示的な表現を取得します。この表現は、線形収束率が学習率に依存していることを明確に示しています。つまり、広範な非凸関数のクラスでは学習率がゼロに近づくにつれて線形率は急速にゼロに減少しますが、強い凸関数では線形率は一定のままです。非凸問題と凸問題のこの明確な違いに基づいて、非凸最適化に学習率減衰を使用する利点の数学的解釈を示します。

Principled Out-of-Distribution Detection via Multiple Testing
複数のテストによる原則的な分布外検出

We study the problem of out-of-distribution (OOD) detection, that is, detecting whether a machine learning (ML) model’s output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the ML model, which provides insights for the construction of powerful tests for OOD detection. We also propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the ML model using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks architectures.

私たちは、分布外(OOD)検出、つまり機械学習(ML)モデルの出力が推論時に信頼できるかどうかを検出する問題を研究します。これまでの研究では、OOD検出のテストが数多く提案されていますが、この問題を研究するための正式なフレームワークが欠けています。私たちは、入力分布とMLモデルの両方を含むOODの概念の定義を提案します。これは、OOD検出の強力なテストの構築に役立つ情報を提供します。また、共形p値を使用して、MLモデルから任意の数の異なる統計を体系的に組み合わせるための、多重仮説検定にヒントを得た手順も提案します。さらに、分布内サンプルを誤ってOODとして分類する確率について強力な保証を提供します。我々の実験では、これまでの研究で提案されたしきい値ベースのテストは特定の設定ではうまく機能しますが、異なるOODインスタンス間で一様にうまく機能するわけではないことがわかりました。対照的に、複数の統計を組み合わせる我々の提案方法は、異なるデータセットとニューラルネットワークアーキテクチャ間で一様にうまく機能します。

Scaling Up Models and Data with t5x and seqio
t5x と seqio によるモデルとデータのスケールアップ

Scaling up training datasets and model parameters have benefited neural network-based language models, but also present challenges like distributed compute, input data bottlenecks and reproducibility of results. We introduce two simple and scalable software libraries that simplify these issues: t5x enables training large language models at scale, while seqio enables reproducible input and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on multi-terabyte datasets. Configurations and instructions for T5-like and GPT-like models are also provided. The libraries can be found at https://github.com/google-research/t5x and https://github.com/google/seqio.

トレーニングデータセットとモデルパラメータのスケールアップは、ニューラルネットワークベースの言語モデルにメリットをもたらしましたが、分散コンピューティング、入力データのボトルネック、結果の再現性などの課題も提示しています。これらの問題を簡素化する2つのシンプルでスケーラブルなソフトウェアライブラリを紹介します:t5xは大規模な言語モデルのトレーニングを可能にし、seqioは再現性のある入力および評価パイプラインを可能にします。これらのオープンソースライブラリは、数テラバイトのデータセットで数千億のパラメータを持つモデルをトレーニングするために使用されてきました。T5ライクモデルとGPTライクモデルの設定と手順も記載されています。図書館はhttps://github.com/google-research/t5xとhttps://github.com/google/seqioにあります。

On the Dynamics Under the Unhinged Loss and Beyond
アンヒンジド・ロスとその先でのダイナミクスについて

Recent works have studied implicit biases in deep learning, especially the behavior of last-layer features and classifier weights. However, they usually need to simplify the intermediate dynamics under gradient flow or gradient descent due to the intractability of loss functions and model architectures. In this paper, we introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze the closed-form dynamics while requiring as few simplifications or assumptions as possible. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization. Based on the layer-peeled model that views last-layer features as free optimization variables, we conduct a thorough analysis in the unconstrained, regularized, and spherical constrained cases, as well as the case where the neural tangent kernel remains invariant. To bridge the performance of the unhinged loss to that of Cross-Entropy (CE), we investigate the scenario of fixing classifier weights with a specific structure, (e.g., a simplex equiangular tight frame). Our analysis shows that these dynamics converge exponentially fast to a solution depending on the initialization of features and classifier weights. These theoretical results not only offer valuable insights, including explicit feature regularization and rescaled learning rates for enhancing practical training with the unhinged loss, but also extend their applicability to other loss functions. Finally, we empirically demonstrate these theoretical results and insights through extensive experiments.

最近の研究では、ディープラーニングにおける暗黙のバイアス、特に最終層の特徴と分類器の重みの挙動が研究されています。しかし、損失関数とモデルアーキテクチャの扱いにくさのため、勾配フローまたは勾配降下法の下での中間ダイナミクスを簡略化する必要があります。この論文では、簡潔な損失関数であるアンヒンジ損失を紹介します。アンヒンジ損失は、可能な限り簡略化や仮定を少なくしながら、閉形式のダイナミクスを分析するためのより多くの数学的機会を提供します。アンヒンジ損失により、時間変動学習率や特徴の正規化など、より実用的な手法を検討できます。最終層の特徴を自由な最適化変数と見なすレイヤーピールモデルに基づいて、制約なし、正規化、球面制約の場合、およびニューラル接線カーネルが不変のままの場合について徹底的な分析を行います。アンヒンジ損失のパフォーマンスをクロスエントロピー(CE)のパフォーマンスに橋渡しするために、特定の構造(たとえば、単純等角タイトフレーム)で分類器の重みを固定するシナリオを調査します。分析により、これらのダイナミクスは、特徴と分類器の重みの初期化に応じて、指数関数的に速くソリューションに収束することが示されています。これらの理論的結果は、明示的な特徴の正規化や、アンヒンジ損失による実際のトレーニングを強化するための再スケーリングされた学習率など、貴重な洞察を提供するだけでなく、他の損失関数への適用範囲も拡張します。最後に、広範な実験を通じて、これらの理論的結果と洞察を実証します。

Set-valued Classification with Out-of-distribution Detection for Many Classes
多くのクラスに対する分布外検出を使用した設定値分類

Set-valued classification, a new classification paradigm that aims to identify all the plausible classes that an observation belongs to, improves over the traditional classification paradigms in multiple aspects. Existing set-valued classification methods do not consider the possibility that the test set may contain out-of-distribution data, that is, the emergence of a new class that never appeared in the training data. Moreover, they are computationally expensive when the number of classes is large. We propose a Generalized Prediction Set (GPS) approach to set-valued classification while considering the possibility of a new class in the test data. The proposed classifier uses kernel learning and empirical risk minimization to encourage a small expected size of the prediction set while guaranteeing that the class-specific accuracy is at least some value specified by the user. For high-dimensional data, further improvement is obtained through kernel feature selection. Unlike previous methods, the proposed method achieves a good balance between accuracy, efficiency, and out-of-distribution detection rate. Moreover, our method can be applied in parallel to all the classes to alleviate the computational burden. Both theoretical analysis and numerical experiments are conducted to illustrate the effectiveness of the proposed method.

セット値分類は、観測が属する可能性のあるすべてのクラスを識別することを目的とした新しい分類パラダイムであり、従来の分類パラダイムをさまざまな面で改善しています。既存のセット値分類法では、テストセットに分布外データが含まれる可能性、つまりトレーニングデータに出現したことのない新しいクラスの出現が考慮されていません。さらに、クラス数が多い場合、計算コストが高くなります。テストデータに新しいクラスが存在する可能性を考慮しながら、セット値分類に一般化予測セット(GPS)アプローチを提案します。提案された分類器は、カーネル学習と経験的リスク最小化を使用して、クラス固有の精度が少なくともユーザーが指定した値であることを保証する一方で、予測セットの予想サイズを小さくすることを推奨します。高次元データの場合、カーネル特徴選択によってさらに改善されます。以前の方法とは異なり、提案された方法は、精度、効率、分布外検出率の間で良好なバランスを実現します。さらに、私たちの方法はすべてのクラスに並列に適用できるため、計算負荷を軽減できます。提案された方法の有効性を示すために、理論分析と数値実験の両方が実施されています。

Diffusion Bridge Mixture Transports, Schrödinger Bridge Problems and Generative Modeling
拡散橋混合輸送、シュレーディンガー橋問題、および生成モデリング

The dynamic Schrödinger bridge problem seeks a stochastic process that defines a transport between two target probability measures, while optimally satisfying the criteria of being closest, in terms of Kullback-Leibler divergence, to a reference process. We propose a novel sampling-based iterative algorithm, the iterated diffusion bridge mixture (IDBM) procedure, aimed at solving the dynamic Schrödinger bridge problem. The IDBM procedure exhibits the attractive property of realizing a valid transport between the target probability measures at each iteration. We perform an initial theoretical investigation of the IDBM procedure, establishing its convergence properties. The theoretical findings are complemented by numerical experiments illustrating the competitive performance of the IDBM procedure. Recent advancements in generative modeling employ the time-reversal of a diffusion process to define a generative process that approximately transports a simple distribution to the data distribution. As an alternative, we propose utilizing the first iteration of the IDBM procedure as an approximation-free method for realizing this transport. This approach offers greater flexibility in selecting the generative process dynamics and exhibits accelerated training and superior sample quality over larger discretization intervals. In terms of implementation, the necessary modifications are minimally intrusive, being limited to the training loss definition.

動的シュレーディンガー橋問題は、2つのターゲット確率測度間の転送を定義し、カルバックライブラーダイバージェンスの観点から参照プロセスに最も近いという基準を最適に満たす確率プロセスを求めます。動的シュレーディンガー橋問題を解決することを目的とした、新しいサンプリングベースの反復アルゴリズム、反復拡散橋混合(IDBM)手順を提案します。IDBM手順は、各反復でターゲット確率測度間の有効な転送を実現するという魅力的な特性を示します。IDBM手順の初期理論的調査を行い、その収束特性を確立します。理論的発見は、IDBM手順の競争力のあるパフォーマンスを示す数値実験によって補完されます。生成モデリングの最近の進歩は、拡散プロセスの時間反転を使用して、単純な分布をデータ分布に近似的に転送する生成プロセスを定義します。代わりに、この転送を実現するための近似を使用しない方法として、IDBM手順の最初の反復を使用することを提案します。このアプローチは、生成プロセスのダイナミクスを選択する際の柔軟性を高め、より長い離散化間隔にわたってトレーニングの加速と優れたサンプル品質を実現します。実装の点では、必要な変更は最小限の侵襲性しかなく、トレーニング損失の定義に限定されています。

Multilevel CNNs for Parametric PDEs
パラメトリック偏微分方程式のマルチレベル CNN

We combine concepts from multilevel solvers for partial differential equations (PDEs) with neural network based deep learning and propose a new methodology for the efficient numerical solution of high-dimensional parametric PDEs. An in-depth theoretical analysis shows that the proposed architecture is able to approximate multigrid V-cycles to arbitrary precision with the number of weights only depending logarithmically on the resolution of the finest mesh. As a consequence, approximation bounds for the solution of parametric PDEs by neural networks that are independent on the (stochastic) parameter dimension can be derived. The performance of the proposed method is illustrated on high-dimensional parametric linear elliptic PDEs that are common benchmark problems in uncertainty quantification. We find substantial improvements over state-of-the-art deep learning-based solvers. As particularly challenging examples, random conductivity with high-dimensional non-affine Gaussian fields in 100 parameter dimensions and a random cookie problem are examined. Due to the multilevel structure of our method, the amount of training samples can be reduced on finer levels, hence significantly lowering the generation time for training data and the training time of our method.

私たちは、偏微分方程式(PDE)のマルチレベルソルバーの概念とニューラルネットワークベースのディープラーニングを組み合わせ、高次元パラメトリックPDEの効率的な数値解法のための新しい方法論を提案します。詳細な理論分析により、提案されたアーキテクチャは、重みの数が最細メッシュの解像度に対数的にのみ依存する状態で、マルチグリッドVサイクルを任意の精度で近似できることが示されています。結果として、(確率的)パラメーター次元に依存しないニューラルネットワークによるパラメトリックPDEの解法の近似境界を導出できます。提案された方法のパフォーマンスは、不確実性定量化の一般的なベンチマーク問題である高次元パラメトリック線形楕円PDEで実証されています。最先端のディープラーニングベースのソルバーに比べて大幅に改善されていることがわかります。特に難しい例として、100パラメーター次元の高次元非アフィンガウス場によるランダム伝導率とランダムクッキー問題が検討されます。私たちの方法はマルチレベル構造であるため、トレーニングサンプルの量をより細かいレベルで削減でき、トレーニングデータの生成時間と私たちの方法のトレーニング時間が大幅に短縮されます。

A Unified Recipe for Deriving (Time-Uniform) PAC-Bayes Bounds
(時間一様な) PACベイズ境界を導出するための統一レシピ

We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville’s inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-iid data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.

私たちは、PAC-ベイズの一般化境界を導出するための統一されたフレームワークを提示します。このトピックに関するこれまでのほとんどの文献とは異なり、私たちの境界はいつでも有効（つまり時間均一）であり、つまり、固定サンプルサイズだけでなく、すべての停止時間で成立します。私たちのアプローチは、次の順序で4つのツールを組み合わせたものである：(a)非負スーパーマルチンゲールまたは逆サブマルチンゲール、(b)混合法、(c)ドンスカー-バラダン公式（またはその他の凸双対原理）、および(d)ヴィルの不等式。私たちの主な結果は、幅広いクラスの離散確率過程に成立するPAC-ベイズ定理です。私たちは、この結果が、Seeger、McAllester、Maurer、およびCatoniなどのよく知られた古典的なPAC-ベイズ境界の時間均一バージョン、さらに最近の多くの境界を意味する方法を示す。私たちはまた、いくつかの新しい境界を提示します。私たちのフレームワークにより、従来の仮定を緩和することもできます。特に、非定常損失関数と非iidデータを考慮します。要約すると、過去の境界の導出を統一し、将来の境界の検索を容易にします。つまり、スーパーマルチンゲールまたはサブマルチンゲールの条件が満たされているかどうかを確認するだけで、満たされている場合は(時間的に均一な) PACベイズ境界が保証されます。

Decentralized Robust V-learning for Solving Markov Games with Model Uncertainty
モデル不確実性を持つマルコフゲームを解くための分散型ロバストV学習

The Markov game is a popular reinforcement learning framework for modeling competitive players in a dynamic environment. However, most of the existing works on Markov games focus on computing a certain equilibrium following uncertain interactions among the players but ignore the uncertainty of the environment model, which is ubiquitous in practical scenarios. In this work, we develop a theoretical solution to Markov games with environment model uncertainty. Specifically, we propose a new and tractable notion of robust correlated equilibria for Markov games with environment model uncertainty. In particular, we prove that the robust correlated equilibrium has a simple modification structure, and its characterization of equilibria critically depends on the environment model uncertainty. Moreover, we propose the first fully-decentralized stochastic algorithm for computing such the robust correlated equilibrium. Our analysis proves that the algorithm achieves the polynomial episode complexity $\widetilde{O}( SA^2 H^5 \epsilon^{-2})$ for computing an approximate robust correlated equilibrium with $\epsilon$ accuracy.

マルコフゲームは、動的環境における競争的なプレイヤーをモデル化するための人気の強化学習フレームワークです。しかし、マルコフゲームに関する既存の研究のほとんどは、プレイヤー間の不確実な相互作用に従う特定の均衡を計算することに焦点を当てており、実際のシナリオで普遍的な環境モデルの不確実性を無視しています。この研究では、環境モデルの不確実性を伴うマルコフゲームに対する理論的ソリューションを開発します。具体的には、環境モデルの不確実性を伴うマルコフゲームに対するロバスト相関均衡の新しい扱いやすい概念を提案します。特に、ロバスト相関均衡には単純な修正構造があり、均衡の特徴付けは環境モデルの不確実性に大きく依存することを証明します。さらに、そのようなロバスト相関均衡を計算するための最初の完全に分散化された確率アルゴリズムを提案します。私たちの分析は、アルゴリズムが$\epsilon$の精度で近似ロバスト相関平衡を計算するために多項式エピソード複雑度$\widetilde{O}( SA^2 H^5 \epsilon^{-2})$を達成することを証明しています。

Densely Connected G-invariant Deep Neural Networks with Signed Permutation Representations
符号付き順列表現を持つ密結合G不変深層ニューラルネットワーク

We introduce and investigate, for finite groups $G$, $G$-invariant deep neural network ($G$-DNN) architectures with ReLU activation that are densely connected–i.e., include all possible skip connections. In contrast to other $G$-invariant architectures in the literature, the preactivations of the $G$-DNNs presented here are able to transform by signed permutation representations (signed perm-reps) of $G$. Moreover, the individual layers of the $G$-DNNs are not required to be $G$-equivariant; instead, the preactivations are constrained to be $G$-equivariant functions of the network input in a way that couples weights across all layers. The result is a richer family of $G$-invariant architectures never seen previously. We derive an efficient implementation of $G$-DNNs after a reparameterization of weights, as well as necessary and sufficient conditions for an architecture to be “admissible”– i.e., nondegenerate and inequivalent to smaller architectures. We include code that allows a user to build a $G$-DNN interactively layer-by-layer, with the final architecture guaranteed to be admissible. We show that there are far more admissible $G$-DNN architectures than those accessible with the “concatenated ReLU” activation function from the literature. Finally, we apply $G$-DNNs to two example problems—(1) multiplication in $\{-1, 1\}$ (with theoretical guarantees) and (2) 3D object classification—finding that the inclusion of signed perm-reps significantly boosts predictive performance compared to baselines with only ordinary (i.e., unsigned) perm-reps.

私たちは、有限群$G$に対して、ReLU活性化を備えた密に接続された、つまりすべての可能なスキップ接続を含む$G$不変ディープニューラルネットワーク($G$-DNN)アーキテクチャを導入し、調査します。文献にある他の$G$不変アーキテクチャとは対照的に、ここで紹介する$G$-DNNの事前活性化は、$G$の符号付き順列表現(signed perm-reps)によって変換できます。さらに、$G$-DNNの個々のレイヤーは$G$等変である必要はありません。代わりに、事前活性化は、すべてのレイヤーにわたって重みを結合する方法で、ネットワーク入力の$G$等変関数になるように制約されます。その結果、これまでにない豊富な$G$不変アーキテクチャファミリが生まれます。重みの再パラメータ化後の$G$-DNNの効率的な実装と、アーキテクチャが「許容可能」であるための必要十分条件(つまり、非退化かつより小さなアーキテクチャと同等ではない)を導き出します。ユーザーが$G$-DNNをレイヤーごとに対話的に構築し、最終的なアーキテクチャが許容可能であることが保証されるコードも含まれています。文献の「連結ReLU」活性化関数でアクセスできるものよりも、許容可能な$G$-DNNアーキテクチャがはるかに多く存在することを示します。最後に、$G$-DNNを2つの例題((1) $\{-1, 1\}$での乗算(理論的な保証あり)、(2) 3Dオブジェクト分類)に適用し、符号付きperm-repを含めると、通常の(つまり、符号なし) perm-repのみのベースラインと比較して予測パフォーマンスが大幅に向上することを発見しました。

A Permutation-Free Kernel Independence Test
順列フリーのカーネル独立性テスト

In nonparametric independence testing, we observe i.i.d.\ data $\{(X_i,Y_i)\}_{i=1}^n$, where $X \in \mathcal{X}, Y \in \mathcal{Y}$ lie in any general spaces, and we wish to test the null that $X$ is independent of $Y$. Modern test statistics such as the kernel Hilbert–Schmidt Independence Criterion (HSIC) and Distance Covariance (dCov) have intractable null distributions due to the degeneracy of the underlying U-statistics. Hence, in practice, one often resorts to using permutation testing, which provides a nonasymptotic guarantee at the expense of recalculating the quadratic-time statistics (say) a few hundred times. In this paper, we provide a simple but nontrivial modification of HSIC and dCov (called xHSIC and xdCov, pronounced “cross” HSIC/dCov) so that they have a limiting Gaussian distribution under the null, and thus do not require permutations. We show that our new tests, like the originals, are consistent against fixed alternatives, and minimax rate optimal against smooth local alternatives. Numerical simulations demonstrate that compared to the permutation tests, our variants have the same power within a constant factor, giving practitioners a new option for large problems or data-analysis pipelines where computation, not sample size, could be the bottleneck.

ノンパラメトリック独立性検定では、i.i.d.\データ$\{(X_i,Y_i)\}_{i=1}^n$を観察します。ここで、$X \in \mathcal{X}、Y \in \mathcal{Y}$は任意の一般空間にあり、$X$が$Y$から独立しているという帰無仮説を検定します。カーネルヒルベルト-シュミット独立基準(HSIC)や距離共分散(dCov)などの最新の検定統計量では、基礎となるU統計量の退化により、帰無仮説の分布が扱いにくくなります。そのため、実際には、多くの場合、2次時間統計量を(たとえば)数百回再計算する代償として非漸近的保証を提供する順列検定を使用します。この論文では、HSICとdCov (xHSICとxdCovと呼ばれ、「クロス」HSIC/dCovと発音)の単純だが重要な修正を提案し、帰無仮説の下で極限ガウス分布を持つようにして、順列を必要としないようにします。新しいテストは、元のテストと同様に、固定代替テストに対して一貫性があり、滑らかなローカル代替テストに対してはミニマックスレート最適であることを示します。数値シミュレーションでは、順列テストと比較して、私たちのバリアントは定数係数内で同じパワーを持つことが実証されており、サンプルサイズではなく計算がボトルネックになる可能性がある大規模な問題やデータ分析パイプラインに対する新しいオプションを実践者に提供します。

LapGym – An Open Source Framework for Reinforcement Learning in Robot-Assisted Laparoscopic Surgery
LapGym – ロボット支援腹腔鏡手術における強化学習のためのオープンソースフレームワーク

Recent advances in reinforcement learning (RL) have increased the promise of introducing cognitive assistance and automation to robot-assisted laparoscopic surgery (RALS). However, progress in algorithms and methods depends on the availability of standardized learning environments that represent skills relevant to RALS. We present LapGym, a framework for building RL environments for RALS that models the challenges posed by surgical tasks, and sofaenv, a diverse suite of 12 environments. Motivated by surgical training, these environments are organized into 4 tracks: Spatial Reasoning, Deformable Object Manipulation & Grasping, Dissection, and Thread Manipulation. Each environment is highly parametrizable for increasing difficulty, resulting in a high performance ceiling for new algorithms. We use Proximal Policy Optimization (PPO) to establish a baseline for model-free RL algorithms, investigating the effect of several environment parameters on task difficulty. Finally, we show that many environments and parameter configurations reflect well-known, open problems in RL research, allowing researchers to continue exploring these fundamental problems in a surgical context. We aim to provide a challenging, standard environment suite for further development of RL for RALS, ultimately helping to realize the full potential of cognitive surgical robotics. LapGym is publicly accessible through GitHub (https://github.com/ScheiklP/lap_gym).

強化学習(RL)の最近の進歩により、ロボット支援腹腔鏡手術(RALS)に認知支援と自動化を導入する可能性が高まっています。ただし、アルゴリズムと方法の進歩は、RALSに関連するスキルを表す標準化された学習環境の利用可能性に依存します。ここでは、外科手術の課題をモデル化するRALSのRL環境を構築するためのフレームワークであるLapGymと、12の環境からなる多様なスイートであるsofaenvを紹介します。外科手術のトレーニングに動機付けられたこれらの環境は、空間推論、変形可能オブジェクトの操作と把持、解剖、および糸操作の4つのトラックに編成されています。各環境は高度にパラメーター化可能で難易度が上がるため、新しいアルゴリズムのパフォーマンス上限が高くなります。私たちは、近似ポリシー最適化(PPO)を使用して、モデルフリーRLアルゴリズムのベースラインを確立し、いくつかの環境パラメーターがタスクの難易度に与える影響を調査します。最後に、多くの環境とパラメータ構成がRL研究におけるよく知られた未解決の問題を反映していることを示し、研究者が外科的コンテキストでこれらの基本的な問題を引き続き調査できるようにします。私たちの目標は、RALS向けRLのさらなる開発のための挑戦的な標準環境スイートを提供し、最終的には認知外科ロボットの潜在能力を最大限に引き出すことです。LapGymはGitHub (https://github.com/ScheiklP/lap_gym)から公開されています。

TorchOpt: An Efficient Library for Differentiable Optimization
TorchOpt:微分可能な最適化のための効率的なライブラリ

Differentiable optimization algorithms often involve expensive computations of various meta-gradients. To address this, we design and implement TorchOpt, a new PyTorch-based differentiable optimization library. TorchOpt provides an expressive and unified programming interface that simplifies the implementation of explicit, implicit, and zero-order gradients. Moreover, TorchOpt has a distributed execution runtime capable of parallelizing diverse operations linked to differentiable optimization tasks across CPU and GPU devices. Experimental results demonstrate that TorchOpt achieves a 5.2× training time speedup in a cluster. TorchOpt is open-sourced at https://github.com/metaopt/torchopt and has become a PyTorch Ecosystem project.

微分可能最適化アルゴリズムには、多くの場合、さまざまなメタ勾配の高価な計算が含まれます。これに対処するために、PyTorchベースの新しい微分可能最適化ライブラリであるTorchPitを設計し、実装します。TorchOptは、明示的、暗黙的、およびゼロ次勾配の実装を簡素化する、表現力豊かで統一されたプログラミングインターフェースを提供します。さらに、TorchOptは、CPUおよびGPUデバイス間で微分可能な最適化タスクにリンクされた多様な操作を並列化できる分散実行ランタイムを備えています。実験結果から、TorchOptはクラスタ内での学習時間を5.2×高速化することが実証されています。TorchOptはhttps://github.com/metaopt/torchoptでオープンソース化されており、PyTorchエコシステムプロジェクトになっています。

Confidence and Uncertainty Assessment for Distributional Random Forests
分布ランダムフォレストの信頼度と不確実性の評価

The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations

Distributional Random Forest (DRF)は、多変量条件付き分布を推定するために最近導入されたランダムフォレストアルゴリズムです。その一般的な推定手順により、条件付き平均治療効果、条件付き分位数、条件付き相関など、幅広いターゲットを推定するために使用できます。ただし、これまでのところ、DRF予測の一貫性と収束率に関する結果のみを利用できます。DRFの漸近分布を特徴付け、そのブートストラップ近似を開発します。これにより、標準誤差を定量化するための推論ツールを導き出し、漸近的なカバレッジ保証を持つ信頼領域を構築できます。シミュレーション研究では、低次元のターゲットの推論と2つの母集団間の分布の違いをテストするために、開発された理論を経験的に検証します

Hard-Constrained Deep Learning for Climate Downscaling
気候のダウンスケーリングのためのハード制約付き深層学習

The availability of reliable, high-resolution climate and weather data is important to inform long-term decisions on climate adaptation and mitigation and to guide rapid responses to extreme events. Forecasting models are limited by computational costs and, therefore, often generate coarse-resolution predictions. Statistical downscaling, including super-resolution methods from deep learning, can provide an efficient method of upsampling low-resolution data. However, despite achieving visually compelling results in some cases, such models frequently violate conservation laws when predicting physical variables. In order to conserve physical quantities, here we introduce methods that guarantee statistical constraints are satisfied by a deep learning downscaling model, while also improving their performance according to traditional metrics. We compare different constraining approaches and demonstrate their applicability across different neural architectures as well as a variety of climate and weather data sets. Besides enabling faster and more accurate climate predictions through downscaling, we also show that our novel methodologies can improve super-resolution for satellite data and natural images data sets.

信頼性の高い高解像度の気候および気象データが利用可能であることは、気候への適応と緩和に関する長期的な決定に情報を提供し、異常気象への迅速な対応を導くために重要です。予測モデルは計算コストによって制限されるため、粗い解像度の予測を生成することがよくあります。統計的ダウンスケーリング(ディープラーニングの超解像法を含む)は、低解像度のデータをアップサンプリングする効率的な方法を提供できます。ただし、視覚的に説得力のある結果が得られる場合もありますが、このようなモデルは物理変数を予測する際に保存則に違反することがよくあります。物理量を保存するために、ここでは、統計的制約がディープラーニングのダウンスケーリングモデルによって満たされることを保証すると同時に、従来の指標に従ってパフォーマンスを向上させる方法を紹介します。さまざまな制約アプローチを比較し、さまざまなニューラルアーキテクチャやさまざまな気候および気象データセットに適用できることを実証します。ダウンスケーリングによって気候予測をより迅速かつ正確に行うことができるだけでなく、新しい方法論によって衛星データや自然画像データセットの超解像を改善できることも示します。

Partial Order in Chaos: Consensus on Feature Attributions in the Rashomon Set
混沌の中の部分的な順序:羅生門セットの特徴属性に関するコンセンサス

Post-hoc global/local feature attribution methods are progressively being employed to understand the decisions of complex machine learning models. Yet, because of limited amounts of data, it is possible to obtain a diversity of models with good empirical performance but that provide very different explanations for the same prediction, making it hard to derive insight from them. In this work, instead of aiming at reducing the under-specification of model explanations, we fully embrace it and extract logical statements about feature attributions that are consistent across all models with good empirical performance (i.e. all models in the Rashomon Set). We show that partial orders of local/global feature importance arise from this methodology enabling more nuanced interpretations by allowing pairs of features to be incomparable when there is no consensus on their relative importance. We prove that every relation among features present in these partial orders also holds in the rankings provided by existing approaches. Finally, we present three use cases employing hypothesis spaces with tractable Rashomon Sets (Additive models, Kernel Ridge, and Random Forests) and show that partial orders allow one to extract consistent local and global interpretations of models despite their under-specification.

複雑な機械学習モデルの決定を理解するために、事後的なグローバル/ローカル特徴帰属法が徐々に採用されています。しかし、データ量が限られているため、経験的パフォーマンスは良好であるものの、同じ予測に対して非常に異なる説明を提供する多様なモデルが得られる可能性があり、そこから洞察を引き出すのは困難です。この研究では、モデルの説明の不十分な仕様を削減することを目指すのではなく、それを完全に受け入れ、経験的パフォーマンスが良好なすべてのモデル（つまり、羅生門セットのすべてのモデル）にわたって一貫した特徴帰属に関する論理ステートメントを抽出します。この方法論からローカル/グローバル特徴の重要性の部分的な順序が生まれ、相対的な重要性についてコンセンサスがない場合に特徴のペアを比較できないようにすることで、より微妙な解釈が可能になることを示します。これらの部分的な順序に存在する特徴間のすべての関係は、既存のアプローチによって提供されるランキングでも保持されることを証明します。最後に、扱いやすい羅生門セット(加法モデル、カーネルリッジ、ランダムフォレスト)を使用した仮説空間を使用する3つのユースケースを示し、部分順序によって、モデルの不十分な仕様にもかかわらず、一貫したローカルおよびグローバルなモデルの解釈を抽出できることを示します。

Avalanche: A PyTorch Library for Deep Continual Learning
Avalanche: 深層継続学習のための PyTorch ライブラリ

Continual learning is the problem of learning from a nonstationary stream of data, a fundamental issue for sustainable and efficient training of deep neural networks over time. Unfortunately, deep learning libraries only provide primitives for offline training, assuming that model’s architecture and data are fixed. Avalanche is an open source library maintained by the ContinualAI non-profit organization that extends PyTorch by providing first-class support for dynamic architectures, streams of datasets, and incremental training and evaluation methods. Avalanche provides a large set of predefined benchmarks and training algorithms and it is easy to extend and modular while supporting a wide range of continual learning scenarios. Documentation is available at https://avalanche.continualai.org.

継続的学習とは、非定常的なデータの流れから学習する問題であり、長期にわたってディープニューラルネットワークを持続可能で効率的に学習するための基本的な問題です。残念ながら、ディープラーニングライブラリは、モデルのアーキテクチャとデータが固定されていると仮定して、オフライントレーニング用のプリミティブのみを提供します。Avalancheは、非営利団体ContinualAIが管理するオープンソースライブラリで、動的アーキテクチャ、データセットのストリーム、インクリメンタルトレーニングおよび評価方法に対するファーストクラスのサポートを提供することでPyTorchを拡張します。Avalancheは、事前定義されたベンチマークとトレーニングアルゴリズムの大規模なセットを提供し、拡張が容易でモジュール式でありながら、幅広い継続的な学習シナリオをサポートします。ドキュメントはhttps://avalanche.continualai.orgで入手できます。

Discovering Salient Neurons in deep NLP models
深層NLPモデルにおける顕著なニューロンの発見

While a lot of work has been done in understanding representations learned within deep NLP models and what knowledge they capture, work done towards analyzing individual neurons is relatively sparse. We present a technique called Linguistic Correlation Analysis to extract salient neurons in the model, with respect to any extrinsic property, with the goal of understanding how such knowledge is preserved within neurons. We carry out a fine-grained analysis to answer the following questions: (i) can we identify subsets of neurons in the network that learn a specific linguistic property? (ii) is a certain linguistic phenomenon in a given model localized (encoded in few individual neurons) or distributed across many neurons? (iii) how redundantly is the information preserved? (iv) how does fine-tuning pre-trained models towards downstream NLP tasks impact the learned linguistic knowledge? (v) how do models vary in learning different linguistic properties? Our data-driven, quantitative analysis illuminates interesting findings: (i) we found small subsets of neurons that can predict different linguistic tasks; (ii) neurons capturing basic lexical information, such as suffixation, are localized in the lowermost layers; (iii) neurons learning complex concepts, such as syntactic role, are predominantly found in middle and higher layers; (iv) salient linguistic neurons are relocated from higher to lower layers during transfer learning, as the network preserves the higher layers for task-specific information; (v) we found interesting differences across pre-trained models regarding how linguistic information is preserved within them; and (vi) we found that concepts exhibit similar neuron distribution across different languages in the multilingual transformer models. Our code is publicly available as part of the NeuroX toolkit (Dalvi et al., 2023).

深層NLPモデル内で学習された表現とそれが捕捉する知識を理解するための研究は数多く行われてきましたが、個々のニューロンを分析するための研究は比較的まばらです。この研究では、そのような知識がニューロン内でどのように保持されるかを理解するために、外在的特性に関してモデル内の顕著なニューロンを抽出する言語相関分析と呼ばれる手法を紹介します。以下の質問に答えるためにきめ細かい分析を行います。(i)特定の言語特性を学習するネットワーク内のニューロンのサブセットを識別できますか? (ii)特定のモデル内の特定の言語現象は局所化されていますか(少数の個々のニューロンにエンコードされています)、それとも多くのニューロンに分散されていますか? (iii)情報はどの程度冗長的に保持されていますか? (iv)事前トレーニング済みモデルを下流のNLPタスクに向けて微調整すると、学習した言語知識にどのような影響がありますか? (v)さまざまな言語特性を学習する際のモデルの違いは何ですか?データに基づく定量分析により、興味深い発見が明らかになりました。(i)さまざまな言語タスクを予測できるニューロンの小さなサブセットを発見しました。(ii)接尾辞などの基本的な語彙情報を捕捉するニューロンは、最下層に局在しています。(iii)統語的役割などの複雑な概念を学習するニューロンは、主に中間層と高層に存在します。(iv)顕著な言語ニューロンは、ネットワークがタスク固有の情報のために高層を保存するため、転移学習中に高層から低層に再配置されます。(v)事前学習済みモデル間で、言語情報がどのように保存されるかに関して興味深い違いが見つかりました。(vi)多言語トランスフォーマーモデルでは、概念がさまざまな言語間で同様のニューロン分布を示すことがわかりました。私たちのコードは、NeuroXツールキット(Dalviら、2023)の一部として公開されています。

Differentially Private Hypothesis Testing for Linear Regression
線形回帰のための差分プライベート仮説検定

In this work, we design differentially private hypothesis tests for the following problems in the multivariate linear regression model: testing a linear relationship and testing for the presence of mixtures. The majority of our hypothesis tests are based on differentially private versions of the $F$-statistic for the multivariate linear regression model framework. We also present other differentially private tests—not based on the $F$-statistic—for these problems. We show that the differentially private $F$-statistic converges to the asymptotic distribution of its non-private counterpart. As a corollary, the statistical power of the differentially private $F$-statistic converges to the statistical power of the non-private $F$-statistic. Through a suite of Monte Carlo based experiments, we show that our tests achieve desired significance levels and have a high power that approaches the power of the non-private tests as we increase sample sizes or the privacy-loss parameter. We also show when our tests outperform existing methods in the literature.

この研究では、多変量線形回帰モデルにおける線形関係の検定と混合の存在の検定という問題に対して、差分プライバシー仮説検定を設計します。仮説検定の大部分は、多変量線形回帰モデルフレームワークの$F$統計量の差分プライバシーバージョンに基づいています。また、これらの問題に対する他の差分プライバシー検定($F$統計量に基づかない検定)も示します。差分プライバシー$F$統計量は、非プライベートの対応するものの漸近分布に収束することを示します。結果として、差分プライバシー$F$統計量の統計的検出力は、非プライベート$F$統計量の統計的検出力に収束します。一連のモンテカルロベースの実験を通じて、サンプルサイズまたはプライバシー損失パラメータを増やすにつれて、検定が望ましい有意水準を達成し、非プライベート検定の検出力に近づく高い検出力を持つことを示します。また、検定が文献の既存の方法よりも優れている場合も示します。

Attribution-based Explanations that Provide Recourse Cannot be Robust
頼みの綱を提供する帰属ベースの説明は堅牢ではあり得ない

Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision f(x) of a machine learning system by making limited changes to its input x. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input x that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of x. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions, and we provide sufficient conditions for specific classes of continuous functions to be recourse sensitive. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of x, by providing an exact characterization of the functions f to which impossibility applies.

機械学習手法のさまざまなユーザーは、その目的に応じてさまざまな説明を必要とします。機械学習を社会に対して説明可能にするための重要な目標の1つは、影響を受けるユーザーが入力xに限定的な変更を加えることで機械学習システムの決定f(x)を変更できるようにする、実行可能な償還オプションを取得することです。私たちはこれを、償還感度の一般的な定義を提供することで形式化します。償還感度は、決定に対するどの変更がユーザーに関連するかを説明する効用関数でインスタンス化する必要があります。この定義は、各入力機能に重要度の重みを付与するローカルアトリビューションメソッドに適用されます。このようなローカルアトリビューションは、説明されている入力xの小さな変更が機能の重みに大きな変化を引き起こさないという意味で堅牢である必要があるとよく主張されます。ただし、単一のアトリビューションメソッドが償還感度と堅牢性を同時に実現することは一般に不可能であることを正式に証明します。したがって、これらのプロパティの少なくとも1つに対する反例が常に存在する必要があります。私たちは、LIME、SHAP、Integrated Gradients、SmoothGradなど、いくつかの一般的な帰属方法に対して、このような反例を提供します。我々の結果は、xの摂動を記述する帰属として見ることができる反事実的説明もカバーしています。我々はさらに、例えば出力が複数の帰属を持つセットで構成できるようにするなど、不可能性の結果を回避できる可能性のある方法について議論し、特定のクラスの連続関数がリコースに敏感であるための十分な条件を提供します。最後に、不可能性が適用される関数fの正確な特徴付けを提供することで、ユーザーがxの単一の属性のみを変更できるという制限されたケースに対する不可能性の結果を強化します。

A Unified Theory of Diversity in Ensemble Learning
アンサンブル学習における多様性の統一理論

We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge has been referred to as the “holy grail” of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for a wide range of losses in both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach: quantifying the effects of diversity, which turn out to be dependent on the label distribution. Overall, we argue that diversity is a measure of model fit, in precisely the same sense as bias and variance, but accounting for statistical dependencies between ensemble members. Thus, we should not be ‘maximising diversity’ as so many works aim to do—instead, we have a bias/variance/diversity trade-off to manage.

私たちは、アンサンブル多様性の理論を提示し、幅広い教師あり学習シナリオにおける多様性の性質を説明します。この課題はアンサンブル学習の「聖杯」と呼ばれ、30年以上にわたって未解決の研究課題でした。我々のフレームワークは、多様性が実際にはアンサンブル損失のバイアス-分散分解における隠れた次元であることを明らかにしています。私たちは、回帰と分類の両方における幅広い損失、例えば二乗損失、クロスエントロピー損失、ポアソン損失について、正確なバイアス-分散-多様性分解のファミリーを証明します。加法的バイアス-分散分解が利用できない損失（例えば0/1損失）については、代替アプローチを提示します。それは、ラベル分布に依存することが判明した多様性の影響を定量化するものです。全体として、私たちは、多様性はバイアスや分散とまったく同じ意味でモデル適合の尺度であると主張しますが、アンサンブルメンバー間の統計的依存性を考慮しています。したがって、多くの研究が目指しているように「多様性を最大化」するべきではなく、むしろ、バイアス/差異/多様性のトレードオフを管理する必要があります。

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging
セミパラメトリックモデル平均化による最適パラメータ伝達学習

In this article, we focus on prediction of a target model by transferring the information of source models. To be flexible, we use semiparametric additive frameworks for the target and source models. Inheriting the spirit of parameter-transfer learning, we assume that different models possibly share common knowledge across parametric components that is helpful for the target predictive task. Unlike existing parameter-transfer approaches, which need to construct auxiliary source models by parameter similarity with the target model and then adopt a regularization procedure, we propose a frequentist model averaging strategy with a $J$-fold cross-validation criterion so that auxiliary parameter information from different models can be adaptively transferred through data-driven weight assignments. The asymptotic optimality and weight convergence of our proposed method are built under some regularity conditions. Extensive numerical results demonstrate the superiority of the proposed method over competitive methods.

この記事では、ソースモデルの情報を転送することによるターゲットモデルの予測に焦点を当てています。柔軟性を持たせるために、ターゲットモデルとソースモデルにセミパラメトリック加法フレームワークを使用します。パラメータ転送学習の精神を継承し、異なるモデルがパラメトリックコンポーネント間で共通の知識を共有している可能性があり、それがターゲット予測タスクに役立つと想定しています。ターゲットモデルとのパラメータ類似性によって補助ソースモデルを構築し、次に正則化手順を採用する必要がある既存のパラメータ転送アプローチとは異なり、異なるモデルからの補助パラメータ情報をデータ駆動型の重み割り当てを通じて適応的に転送できるように、$J$フォールドクロスバリデーション基準を備えた頻度論的モデル平均化戦略を提案します。提案方法の漸近最適性と重み収束は、いくつかの正則性条件下で構築されます。広範な数値結果により、競合方法に対する提案方法の優位性が実証されています。

Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov Spaces
ソボレフ空間とベゾフ空間におけるディープ ReLU ニューラルネットワークの最適近似率

Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev spaces $W^s(L_q(\Omega))$ and Besov spaces $B^s_r(L_q(\Omega))$, with error measured in the $L_p(\Omega)$ norm. This problem is important when studying the application of neural networks in a variety of fields, including scientific computing and signal processing, and has previously been solved only when $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$ for which the corresponding Sobolev or Besov space compactly embeds into $L_p$. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.

$\Omega = [0,1]^d$を$\mathbb{R}^d$の単位立方体とします。ReLU活性化関数を持つディープニューラルネットワークが、パラメータ数の観点から、ソボレフ空間$W^s(L_q(\Omega))$とベゾフ空間$B^s_r(L_q(\Omega))$の関数を、$L_p(\Omega)$ノルムで測定された誤差でいかに効率的に近似できるかという問題を研究します。この問題は、科学計算や信号処理など、さまざまな分野でニューラルネットワークの応用を研究する際に重要であり、これまでは$p=q=\infty$の場合にのみ解決されていました。私たちの貢献は、対応するソボレフ空間またはベゾフ空間が$L_p$にコンパクトに埋め込まれるすべての$1\leq p,q\leq \infty$および$s > 0$に対する完全なソリューションを提供することです。重要な技術ツールは、スパースベクトルの最適なエンコードを実現する新しいビット抽出技術です。これにより、$p > q$の非線形領域で明確な上限を得ることができます。また、$p < \infty$の場合にVC次元に基づいて$L_p$近似下限を導出する新しい方法も提供します。結果から、非常に深いReLUネットワークは、パラメーターの数の点で従来の近似方法を大幅に上回っていることがわかりますが、これはエンコードできないパラメーターの犠牲を伴います。

MAUVE Scores for Generative Models: Theory and Practice
生成モデルのMAUVEスコア:理論と実践

Generative artificial intelligence has made significant strides, producing text indistinguishable from human prose and remarkably photorealistic images. Automatically measuring how close the generated data distribution is to the target distribution is central to diagnosing existing models and developing better ones. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore three approaches to statistically estimate these scores: vector quantization, non-parametric estimation, and classifier-based estimation. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics. In conclusion, we present practical recommendations for using MAUVE effectively with language and image modalities.

生成型人工知能は大きな進歩を遂げ、人間の文章と区別がつかないテキストや、驚くほど写実的な画像を生み出しています。生成されたデータ分布がターゲット分布にどれだけ近いかを自動的に測定することは、既存のモデルを診断し、より優れたモデルを開発する上で重要です。ここでは、テキストや画像の生成モデル化で遭遇するような分布のペア間の比較尺度ファミリーであるMAUVEを紹介します。これらのスコアは、生成モデル化における2種類のエラーを捉えた発散フロンティアの統計的要約です。これらのスコアを統計的に推定する3つのアプローチ、つまりベクトル量子化、ノンパラメトリック推定、分類器ベースの推定を検討します。ベクトル量子化アプローチの統計的境界を示します。経験的に、提案されたスコアをさまざまな$f$発散および統計的推定方法と組み合わせると、人間の判断と相関し、生成されたテキストの既知の特性を識別することで、人間が書いたテキストの分布と最新のニューラル言語モデルの分布のギャップを定量化できることがわかりました。視覚領域において、MAUVEは既存の指標と同等かそれ以上の精度で、生成された画像の既知の特性を識別できることを実証しました。結論として、言語および画像モダリティでMAUVEを効果的に使用するための実用的な推奨事項を示します。

Beyond Spectral Gap: The Role of the Topology in Decentralized Learning
スペクトルギャップを超えて:分散学習におけるトポロジーの役割

In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. In the decentralized setting, in which workers communicate over a sparse graph, current theory fails to capture important aspects of real-world behavior. First, the `spectral gap’ of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence dynamics in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.

機械学習モデルのデータ並列最適化では、ワーカーは協力してモデルの推定値を改善します。勾配の精度が高ければ高いほど、より大きな学習率を使用してより速く最適化できます。ワーカーがスパースグラフを介して通信する分散設定では、現在の理論は現実世界の行動の重要な側面を捉えることができません。まず、通信グラフの「スペクトルギャップ」は、(深層)学習における経験的パフォーマンスを予測するものではありません。次に、現在の理論では、協力することでトレーニングのみの場合よりも大きな学習率が可能になることを説明していません。実際、現在の理論ではより小さな学習率を規定しており、グラフが大きくなるにつれてさらに低下するため、無限グラフの収束ダイナミクスを説明できません。この論文では、スパース接続された分散最適化の正確な図を描くことを目的としています。グラフトポロジーが2次トイプロブレムの収束にどのように影響するかを定量化し、一般的な滑らかな(強い)凸目標に対する理論的結果を示します。私たちの理論は、深層学習における経験的観察と一致し、さまざまなグラフトポロジーの相対的なメリットを正確に説明します。

A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models
巡回および非巡回因果モデルのための統一実験デザインアプローチ

We study experiment design for unique identification of the causal graph of a simple SCM, where the graph may contain cycles. The presence of cycles in the structure introduces major challenges for experiment design as, unlike acyclic graphs, learning the skeleton of causal graphs with cycles may not be possible from merely the observational distribution. Furthermore, intervening on a variable in such graphs does not necessarily lead to orienting all the edges incident to it. In this paper, we propose an experiment design approach that can learn both cyclic and acyclic graphs and hence, unifies the task of experiment design for both types of graphs. We provide a lower bound on the number of experiments required to guarantee the unique identification of the causal graph in the worst case, showing that the proposed approach is order-optimal in terms of the number of experiments up to an additive logarithmic term. Moreover, we extend our result to the setting where the size of each experiment is bounded by a constant. For this case, we show that our approach is optimal in terms of the size of the largest experiment required for uniquely identifying the causal graph in the worst case.

私たちは、サイクルを含む可能性のある単純なSCMの因果グラフを一意に識別するための実験設計について研究します。構造にサイクルが存在すると、非巡回グラフとは異なり、サイクルを含む因果グラフのスケルトンを観測分布のみから学習することはできないため、実験設計に大きな課題が生じる。さらに、このようなグラフの変数に介入しても、必ずしもそれに付随するすべてのエッジの方向付けにつながるわけではない。この論文では、巡回グラフと非巡回グラフの両方を学習できる実験設計アプローチを提案し、両方のタイプのグラフの実験設計タスクを統合します。最悪の場合でも因果グラフを一意に識別するために必要な実験数の下限を示し、提案されたアプローチが加法対数項までの実験数に関して順序最適であることを示す。さらに、各実験のサイズが定数で制限される設定に結果を拡張します。この場合、最悪のケースで因果グラフを一意に識別するために必要な最大の実験のサイズに関して、私たちのアプローチが最適であることを示します。

Limitations on approximation by deep and shallow neural networks
深層ニューラルネットワークと浅いニューラルネットワークによる近似の限界

We prove Carl’s type inequalities for the error of approximation of compact sets K by deep and shallow neural networks. This in turn gives estimates from below on how well we can approximate the functions in K when requiring the approximants to come from outputs of such networks. Our results are obtained as a byproduct of the study of the recently introduced Lipschitz widths.

私たちは、コンパクトセットKの近似誤差に対するカールの型不等式を、深いニューラルネットワークと浅いニューラルネットワークによって証明します。これにより、近似がそのようなネットワークの出力から来る必要がある場合に、Kの関数をどれだけうまく近似できるかについて、以下から推定値が得られます。私たちの結果は、最近導入されたリプシッツ幅の研究の副産物として得られます。

Group SLOPE Penalized Low-Rank Tensor Regression
グループSLOPEペナルティ付き低ランクテンソル回帰

This article aims to seek a selection and estimation procedure for a class of tensor regression problems with multivariate covariates and matrix responses, which can provide theoretical guarantees for model selection in finite samples. Considering the frontal slice sparsity and low-rankness inherited in the coefficient tensor, we formulate the regression procedure as a group SLOPE penalized low-rank tensor optimization problem based on an orthogonal decomposition, namely TgSLOPE. This procedure provably controls the newly introduced tensor group false discovery rate (TgFDR), provided that the predictor matrix is column-orthogonal. Moreover, we establish the asymptotically minimax convergence with respect to the TgSLOPE estimate risk. For efficient problem resolution, we equivalently transform the TgSLOPE problem into a difference-of-convex (DC) program with the level-coercive objective function. This allows us to solve the reformulation problem of TgSLOPE by an efficient proximal DC algorithm (DCA) with global convergence. Numerical studies conducted on synthetic data and a real human brain connection data illustrate the efficacy of the proposed TgSLOPE estimation procedure.

この記事の目的は、有限サンプルでのモデル選択の理論的保証を提供できる、多変量共変量と行列応答を持つテンソル回帰問題のクラスの選択および推定手順を探すことです。係数テンソルに継承された前頭スライスのスパース性と低ランク性を考慮して、直交分解、つまりTgSLOPEに基づくグループSLOPEペナルティ付き低ランクテンソル最適化問題として回帰手順を定式化します。この手順は、予測子行列が列直交である場合に、新しく導入されたテンソルグループ偽発見率(TgFDR)を証明可能に制御します。さらに、TgSLOPE推定リスクに関して漸近的にミニマックス収束を確立します。効率的な問題解決のために、TgSLOPE問題をレベル強制目的関数を持つ凸差(DC)プログラムに等価変換します。これにより、グローバル収束を伴う効率的な近位DCアルゴリズム(DCA)によってTgSLOPEの再定式化問題を解決できます。合成データと実際の人間の脳接続データに対して実施された数値研究により、提案されたTgSLOPE推定手順の有効性が実証されています。

Modular Regression: Improving Linear Models by Incorporating Auxiliary Data
モジュラー回帰: 補助データの組み込みによる線形モデルの改善

This paper develops a new framework, called modular regression, to utilize auxiliary information — such as variables other than the original features or additional data sets — in the training process of linear models. At a high level, our method follows the routine: (i) decomposing the regression task into several sub-tasks, (ii) fitting the sub-task models, and (iii) using the sub-task models to provide an improved estimate for the original regression problem. This routine applies to widely-used low-dimensional (generalized) linear models and high-dimensional regularized linear regression. It also naturally extends to missing-data settings where only partial observations are available. By incorporating auxiliary information, our approach improves the estimation efficiency and prediction accuracy upon linear regression or the Lasso under a conditional independence assumption for predicting the outcome. For high-dimensional settings, we develop an extension of our procedure that is robust to violations of the conditional independence assumption, in the sense that it improves efficiency if this assumption holds and coincides with the Lasso otherwise. We demonstrate the efficacy of our methods with simulated and real data sets.

この論文では、線形モデルのトレーニングプロセスで補助情報(元の特徴以外の変数や追加のデータセットなど)を利用するための、モジュラー回帰と呼ばれる新しいフレームワークを開発しています。大まかに言うと、この方法は、(i)回帰タスクを複数のサブタスクに分解し、(ii)サブタスクモデルを適合し、(iii)サブタスクモデルを使用して元の回帰問題に対する推定値を改善するという手順に従います。この手順は、広く使用されている低次元(一般化)線形モデルと高次元の正規化線形回帰に適用されます。また、部分的な観測値しか利用できない欠損データ設定にも当然適用されます。補助情報を組み込むことで、結果を予測するための条件付き独立性仮定の下で線形回帰またはLassoの推定効率と予測精度が向上します。高次元設定では、条件付き独立性仮定に違反してもロバストな手順の拡張を開発します。つまり、この仮定が成り立つ場合は効率が向上し、そうでない場合はLassoと一致するということです。私たちは、シミュレーションと実際のデータセットを使用して、私たちの方法の有効性を実証します。

Robust High-Dimensional Low-Rank Matrix Estimation: Optimal Rate and Data-Adaptive Tuning
ロバストな高次元低ランク行列推定:最適レートとデータ適応調整

The matrix lasso, which minimizes a least-squared loss function with the nuclear-norm regularization, offers a generally applicable paradigm for high-dimensional low-rank matrix estimation, but its efficiency is adversely affected by heavy-tailed distributions. This paper introduces a robust procedure by incorporating a Wilcoxon-type rank-based loss function with the nuclear-norm penalty for a unified high-dimensional low-rank matrix estimation framework. It includes matrix regression, multivariate regression and matrix completion as special examples. This procedure enjoys several appealing features. First, it relaxes the distributional conditions on random errors from sub-exponential or sub-Gaussian to more general distributions and thus it is robust with substantial efficiency gain for heavy-tailed random errors. Second, as the gradient function of the rank-based loss function is completely pivotal, it overcomes the challenge of tuning parameter selection and substantially saves the computation time by using an easily simulated tuning parameter. Third, we theoretically establish non-asymptotic error bounds with a nearly-oracle rate for the new estimator. Numerical results indicate that the new estimator can be highly competitive among existing methods, especially for heavy-tailed or skewed errors.

核ノルム正則化を用いて最小二乗損失関数を最小化するマトリックスLassoは、高次元低ランク行列推定に一般的に適用可能なパラダイムを提供しますが、その効率は裾の重い分布によって悪影響を受けます。この論文では、統一された高次元低ランク行列推定フレームワークのために、核ノルムペナルティを伴うWilcoxonタイプのランクベース損失関数を組み込むことで、堅牢な手順を紹介します。これには、特別な例として、行列回帰、多変量回帰、および行列補完が含まれます。この手順には、いくつかの魅力的な機能があります。まず、ランダム誤差に関する分布条件を、指数関数またはガウス関数未満からより一般的な分布に緩和するため、裾の重いランダム誤差に対して大幅な効率向上を実現し、堅牢です。次に、ランクベース損失関数の勾配関数が完全に重要なため、チューニングパラメーターの選択の課題を克服し、簡単にシミュレートできるチューニングパラメーターを使用することで、計算時間を大幅に節約します。3番目に、新しい推定量について、ほぼ神託の確率で非漸近的な誤差境界を理論的に確立します。数値結果は、新しい推定量が、特に裾が重い誤差や歪んだ誤差に対して、既存の方法と非常に競争力があることを示しています。

Instance-Dependent Generalization Bounds via Optimal Transport
最適輸送によるインスタンス依存の汎化境界

Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.

既存の一般化境界では、現代のニューラルネットワークの一般化を推進する重要な要素を説明できません。このような境界は多くの場合、すべてのパラメーターに対して一様に適用されるため、過剰パラメーター化の影響を受け、初期化と確率的勾配降下法の強い帰納的バイアスを考慮できません。代替案として、一般化問題の新しい最適輸送解釈を提案します。これにより、データ空間で学習された予測関数のローカルLipschitz正則性に依存するインスタンス依存の一般化境界を導出できます。したがって、この境界はモデルのパラメーター化に依存せず、トレーニングサンプルの数がパラメーターの数よりはるかに少ない場合にうまく機能します。わずかな変更を加えると、このアプローチにより、低次元多様体上のデータに対して速度が加速され、分布シフトの下で保証されます。ニューラルネットワークの一般化境界を経験的に分析し、境界値が意味を持ち、トレーニング中に一般的な正則化方法の効果を捉えていることを示します。

Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries
離散スケッチデータを用いた共形度数推定と個別クエリのカバレッジ

This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and variations thereof. After explaining how to achieve marginal coverage for exchangeable random queries, we extend our solution to provide stronger inferences that can account for the discreteness of the data and for heterogeneous query frequencies, increasing also robustness to possible distribution shifts. These results are facilitated by a novel conformal calibration technique that guarantees valid coverage for a large fraction of distinct random queries. Finally, we show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations as well as in examples of text and SARS-CoV-2 DNA data.

この論文では、メモリ使用量の少ないスケッチに基づいて、非常に大規模な離散データセット内でクエリされたオブジェクトの頻度の信頼区間を構築するコンフォーマル推論法を開発します。このアプローチでは、データ分布の知識は必要なく、有名なcount-minスケッチ、count-sketch、およびそれらのバリエーションを含むがこれらに限定されない、任意のスケッチアルゴリズムと組み合わせることができます。交換可能なランダムクエリの限界カバレッジを達成する方法を説明した後、データの離散性と異質なクエリ頻度を考慮できるより強力な推論を提供するためにソリューションを拡張し、起こり得る分布シフトに対する堅牢性も向上させます。これらの結果は、異なるランダムクエリの大部分に対して有効なカバレッジを保証する新しいコンフォーマルキャリブレーション手法によって促進されます。最後に、シミュレーション、テキスト、SARS-CoV-2 DNAデータの例において、既存の頻度主義およびベイジアン代替法と比較して、当社の方法が実証的なパフォーマンスが向上したことを示します。

Implicit Regularization and Entrywise Convergence of Riemannian Optimization for Low Tucker-Rank Tensor Completion
低タッカーランクテンソル補完のためのリーマン最適化の陰的正則化と入口収束

This paper is concerned with the low Tucker-rank tensor completion problem, which is about reconstructing a tensor $\mathcal{T}\in\mathbb{R}^{n\times n\times n}$ of low multilinear rank from partially observed entries. Riemannian optimization algorithms are a class of efficient methods for this problem, but the theoretical convergence analysis is still lacking. In this manuscript, we establish the entrywise convergence of the vanilla Riemannian gradient method for low Tucker-rank tensor completion under the nearly optimal sampling complexity $O(n^{3/2})$. Meanwhile, the implicit regularization phenomenon of the algorithm has also been revealed. As far as we know, this is the first work that has shown the entrywise convergence and implicit regularization property of a non-convex method for low Tucker-rank tensor completion. The analysis relies on the leave-one-out technique, and some of the technical results developed in the paper might be of broader interest in investigating the properties of other non-convex methods for this problem.

この論文では、低タッカーランクテンソル完成問題、つまり、部分的に観測されたエントリから低多重線形ランクのテンソル$\mathcal{T}\in\mathbb{R}^{n\times n\times n}$を再構築する問題を扱っています。リーマン最適化アルゴリズムは、この問題に対する効率的な方法の一種ですが、理論的な収束分析はまだ不足しています。この論文では、ほぼ最適なサンプリング複雑度$O(n^{3/2})$の下で、低タッカーランクテンソル完成のためのバニラリーマン勾配法のエントリごとの収束を確立します。同時に、アルゴリズムの暗黙的な正則化現象も明らかにされました。私たちの知る限り、これは、低タッカーランクテンソル完成のための非凸法のエントリごとの収束と暗黙的な正則化特性を示した最初の研究です。この分析は、Leave-One-Out手法に依存しており、この論文で開発された技術的結果の一部は、この問題に対する他の非凸手法の特性を調査する際に、より広範な関心を引く可能性があります。

Linear Partial Monitoring for Sequential Decision Making: Algorithms, Regret Bounds and Applications
逐次的意思決定のための線形部分監視:アルゴリズム、後悔限界および応用

Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.

パーシャルモニタリングは、グラフ構造やデュエルバンディット、ダイナミックプライシング、変換フィードバックモデルなど、豊富なアプリケーションを使用したシーケンシャルな意思決定のための表現力豊かなフレームワークです。私たちは、標準的な線形バンディット設定を自然に一般化する部分監視の線形定式化に関する最近の結果を調査し、拡張します。主な結果は、単一のアルゴリズムである情報指向サンプリング(IDS)が、すべての有限アクションゲームで(ほぼ)最悪の場合の最適なレートであるということです。確率的部分監視のシンプルで統一された分析を提示し、モデルをさらにコンテキストおよびカーネル化された設定に拡張します。

Radial Basis Approximation of Tensor Fields on Manifolds: From Operator Estimation to Manifold Learning
多様体上のテンソル場の動径基底近似:演算子推定から多様体学習へ

In this paper, we study the Radial Basis Function (RBF) approximation to differential operators on smooth tensor fields defined on closed Riemannian submanifolds of Euclidean space, identified by randomly sampled point cloud data. The formulation in this paper leverages a fundamental fact that the covariant derivative on a submanifold is the projection of the directional derivative in the ambient Euclidean space onto the tangent space of the submanifold. To differentiate a test function (or vector field) on the submanifold with respect to the Euclidean metric, the RBF interpolation is applied to extend the function (or vector field) in the ambient Euclidean space. When the manifolds are unknown, we develop an improved second-order local SVD technique for estimating local tangent spaces on the manifold. When the classical pointwise non-symmetric RBF formulation is used to solve Laplacian eigenvalue problems, we found that while accurate estimation of the leading spectra can be obtained with large enough data, such an approximation often produces irrelevant complex-valued spectra (or pollution) as the true spectra are real-valued and positive. To avoid such an issue, we introduce a symmetric RBF discrete approximation of the Laplacians induced by a weak formulation on appropriate Hilbert spaces. Unlike the non-symmetric approximation, this formulation guarantees non-negative real-valued spectra and the orthogonality of the eigenvectors. Theoretically, we establish the convergence of the eigenpairs of both the Laplace-Beltrami operator and Bochner Laplacian for the symmetric formulation in the limit of large data with convergence rates. Numerically, we provide supporting examples for approximations of the Laplace-Beltrami operator and various vector Laplacians, including the Bochner, Hodge, and Lichnerowicz Laplacians.

この論文では、ランダムにサンプリングされた点群データによって識別されるユークリッド空間の閉じたリーマン部分多様体上に定義された滑らかなテンソル場上の微分作用素に対するラジアル基底関数(RBF)近似について検討します。本稿の定式化は、部分多様体上の共変微分が、周囲ユークリッド空間の方向微分を部分多様体の接空間に投影したものであるという基本的な事実を利用しています。部分多様体上のテスト関数(またはベクトル場)をユークリッド計量に関して微分するには、RBF補間を適用して関数(またはベクトル場)を周囲ユークリッド空間に拡張します。多様体が未知の場合は、多様体上の局所接空間を推定するための改良された2次局所SVD手法を開発します。古典的な点ごとの非対称RBF定式化を使用してラプラシアン固有値問題を解く場合、十分な大きさのデータがあれば主要なスペクトルの正確な推定が得られるものの、真のスペクトルは実数値で正であるため、このような近似では無関係な複素数値スペクトル(または汚染)が生成されることが多いことがわかりました。このような問題を回避するために、適切なヒルベルト空間上の弱い定式化によって誘導されるラプラシアンの対称RBF離散近似を導入します。非対称近似とは異なり、この定式化では、非負の実数値スペクトルと固有ベクトルの直交性が保証されます。理論的には、収束率を持つ大規模データの限界において、対称定式化のラプラス-ベルトラミ演算子とボッホナーラプラシアンの両方の固有ペアの収束を確立します。数値的には、ラプラス-ベルトラミ演算子と、ボクナー、ホッジ、リヒネロヴィッチのラプラシアンを含むさまざまなベクトルラプラシアンの近似値のサポート例を提供します。

Large data limit of the MBO scheme for data clustering: convergence of the dynamics
データクラスタリングのためのMBOスキームの大きなデータ制限:ダイナミクスの収束

We prove that the dynamics of the MBO scheme for data clustering converge to a viscosity solution to mean curvature flow. The main ingredients are (i) a new abstract convergence result based on quantitative estimates for heat operators and (ii) the derivation of these estimates in the setting of random geometric graphs. To implement the scheme in practice, two important parameters are the number of eigenvalues for computing the heat operator and the step size of the scheme. The results of the current paper give a theoretical justification for the choice of these parameters in relation to sample size and interaction width.

私たちは、データクラスタリングのMBOスキームのダイナミクスが、平均曲率流れの粘度解に収束することを証明します。主な要素は、(i)熱演算子の定量的推定値に基づく新しい抽象的な収束結果と、(ii)ランダムな幾何学的グラフの設定におけるこれらの推定値の導出です。このスキームを実際に実装するには、熱演算子を計算するための固有値の数とスキームのステップサイズという2つの重要なパラメーターが必要です。現在の論文の結果は、サンプルサイズと相互作用幅に関連してこれらのパラメータを選択することを理論的に正当化するものです。

Accelerated Primal-Dual Mirror Dynamics for Centralized and Distributed Constrained Convex Optimization Problems
集中型および分散型制約付き凸最適化問題のための加速された主対二重ミラーダイナミクス

This paper investigates two accelerated primal-dual mirror dynamical approaches for smooth and nonsmooth convex optimization problems with affine and closed, convex set constraints. In the smooth case, an accelerated primal-dual mirror dynamical approach (APDMD) based on accelerated mirror descent and primal-dual framework is proposed and accelerated convergence properties of primal-dual gap, feasibility measure and the objective function value along with trajectories of APDMD are derived by the Lyapunov analysis method. Then, we extend APDMD into two distributed dynamical approaches to deal with two types of distributed smooth optimization problems, i.e., distributed constrained consensus problem (DCCP) and distributed extended monotropic optimization (DEMO) with accelerated convergence guarantees. Moreover, in the nonsmooth case, we propose a smoothing accelerated primal-dual mirror dynamical approach (SAPDMD) with the help of smoothing approximation technique and the above APDMD. We further also prove that primal-dual gap, objective function value and feasibility measure along with trajectories of SAPDMD have the same accelerated convergence properties as APDMD by choosing the appropriate smooth approximation parameters. Later, we propose two smoothing accelerated distributed dynamical approaches to deal with nonsmooth DEMO and DCCP to obtain accelerated and efficient solutions. Finally, numerical and comparative experiments are given to demonstrate the effectiveness and superiority of the proposed accelerated mirror dynamical approaches.

この論文では、アフィンおよび閉じた凸集合制約を持つ滑らかな凸最適化問題と滑らかでない凸最適化問題に対する2つの加速主双対ミラー動的アプローチについて検討します。滑らかなケースでは、加速ミラー降下法と主双対フレームワークに基づく加速主双対ミラー動的アプローチ(APDMD)が提案され、主双対ギャップ、実現可能性尺度、およびAPDMDの軌跡に沿った目的関数値の加速収束特性が、リアプノフ解析法によって導出されます。次に、APDMDを2つの分散動的アプローチに拡張し、加速収束保証付きの2種類の分散滑らかな最適化問題、つまり分散制約付きコンセンサス問題(DCCP)と分散拡張単向性最適化(DEMO)を処理します。さらに、滑らかでないケースでは、平滑化近似手法と上記のAPDMDの助けを借りて、平滑化加速主双対ミラー動的アプローチ(SAPDMD)を提案します。さらに、適切な滑らかな近似パラメータを選択することで、SAPDMDの軌道とともに、主デュアルギャップ、目的関数値、実現可能性の尺度がAPDMDと同じ加速収束特性を持つことを証明します。その後、非滑らかなDEMOとDCCPに対処して加速された効率的なソリューションを得るための2つの平滑化加速分散動的アプローチを提案します。最後に、提案された加速ミラー動的アプローチの有効性と優位性を示す数値実験と比較実験を示します。

The Geometry and Calculus of Losses
損失の幾何学と計算

Statistical decision problems lie at the heart of statistical machine learning. The simplest problems are multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend, existing results on M-sums of convex sets. Third, the perspective leads to a natural theory of “polar” loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk’s aggregating algorithm.

統計的決定問題は、統計的機械学習の核心です。最も単純な問題は、多クラス分類とクラス確率推定です。それらの定義の中心となるのは、ソリューションの品質を評価する手段である損失関数の選択です。この論文では、特定の構造を持つ凸集合を基本要素とする新しい観点から、このような問題の損失関数の理論を体系的に展開します。損失関数は、凸集合のサポート関数のサブグラディエントとして定義されます。したがって、自動的に適切です(確率推定用に調整されています)。この視点は、3つの新しい機会を提供します。損失と(反)ノルムの間の基本的な関係を開発できますが、これはこれまで注目されていなかったようです。2番目に、凸集合の計算によって誘導される損失の計算を開発できます。これにより、異なる損失間の補間が可能になり、特定の問題に合わせて損失を調整するための有用な設計ツールになる可能性があります。これを行うことで、凸集合のM和に関する既存の結果を基にして、大幅に拡張します。第三に、この観点は、損失を定義する凸集合の極双対から導出され、Vovkの集約アルゴリズムの自然な普遍的置換関数を形成する「極」損失関数の自然な理論につながります。

A Bayesian Bradley-Terry model to compare multiple ML algorithms on multiple data sets
複数のデータセットで複数の ML アルゴリズムを比較するためのベイジアン Bradley-Terry モデル

his paper presents a Bayesian model, called the Bayesian Bradley Terry (BBT) model, for comparing multiple algorithms on multiple data sets based on any metric. The model is an extension of the Bradley Terry model, which tracks the number of wins each algorithm has on different data sets. Unlike frequentist methods such as Demsar tests on mean rank or multiple pairwise Wilcoxon tests, the Bayesian approach provides a more nuanced understanding of the algorithms’ performance and allows for the definition of the “region of practical equivalence” (ROPE) for two algorithms. Additionally, the paper introduces the concept of “local ROPE,” which assesses the significance of the difference in mean measure between two algorithms using effect sizes, and can be applied in frequentist approaches as well. Both an R package and a Python program implementing the BBT are available for use.

彼の論文では、ベイジアン・ブラッドリー・テリー(BBT)モデルと呼ばれるベイジアンモデルを紹介しており、任意のメトリックに基づいて複数のデータセットの複数のアルゴリズムを比較しています。このモデルは、Bradley Terryモデルの拡張であり、各アルゴリズムが異なるデータセットで獲得した勝利数を追跡します。平均ランクのデムサー検定や多重ペアワイズウィルコクソン検定などの頻度論的手法とは異なり、ベイズアプローチでは、アルゴリズムのパフォーマンスをより微妙に理解し、2つのアルゴリズムの「実用等価領域」(ROPE)を定義できます。さらに、この論文では、効果サイズを使用して2つのアルゴリズム間の平均測定の差の重要性を評価する「ローカルROPE」の概念を紹介しており、頻度論的アプローチにも適用できます。RパッケージとBBTを実装したPythonプログラムの両方を使用できます。

Topological Hidden Markov Models
トポロジカル隠れマルコフモデル

The Hidden Markov Model is a classic modelling tool with a wide swath of applications. Its inception considered observations restricted to a finite alphabet, but it was quickly extended to multivariate continuous distributions. In this article, we further extend the Hidden Markov Model from mixtures of normal distributions in $d$-dimensional Euclidean space to general Gaussian measure mixtures in locally convex topological spaces, and hence, we christen this method the Topological Hidden Markov Model. The main innovation is the use of the Onsager-Machlup functional as a proxy for the probability density function in infinite dimensional spaces. This allows for choice of a Cameron-Martin space suitable for a given application. We demonstrate the versatility of this methodology by applying it to simulated diffusion processes such as Brownian and fractional Brownian sample paths as well as the Ornstein-Uhlenbeck process. Our methodology is applied to the identification of sleep states from overnight polysomnography time series data with the aim of diagnosing Obstructive Sleep Apnea in pediatric patients. It is also applied to a series of annual cumulative snowfall curves from 1940 to 1990 in the city of Edmonton, Alberta.

隠れマルコフモデルは、幅広い用途を持つ古典的なモデリングツールです。当初は観測値は有限のアルファベットに限定されていましたが、すぐに多変量連続分布に拡張されました。この記事では、隠れマルコフモデルを、d次元ユークリッド空間の正規分布の混合から局所凸位相空間の一般的なガウス測度の混合にさらに拡張し、この方法を位相隠れマルコフモデルと名付けました。主な革新は、無限次元空間の確率密度関数のプロキシとしてOnsager-Machlup関数を使用することです。これにより、特定のアプリケーションに適したCameron-Martin空間を選択できます。この方法論の汎用性を示すために、ブラウン運動や分数ブラウン運動のサンプルパス、およびオルンシュタイン-ウーレンベック過程などのシミュレートされた拡散過程に適用します。私たちの方法論は、小児患者の閉塞性睡眠時無呼吸症の診断を目的として、夜間の睡眠ポリグラフの時系列データから睡眠状態を識別するために適用されています。また、アルバータ州エドモントン市における1940年から1990年までの年間累積降雪量曲線にも適用されています。

Compression, Generalization and Learning
圧縮、一般化、学習

A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassification, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (which maps into the statistical “risk” in learning applications). Under suitable conditions, the cardinality of the compressed set is shown to be a consistent estimator of the probability of change of compression (without any upper limit on the size of the compressed set); moreover, unprecedentedly tight finite-sample bounds to evaluate the probability of change of compression are obtained under a generally applicable condition of preference. All results are usable in a fully agnostic setup, i.e., without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.

圧縮関数は、情報内容を保持しながら、観測セットを縮小したサイズのサブセットにスリム化するマップです。複数のアプリケーションでは、1つの新しい観測によって圧縮セットが変更されるという条件は、この観測によって追加の情報がもたらされると解釈され、学習理論では、これは誤分類または誤予測に相当します。この論文では、圧縮の変更の確率(学習アプリケーションにおける統計的「リスク」にマップされます)を制御できるようにする新しい理論の基礎を築きます。適切な条件下では、圧縮セットの基数は、圧縮の変更の確率の一貫した推定値であることが示されています(圧縮セットのサイズに上限はありません)。さらに、一般的に適用可能な優先条件下で、圧縮の変更の確率を評価するための前例のないほど厳しい有限サンプル境界が得られます。すべての結果は、完全に不可知論的な設定で使用できます。つまり、観測の確率分布に関する事前の知識は必要ありません。これらの結果は、観察主導の方法論に対する信頼を構築するための有効なサポートを提供するだけでなく、ハイパーパラメータ調整のツールとしての学習技術においても基本的な役割を果たします。

Community Recovery in the Geometric Block Model
幾何学的ブロックモデルにおけるコミュニティ回復

To capture the inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a Geometric Block Model. The geometric block model builds on the random geometric graphs (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erdos-Renyi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancements in community detection. To analyze the geometric block model, we first provide new connectivity results for random annulus graphs which are generalizations of random geometric graphs. The connectivity properties of geometric graphs have been studied since their introduction, and analyzing them has been more difficult than their Erdos-Renyi counterparts, due to correlated edge formation. We then use the connectivity results of random annulus graphs to provide necessary and sufficient conditions for efficient recovery of communities for the geometric block model. We show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. For this, we consider the following two regimes of graph density. In the regime where the average degree of the graph grows logarithmically with the number of vertices, we show that our algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model in the logarithmic degree regime. We simulate our results on both real and synthetic datasets to show superior performance of both the new model as well as our algorithm.

多くのコミュニティ検出問題に固有の幾何学的特徴を捉えるために、我々は幾何学的ブロックモデルと呼ぶコミュニティの新しいランダムグラフモデルの使用を提案します。幾何学的ブロックモデルは、よく研究されている確率的ブロックモデルがエルデシュ・レニランダムグラフに基づいているのと同様に、空間ネットワークのランダムグラフの基本モデルの1つであるランダム幾何学的グラフ(Gilbert、1961)に基づいています。これはまた、コミュニティ検出における最近の理論的および実践的進歩に触発されたランダムコミュニティモデルの自然な拡張でもあります。幾何学的ブロックモデルを分析するために、我々はまず、ランダム幾何学的グラフの一般化であるランダム環状グラフの新しい接続結果を提供します。幾何学的グラフの接続特性は導入以来研究されてきたが、相関エッジ形成のため、それらの分析はエルデシュ・レニの対応するグラフよりも困難であった。次に、ランダム環状グラフの接続結果を使用して、幾何学的ブロックモデルのコミュニティの効率的な回復に必要かつ十分な条件を提供します。幾何学的ブロックモデルでコミュニティを検出するための単純な三角形カウントアルゴリズムがほぼ最適であることを示します。このために、次の2つのグラフ密度の状態を考慮します。グラフの平均次数が頂点の数とともに対数的に増加する状態では、理論的にも実用的にも、アルゴリズムのパフォーマンスが非常に優れていることを示します。対照的に、三角形カウントアルゴリズムは、対数次数状態における確率的ブロックモデルには最適とはほど遠いものです。実際のデータセットと合成データセットの両方で結果をシミュレートし、新しいモデルとアルゴリズムの両方の優れたパフォーマンスを示します。

The Art of BART: Minimax Optimality over Nonhomogeneous Smoothness in High Dimension
BARTの技術:高次元における不均質な平滑性に対するミニマックス最適性

Many asymptotically minimax procedures for function estimation often rely on somewhat arbitrary and restrictive assumptions such as isotropy or spatial homogeneity. This work enhances the theoretical understanding of Bayesian additive regression trees under substantially relaxed smoothness assumptions. We provide a comprehensive study of asymptotic optimality and posterior contraction of Bayesian forests when the regression function has anisotropic smoothness that possibly varies over the function domain. The regression function can also be possibly discontinuous. We introduce a new class of sparse piecewise heterogeneous anisotropic Holder functions and derive their minimax lower bound of estimation in high-dimensional scenarios under the $L_2$-loss. We then find that the Bayesian tree priors, coupled with a Dirichlet subset selection prior for sparse estimation in high-dimensional scenarios, adapt to unknown heterogeneous smoothness, discontinuity, and sparsity. These results show that Bayesian forests are uniquely suited for more general estimation problems that would render other default machine learning tools, such as Gaussian processes, suboptimal. Our numerical study shows that Bayesian forests often outperform other competitors such as random forests and deep neural networks, which are believed to work well for discontinuous or complicated smooth functions. Beyond nonparametric regression, we also examined posterior contraction of Bayesian forests for density estimation and binary classification using the technique developed in this study.

関数推定のための漸近的ミニマックス手順の多くは、等方性や空間的均一性などのいくぶん恣意的で制限的な仮定に依存していることが多い。本研究は、大幅に緩和された平滑性仮定の下でのベイズ加法回帰ツリーの理論的理解を深めるものです。回帰関数が関数ドメイン上で変化する可能性のある異方性平滑性を持つ場合のベイズ森林の漸近最適性と事後収縮の包括的な研究を提供します。回帰関数は不連続である可能性もあります。私たちは、新しいクラスのスパースで区分的に異質な異方性ホルダー関数を導入し、高次元シナリオでの$L_2$損失の下での推定のミニマックス下限を導出します。次に、ベイズツリー事前分布を、高次元シナリオでのスパース推定のためのディリクレサブセット選択事前分布と組み合わせると、未知の異質な平滑性、不連続性、およびスパース性に適応できることを発見します。これらの結果は、ベイジアンフォレストが、ガウス過程などの他のデフォルトの機械学習ツールでは最適ではない、より一般的な推定問題に特に適していることを示しています。数値研究では、ベイジアンフォレストが、不連続または複雑で滑らかな関数に適していると考えられているランダムフォレストやディープニューラルネットワークなどの他の競合ツールよりも優れていることがよくあります。ノンパラメトリック回帰以外にも、この研究で開発された手法を使用して、密度推定とバイナリ分類のためのベイジアンフォレストの事後収縮も調べました。

Finite-time Koopman Identifier: A Unified Batch-online Learning Framework for Joint Learning of Koopman Structure and Parameters
有限時間 Koopman 識別子: Koopman 構造とパラメータの共同学習のための統合バッチオンライン学習フレームワーク

In this paper, a unified batch-online learning approach is introduced to learn a linear representation of nonlinear system dynamics using the Koopman operator. The presented system modeling approach leverages a novel incremental Koopman-based update law that regains a mini-collection of samples stored in a memory to minimize not only the instantaneous Koopman operator’s identification errors but also the identification errors for the collection of retrieved samples. Discontinuous modifications of gradient flows are presented for the online update law to assure finite-time convergence under easy-to-verify conditions defined on the batch of data. Therefore, this unified online-batch framework allows joint sample- and time-domain analysis to converge the Koopman operator’s parameters. More specifically, it is shown that if the collected mini-batch of samples guarantees a rank condition, then finite-time guarantee in the time domain can be certified, and the settling time depends on the quality of collected samples being reused in the update law. Moreover, the efficiency of the proposed Koopman-based update law is further analyzed by showing that the identification regret in continuous time grows sub-linearly with time. Furthermore, to avoid learning corrupted dynamics due to the selection of an inappropriate set of Koopman observables, a higher-layer meta-learner employs a discrete Bayesian optimization algorithm to obtain the best library of observable functions for the operator. Since finite-time convergence of the Koopman model for each set of observables is guaranteed under a rank condition on stored data, the fitness of each set of observables can be obtained based on the identification error on the stored samples in the proposed framework and even without implementing any controller based on the learned system. Finally, to confirm the effectiveness of the proposed scheme, two simulation examples are presented.

この論文では、クープマン演算子を使用して非線形システムダイナミクスの線形表現を学習するための、統合バッチオンライン学習アプローチを紹介します。提示されたシステムモデリングアプローチは、メモリに保存されたサンプルのミニコレクションを回復する新しい増分クープマンベースの更新法則を活用して、瞬間的なクープマン演算子の識別エラーだけでなく、取得されたサンプルのコレクションの識別エラーも最小化します。データのバッチで定義された検証しやすい条件下で有限時間の収束を保証するために、オンライン更新法則の勾配フローの不連続な変更が提示されます。したがって、この統合オンラインバッチフレームワークにより、サンプル領域と時間領域の共同分析により、クープマン演算子のパラメータを収束させることができます。より具体的には、収集されたサンプルのミニバッチがランク条件を保証する場合、時間領域での有限時間の保証を証明でき、整定時間は更新法則で再利用される収集サンプルの品質に依存することが示されています。さらに、提案されたKoopmanベースの更新法則の効率は、連続時間における識別後悔が時間とともに線形以下で増加することを示すことによってさらに分析されます。さらに、Koopman観測可能量の不適切なセットの選択による破損したダイナミクスの学習を回避するために、上位層のメタ学習器は離散ベイズ最適化アルゴリズムを使用して、演算子の観測可能関数の最適なライブラリを取得します。各観測可能量セットに対するKoopmanモデルの有限時間収束は、格納されたデータのランク条件の下で保証されるため、提案されたフレームワークでは、学習されたシステムに基づくコントローラーを実装しなくても、格納されたサンプルの識別エラーに基づいて各観測可能量セットの適合性を取得できます。最後に、提案された方式の有効性を確認するために、2つのシミュレーション例を示します。

T-Cal: An Optimal Test for the Calibration of Predictive Models
T-Cal:予測モデルのキャリブレーションに最適なテスト

The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are Holder continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which—combined with classical tests for discrete-valued predictors—can be used to test the calibration of virtually any probabilistic classification method.

機械学習手法の予測精度は着実に向上していますが、その不確実性予測のキャリブレーションは大きな課題となっています。多くの研究は、適切にキャリブレーションされた予測モデルを取得することに焦点を当てていますが、モデルキャリブレーションを確実に評価することについてはあまり知られていません。このため、キャリブレーションを改善するためのアルゴリズムが実際に効果を発揮する場合と、その改善が有限データセットのランダムノイズによる単なるアーティファクトである場合を判断する能力が制限されます。この研究では、有限の検証データセットを使用して予測モデルの誤ったキャリブレーションを検出することを仮説検定問題として検討します。帰無仮説は予測モデルがキャリブレーションされていることであり、対立仮説はキャリブレーションからの偏差が十分に大きいことです。誤ったキャリブレーションの検出は、クラスの条件付き確率が予測の十分に滑らかな関数である場合にのみ可能であることがわかりました。条件付きクラス確率がHolder連続である場合、$\ell_2$期待キャリブレーション誤差(ECE)のバイアス除去プラグイン推定量に基づくキャリブレーションのミニマックス最適検定であるT-Calを提案します。さらに、未知の滑らかさに適応する適応型T-Calを提案します。いくつかの一般的なディープニューラルネットアーキテクチャやいくつかの標準的な事後キャリブレーション方法を含む、幅広い実験で理論的な発見を検証します。T-Calは実用的な汎用ツールであり、離散値予測子の従来のテストと組み合わせることで、事実上あらゆる確率的分類方法のキャリブレーションをテストするために使用できます。

Dimensionality Reduction and Wasserstein Stability for Kernel Regression
カーネル回帰における次元削減とワッサーシュタイン安定性

In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable with kernel regression. In order to analyze the resulting regression errors, a novel stability result for kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit the regression function. We apply the general stability result to principal component analysis (PCA). Exploiting known estimates from the literature on both principal component analysis and kernel regression, we deduce convergence rates for the two-step procedure. The latter turns out to be particularly useful in a semi-supervised setting.

高次元回帰フレームワークでは、最初に入力変数の次元が縮小され、次に、縮小された入力変数を使用してカーネル回帰で出力変数を予測するという単純な2ステップ手順の結果を研究します。結果として生じる回帰エラーを分析するために、Wasserstein距離に関するカーネル回帰の新しい安定性結果が導出されます。これにより、摂動された入力データが回帰関数を適合するために使用されたときに発生する誤差を限定することができます。一般的な安定性結果を主成分分析(PCA)に適用します。主成分分析とカーネル回帰の両方に関する文献からの既知の推定値を利用して、2ステップ手順の収束率を推定します。後者は、半教師ありの設定で特に役立つことがわかりました。

Weisfeiler and Leman go Machine Learning: The Story so far
ワイスファイラーとレマンが機械学習に取り組む: これまでのストーリー

In recent years, algorithms and neural architectures based on the Weisfeiler–Leman algorithm, a well-known heuristic for the graph isomorphism problem, have emerged as a powerful tool for machine learning with graphs and relational data. Here, we give a comprehensive overview of the algorithm’s use in a machine-learning setting, focusing on the supervised regime. We discuss the theoretical background, show how to use it for supervised graph and node representation learning, discuss recent extensions, and outline the algorithm’s connection to (permutation-)equivariant neural architectures. Moreover, we give an overview of current applications and future directions to stimulate further research.

近年、グラフ同型問題のヒューリスティックとして有名なワイスファイラー・レマンアルゴリズムに基づくアルゴリズムとニューラルアーキテクチャが、グラフとリレーショナルデータを用いた機械学習の強力なツールとして登場しています。ここでは、機械学習環境でのアルゴリズムの使用について、教師ありレジームに焦点を当てて包括的に概観します。理論的な背景について議論し、教師ありグラフとノード表現学習にそれを使用する方法を示し、最近の拡張について議論し、アルゴリズムの(順列)等変ニューラルアーキテクチャへの接続を概説します。さらに、現在のアプリケーションの概要と、さらなる研究を刺激するための将来の方向性を示します。

Learning Conditional Generative Models for Phase Retrieval
位相検索のための条件付き生成モデルの学習

Reconstructing images from magnitude measurements is an important and difficult problem arising in many research areas, such as X-ray crystallography, astronomical imaging and more. While optimization-based approaches often struggle with the non-convexity and non- linearity of the problem, learning-based approaches are able to produce reconstructions of high quality for data similar to a given training dataset. In this work, we analyze a class of methods based on conditional generative adversarial networks (CGAN). We show how the benefits of optimization-based and learning-based methods can be combined to improve reconstruction quality. Furthermore, we show that these combined methods are able to generalize to out-of-distribution data and analyze their robustness to measurement noise. In addition to that, we compare how the methods are impacted by missing measurements. Extensive ablation studies demonstrate that all components of our approach are essential and justify the choice of network architecture.

マグニチュード測定から画像を再構成することは、X線結晶構造解析、天体画像など、多くの研究分野で生じる重要かつ困難な問題です。最適化ベースのアプローチは、多くの場合、問題の非凸性および非線形性に苦労しますが、学習ベースのアプローチは、特定のトレーニングデータセットに類似したデータに対して高品質の再構成を生成できます。この研究では、条件付き生成的敵対的ネットワーク(CGAN)に基づく一連の方法を分析します。最適化ベースの方法と学習ベースの方法の利点を組み合わせて、再構成の品質を向上させる方法を示します。さらに、これらの組み合わせた方法が分布外データに一般化できることを示し、測定ノイズに対する堅牢性を分析します。それに加えて、測定の欠落によって方法がどのように影響を受けるかを比較します。広範なアブレーション研究により、私たちのアプローチのすべてのコンポーネントが不可欠であり、ネットワークアーキテクチャの選択が正当化されることが実証されています。

Fair Data Representation for Machine Learning at the Pareto Frontier
パレートフロンティアにおける機械学習のための公正なデータ表現

As machine learning powered decision-making becomes increasingly important in our daily lives, it is imperative to strive for fairness in the underlying data processing. We propose a pre-processing algorithm for fair data representation via which supervised learning results in estimations of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal affine transport to approach the post-processing Wasserstein barycenter characterization of the optimal fair $L^2$-objective supervised learning via a pre-processing data deformation. Furthermore, we show that the Wasserstein geodesics from the conditional (on sensitive information) distributions of the learning outcome to their barycenter characterize the Pareto frontier between $L^2$-loss and the average pairwise Wasserstein distance among sensitive groups on the learning outcome. Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal affine maps are computationally efficient even for high-dimensional data.

機械学習を活用した意思決定が日常生活でますます重要になるにつれ、基礎となるデータ処理の公平性を追求することが必須となっています。私たちは、教師あり学習によって予測誤差と統計的差異の間のパレート最適境界を推定できる、公平なデータ表現のための前処理アルゴリズムを提案します。特に、この研究では、最適アフィン輸送を適用して、前処理データ変形を介して最適で公平な$L^2$目的の教師あり学習の後処理ワッサーシュタイン重心特性評価にアプローチします。さらに、学習結果の条件付き(機密情報に基づく)分布からその重心までのワッサーシュタイン測地線が、学習結果に対する敏感なグループ間の平均ペアワイズワッサーシュタイン距離と$L^2$損失の間のパレート境界を特徴付けることを示します。数値シミュレーションにより、次の利点が強調されています。(1)前処理ステップは、任意の条件付き期待値推定教師あり学習方法および目に見えないデータと合成可能です。（２）公平な表現は、機密データに関する残りのデータの推論能力を制限することによって機密情報を保護します。（３）最適なアフィンマップは、高次元データに対しても計算効率が優れています。

The Power of Contrast for Feature Learning: A Theoretical Analysis
特徴学習におけるコントラストの力:理論的分析

Contrastive learning has achieved state-of-the-art performance in various self-supervised learning tasks and even outperforms its supervised counterpart. Despite its empirical success, theoretical understanding of the superiority of contrastive learning is still limited. In this paper, under linear representation settings, (i) we provably show that contrastive learning outperforms the standard autoencoders and generative adversarial networks, two classical generative unsupervised learning methods, for both feature recovery and in-domain downstream tasks; (ii) we also illustrate the impact of labeled data in supervised contrastive learning. This provides theoretical support for recent findings that contrastive learning with labels improves the performance of learned representations in the in-domain downstream task, but it can harm the performance in transfer learning. We verify our theory with numerical experiments.

コントラスティブ学習は、さまざまな自己教師あり学習タスクで最先端のパフォーマンスを達成し、教師あり学習タスクよりも優れたパフォーマンスを発揮しています。その経験的な成功にもかかわらず、コントラスティブ学習の優位性に関する理論的理解はまだ限られています。この論文では、線形表現設定の下で、(i)コントラスティブ学習が、特徴回復とドメイン内ダウンストリームタスクの両方で、標準的な自己エンコーダと敵対的生成ネットワーク(2つの古典的な生成教師なし学習方法)よりも優れていることを証明できることを示しています。(ii)また、教師ありコントラスティブ学習におけるラベル付きデータの影響についても説明します。これは、ラベルを使用したコントラスティブ学習が、ドメイン内のダウンストリームタスクで学習した表現のパフォーマンスを向上させるが、転移学習のパフォーマンスを損なう可能性があるという最近の発見を理論的に裏付けるものです。私たちは、数値実験で理論を検証しています。

Mini-batching error and adaptive Langevin dynamics
ミニバッチエラーと適応ランジュバンダイナミクス

Bayesian inference allows to obtain useful information on the parameters of models, either in computational statistics or more recently in the context of Bayesian Neural Networks. The computational cost of usual Monte Carlo methods for sampling posterior laws in Bayesian inference scales linearly with the number of data points. One option to reduce it to a fraction of this cost is to resort to mini-batching in conjunction with unadjusted discretizations of Langevin dynamics, in which case only a random fraction of the data is used to estimate the gradient. However, this leads to an additional noise in the dynamics and hence a bias on the invariant measure which is sampled by the Markov chain. We advocate using the so-called Adaptive Langevin dynamics, which is a modification of standard inertial Langevin dynamics with a dynamical friction which automatically corrects for the increased noise arising from mini-batching. We investigate the practical relevance of the assumptions underpinning Adaptive Langevin (constant covariance for the estimation of the gradient, Gaussian minibatching noise), which are not satisfied in typical models of Bayesian inference, and quantify the bias induced by minibatching in this case. We also suggest a possible extension of AdL to further reduce the bias on the posterior distribution, by considering a dynamical friction depending on the current value of the parameter to sample.

ベイズ推論により、計算統計学、または最近ではベイズニューラルネットワークのコンテキストで、モデルのパラメータに関する有用な情報を得ることができます。ベイズ推論で事後法則をサンプリングするための通常のモンテカルロ法の計算コストは、データポイントの数に比例して増加します。このコストを数分の1にまで削減する1つの方法は、ランジュバンダイナミクスの調整されていない離散化と組み合わせたミニバッチングに頼ることです。この場合、データのランダムな一部のみが勾配の推定に使用されます。ただし、これによりダイナミクスにノイズが追加され、マルコフ連鎖によってサンプリングされる不変測度にバイアスが生じます。私たちは、いわゆる適応型ランジュバンダイナミクスの使用を推奨します。これは、ミニバッチングから生じるノイズの増加を自動的に修正する動的摩擦を備えた標準的な慣性ランジュバンダイナミクスの修正版です。私たちは、ベイズ推論の典型的なモデルでは満たされていない適応型ランジュバンの仮定（勾配の推定に対する一定の共分散、ガウス型ミニバッチノイズ）の実際的な関連性を調査し、この場合のミニバッチによって誘発されるバイアスを定量化します。また、サンプリングするパラメータの現在の値に応じて動的摩擦を考慮することにより、事後分布のバイアスをさらに削減するためのAdLの可能な拡張を提案します。

Bandit problems with fidelity rewards
忠実度報酬に関するバンディットの問題

The fidelity bandits problem is a variant of the $K$-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how ‘loyal’ the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every $m$ plays of an arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret.

忠実度バンディット問題は、各アームの報酬が忠実度報酬によって増強される、Kアームバンディット問題の変形です。忠実度報酬は、プレーヤーが過去にそのアームにどれだけ「忠実」であったかに応じてプレーヤーに追加の見返りを提供します。忠実度について2つのモデルを提案します。忠誠ポイントモデルでは、追加報酬の量は、アームが以前にプレイされた回数によって決まります。サブスクリプションモデルでは、追加報酬は、アームの現在の連続抽選回数によって決まります。確率的問題と敵対的問題の両方を検討します。単一アーム戦略は確率的問題では常に最適とは限らないため、敵対的設定での後悔の概念は慎重に調整する必要があります。後悔の概念として考えられる3つの概念を紹介し、どれが線形以下で制限できるかを調べます。増加、減少、クーポン(プレーヤーがアームをm回プレイするごとに追加の報酬を得る)の忠実度報酬の特殊なケースを詳細に調査します。必ずしも線形以下の後悔を享受しないモデルについては、最悪のケースの下限値を提供します。線形以下の後悔を示すモデルについては、アルゴリズムを提供し、その後悔を制限します。

Reproducing Kernels and New Approaches in Compositional Data Analysis
カーネルの再現と組成データ解析における新しいアプローチ

Compositional data, such as human gut microbiomes, consist of non-negative variables where only the relative values of these variables are available. Analyzing compositional data requires careful treatment of the geometry of the data. A common geometrical approach to understanding such data is through a regular simplex. The majority of existing approaches rely on log-ratio or power transformations to address the inherent simplicial geometry. In this work, based on the key observation that compositional data are projective, we reinterpret the compositional domain as a group quotient of a sphere, leveraging the intrinsic connection between projective and spherical geometry. This interpretation enables us to understand the function spaces on the compositional domain in terms of those on a sphere, and furthermore, to utilize spherical harmonics theory for constructing a compositional Reproducing Kernel Hilbert Space (RKHS). The construction of RKHS for compositional data opens up new research avenues for future methodology developments, particularly introducing well-developed kernel methods to compositional data analysis. We demonstrate the wide applicability of the proposed theoretical framework with examples of nonparametric density estimation, kernel exponential family, and support vector machine for compositional data.

ヒト腸内微生物叢などの構成データは、非負の変数で構成され、これらの変数の相対値のみが利用可能です。構成データを分析するには、データの幾何学を慎重に扱う必要があります。このようなデータを理解する一般的な幾何学的アプローチは、正則単体を介したものです。既存のアプローチの大部分は、固有の単体幾何学に対処するために対数比またはべき乗変換に依存しています。この研究では、構成データが射影的であるという重要な観察に基づいて、射影幾何学と球面幾何学の本質的なつながりを利用して、構成領域を球面の群商として再解釈します。この解釈により、構成領域上の関数空間を球面上の関数空間の観点から理解できるようになり、さらに、球面調和関数理論を使用して構成的再生カーネルヒルベルト空間(RKHS)を構築できます。構成データのRKHSの構築は、特に十分に開発されたカーネル法を構成データ分析に導入するなど、将来の方法論開発のための新しい研究の道を開きます。ノンパラメトリック密度推定、カーネル指数族、構成データのサポートベクターマシンの例を使用して、提案された理論的枠組みの幅広い適用可能性を示します。

Benign Overfitting of Constant-Stepsize SGD for Linear Regression
線形回帰のための一定ステップサイズSGDの良性過適合

There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging or tail averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). More specifically, for SGD with iterate averaging, we demonstrate the sharpness of the established excess risk bound by proving a matching lower bound (up to constant factors). For SGD with tail averaging, we show its advantage over SGD with iterate averaging by proving a better excess risk bound together with a nearly matching lower bound. Moreover, we reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression. Experimental results on synthetic data corroborate our theoretical findings.

アルゴリズムの帰納的バイアスが過剰適合の防止に重要であるという認識が高まっています。経験的には、明示的な正則化がほとんどまたはまったく使用されていない確率的勾配降下法(SGD)などの自然学習アルゴリズムの過剰パラメータ設定で、良性の過剰適合現象がよく見られます。この研究では、この問題を、おそらく最も基本的な設定、つまり過剰パラメータ化された状態での線形回帰に対する一定ステップサイズのSGD (反復平均化または裾平均化を使用)で検討します。主な結果は、データ共分散行列の完全な固有スペクトルの観点から述べられた鋭い過剰リスク境界を提供し、一般化が可能な場合を特徴付けるバイアス-分散分解を明らかにします。(i)分散境界は有効次元の観点から特徴付けられ(SGDに固有)、(ii)バイアス境界は初期反復の位置(およびそれがデータ共分散行列とどのように一致するか)の観点から鋭い幾何学的特徴付けを提供します。具体的には、反復平均化によるSGDの場合、一致する下限値(定数倍まで)を証明することで、確立された超過リスク境界の厳しさを実証します。テール平均化によるSGDの場合、より良い超過リスク境界値とほぼ一致する下限値を証明することで、反復平均化によるSGDに対する優位性を示します。さらに、(非正規化) SGDによって提供されるアルゴリズム正規化と、通常の最小二乗法(最小ノルム補間)およびリッジ回帰との比較における、いくつかの注目すべき違いについて考察します。合成データでの実験結果は、理論的発見を裏付けています。

ProtoShotXAI: Using Prototypical Few-Shot Architecture for Explainable AI
ProtoShotXAI:説明可能なAIのためのプロトタイプのFew-Shotアーキテクチャの使用

Unexplainable black-box models create scenarios where anomalies cause deleterious responses, thus creating unacceptable risks. These risks have motivated the field of eXplainable Artificial Intelligence (XAI) which improves trust by evaluating local interpretability in black-box neural networks. Unfortunately, the ground truth is unavailable for the model’s decision, so evaluation is limited to qualitative assessment. Further, interpretability may lead to inaccurate conclusions about the model or a false sense of trust. We propose to improve XAI from the vantage point of the user’s trust by exploring a black-box model’s latent feature space. We present an approach, ProtoShotXAI, that uses a Prototypical few-shot network to explore the contrastive manifold between nonlinear features of different classes. A user explores the manifold by perturbing the input features of a query sample and recording the response for a subset of exemplars from any class. Our approach is a locally interpretable XAI model that can be extended to, and demonstrated on, few-shot networks. We compare ProtoShotXAI to the state-of-the-art XAI approaches on MNIST, Omniglot, and ImageNet to demonstrate, both quantitatively and qualitatively, that ProtoShotXAI provides more flexibility for model exploration. Finally, ProtoShotXAI also demonstrates novel explainability and detectability on adversarial samples.

説明不可能なブラックボックスモデルは、異常が有害な応答を引き起こすシナリオを作成し、許容できないリスクを生み出します。これらのリスクが、ブラックボックスニューラルネットワークのローカルな解釈可能性を評価することで信頼性を向上させる、説明可能な人工知能(XAI)の分野への動機となっています。残念ながら、モデルの決定にはグラウンドトゥルースが利用できないため、評価は定性的な評価に限定されます。さらに、解釈可能性により、モデルに関する結論が不正確になったり、信頼感が誤っている可能性があります。ブラックボックスモデルの潜在的な特徴空間を探索することで、ユーザーの信頼の観点からXAIを改善することを提案します。プロトタイプの少数ショットネットワークを使用して、異なるクラスの非線形特徴間の対照的な多様体を探索するアプローチ、ProtoShotXAIを紹介します。ユーザーは、クエリサンプルの入力特徴を摂動し、任意のクラスのサンプルのサブセットに対する応答を記録することで、多様体を探索します。私たちのアプローチは、ローカルで解釈可能なXAIモデルであり、少数ショットネットワークに拡張して実証することができます。私たちは、ProtoShotXAIをMNIST、Omniglot、ImageNetの最先端のXAIアプローチと比較し、ProtoShotXAIがモデル探索に高い柔軟性を提供することを定量的にも定性的にも実証します。最後に、ProtoShotXAIは敵対的サンプルに対する新しい説明可能性と検出可能性も実証します。

Be More Active! Understanding the Differences Between Mean and Sampled Representations of Variational Autoencoders
もっとアクティブに!変分オートエンコーダの平均表現とサンプル表現の違いの理解

The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement_lib}, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.

変分オートエンコーダは、分離表現を学習できるため、実用的用途に魅力的です。しかし、下流のタスクに一般的に使用される平均表現は、分離の測定に使用されるサンプリングされた表現よりも相関が高いことが最近明らかになりました。この論文では、選択的事後崩壊の観点からこの観察結果を精緻化します。選択的事後崩壊とは、学習された表現のサブセットであるアクティブ変数のみが有用な情報をエンコードし、残り(パッシブ変数)は破棄されるというものです。まず、既存の定義を複数のデータ例に拡張し、アクティブ変数が平均表現とサンプリング表現で等しく分離されていることを示します。この拡張とdisentanglement_lib}の事前トレーニング済みモデルに基づいて、パッシブ変数を分離し、平均表現とサンプリング表現の不一致の原因がパッシブ変数であることを示します。具体的には、パッシブ変数は平均表現では他の変数と高い相関スコアを示しますが、サンプリング表現では完全に無相関です。したがって、相関関係が高いことが示唆するにもかかわらず、平均表現は下流のタスクアプリケーションにとって依然として優れた候補であると結論付けています。ただし、相関関係のある特徴に敏感なモデルで使用する場合は特に、受動的な変数を削除することが有益な場合があります。

Semi-Supervised Off-Policy Reinforcement Learning and Value Estimation for Dynamic Treatment Regimes
動的治療レジームのための半教師ありオフポリシー強化学習と価値推定

Reinforcement learning (RL) has shown great promise in estimating dynamic treatment regimes which take into account patient heterogeneity. However, health-outcome information, used as the reward for RL methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource-intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small-sized labeled data set with actual outcomes observed and a large unlabeled data set with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to $Q$-learning and doubly robust off-policy value estimation. Generalizing SSL to dynamic treatment regimes brings interesting challenges: 1) Feature distribution for $Q$-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative of the optimal policy or value function. We provide theoretical results for our $Q$ function and value function estimators to understand the degree of efficiency gained from SSL. Our method is at least as efficient as the supervised approach, and robust to bias from mis-specification of the imputation models.

強化学習(RL)は、患者の異質性を考慮した動的治療体制の推定に大きな可能性を示しています。しかし、RL手法の報酬として使用される健康アウトカム情報は、適切にコード化されておらず、臨床記録に埋め込まれていることがよくあります。正確なアウトカム情報を抽出することはリソース集約型のタスクであるため、利用可能な適切に注釈付けされたコホートのほとんどは小規模です。この問題に対処するために、実際のアウトカムが観察された小規模のラベル付きデータセットと、アウトカムのサロゲートを含む大規模なラベルなしデータセットを効率的に活用する半教師あり学習(SSL)アプローチを提案します。特に、$Q$学習と二重に堅牢なオフポリシー値推定に対する半教師ありの効率的なアプローチを提案します。SSLを動的治療体制に一般化すると、興味深い課題が生じます。1) $Q$学習の特徴分布は、以前のアウトカムを含むため不明です。2)修正されたSSLフレームワークで活用するサロゲート変数は、アウトカムを予測しますが、最適なポリシーまたは値関数に関する情報は提供しません。SSLから得られる効率の程度を理解するために、$Q$関数と価値関数の推定値の理論的結果を提供します。私たちの方法は、少なくとも教師ありアプローチと同程度の効率性があり、代入モデルの誤った指定によるバイアスに対して堅牢です。

Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks
ベイジアンネットワークの学習のための無矛盾な2次円錐整数計画法

Bayesian Networks (BNs) represent conditional probability relations among a set of random variables (nodes) in the form of a directed acyclic graph (DAG), and have found diverse applications in knowledge discovery. We study the problem of learning the sparse DAG structure of a BN from continuous observational data. The central problem can be modeled as a mixed-integer program with an objective function composed of a convex quadratic loss function and a regularization penalty subject to linear constraints. The optimal solution to this mathematical program is known to have desirable statistical properties under certain conditions. However, the state-of-the-art optimization solvers are not able to obtain provably optimal solutions to the existing mathematical formulations for medium-size problems within reasonable computational times. To address this difficulty, we tackle the problem from both computational and statistical perspectives. On the one hand, we propose a concrete early stopping criterion to terminate the branch-and-bound process in order to obtain a near-optimal solution to the mixed-integer program, and establish the consistency of this approximate solution. On the other hand, we improve the existing formulations by replacing the linear “big-$M$” constraints that represent the relationship between the continuous and binary indicator variables with second-order conic constraints. Our numerical results demonstrate the effectiveness of the proposed approaches.

ベイジアンネットワーク(BN)は、有向非巡回グラフ(DAG)の形式で一連のランダム変数(ノード)間の条件付き確率関係を表し、知識発見のさまざまな用途に使用されています。私たちは、連続観測データからBNのスパースDAG構造を学習する問題を研究しています。中心的な問題は、線形制約を受ける凸二次損失関数と正則化ペナルティで構成される目的関数を持つ混合整数計画としてモデル化できます。この数学計画の最適解は、特定の条件下では望ましい統計特性を持つことが知られています。ただし、最先端の最適化ソルバーは、中規模問題に対する既存の数学的定式化の証明可能な最適解を妥当な計算時間内に得ることができません。この困難に対処するために、私たちは計算と統計の両方の観点から問題に取り組んでいます。一方では、混合整数計画のほぼ最適な解を得るために分岐限定法プロセスを終了し、この近似解の一貫性を確立するための具体的な早期停止基準を提案します。一方、連続変数とバイナリ指標変数の関係を表す線形「big-$M$」制約を2次円錐制約に置き換えることで、既存の定式化を改善します。数値結果は、提案されたアプローチの有効性を実証しています。

Scale Invariant Power Iteration
スケール不変電力反復

We introduce a new class of optimization problems called scale invariant problems that cover interesting problems in machine learning and statistics and show that they are efficiently solved by a general form of power iteration called scale invariant power iteration (SCI-PI). SCI-PI is a special case of the generalized power method (GPM) (Journée et al., 2010) where the constraint set is the unit sphere. In this work, we provide the convergence analysis of SCI-PI for scale invariant problems which yields a better rate than the analysis of GPM. Specifically, we prove that it attains local linear convergence with a generalized rate of power iteration to find an optimal solution for scale invariant problems. Moreover, we discuss some extended settings of scale invariant problems and provide similar convergence results. In numerical experiments, we introduce applications to independent component analysis, Gaussian mixtures, and non-negative matrix factorization with the KL-divergence. Experimental results demonstrate that SCI-PI is competitive to application specific state-of-the-art algorithms and often yield better solutions.

私たちは、機械学習と統計学における興味深い問題をカバーするスケール不変問題と呼ばれる新しいクラスの最適化問題を紹介し、スケール不変べき乗反復(SCI-PI)と呼ばれる一般的な形式のべき乗反復によってそれらが効率的に解決されることを示します。SCI-PIは、制約セットが単位球である一般化べき乗法(GPM) (Journéeら, 2010)の特殊なケースです。この研究では、スケール不変問題に対するSCI-PIの収束解析を提供し、GPMの解析よりも優れた速度をもたらします。具体的には、スケール不変問題の最適解を見つけるために、一般化されたべき乗反復速度で局所線形収束を達成することを証明します。さらに、スケール不変問題のいくつかの拡張設定について説明し、同様の収束結果を示します。数値実験では、独立成分分析、ガウス混合、およびKL情報による非負値行列因子分解への応用を紹介します。実験結果から、SCI-PIはアプリケーション固有の最先端のアルゴリズムと競合し、より優れたソリューションを生み出すことが多いことが実証されています。

Higher-Order Spectral Clustering Under Superimposed Stochastic Block Models
重畳確率的ブロックモデルの下での高次スペクトルクラスタリング

Higher-order motif structures and multi-vertex interactions are becoming increasingly important in studies of functionalities and evolution patterns of complex networks. To elucidate the role of higher-order structures in community detection over networks, we introduce a Superimposed Stochastic Block Model (SupSBM). The model is based on a random graph framework in which certain higher-order structures or subgraphs are generated through an independent hyperedge generation process and then replaced with graphs superimposed with edges generated by an inhomogeneous random graph model. Consequently, the model introduces dependencies between edges which allow for capturing more realistic network phenomena, namely strong local clustering in a sparse network, short average path length, and community structure. We then proceed to rigorously analyze the performance of a recently proposed higher-order spectral clustering method on the SupSBM. In particular, we prove non-asymptotic upper bounds on the misclustering error of higher-order spectral community detection for a SupSBM setting in which triangles are superimposed with undirected edges. We assess the model fit of the proposed model and compare it with existing random graph models in terms of observed properties of real network data obtained from diverse domains by sampling networks from the fitted models and a nonparametric network cross-validation approach.

高次モチーフ構造と多頂点相互作用は、複雑ネットワークの機能と進化パターンの研究においてますます重要になっています。ネットワーク上のコミュニティ検出における高次構造の役割を明らかにするために、重ね合わせた確率的ブロックモデル(SupSBM)を紹介します。このモデルはランダムグラフフレームワークに基づいており、特定の高次構造またはサブグラフが独立したハイパーエッジ生成プロセスによって生成され、その後、不均質ランダムグラフモデルによって生成されたエッジが重ね合わされたグラフに置き換えられます。その結果、このモデルはエッジ間の依存関係を導入し、より現実的なネットワーク現象、つまり疎なネットワークでの強力なローカルクラスタリング、短い平均パス長、およびコミュニティ構造を捉えることができます。次に、最近提案された高次スペクトルクラスタリング法のSupSBMでのパフォーマンスを厳密に分析します。特に、三角形が無向エッジで重ね合わされたSupSBM設定に対する高次スペクトルコミュニティ検出のミスクラスタリングエラーの非漸近的上限を証明します。提案モデルのモデル適合性を評価し、適合モデルからネットワークをサンプリングし、ノンパラメトリックネットワーククロス検証アプローチを使用して、さまざまなドメインから取得した実際のネットワークデータの観測特性の観点から、既存のランダムグラフモデルと比較します。

Bagging in overparameterized learning: Risk characterization and risk monotonization
過剰パラメータ化学習におけるバギング:リスク特性評価とリスク単調化

Bagging is a commonly used ensemble technique in statistics and machine learning to improve the performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged predictors under the proportional asymptotics regime, in which the ratio of the number of features to the number of observations converges to a constant. Specifically, we propose a general strategy to analyze the prediction risk under squared error loss of bagged predictors using classical results on simple random sampling. Specializing the strategy, we derive the exact asymptotic risk of the bagged ridge and ridgeless predictors with an arbitrary number of bags under a well-specified linear model with arbitrary feature covariance matrices and signal vectors. Furthermore, we prescribe a generic cross-validation procedure to select the optimal subsample size for bagging and discuss its utility to eliminate the non-monotonic behavior of the limiting risk in the sample size (i.e., double or multiple descents). In demonstrating the proposed procedure for bagged ridge and ridgeless predictors, we thoroughly investigate the oracle properties of the optimal subsample size and provide an in-depth comparison between different bagging variants.

バギングは、統計と機械学習において予測手順のパフォーマンスを向上させるために一般的に使用されるアンサンブル手法です。この論文では、特徴数と観測数の比率が定数に収束する比例漸近領域における、バギングされた予測子のバリアントの予測リスクについて検討します。具体的には、単純ランダムサンプリングの古典的な結果を使用して、バギングされた予測子の二乗誤差損失における予測リスクを分析するための一般的な戦略を提案します。この戦略を特殊化して、任意の特徴共分散行列と信号ベクトルを含む適切に指定された線形モデルの下で、任意の数のバッグを持つバギングされたリッジ予測子とリッジレス予測子の正確な漸近リスクを導出します。さらに、バギングに最適なサブサンプルサイズを選択するための一般的なクロス検証手順を規定し、サンプルサイズにおける制限リスクの非単調な動作（つまり、二重降下または多重降下）を排除するためのその有用性について説明します。バギングされたリッジとリッジレス予測子の提案された手順を実証する際に、最適なサブサンプルサイズのオラクル特性を徹底的に調査し、さまざまなバギングバリアント間の詳細な比較を提供します。

Operator learning with PCA-Net: upper and lower complexity bounds
PCA-Netによるオペレーター学習:複雑さの上限と下限

PCA-Net is a recently proposed neural operator architecture which combines principal component analysis (PCA) with neural networks to approximate operators between infinite-dimensional function spaces. The present work develops approximation theory for this approach, improving and significantly extending previous work in this direction: First, a novel universal approximation result is derived, under minimal assumptions on the underlying operator and the data-generating distribution. Then, two potential obstacles to efficient operator learning with PCA-Net are identified, and made precise through lower complexity bounds; the first relates to the complexity of the output distribution, measured by a slow decay of the PCA eigenvalues. The other obstacle relates to the inherent complexity of the space of operators between infinite-dimensional input and output spaces, resulting in a rigorous and quantifiable statement of a “curse of parametric complexity”, an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. In addition to these lower bounds, upper complexity bounds are finally derived. A suitable smoothness criterion is shown to ensure an algebraic decay of the PCA eigenvalues. Furthermore, it is shown that PCA-Net can overcome the general curse for specific operators of interest, arising from the Darcy flow and the Navier-Stokes equations.

PCA-Netは、主成分分析(PCA)とニューラルネットワークを組み合わせて無限次元関数空間間の演算子を近似する、最近提案されたニューラル演算子アーキテクチャです。この研究では、このアプローチの近似理論を開発し、この方向での以前の研究を改善し、大幅に拡張します。まず、基礎となる演算子とデータ生成分布に関する最小限の仮定の下で、新しい普遍的な近似結果が導出されます。次に、PCA-Netによる効率的な演算子学習に対する2つの潜在的な障害が特定され、より低い複雑性境界によって明確にされます。1つ目は、PCA固有値の緩やかな減衰によって測定される出力分布の複雑性に関するものです。もう1つの障害は、無限次元の入力空間と出力空間間の演算子の空間の固有の複雑性に関するもので、その結果、「パラメトリック複雑性の呪い」という厳密で定量化可能なステートメントが生まれます。これは、高次元近似問題で遭遇するよく知られた次元の呪いの無限次元版です。これらの下限に加えて、最終的に複雑さの上限が導出されます。PCA固有値の代数的減衰を保証するために適切な平滑性基準が示されています。さらに、PCA-Netは、ダルシーフローとナビエ-ストークス方程式から生じる、特定の対象演算子の一般的な呪いを克服できることが示されています。

Mixed Regression via Approximate Message Passing
近似メッセージパッシングによる混合回帰

We study the problem of regression in a generalized linear model (GLM) with multiple signals and latent variables. This model, which we call a matrix GLM, covers many widely studied problems in statistical learning, including mixed linear regression, max-affine regression, and mixture-of-experts. The goal in all these problems is to estimate the signals, and possibly some of the latent variables, from the observations. We propose a novel approximate message passing (AMP) algorithm for estimation in a matrix GLM and rigorously characterize its performance in the high-dimensional limit. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic mean-squared error. The state evolution characterization can be used to tailor the AMP algorithm to take advantage of any structural information known about the signals. Using state evolution, we derive an optimal choice of AMP `denoising’ functions that minimizes the estimation error in each iteration. The theoretical results are validated by numerical simulations for mixed linear regression, max-affine regression, and mixture-of-experts. For max-affine regression, we propose an algorithm that combines AMP with expectation-maximization to estimate the intercepts of the model along with the signals. The numerical results show that AMP significantly outperforms other estimators for mixed linear regression and max-affine regression in most parameter regimes.

私たちは、複数の信号と潜在変数を持つ一般化線形モデル(GLM)における回帰の問題を研究します。我々はマトリックスGLMと呼ぶこのモデルは、混合線形回帰、最大アフィン回帰、およびエキスパート混合など、統計学習において広く研究されている多くの問題をカバーします。これらすべての問題の目標は、観測から信号、および場合によっては潜在変数の一部を推定することです。私たちは、マトリックスGLMでの推定のための新しい近似メッセージパッシング(AMP)アルゴリズムを提案し、高次元の限界におけるそのパフォーマンスを厳密に特徴付けます。この特徴付けは状態進化再帰の観点から行われ、これにより、漸近平均二乗誤差などのパフォーマンス指標を正確に計算できます。状態進化の特徴付けは、AMPアルゴリズムをカスタマイズして、信号について知られている構造情報を活用するために使用できます。状態進化を使用して、各反復における推定エラーを最小化するAMP「ノイズ除去」関数の最適な選択を導き出します。理論的な結果は、混合線形回帰、最大アフィン回帰、および専門家の混合の数値シミュレーションによって検証されます。最大アフィン回帰については、AMPと期待値最大化を組み合わせて、モデルの切片と信号を推定するアルゴリズムを提案します。数値結果では、ほとんどのパラメーター領域で、AMPが混合線形回帰および最大アフィン回帰の他の推定器を大幅に上回る性能を発揮することが示されています。

The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima
シャープネスを意識した最小化のダイナミクス:渓谷を跳ね返り、幅の広い最小値に向かって漂流

We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM’s update may be regarded as a third derivative—the derivative of the Hessian in the leading eigenvector direction—that encourages drift toward wider minima.

私たちは、画像と言語の予測問題でパフォーマンスの向上を示した深層ネットワークの勾配ベースの最適化手法であるSharpness-Aware Minimization(SAM)を検討しています。SAMを凸2次目的関数で適用すると、ほとんどのランダム初期化では、最小値の両側間で曲率が最大の方向に振動するサイクルに収束することを示し、収束率に境界を設けます。非二次の場合、そのような振動がヘッシアンのスペクトルノルムに対して、より小さなステップサイズで勾配降下を効果的に実行することを示します。このような場合、SAMの更新は、3次導関数—ヘッセ分布の先行固有ベクトル方向の導関数—より広範な極小値へのドリフトを促進すると見なすことができます。

MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library
MARLlib: スケーラブルで効率的なマルチエージェント強化学習ライブラリ

A significant challenge facing researchers in the area of multi-agent reinforcement learning (MARL) pertains to the identification of a library that can offer fast and compatible development for multi-agent tasks and algorithm combinations, while obviating the need to consider compatibility issues. In this paper, we present MARLlib, a library designed to address the aforementioned challenge by leveraging three key mechanisms: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy. By utilizing these mechanisms, MARLlib can effectively disentangle the intertwined nature of the multi-agent task and the learning process of the algorithm, with the ability to automatically alter the training strategy based on the current task’s attributes. The MARLlib library’s source code is publicly accessible on GitHub: https://github.com/Replicable-MARL/MARLlib.

マルチエージェント強化学習(MARL)の分野の研究者が直面している大きな課題は、マルチエージェントタスクとアルゴリズムの組み合わせに対して高速で互換性のある開発を提供できるライブラリを特定することであり、互換性の問題を考慮する必要がなくなります。この論文では、1)標準化されたマルチエージェント環境ラッパー、2)エージェントレベルのアルゴリズム実装、3)柔軟なポリシーマッピング戦略という3つの主要なメカニズムを活用して、前述の課題に対処するように設計されたライブラリであるMARLlibを紹介します。これらのメカニズムを利用することで、MARLlibは、マルチエージェントタスクの絡み合った性質とアルゴリズムの学習プロセスを効果的に解きほぐすことができ、現在のタスクの属性に基づいてトレーニング戦略を自動的に変更することができます。MARLlibライブラリのソースコードは、GitHub: https://github.com/Replicable-MARL/MARLlibで公開されています。

Fast Expectation Propagation for Heteroscedastic, Lasso-Penalized, and Quantile Regression
不均一分散、Lasso ペナルティ付き、および分位回帰のための高速期待値伝播

Expectation propagation (EP) is an approximate Bayesian inference (ABI) method which has seen widespread use across machine learning and statistics, owing to its accuracy and speed. However, it is often difficult to apply EP to models with complex likelihoods, where the EP updates do not have a tractable form and need to be calculated using methods such as multivariate numerical quadrature. These methods increase run time and reduce the appeal of EP as a fast approximate method. In this paper, we demonstrate that EP can still be made fast for certain models in this category. We focus on various types of linear regression, for which fast Bayesian inference is becoming increasingly important in the transition to big data. Fast EP updates are achieved through analytic integral reductions in certain moment computations. EP is compared to other ABI methods across simulations and benchmark datasets, and is shown to offer a good balance between accuracy and speed.

期待伝播(EP)は、その精度と速度により、機械学習と統計全体で広く使用されている近似ベイズ推論(ABI)手法です。ただし、EPの更新が扱いやすい形式を持たず、多変量数値求積法などの方法を使用して計算する必要がある複雑な尤度を持つモデルにEPを適用することは、多くの場合困難です。これらの方法は、実行時間を増加させ、高速近似法としてのEPの魅力を低下させます。この論文では、このカテゴリの特定のモデルでEPを高速化できることを示します。私たちは、ビッグデータへの移行において高速ベイズ推論がますます重要になっているさまざまなタイプの線形回帰に焦点を当てています。高速EP更新は、特定のモーメント計算における解析的積分削減によって実現されます。EPは、シミュレーションやベンチマークデータセット全体で他のABI手法と比較され、精度と速度のバランスが取れていることが示されています。

Zeroth-Order Alternating Gradient Descent Ascent Algorithms for A Class of Nonconvex-Nonconcave Minimax Problems
非凸-非凹型ミニマックス問題のクラスに対する0次交互勾配降下上昇アルゴリズム

In this paper, we consider a class of nonconvex-nonconcave minimax problems, i.e., NC-PL minimax problems, whose objective functions satisfy the Polyak-Lojasiewicz (PL) condition with respect to the inner variable. We propose a zeroth-order alternating gradient descent ascent (ZO-AGDA) algorithm and a zeroth-order variance reduced alternating gradient descent ascent (ZO-VRAGDA) algorithm for solving NC-PL minimax problem under the deterministic and the stochastic setting, respectively. The total number of function value queries to obtain an $\epsilon$-stationary point of ZO-AGDA and ZO-VRAGDA algorithm for solving NC-PL minimax problem is upper bounded by $\mathcal{O}(\varepsilon^{-2})$ and $\mathcal{O}(\varepsilon^{-3})$, respectively. To the best of our knowledge, they are the first two zeroth-order algorithms with the iteration complexity gurantee for solving NC-PL minimax problems.

この論文では、非凸-非凹型ミニマックス問題のクラス、つまり、目的関数が内部変数に関してPolyak-Lojasiewicz(PL)条件を満たすNC-PLミニマックス問題について考えます。決定論的および確率的設定の下でNC-PLミニマックス問題を解くために、ゼロ次交互勾配降下上昇(ZO-AGDA)アルゴリズムとゼロ次分散縮小交互勾配降下上昇(ZO-VRAGDA)アルゴリズムをそれぞれ提案します。NC-PLミニマックス問題を解くためのZO-AGDAアルゴリズムとZO-VRAGDAアルゴリズムの$epsilon$-定常点を取得するための関数値クエリの総数は、それぞれ$mathcal{O}(varepsilon^{-2})$と$mathcal{O}(varepsilon^{-3})$によって上限になります。私たちの知る限りでは、これらはNC-PLミニマックス問題を解くための反復計算量が保証された最初の2つの0次アルゴリズムです。

The Measure and Mismeasure of Fairness
公平性の尺度と誤認

The field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last decade, several formal, mathematical definitions of fairness have gained prominence. Here we first assemble and categorize these definitions into two broad families: (1) those that constrain the effects of decisions on disparities; and (2) those that constrain the effects of legally protected characteristics, like race and gender, on decisions. We then show, analytically and empirically, that both families of definitions typically result in strongly Pareto dominated decision policies. For example, in the case of college admissions, adhering to popular formal conceptions of fairness would simultaneously result in lower student-body diversity and a less academically prepared class, relative to what one could achieve by explicitly tailoring admissions policies to achieve desired outcomes. In this sense, requiring that these fairness definitions hold can, perversely, harm the very groups they were designed to protect. In contrast to axiomatic notions of fairness, we argue that the equitable design of algorithms requires grappling with their context-specific consequences, akin to the equitable design of policy. We conclude by listing several open challenges in fair machine learning and offering strategies to ensure algorithms are better aligned with policy goals.

公平な機械学習の分野は、アルゴリズムによって導かれる決定が公平であることを保証することを目的としています。過去10年間で、公平性の正式な数学的定義がいくつか注目を集めるようになりました。ここでは、まずこれらの定義を2つの大きなグループにまとめ、分類します。(1)決定が格差に与える影響を制限するもの、(2)人種や性別などの法的に保護された特性が決定に与える影響を制限するものです。次に、両方の定義グループが通常、強くパレート支配された決定ポリシーをもたらすことを分析的かつ実証的に示します。たとえば、大学入学の場合、公平性の一般的な正式な概念に固執すると、望ましい結果を達成するために入学ポリシーを明示的に調整することによって達成できるものと比較して、学生の多様性が低下し、学業の準備が整っていないクラスになります。この意味で、これらの公平性の定義が保持されることを要求すると、逆に、保護するように設計されたグループ自体に害を及ぼす可能性があります。公平性の公理的な概念とは対照的に、公平なアルゴリズムの設計には、公平な政策設計と同様に、コンテキスト固有の結果に取り組む必要があると私たちは主張します。結論として、公平な機械学習におけるいくつかの未解決の課題を列挙し、アルゴリズムが政策目標とよりよく一致するようにするための戦略を提示します。

Microcanonical Hamiltonian Monte Carlo
マイクロカノニカルハミルトニアンモンテカルロ

We develop Microcanonical Hamiltonian Monte Carlo (MCHMC), a class of models that follow fixed energy Hamiltonian dynamics, in contrast to Hamiltonian Monte Carlo (HMC), which follows canonical distribution with different energy levels. MCHMC tunes the Hamiltonian function such that the marginal of the uniform distribution on the constant-energy-surface over the momentum variables gives the desired target distribution. We show that MCHMC requires occasional energy-conserving billiard-like momentum bounces for ergodicity, analogous to momentum resampling in HMC. We generalize the concept of bounces to a continuous version with partial direction preserving bounces at every step, which gives energy-conserving underdamped Langevin-like dynamics with non-Gaussian noise (MCLMC). MCHMC and MCLMC exhibit favorable scalings with condition number and dimensionality. We develop an efficient hyperparameter tuning scheme that achieves high performance and consistently outperforms NUTS HMC on several standard benchmark problems, in some cases by orders of magnitude.

私たちは、異なるエネルギーレベルの正準分布に従うハミルトニアンモンテカルロ（HMC）とは対照的に、固定エネルギーのハミルトニアンダイナミクスに従うモデルのクラスであるミクロカノニカルハミルトニアンモンテカルロ（MCHMC）を開発しました。MCHMCは、運動量変数上の一定エネルギー面上の一様分布の周辺が目的の分布を与えるようにハミルトニアン関数を調整します。私たちは、HMCの運動量リサンプリングに類似して、MCHMCがエルゴード性のために時折エネルギーを保存するビリヤードのような運動量バウンスを必要とすることを示します。私たちは、バウンスの概念を、すべてのステップで部分的な方向保存バウンスを含む連続バージョンに一般化し、非ガウスノイズを伴うエネルギー保存の減衰不足のランジュバンのようなダイナミクス（MCLMC）を与えます。MCHMCとMCLMCは、条件数と次元に対して好ましいスケーリングを示します。私たちは、いくつかの標準的なベンチマーク問題において、高いパフォーマンスを達成し、場合によっては桁違いにNUTS HMCを一貫して上回る、効率的なハイパーパラメータ調整スキームを開発しました。

Prediction Equilibrium for Dynamic Network Flows
動的ネットワークフローの予測平衡

We study a dynamic traffic assignment model, where agents base their instantaneous routing decisions on real-time delay predictions. We formulate a mathematically concise model and define dynamic prediction equilibrium (DPE) in which no agent can at any point during their journey improve their predicted travel time by switching to a different route. We demonstrate the versatility of our framework by showing that it subsumes the well-known full information and instantaneous information models, in addition to admitting further realistic predictors as special cases. We then proceed to derive properties of the predictors that ensure a dynamic prediction equilibrium exists. Additionally, we define $\varepsilon$-approximate DPE wherein no agent can improve their predicted travel time by more than $\varepsilon$ and provide further conditions of the predictors under which such an approximate equilibrium can be computed. Finally, we complement our theoretical analysis by an experimental study, in which we systematically compare the induced average travel times of different predictors, including two machine-learning based models trained on data gained from previously computed approximate equilibrium flows, both on synthetic and real world road networks.

私たちは、エージェントがリアルタイムの遅延予測に基づいて瞬間的な経路決定を行う動的交通割り当てモデルを研究します。我々は数学的に簡潔なモデルを定式化し、どのエージェントも移動中のどの時点でも別の経路に切り替えることで予測移動時間を改善できない動的予測均衡(DPE)を定義します。私たちは、このフレームワークが、よく知られている完全情報モデルと瞬間情報モデルを包含し、さらに現実的な予測子を特別なケースとして認めることを示すことで、このフレームワークの汎用性を実証します。次に、動的予測均衡が存在することを保証する予測子の特性を導出します。さらに、どのエージェントも予測移動時間を$\varepsilon$以上改善できない$\varepsilon$近似DPEを定義し、このような近似均衡を計算できる予測子のさらなる条件を提供します。最後に、私たちは理論的分析を実験的研究で補完し、合成道路網と現実世界の道路網の両方で、以前に計算された近似平衡フローから得られたデータでトレーニングされた2つの機械学習ベースのモデルを含む、さまざまな予測子の誘導平均移動時間を体系的に比較します。

Dimension Reduction and MARS
次元削減とMARS

The multivariate adaptive regression spline (MARS) is one of the popular estimation methods for nonparametric multivariate regression. However, as MARS is based on marginal splines, to incorporate interactions of covariates, products of the marginal splines must be used, which often leads to an unmanageable number of basis functions when the order of interaction is high and results in low estimation efficiency. In this paper, we improve the performance of MARS by using linear combinations of the covariates which achieve sufficient dimension reduction. The special basis functions of MARS facilitate the calculation of gradients of the regression function, and estimation of these linear combinations is obtained via eigen-analysis of the outer-product of the gradients. Under some technical conditions, the consistency property is established for the proposed estimation method. Numerical studies including both simulation and empirical applications show its effectiveness in dimension reduction and improvement over MARS and other commonly-used nonparametric methods in regression estimation and prediction.

多変量適応型回帰スプライン(MARS)は、ノンパラメトリック多変量回帰の一般的な推定法の1つです。しかし、MARSは周辺スプラインに基づいているため、共変量の相互作用を組み込むには周辺スプラインの積を使用する必要があります。これにより、相互作用の次数が高い場合に管理できない数の基底関数が発生し、推定効率が低下します。この論文では、十分な次元削減を実現する共変量の線形結合を使用してMARSのパフォーマンスを改善します。MARSの特別な基底関数は、回帰関数の勾配の計算を容易にし、これらの線形結合の推定は、勾配の外積の固有値解析によって得られます。いくつかの技術的条件下では、提案された推定法の一貫性プロパティが確立されます。シミュレーションと実証アプリケーションの両方を含む数値研究では、次元削減の有効性と、回帰推定と予測で一般的に使用されるMARSやその他のノンパラメトリック手法よりも改善されていることを示しています。

Nevis’22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research
Nevis’22:30年間のコンピュータービジョン研究からサンプリングされた100タスクのストリーム

A shared goal of several machine learning communities like continual learning, meta-learning and transfer learning, is to design algorithms and models that efficiently and robustly adapt to unseen tasks. An even more ambitious goal is to build models that never stop adapting, and that become increasingly more efficient through time by suitably transferring the accrued knowledge. Beyond the study of the actual learning algorithm and model architecture, there are several hurdles towards our quest to build such models, such as the choice of learning protocol, metric of success and data needed to validate research hypotheses. In this work, we introduce the Never-Ending VIsual-classification Stream (NEVIS’22), a benchmark consisting of a stream of over 100 visual classification tasks, sorted chronologically and extracted from papers sampled uniformly from computer vision proceedings spanning the last three decades. The resulting stream reflects what the research community thought was meaningful at any point in time, and it serves as an ideal test bed to assess how well models can adapt to new tasks, and do so better and more efficiently as time goes by. Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth. The diversity is also reflected in the wide range of dataset sizes, spanning over four orders of magnitude. Overall, NEVIS’22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks, yet with a low entry barrier as it is limited to a single modality and well understood supervised learning problems. Moreover, we provide a reference implementation including strong baselines and an evaluation protocol to compare methods in terms of their trade-off between accuracy and compute. We hope that NEVIS’22 can be useful to researchers working on continual learning, meta-learning, AutoML and more generally sequential learning, and help these communities join forces towards more robust models that efficiently adapt to a never ending stream of data.

継続的学習、メタ学習、転移学習など、いくつかの機械学習コミュニティの共通の目標は、目に見えないタスクに効率的かつ堅牢に適応するアルゴリズムとモデルを設計することです。さらに野心的な目標は、蓄積された知識を適切に転移することで、適応を止めず、時間の経過とともにますます効率的になるモデルを構築することです。実際の学習アルゴリズムとモデルアーキテクチャの研究を超えて、学習プロトコルの選択、成功の指標、研究仮説を検証するために必要なデータなど、そのようなモデルを構築するための私たちの探求に向けて、いくつかのハードルがあります。この研究では、Never-Ending VIsual-classification Stream (NEVIS’22)を紹介します。これは、過去30年間にわたるコンピュータービジョンの議事録から均一にサンプリングされた論文から抽出され、時系列に並べられた100を超える視覚分類タスクのストリームで構成されるベンチマークです。結果として得られるストリームは、研究コミュニティがどの時点でも意味があると考えていたものを反映しており、モデルが新しいタスクにどれだけうまく適応できるか、そして時間の経過とともにより良く、より効率的に適応できるかを評価するための理想的なテストベッドとして機能します。分類に限定されているにもかかわらず、結果として得られるストリームには、OCR、テクスチャ分析、シーン認識など、さまざまなタスクが含まれています。この多様性は、4桁を超えるデータセットサイズの幅広い範囲にも反映されています。全体として、NEVIS’22は、タスクの規模と多様性により、現在のシーケンシャル学習アプローチに前例のない課題をもたらしますが、単一のモダリティと十分に理解されている教師あり学習の問題に限定されているため、参入障壁は低くなっています。さらに、強力なベースラインと、精度とコンピューティングのトレードオフの観点から方法を比較するための評価プロトコルを含むリファレンス実装を提供します。「NEVIS」22が、継続的学習、メタ学習、AutoML、より一般的にはシーケンシャル学習に取り組む研究者にとって役立ち、これらのコミュニティが力を合わせて、終わりのないデータストリームに効率的に適応するより堅牢なモデルを開発するのに役立つことを願っています。

Fast Screening Rules for Optimal Design via Quadratic Lasso Reformulation
二次投げ縄再定式化による最適設計のための高速スクリーニングルール

The problems of Lasso regression and optimal design of experiments share a critical property: their optimal solutions are typically sparse, i.e., only a small fraction of the optimal variables are non-zero. Therefore, the identification of the support of an optimal solution reduces the dimensionality of the problem and can yield a substantial simplification of the calculations. It has recently been shown that linear regression with a squared $\ell_1$-norm sparsity-inducing penalty is equivalent to an optimal experimental design problem. In this work, we use this equivalence to derive safe screening rules that can be used to discard inessential samples. Compared to previously existing rules, the new tests are much faster to compute, especially for problems involving a parameter space of high dimension, and can be used dynamically within any iterative solver, with negligible computational overhead. Moreover, we show how an existing homotopy algorithm to compute the regularization path of the lasso method can be reparametrized with respect to the squared $\ell_1$-penalty. This allows the computation of a Bayes $c$-optimal design in a finite number of steps and can be several orders of magnitude faster than standard first-order algorithms. The efficiency of the new screening rules and of the homotopy algorithm are demonstrated on different examples based on real data.

Lasso回帰と最適実験計画の問題には、重要な特性が共通しています。最適解は一般にスパースであり、つまり、最適変数のごく一部だけが非ゼロです。したがって、最適解のサポートを特定することで、問題の次元が削減され、計算が大幅に簡素化されます。最近、2乗$\ell_1$ノルムのスパース性誘導ペナルティを伴う線形回帰は、最適実験計画問題と同等であることが示されました。この研究では、この同等性を利用して、不要なサンプルを破棄するために使用できる安全なスクリーニングルールを導出します。既存のルールと比較して、新しいテストは、特に高次元のパラメーター空間を含む問題の場合、計算がはるかに高速であり、計算オーバーヘッドを無視できる任意の反復ソルバー内で動的に使用できます。さらに、Lasso法の正規化パスを計算する既存のホモトピーアルゴリズムを、$\ell_1$ペナルティの2乗に関して再パラメータ化する方法を示します。これにより、ベイズ$c$最適設計を有限数のステップで計算できるようになり、標準的な1次アルゴリズムよりも数桁高速化できます。新しいスクリーニングルールとホモトピーアルゴリズムの効率は、実際のデータに基づくさまざまな例で実証されています。

Multi-Consensus Decentralized Accelerated Gradient Descent
マルチコンセンサス分散型加速勾配降下法

his paper considers the decentralized convex optimization problem, which has a wide range of applications in large-scale machine learning, sensor networks, and control theory. We propose novel algorithms that achieve optimal computation complexity and near optimal communication complexity. Our theoretical results give affirmative answers to the open problem on whether there exists an algorithm that can achieve a communication complexity (nearly) matching the lower bound depending on the global condition number instead of the local one. Furthermore, the linear convergence of our algorithms only depends on the strong convexity of global objective and it does not require the local functions to be convex. The design of our methods relies on a novel integration of well-known techniques including Nesterov’s acceleration, multi-consensus and gradient-tracking. Empirical studies show the outperformance of our methods for machine learning applications.

この論文では、大規模な機械学習、センサーネットワーク、および制御理論に幅広く応用できる分散型凸最適化問題について考察します。私たちは、最適な計算量と最適な通信量を実現する新しいアルゴリズムを提案します。私たちの理論的な結果は、ローカル条件数ではなくグローバル条件数に応じて下限に(ほぼ)一致する通信複雑さを達成できるアルゴリズムが存在するかどうかという未解決の問題に対して肯定的な答えを提供します。さらに、アルゴリズムの線形収束は、大域目的の強い凸性にのみ依存し、局所関数が凸である必要はありません。私たちの手法の設計は、ネステロフの加速、マルチコンセンサス、グラジエントトラッキングなどのよく知られた手法の斬新な統合に依存しています。実証研究は、機械学習アプリケーションに対する私たちの方法が優れていることを示しています。

Continuous-in-time Limit for Bayesian Bandits
ベイジアンバンディットの連続時間制限

This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.

この論文では、ベイジアン設定におけるバンディット問題を再検討します。ベイジアンアプローチでは、バンディット問題を最適化問題として定式化し、その目標はベイジアンリグレットを最小化する最適なポリシーを見つけることです。ベイジアンアプローチが直面する主な課題の1つは、最適なポリシーの計算がしばしば扱いにくいことです。特に、問題のホライズンの長さやアームの数が多い場合はそうです。この論文では、まず、適切な再スケーリングを行うと、ベイジアンバンディット問題が連続的なハミルトンヤコビベルマン(HJB)方程式に収束することを示します。制限HJB方程式の最適なポリシーは、いくつかの一般的なバンディット問題に対して明示的に取得できます。また、明示的なソリューションが利用できない場合にHJB方程式を解くための数値的方法を示します。これらの結果に基づいて、大きなホライズンを持つベイジアンバンディット問題を解決するための近似的なベイズ最適ポリシーを提案します。この方法には、ホライズンが長くなっても計算コストが増加しないという追加の利点があります。

Two Sample Testing in High Dimension via Maximum Mean Discrepancy
最大平均不一致による高次元での2サンプルテスト

Maximum Mean Discrepancy (MMD) has been widely used in the areas of machine learning and statistics to quantify the distance between two distributions in the $p$-dimensional Euclidean space. The asymptotic property of the sample MMD has been well studied when the dimension $p$ is fixed using the theory of U-statistic. As motivated by the frequent use of MMD test for data of moderate/high dimension, we propose to investigate the behavior of the sample MMD in a high-dimensional environment and develop a new studentized test statistic. Specifically, we obtain the central limit theorems for the studentized sample MMD as both the dimension $p$ and sample sizes $n,m$ diverge to infinity. Our results hold for a wide range of kernels, including popular Gaussian and Laplacian kernels, and also cover energy distance as a special case. We also derive the explicit rate of convergence under mild assumptions and our results suggest that the accuracy of normal approximation can improve with dimensionality. Additionally, we provide a general theory on the power analysis under the alternative hypothesis and show that our proposed test can detect difference between two distributions in the moderately high dimensional regime. Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation.

最大平均乖離度(MMD)は、機械学習や統計の分野で、p次元ユークリッド空間の2つの分布間の距離を定量化するために広く使用されています。U統計の理論を使用して次元pが固定されている場合、サンプルMMDの漸近特性は十分に研究されてきました。中/高次元のデータに対するMMDテストの頻繁な使用に動機付けられて、高次元環境でのサンプルMMDの動作を調査し、新しいスチューデント化されたテスト統計量を開発することを提案します。具体的には、次元pとサンプルサイズn、mの両方が無限大に発散するため、スチューデント化されたサンプルMMDの中心極限定理を取得します。結果は、一般的なガウスカーネルやラプラシアンカーネルを含む広範囲のカーネルに当てはまり、特殊なケースとしてエネルギー距離もカバーします。また、緩やかな仮定の下で明示的な収束率を導出し、結果は正規近似の精度が次元とともに向上できることを示唆しています。さらに、対立仮説の下での検出力分析に関する一般理論を提供し、提案する検定が中程度に高次元の領域における2つの分布間の差を検出できることを示します。数値シミュレーションは、提案する検定統計量と正規近似の有効性を実証します。

Random Feature Amplification: Feature Learning and Generalization in Neural Networks
ランダム特徴増幅:ニューラルネットワークにおける特徴学習と一般化

In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics `amplify’ these weak, random features to strong, useful features.

この研究では、ランダム初期化後のロジスティック損失に対する勾配降下法によって訓練された2層ReLUネットワークにおける特徴学習プロセスの特性評価を提供します。入力フィーチャのXORのような関数によって生成されたバイナリラベルを持つデータを考慮します。私たちは、トレーニングラベルの一定の割合が敵対者によって破損されることを許しています。線形分類器は、考慮する分布のランダム推測よりも優れていませんが、勾配降下法によって訓練された2層ReLUネットワークは、ラベルノイズ率に近い一般化誤差を達成することを示します。私たちは、初期化時に、ニューロンの大部分が有用な特徴と弱く相関するランダムな特徴として機能し、勾配降下ダイナミクスがこれらの弱いランダムな特徴を強力で有用な特徴に「増幅」することを示す新しい証明技術を開発しています。

Pivotal Estimation of Linear Discriminant Analysis in High Dimensions
高次元における線形判別分析のピボタル推定

We consider the linear discriminant analysis problem in the high-dimensional settings. In this work, we propose PANDA(PivotAl liNear Discriminant Analysis), a tuning insensitive method in the sense that it requires very little effort to tune the parameters. Moreover, we prove that PANDA achieves the optimal convergence rate in terms of both the estimation error and misclassification rate. Our theoretical results are backed up by thorough numerical studies using both simulated and real datasets. In comparison with the existing methods, we observe that our proposed PANDA yields equal or better performance, and requires substantially less effort in parameter tuning.

私たちは、高次元の設定で線形判別分析問題を検討します。この研究では、パラメータの調整にほとんど労力を必要としないという意味で、調整感度の低い手法であるPANDA(PivotAl liNear Discriminant Analysis)を提案します。さらに、PANDAが推定誤差と誤分類率の両方に関して最適な収束率を達成していることを証明します。私たちの理論的な結果は、シミュレーションデータセットと実際のデータセットの両方を使用した徹底的な数値研究によって裏付けられています。既存の方法と比較すると、提案されているPANDAは同等以上のパフォーマンスをもたらし、パラメータの調整に必要な労力が大幅に少ないことがわかります。

Learning Optimal Feedback Operators and their Sparse Polynomial Approximations
最適フィードバック演算子とそのスパース多項式近似の学習

A learning based method for obtaining feedback laws for nonlinear optimal control problems is proposed. The learning problem is posed such that the open loop value function is its optimal solution. This infinite dimensional, function space, problem, is approximated by a polynomial ansatz and its convergence is analyzed. An $\ell_1$ penalty term is employed, which combined with the proximal point method, allows to find sparse solutions for the learning problem. The approach requires multiple evaluations of the elements of the polynomial basis and of their derivatives. In order to do this efficiently a graph-theoretic algorithm is devised. Several examples underline that the proposed methodology provides a promising approach for mitigating the curse of dimensionality which would be involved in case the optimal feedback law was obtained by solving the Hamilton Jacobi Bellman equation.

非線形最適制御問題のフィードバック則を求めるための学習ベースの方法を提案します。学習問題は、開ループ値関数がその最適解であるように提起されます。この無限次元の関数空間の問題は、多項式のアンサッツによって近似され、その収束が分析されます。$ell_1$ペナルティ項が使用され、近位点法と組み合わせることで、学習問題のスパース解を見つけることができます。このアプローチでは、多項式基底の要素とその導関数の要素を複数評価する必要があります。これを効率的に行うために、グラフ理論アルゴリズムが考案されています。いくつかの例は、提案された方法論が、ハミルトン・ヤコビ・ベルマン方程式を解くことによって最適なフィードバック法則が得られた場合に関与する次元の呪いを軽減するための有望なアプローチを提供することを強調しています。

Sensitivity-Free Gradient Descent Algorithms
感度フリー勾配降下アルゴリズム

We introduce two block coordinate descent algorithms for solving optimization problems with ordinary differential equations (ODEs) as dynamical constraints. In contrast to prior algorithms, ours do not need to implement sensitivity analysis methods to evaluate loss function gradients. They result from the reformulation of the original problem as an equivalent optimization problem with equality constraints. In our first algorithm we avoid explicitly solving the ODE by integrating the ODE solver as a sequence of implicit constraints. In our second algorithm, we add an ODE solver to reset the estimate of the ODE solution, but no sensitivity analysis method is needed. We test the proposed algorithms on the problem of learning the parameters of the Cucker-Smale model. The algorithms are compared with gradient descent algorithms based on ODE solvers endowed with sensitivity analysis capabilities. We show that the proposed algorithms are at least 4x faster when implemented in Pytorch, and at least 16x faster when implemented in Jax. For large versions of the Cucker-Smale model, the Jax implementation is thousands of times faster. Our algorithms generate more accurate results both on training and test data. In addition, we show how the proposed algorithms scale with the number of optimization variables, and how they can be applied to learning black-box models of dynamical systems. Moreover, we demonstrate how our approach can be combined with approaches based on sensitivity analysis enabled ODE solvers to reduce the training time.

私たちは、動的制約として常微分方程式(ODE)を持つ最適化問題を解くための2つのブロック座標降下アルゴリズムを紹介します。従来のアルゴリズムとは対照的に、私たちのアルゴリズムは損失関数の勾配を評価するために感度分析法を実装する必要がありません。それらは、元の問題を等式制約を持つ同等の最適化問題として再定式化することによって生じます。最初のアルゴリズムでは、ODEソルバーを一連の暗黙的制約として統合することにより、ODEを明示的に解くことを回避します。2番目のアルゴリズムでは、ODEソルバーを追加してODEソリューションの推定値をリセットしますが、感度分析法は必要ありません。提案されたアルゴリズムをCucker-Smaleモデルのパラメーターを学習する問題でテストします。アルゴリズムは、感度分析機能を備えたODEソルバーに基づく勾配降下アルゴリズムと比較されます。提案されたアルゴリズムは、Pytorchで実装すると少なくとも4倍高速になり、Jaxで実装すると少なくとも16倍高速になることがわかります。Cucker-Smaleモデルの大規模バージョンでは、Jax実装は数千倍高速です。当社のアルゴリズムは、トレーニングデータとテストデータの両方でより正確な結果を生成します。さらに、提案されたアルゴリズムが最適化変数の数に応じてどのように拡張されるか、および動的システムのブラックボックスモデルの学習にどのように適用できるかを示します。さらに、当社のアプローチを感度分析に基づくアプローチと組み合わせて、ODEソルバーを有効にしてトレーニング時間を短縮する方法を示します。

A PDE approach for regret bounds under partial monitoring
パーシャルモニタリング下の後悔限界に対するPDEアプローチ

In this paper, we study a learning problem in which a forecaster only observes partial information. By properly rescaling the problem, we heuristically derive a limiting PDE on Wasserstein space which characterizes the asymptotic behavior of the regret of the forecaster. Using a verification type argument, we show that the problem of obtaining regret bounds and efficient algorithms can be tackled by finding appropriate smooth sub/supersolutions of this parabolic PDE.

この論文では、予報者が部分的な情報しか観察しない学習問題について研究します。問題を適切に再スケーリングすることにより、予報者の後悔の漸近的な振る舞いを特徴付けるワッサーシュタイン空間上の制限PDEをヒューリスティックに導出します。検証タイプの引数を使用して、後悔境界と効率的なアルゴリズムを取得する問題は、この放物線偏微分の適切な滑らかな部分/超解を見つけることによって対処できることを示します。

A General Learning Framework for Open Ad Hoc Teamwork Using Graph-based Policy Learning
グラフベースのポリシー学習を用いたオープンアドホックチームワークのための一般的な学習フレームワーク

Open ad hoc teamwork is the problem of training a single agent to efficiently collaborate with an unknown group of teammates whose composition may change over time. A variable team composition creates challenges for the agent, such as the requirement to adapt to new team dynamics and dealing with changing state vector sizes. These challenges are aggravated in real-world applications in which the controlled agent only has a partial view of the environment. In this work, we develop a class of solutions for open ad hoc teamwork under full and partial observability. We start by developing a solution for the fully observable case that leverages graph neural network architectures to obtain an optimal policy based on reinforcement learning. We then extend this solution to partially observable scenarios by proposing different methodologies that maintain belief estimates over the latent environment states and team composition. These belief estimates are combined with our solution for the fully observable case to compute an agent’s optimal policy under partial observability in open ad hoc teamwork. Empirical results demonstrate that our solution can learn efficient policies in open ad hoc teamwork in fully and partially observable cases. Further analysis demonstrates that our methods’ success is a result of effectively learning the effects of teammates’ actions while also inferring the inherent state of the environment under partial observability.

オープンアドホックチームワークは、時間の経過とともに構成が変化する可能性のある未知のチームメイトのグループと効率的に協力できるように、単一のエージェントをトレーニングする問題です。チーム構成が変化すると、新しいチームダイナミクスに適応したり、状態ベクトルのサイズの変化に対処したりする必要性など、エージェントにとっての課題が生じます。これらの課題は、制御されたエージェントが環境の部分的なビューしか持たない現実世界のアプリケーションでは悪化します。この研究では、完全および部分的な可観測性の下でのオープンアドホックチームワークのソリューションのクラスを開発します。まず、グラフニューラルネットワークアーキテクチャを活用して強化学習に基づく最適なポリシーを取得する、完全に観測可能なケースのソリューションを開発します。次に、潜在的な環境状態とチーム構成の確信推定を維持するさまざまな方法論を提案することにより、このソリューションを部分的に観測可能なシナリオに拡張します。これらの確信推定は、完全に観測可能なケースのソリューションと組み合わされ、オープンアドホックチームワークにおける部分的な可観測性の下でのエージェントの最適ポリシーを計算します。実験結果から、当社のソリューションは、完全に観測可能な場合と部分的に観測可能な場合のオープンなアドホックチームワークにおいて、効率的なポリシーを学習できることが実証されています。さらに分析を進めると、当社の手法の成功は、部分的な観測可能性の下で環境の固有の状態を推測しながら、チームメイトの行動の影響を効果的に学習した結果であることが実証されています。

Causal Bandits for Linear Structural Equation Models
線形構造方程式モデルの因果バンディット

This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model to minimize cumulative regret with respect to the best intervention in hindsight. This is, naturally, posed as a causal bandit problem. The focus is on causal bandits for linear structural equation models (SEMs) and soft interventions. It is assumed that the graph’s structure is known and has $N$ nodes. Two linear mechanisms, one soft intervention and one observational, are assumed for each node, giving rise to $2^N$ possible interventions. The majority of the existing causal bandit algorithms assume that at least the interventional distributions of the reward node’s parents are fully specified. However, there are $2^N$ such distributions (one corresponding to each intervention), acquiring which becomes prohibitive even in moderate-sized graphs. This paper dispenses with the assumption of knowing these distributions or their marginals. Two algorithms are proposed for the frequentist (UCB-based) and Bayesian (Thompson sampling-based) settings. The key idea of these algorithms is to avoid directly estimating the $2^N$ reward distributions and instead estimate the parameters that fully specify the SEMs (linear in $N$) and use them to compute the rewards. In both algorithms, under boundedness assumptions on noise and the parameter space, the cumulative regrets scale as $\tilde{\cal O} (d^{L+\frac{1}{2}} \sqrt{NT})$, where $d$ is the graph’s maximum degree, and $L$ is the length of its longest causal path. Additionally, a minimax lower of $\Omega(d^{\frac{L}{2}-2}\sqrt{T})$ is presented, which suggests that the achievable and lower bounds conform in their scaling behavior with respect to the horizon $T$ and graph parameters $d$ and $L$.

この論文では、因果グラフィカルモデルで介入の最適なシーケンスを設計し、後から見て最善の介入に関する累積後悔を最小化する問題を研究します。これは、当然、因果バンディット問題として提起されます。焦点は、線形構造方程式モデル(SEM)とソフト介入の因果バンディットにあります。グラフの構造は既知であり、$N$個のノードがあると仮定します。各ノードには、ソフト介入と観測介入の2つの線形メカニズムが想定され、$2^N$個の可能な介入が生じます。既存の因果バンディットアルゴリズムの大部分は、少なくとも報酬ノードの親の介入分布が完全に指定されていると仮定しています。ただし、そのような分布は$2^N$個(各介入に1つずつ対応)あり、中規模のグラフでも取得するのは困難です。この論文では、これらの分布またはその周辺を知っているという仮定は不要です。頻度主義(UCBベース)とベイズ主義(トンプソンサンプリングベース)の設定用に2つのアルゴリズムが提案されています。これらのアルゴリズムの重要なアイデアは、$2^N$報酬分布を直接推定することを避け、代わりにSEMを完全に指定するパラメータ($N$で線形)を推定し、それらを使用して報酬を計算することです。両方のアルゴリズムでは、ノイズとパラメータ空間の有界性の仮定の下で、累積後悔は$\tilde{\cal O} (d^{L+\frac{1}{2}} \sqrt{NT})$としてスケーリングされます。ここで、$d$はグラフの最大次数、$L$は最長因果パスの長さです。さらに、$\Omega(d^{\frac{L}{2}-2}\sqrt{T})$のミニマックス下限が提示されており、これは達成可能な境界と下限が、地平線$T$とグラフパラメータ$d$および$L$に関してスケーリング動作において一致することを示唆しています。

High-Dimensional Inference for Generalized Linear Models with Hidden Confounding
隠れ交絡を持つ一般化線形モデルのための高次元推論

Statistical inferences for high-dimensional regression models have been extensively studied for their wide applications ranging from genomics, neuroscience, to economics. However, in practice, there are often potential unmeasured confounders associated with both the response and covariates, which can lead to invalidity of standard debiasing methods. This paper focuses on a generalized linear regression framework with hidden confounding and proposes a debiasing approach to address this high-dimensional problem, by adjusting for the effects induced by the unmeasured confounders. We establish consistency and asymptotic normality for the proposed debiased estimator. The finite sample performance of the proposed method is demonstrated through extensive numerical studies and an application to a genetic data set.

高次元回帰モデルの統計的推論は、ゲノミクス、神経科学、経済学など、幅広い応用で広く研究されてきました。しかし、実際には、応答と共変量の両方に関連する潜在的な未測定の交絡因子が存在することが多く、標準的なバイアス除去法の無効性につながる可能性があります。この論文では、隠れた交絡因子を持つ一般化線形回帰フレームワークに焦点を当て、測定されていない交絡因子によって誘発される影響を調整することにより、この高次元の問題に対処するための偏り除去アプローチを提案します。提案された偏り除去推定量の一貫性と漸近正規性を確立します。提案された方法の有限サンプル性能は、広範な数値研究と遺伝的データセットへの適用を通じて実証されています。

Weibull Racing Survival Analysis with Competing Events, Left Truncation, and Time-Varying Covariates
競合イベント、左切り捨て、および時変共変量によるワイブルレーシング生存時間分析

We propose Bayesian nonparametric Weibull delegate racing (WDR) to fill the gap in interpretable nonlinear survival analysis with competing events, left truncation, and time-varying covariates. We set a two-phase race among a potentially infinite number of sub-events to model nonlinear covariate effects, which does not rely on transformations or complex functions of the covariates. Using gamma processes, the nonlinear capacity of WDR is parsimonious and data-adaptive. In prediction accuracy, WDR dominates cause-specific Cox and Fine-Gray models and is comparable to random survival forests in the presence of time-invariant covariates. More importantly, WDR can cope with different types of censoring, missing outcomes, left truncation, and time-varying covariates, on which other nonlinear models, such as the random survival forests, Gaussian processes, and deep learning approaches, are largely silent. We develop an efficient MCMC algorithm based on Gibbs sampling. We analyze biomedical data, interpret disease progression affected by covariates, and show the potential of WDR in discovering and diagnosing new diseases.

私たちは、競合イベント、左側切り捨て、および時間変動共変量による解釈可能な非線形生存分析のギャップを埋めるために、ベイジアンノンパラメトリックワイブルデリゲートレーシング(WDR)を提案します。潜在的に無限の数のサブイベント間で2段階のレースを設定し、共変量の変換や複雑な関数に依存しない非線形共変量効果をモデル化します。ガンマプロセスを使用すると、WDRの非線形容量は簡素でデータ適応的です。予測精度では、WDRは原因固有のCoxモデルとFine-Grayモデルを上回り、時間不変共変量が存在する場合のランダム生存フォレストに匹敵します。さらに重要なことは、WDRは、ランダム生存フォレスト、ガウス過程、およびディープラーニングアプローチなどの他の非線形モデルではほとんど無視される、さまざまな種類の打ち切り、欠落結果、左側切り捨て、および時間変動共変量に対処できることです。私たちは、ギブスサンプリングに基づく効率的なMCMCアルゴリズムを開発します。私たちは生物医学データを分析し、共変量によって影響を受ける病気の進行を解釈し、新しい病気の発見と診断におけるWDRの可能性を示します。

Erratum: Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm
正誤表:多数決のリスク限界:PAC-ベイズ分析から学習アルゴリズムへ

This work shows that the demonstration of Proposition 15 of Germain et al. (2015) is flawed and the proposition is false in a general setting. This proposition gave an inequality that upper-bounds the variance of the margin of a weighted majority vote classifier. Even though this flaw has little impact on the validity of the other results presented in Germain et al. (2015), correcting it leads to a deeper understanding of the $\mathcal{C}$-bound, which is a key inequality that upper-bounds the risk of a majority vote classifier by the moments of its margin, and to a new result, namely a lower-bound on the $\mathcal{C}$-bound. Notably, Germain et al.’s statement that “the $\mathcal{C}$-bound can be arbitrarily small” is invalid in presence of irreducible error in learning problems with label noise. In this erratum, we pinpoint the mistake present in the demonstration of the said proposition, we give a corrected version of the proposition, and we propose a new theoretical lower bound on the $\mathcal{C}$-bound.

この研究では、Germainら(2015)の命題15の証明に欠陥があり、この命題が一般的な設定では誤りであることを示しています。この命題は、重み付き多数決分類器のマージンの分散の上限となる不等式を与えました。この欠陥はGermainら(2015)で提示された他の結果の妥当性にはほとんど影響しませんが、これを修正することで、多数決分類器のリスクをそのマージンのモーメントで上限とする重要な不等式である$\mathcal{C}$境界のより深い理解と、新しい結果、つまり$\mathcal{C}$境界の下限につながります。特に、Germainらの「$\mathcal{C}$境界は任意に小さくなる可能性がある」という記述は、ラベルノイズのある学習問題で不可避なエラーが存在する場合は無効です。この訂正では、上記の命題の証明に存在する誤りを正確に指摘し、命題の修正版を示し、$\mathcal{C}$境界の新しい理論的な下限を提案します。

Augmented Transfer Regression Learning with Semi-non-parametric Nuisance Models
セミノンパラメトリック撹乱モデルによる拡張転送回帰学習

We develop an augmented transfer regression learning (ATReL) approach that introduces an imputation model to augment the importance weighting equation to achieve double robustness for covariate shift correction. More significantly, we propose a novel semi-non-parametric (SNP) construction framework for the two nuisance models. Compared with existing doubly robust approaches relying on fully parametric or fully non-parametric (machine learning) nuisance models, our proposal is more flexible and balanced to address model misspecification and the curse of dimensionality, achieving a better trade-off in terms of model complexity. The SNP construction presents a new technical challenge in controlling the first-order bias caused by the nuisance estimators. To overcome this, we propose a two-step calibrated estimating approach to construct the nuisance models that ensures the effective reduction of potential bias. Under this SNP framework, our ATReL estimator is root-n-consistent when (i) at least one nuisance model is correctly specified and (ii) the nonparametric components are rate-doubly robust. Simulation studies demonstrate that our method is more robust and efficient than existing methods under various configurations. We also examine the utility of our method through a real transfer learning example of the phenotyping algorithm for rheumatoid arthritis across different time windows. Finally, we propose ways to enhance the intrinsic efficiency of our estimator and to incorporate modern machine-learning methods in the proposed SNP framework.

私たちは、補完モデルを導入して重要度重み付け方程式を拡張し、共変量シフト補正の2倍の堅牢性を実現する拡張転送回帰学習（ATReL）アプローチを開発しました。さらに重要なことに、2つの迷惑モデル用の新しいセミノンパラメトリック（SNP）構築フレームワークを提案します。完全にパラメトリックまたは完全にノンパラメトリック（機械学習）の迷惑モデルに依存する既存の2倍の堅牢性アプローチと比較して、私たちの提案はモデルの誤指定と次元の呪いに対処するためにより柔軟でバランスが取れており、モデルの複雑さに関してより良いトレードオフを実現します。SNP構築は、迷惑推定量によって引き起こされる一次バイアスを制御するという新しい技術的課題を提示します。これを克服するために、潜在的なバイアスを効果的に削減することを保証する迷惑モデルを構築するための2段階の較正推定アプローチを提案します。このSNPフレームワークでは、(i)少なくとも1つのニューサンスモデルが正しく指定され、(ii)ノンパラメトリックコンポーネントがレートダブルロバストである場合、ATReL推定量はルートn一貫性を持ちます。シミュレーション研究では、さまざまな構成で、この方法が既存の方法よりもロバストかつ効率的であることが実証されています。また、異なる時間枠にわたる関節リウマチの表現型アルゴリズムの実際の転移学習例を通じて、この方法の有用性も検証します。最後に、推定量の本質的な効率性を高め、提案されたSNPフレームワークに最新の機械学習手法を組み込む方法を提案します。

From Understanding Genetic Drift to a Smart-Restart Mechanism for Estimation-of-Distribution Algorithms
遺伝的浮動の理解から分布推定アルゴリズムのためのスマートリスタートメカニズムまで

Estimation-of-distribution algorithms (EDAs) are optimization algorithms that learn a distribution from which good solutions can be sampled easily. A key parameter of most EDAs is the sample size (population size). Too small values lead to the undesired effect of genetic drift, while larger values slow down the process. Building on a quantitative analysis of how the population size leads to genetic drift, we design a smart-restart mechanism for EDAs. By stopping runs when the risk for genetic drift is high, it automatically runs the EDA in good parameter regimes. Via a mathematical runtime analysis, we prove a general performance guarantee for this smart-restart scheme. For many situations where the optimal parameter values are known, this shows that the restart scheme automatically finds these optimal values, leading to the asymptotically optimal performance. We also conduct an extensive experimental analysis. On four classic benchmarks, the smart-restart scheme leads to a performance close to the one obtainable with optimal parameter values. We also conduct experiments with PBIL (cross-entropy algorithm) on the max-cut problem and the bipartition problem. Again, the smart-restart mechanism finds much better values for the population size than those suggested in the literature, leading to a much better performance.

分布推定アルゴリズム(EDA)は、良い解を簡単にサンプリングできる分布を学習する最適化アルゴリズムです。ほとんどのEDAの重要なパラメータは、サンプルサイズ(集団サイズ)です。値が小さすぎると、遺伝的浮動の望ましくない影響が生じ、値が大きいとプロセスが遅くなります。集団サイズが遺伝的浮動にどのようにつながるかを定量的に分析して、EDAのスマートリスタートメカニズムを設計します。遺伝的浮動のリスクが高いときに実行を停止することで、EDAを適切なパラメータレジームで自動的に実行します。数学的な実行時間分析により、このスマートリスタートスキームの一般的なパフォーマンス保証を証明します。最適なパラメータ値がわかっている多くの状況では、リスタートスキームによってこれらの最適値が自動的に検出され、漸近的に最適なパフォーマンスが得られることが示されています。また、広範な実験分析も行っています。4つの標準的なベンチマークでは、スマートリスタートスキームによって、最適なパラメータ値で得られるパフォーマンスに近いパフォーマンスが得られます。また、最大カット問題と二分割問題に対してPBIL (クロスエントロピーアルゴリズム)の実験も行っています。この場合も、スマートリスタートメカニズムによって、文献で提案されている値よりもはるかに適切な集団サイズの値が検出され、パフォーマンスが大幅に向上しました。

A Unified Analysis of Multi-task Functional Linear Regression Models with Manifold Constraint and Composite Quadratic Penalty
多様体制約と複合二次ペナルティを持つマルチタスク関数線形回帰モデルの統一解析

This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We propose a general model with double regularization over the spline coefficient matrix: i) a matrix manifold constraint, and ii) a composite penalty as a summation of quadratic terms. Many multi-task learning approaches can be treated as special cases of this proposed model, such as a reduced-rank model and a graph Laplacian regularized model. We show the composite penalty induces a specific norm, which helps quantify the manifold curvature and determine the corresponding proper subset in the manifold tangent space. The complexity of tangent space subset is then bridged to the complexity of geodesic neighbor via generic chaining. A unified upper bound of the convergence rate is obtained and specifically applied to the reduced-rank model and the graph Laplacian regularized model. The phase transition behaviors for the estimators are examined as we vary the configurations of model parameters.

この研究では、共変量と未知の回帰係数(傾斜関数と呼ばれる)の両方が曲線であるマルチタスク機能線形回帰モデルを研究します。傾斜関数の推定には、バイアス、分散、計算の複雑さのバランスをとるためにペナルティ付きスプラインを使用します。マルチタスク学習の威力は、傾斜関数に追加の構造を課すことによってもたらされます。スプライン係数行列に対する二重の正則化(i)行列多様体制約、およびii) 2次項の合計としての複合ペナルティ)を伴う一般的なモデルを提案します。ランク削減モデルやグラフラプラシアン正則化モデルなど、多くのマルチタスク学習アプローチは、この提案モデルの特殊なケースとして扱うことができます。複合ペナルティによって特定のノルムが誘導され、それが多様体の曲率を定量化し、多様体の接線空間内の対応する適切なサブセットを決定するのに役立つことを示します。接線空間サブセットの複雑さは、汎用的な連鎖を介して測地線近傍の複雑さに橋渡しされます。収束率の統一された上限が得られ、特に、ランクを下げたモデルとグラフラプラシアン正規化モデルに適用されます。モデルパラメータの構成を変えながら、推定量の位相遷移動作が調べられます。

Deletion and Insertion Tests in Regression Models
回帰モデルでの削除テストと挿入テスト

A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function f. The insertion and deletion tests of Petsiuk et al. (2018) can be used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of f. We find an expression for the expected value of the AUC under a random ordering of inputs to f and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS) as well as LIME, DeepLIFT, vanilla gradient and input×gradient methods. KS has the best overall performance in two datasets we consider but it is very expensive to compute. We find that IG is nearly as good as KS while being much faster. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels and so we consider ways to handle binary variables in IG. We show that sorting variables by their Shapley value does not necessarily give the optimal ordering for an insertion-deletion test. It will however do that for monotone functions of additive models, such as logistic regression.

説明可能なAI (XAI)の基本的なタスクは、ブラックボックス関数fによる予測の背後にある最も重要な特徴を特定することです。Petsiukら(2018)の挿入テストと削除テストは、分類においてピクセルを最も重要なものから最も重要でないものにランク付けするアルゴリズムの品質を判断するために使用できます。回帰問題に着目して、fのアンカー分解における特定の主効果と相互作用の観点から、曲線下面積(AUC)基準の式を確立します。fへの入力をランダムに順序付けた場合のAUCの期待値の式を見つけ、回帰設定の直線上面積の代替を提案します。この基準を使用して、統合勾配(IG)によって計算された特徴の重要性を、カーネルSHAP (KS)やLIME、DeepLIFT、バニラ勾配、入力×勾配法によって計算された特徴の重要性と比較します。KSは、検討する2つのデータセットで全体的なパフォーマンスが最も優れていますが、計算コストが非常に高くなります。IGはKSとほぼ同等の性能があり、はるかに高速であることがわかりました。比較問題には、IGにとって課題となるバイナリ入力がいくつか含まれます。これは、可能な変数レベル間の値を使用する必要があるためです。そのため、IGでバイナリ変数を処理する方法を検討します。変数をShapley値で並べ替えても、挿入削除テストに最適な順序付けが必ずしも得られるわけではないことを示します。ただし、ロジスティック回帰などの加法モデルの単調関数の場合は、最適な順序付けが行われます。

Deep Neural Networks with Dependent Weights: Gaussian Process Mixture Limit, Heavy Tails, Sparsity and Compressibility
従属重みを持つ深層ニューラルネットワーク:ガウス過程混合限界、ヘビーテール、スパース性、圧縮率

This article studies the infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions. Each hidden node of the network is assigned a nonnegative random variable that controls the variance of the outgoing weights of that node. We make minimal assumptions on these per-node random variables: they are iid and their sum, in each layer, converges to some finite random variable in the infinite-width limit. Under this model, we show that each layer of the infinite-width neural network can be characterised by two simple quantities: a non-negative scalar parameter and a L\’evy measure on the positive reals. If the scalar parameters are strictly positive and the L\’evy measures are trivial at all hidden layers, then one recovers the classical Gaussian process (GP) limit, obtained with iid Gaussian weights. More interestingly, if the L\’evy measure of at least one layer is non-trivial, we obtain a mixture of Gaussian processes (MoGP) in the large-width limit. The behaviour of the neural network in this regime is very different from the GP regime. One obtains correlated outputs, with non-Gaussian distributions, possibly with heavy tails. Additionally, we show that, in this regime, the weights are compressible, and some nodes have asymptotically non-negligible contributions, therefore representing important hidden features. Many sparsity-promoting neural network models can be recast as special cases of our approach, and we discuss their infinite-width limits; we also present an asymptotic analysis of the pruning error. We illustrate some of the benefits of the MoGP regime over the GP regime in terms of representation learning and compressibility on simulated, MNIST and Fashion MNIST datasets.

この記事では、重みが従属し、ガウス分布の混合によってモデル化されたディープフィードフォワードニューラルネットワークの無限幅の限界について研究します。ネットワークの各隠しノードには、そのノードの出力重みの分散を制御する非負のランダム変数が割り当てられます。これらのノードごとのランダム変数については、最小限の仮定を行います。つまり、それらはiidであり、各層での合計は、無限幅の限界で有限のランダム変数に収束します。このモデルでは、無限幅ニューラルネットワークの各層が、非負のスカラーパラメータと正の実数に対するL\’evy測度の2つの単純な量によって特徴付けられることを示します。スカラーパラメータが厳密に正であり、L\’evy測度がすべての隠し層で自明である場合、iidガウス重みで得られる古典的なガウス過程(GP)の限界が回復されます。さらに興味深いことに、少なくとも1つの層のL\’evy測度が自明でない場合、広い幅の極限でガウス過程の混合(MoGP)が得られます。この領域でのニューラルネットワークの動作は、GP領域とは大きく異なります。非ガウス分布を持つ相関出力が得られ、裾が重くなる可能性があります。さらに、この領域では重みが圧縮可能であり、一部のノードは漸近的に無視できない寄与を持つため、重要な隠れた特徴を表すことを示します。多くのスパース性を促進するニューラルネットワークモデルは、このアプローチの特殊なケースとして作り直すことができます。それらの無限幅の極限について説明します。また、刈り込みエラーの漸近分析も示します。シミュレートされたMNISTおよびFashion MNISTデータセットでの表現学習と圧縮性に関して、GP領域よりもMoGP領域の方が優れている点をいくつか示します。

A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits
非定常確率的バンディットの動的後悔の新たな見方

We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of its dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the time horizon of the problem and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.

私たちは、各アームの報酬統計が学習の過程で複数回変化する可能性がある非定常確率的多腕バンディット問題を研究します。学習アルゴリズムのパフォーマンスは、動的後悔の観点から評価されます。動的後悔は、エージェントが各タイムステップで最適なアームを選択する場合の期待累積報酬と学習アルゴリズムの累積報酬との差として定義されます。このような環境の困難さを測定する1つの方法は、最適なアームのIDが何回変化するかを考慮することです。私たちは、$K$アームバンディット問題において、$N$は問題のタイムホライズン、$S$は最適なアームのIDが変化する回数であり、$S$に関する事前知識がなくても、ほぼ最適な$\widetilde O(\sqrt{K N(S+1)})$動的後悔を実現する方法を提案します。この問題に関するこれまでの研究では、報酬関数の変化数(または変化量)に応じて変化する後悔の境界が得られているが、これははるかに大きい場合があり、同様の境界を実現するために$S$に関する事前知識を前提としています。

Universal Approximation Property of Invertible Neural Networks
可逆ニューラルネットワークの普遍近似特性

Invertible neural networks (INNs) are neural network architectures with invertibility by design. Thanks to their invertibility and the tractability of their Jacobians, INNs have various machine learning applications such as probabilistic modeling, generative modeling, and representation learning. However, their attractive properties often come at the cost of restricting the layer design, which poses a question on their representation power: can we use these models to approximate sufficiently diverse functions? To answer this question, we have developed a general theoretical framework to investigate the representation power of INNs, building on a structure theorem of differential geometry. The framework simplifies the approximation problem of diffeomorphisms, which enables us to show the universal approximation properties of INNs. We apply the framework to two representative classes of INNs, namely Coupling-Flow-based INNs (CF-INNs) and Neural Ordinary Differential Equations (NODEs), and elucidate their high representation power despite the restrictions on their architectures.

可逆ニューラルネットワーク(INN)は、設計上可逆性を備えたニューラルネットワークアーキテクチャです。可逆性とヤコビアンの扱いやすさにより、INNは確率モデル、生成モデル、表現学習などのさまざまな機械学習アプリケーションに使用できます。ただし、その魅力的な特性は、多くの場合、レイヤー設計の制限を犠牲にして得られるため、表現力に関する疑問が生じます。これらのモデルを使用して、十分に多様な関数を近似できるでしょうか。この疑問に答えるために、微分幾何学の構造定理に基づいて、INNの表現力を調査するための一般的な理論的フレームワークを開発しました。このフレームワークは、微分同相写像の近似問題を簡素化し、INNの普遍的な近似特性を示すことを可能にします。私たちはこのフレームワークを、カップリングフローベースのINN (CF-INN)とニューラル常微分方程式(NODE)という2つの代表的なINNクラスに適用し、そのアーキテクチャ上の制約にもかかわらず、高い表現力を発揮できることを明らかにしました。

Low Tree-Rank Bayesian Vector Autoregression Models
低ツリーランクのベイジアンベクトル自己回帰モデル

Vector autoregression has been widely used for modeling and analysis of multivariate time series data. In high-dimensional settings, model parameter regularization schemes inducing sparsity yield interpretable models and achieved good forecasting performance. However, in many data applications, such as those in neuroscience, the Granger causality graph estimates from existing vector autoregression methods tend to be quite dense and difficult to interpret, unless one compromises on the goodness-of-fit. To address this issue, this paper proposes to incorporate a commonly used structural assumption — that the ground-truth graph should be largely connected, in the sense that it should only contain at most a few components. We take a Bayesian approach and develop a novel tree-rank prior distribution for the regression coefficients. Specifically, this prior distribution forces the non-zero coefficients to appear only on the union of a few spanning trees. Since each spanning tree connects $p$ nodes with only $(p-1)$ edges, it effectively achieves both high connectivity and high sparsity. We develop a computationally efficient Gibbs sampler that is scalable to large sample size and high dimension. In analyzing test-retest functional magnetic resonance imaging data, our model produces a much more interpretable graph estimate, compared to popular existing approaches. In addition, we show appealing properties of this new method, such as efficient computation, mild stability conditions and posterior consistency.

ベクトル自己回帰は、多変量時系列データのモデリングと分析に広く使用されています。高次元設定では、スパース性を誘導するモデルパラメータ正規化スキームにより、解釈可能なモデルが生成され、優れた予測パフォーマンスが達成されています。ただし、神経科学などの多くのデータアプリケーションでは、既存のベクトル自己回帰法によるグレンジャー因果関係グラフ推定値は非常に密になりがちで、適合度を妥協しない限り、解釈が困難です。この問題に対処するために、この論文では、一般的に使用されている構造仮定、つまり、グラウンドトゥルースグラフは、せいぜい数個のコンポーネントのみを含むという意味で、大部分が接続されている必要があるという仮定を組み込むことを提案します。ベイズアプローチを採用し、回帰係数の新しいツリーランク事前分布を開発します。具体的には、この事前分布により、非ゼロ係数が少数のスパニングツリーの和集合にのみ出現するように強制されます。各スパニングツリーは、p個のノードを(p-1)個のエッジのみで接続するため、高い接続性と高いスパース性の両方を効果的に実現します。大規模なサンプルサイズと高次元に拡張可能な、計算効率の高いギブスサンプラーを開発しました。テスト再テスト機能的磁気共鳴イメージングデータを分析する際、私たちのモデルは、一般的な既存のアプローチと比較して、はるかに解釈しやすいグラフ推定値を生成します。さらに、効率的な計算、穏やかな安定条件、事後一貫性など、この新しい方法の魅力的な特性を示します。

Generic Unsupervised Optimization for a Latent Variable Model With Exponential Family Observables
指数族観測量を持つ潜在変数モデルに対する汎用教師なし最適化

Latent variable models (LVMs) represent observed variables by parameterized functions of latent variables. Prominent examples of LVMs for unsupervised learning are probabilistic PCA or probabilistic sparse coding which both assume a weighted linear summation of the latents to determine the mean of a Gaussian distribution for the observables. In many cases, however, observables do not follow a Gaussian distribution. For unsupervised learning, LVMs which assume specific non-Gaussian observables (e.g., Bernoulli or Poisson) have therefore been considered. Already for specific choices of distributions, parameter optimization is challenging and only a few previous contributions considered LVMs with more generally defined observable distributions. In this contribution, we do consider LVMs that are defined for a range of different distributions, i.e., observables can follow any (regular) distribution of the exponential family. Furthermore, the novel class of LVMs presented here is defined for binary latents, and it uses maximization in place of summation to link the latents to observables. In order to derive an optimization procedure, we follow an expectation maximization approach for maximum likelihood parameter estimation. We then show, as our main result, that a set of very concise parameter update equations can be derived which feature the same functional form for all exponential family distributions. The derived generic optimization can consequently be applied (without further derivations) to different types of metric data (Gaussian and non-Gaussian) as well as to different types of discrete data. Moreover, the derived optimization equations can be combined with a recently suggested variational acceleration which is likewise generically applicable to the LVMs considered here. Thus, the combination maintains generic and direct applicability of the derived optimization procedure, but, crucially, enables efficient scalability. We numerically verify our analytical results using different observable distributions, and, furthermore, discuss some potential applications such as learning of variance structure, noise type estimation and denoising.

潜在変数モデル(LVM)は、潜在変数のパラメータ化された関数によって観測変数を表します。教師なし学習用のLVMの代表的な例としては、確率的PCAまたは確率的スパースコーディングがあります。これらはどちらも、潜在変数の加重線形和を仮定して観測変数のガウス分布の平均を決定します。ただし、多くの場合、観測変数はガウス分布に従いません。そのため、教師なし学習では、特定の非ガウス観測変数(ベルヌーイやポアソンなど)を仮定するLVMが検討されてきました。特定の分布を選択した場合でも、パラメータの最適化は困難であり、より一般的に定義された観測変数分布を持つLVMを検討した以前の論文はごくわずかでした。この論文では、さまざまな分布に対して定義されたLVMを検討します。つまり、観測変数は指数族の任意の(正規の)分布に従うことができます。さらに、ここで紹介する新しいクラスのLVMはバイナリ潜在変数に対して定義され、潜在変数を観測可能なものにリンクするために合計の代わりに最大化を使用します。最適化手順を導出するために、最大尤度パラメータ推定の期待値最大化アプローチに従います。次に、主な結果として、すべての指数族分布に対して同じ関数形式を特徴とする非常に簡潔なパラメータ更新方程式のセットを導出できることを示します。導出された一般的な最適化は、結果として(それ以上の導出なしで)さまざまな種類のメトリックデータ(ガウスおよび非ガウス)やさまざまな種類の離散データに適用できます。さらに、導出された最適化方程式は、最近提案された変分加速と組み合わせることができ、これもここで検討されているLVMに一般的に適用できます。したがって、この組み合わせにより、導出された最適化手順の汎用性と直接的な適用性が維持されますが、重要なことに、効率的なスケーラビリティが可能になります。私たちは、さまざまな観測可能な分布を使用して解析結果を数値的に検証し、さらに、分散構造の学習、ノイズの種類の推定、ノイズ除去などのいくつかの潜在的な応用について説明します。

A Complete Characterization of Linear Estimators for Offline Policy Evaluation
オフライン政策評価のための線形推定器の完全な特性評価

Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.

オフラインポリシー評価は、強化学習における基本的な統計的問題であり、潜在的に異なるポリシーによって収集されたデータを前提として、何らかの意思決定ポリシーの価値関数を推定します。複雑で高次元の観測値を伴う問題に取り組むために、理論家と実践者の両方から、強化学習における関数近似の可能性を理解することに大きな関心が寄せられています。多くの研究にもかかわらず、最も単純な線形関数近似の設定であっても、オフラインポリシー評価が扱いやすくなると予想される時期を明確に特徴付けることは、これまでのところ困難であり、最近、驚くほど多くの強い否定的な結果が文献に現れています。この研究では、古典的な方法、特に適合Q反復(FQI)と最小二乗時間差学習(LSTD)がオフラインポリシー評価を成功させるために必要かつ十分な、単純な制御理論的および線形代数的条件を特定します。この特徴付けを使用して、これらの推定器が成功する正確な階層を確立します。LSTDはFQIよりも厳密に弱い条件下で機能することを証明します。さらに、問題がLSTDで解決できない場合、無限データの限界であっても、幅広い種類の線形推定器では解決できないことが証明されました。総合すると、私たちの結果は、オフラインポリシー評価における線形推定器の動作の完全な図を提供し、標準アルゴリズムのこれまでばらばらだった分析を統合し、オフラインポリシー評価の根底にある統計的複雑さについて、はるかに明確な概念を提供します。

Near-Optimal Weighted Matrix Completion
最適に近い重み付け行列の完了

Recent work in the matrix completion literature has shown that prior knowledge of a matrix’s row and column spaces can be successfully incorporated into reconstruction programs to substantially benefit matrix recovery. This paper proposes a novel methodology that exploits more general forms of known matrix structure in terms of subspaces. The work derives reconstruction error bounds that are informative in practice, providing insight to previous approaches in the literature while introducing novel programs with reduced sample complexities. The main result shows that a family of weighted nuclear norm minimization programs incorporating a $M_1 r$-dimensional subspace of $n\times n$ matrices (where $M_1\geq 1$ conveys structural properties of the subspace) allow accurate approximation of a rank $r$ matrix aligned with the subspace from a near-optimal number of observed entries (within a logarithmic factor of $M_1 r)$. The result is robust, where the error is proportional to measurement noise, applies to full rank matrices, and reflects degraded output when erroneous prior information is imposed. Numerical experiments are presented that validate the theoretical behavior derived for several example weighted programs.

行列補完に関する文献の最近の研究では、行列の行と列の空間に関する事前知識を再構成プログラムにうまく組み込むことで、行列の復元に大きく役立つことが示されています。この論文では、サブスペースの観点から、既知の行列構造のより一般的な形式を活用する新しい方法論を提案します。この研究では、実践に役立つ再構成エラー境界を導き、文献の以前のアプローチへの洞察を提供すると同時に、サンプルの複雑さを軽減した新しいプログラムを紹介します。主な結果は、$n\times n$行列の$M_1 r$次元サブスペース(ここで、$M_1\geq 1$はサブスペースの構造特性を伝えます)を組み込んだ重み付き核ノルム最小化プログラムファミリにより、サブスペースに揃えられたランク$r$行列を、観測されたエントリのほぼ最適な数($M_1 rの対数係数内)から正確に近似できることが示されています。結果は堅牢で、誤差は測定ノイズに比例し、フルランクの行列に適用され、誤った事前情報が課された場合に出力が低下します。いくつかの例の重み付けプログラムから導出された理論的な動作を検証する数値実験が提示されています。

Community models for networks observed through edge nominations
エッジノミネートを通じて観察されるネットワークのコミュニティモデル

Communities are a common and widely studied structure in networks, typically assuming that the network is fully and correctly observed. In practice, network data are often collected by querying nodes about their connections. In some settings, all edges of a sampled node will be recorded, and in others, a node may be asked to name its connections. These sampling mechanisms introduce noise and bias, which can obscure the community structure and invalidate assumptions underlying standard community detection methods. We propose a general model for a class of network sampling mechanisms based on recording edges via querying nodes, designed to improve community detection for network data collected in this fashion. We model edge sampling probabilities as a function of both individual preferences and community parameters, and show community detection can be performed by spectral clustering under this general class of models. We also propose, as a special case of the general framework, a parametric model for directed networks we call the nomination stochastic block model, which allows for meaningful parameter interpretations and can be fitted by the method of moments. In this case, spectral clustering and the method of moments are computationally efficient and come with theoretical guarantees of consistency. We evaluate the proposed model in simulation studies on unweighted and weighted networks and under misspecified models. The method is applied to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.

コミュニティは、ネットワークにおいて一般的で広く研究されている構造であり、通常はネットワークが完全に正しく観察されていることを前提としています。実際には、ネットワークデータは、接続についてノードに問い合わせることによって収集されることがよくあります。設定によっては、サンプリングされたノードのすべてのエッジが記録され、他の設定では、ノードに接続の名前を尋ねることがあります。これらのサンプリングメカニズムによってノイズやバイアスが生じ、コミュニティ構造が不明瞭になり、標準的なコミュニティ検出方法の基礎となる仮定が無効になる可能性があります。私たちは、クエリノードを介してエッジを記録することに基づく、この方法で収集されたネットワークデータのコミュニティ検出を改善するように設計された、ネットワークサンプリングメカニズムのクラスの一般的なモデルを提案します。エッジサンプリング確率を個人の好みとコミュニティパラメータの両方の関数としてモデル化し、この一般的なクラスのモデルの下でスペクトルクラスタリングによってコミュニティ検出を実行できることを示します。また、一般的なフレームワークの特殊なケースとして、有向ネットワークのパラメトリックモデルを提案します。これは、意味のあるパラメータ解釈を可能にし、モーメント法でフィッティングできる、指名確率ブロックモデルと呼ばれます。この場合、スペクトルクラスタリングとモーメント法は計算効率が高く、理論的に一貫性が保証されています。重み付けされていないネットワークと重み付けされたネットワーク、および誤って指定されたモデルでのシミュレーション研究で、提案されたモデルを評価します。この方法は、教員採用データセットに適用され、米国のビジネススクール間のコミュニティの重要な階層を発見しました。

The Bayesian Learning Rule
ベイズ学習ルール

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

私たちは、多くの機械学習アルゴリズムは、ベイズ学習ルールと呼ばれる単一のアルゴリズムの特定のインスタンスであることを示します。ベイズ原理から導出されたこのルールは、最適化、深層学習、グラフィカルモデルなどの分野から幅広いアルゴリズムを生み出します。これには、リッジ回帰、ニュートン法、カルマンフィルターなどの従来のアルゴリズムだけでなく、確率的勾配降下法、RMSprop、Dropoutなどの最新の深層学習アルゴリズムも含まれます。このようなアルゴリズムを導出する際の重要な考え方は、自然勾配を使用して推定された候補分布を使用して事後分布を近似することです。候補分布が異なると、アルゴリズムも異なり、自然勾配へのさらなる近似により、それらのアルゴリズムのバリアントが生じます。私たちの仕事は、既存のアルゴリズムを統一し、一般化し、改善するだけでなく、新しいアルゴリズムの設計にも役立ちます。

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD
データの異種性の影響の除去による分散型 SGD のネットワークトポロジ依存性の強化

We consider decentralized stochastic optimization problems, where a network of $n$ nodes cooperates to find a minimizer of the globally-averaged cost. A widely studied decentralized algorithm for this problem is the decentralized SGD (D-SGD), in which each node averages only with its neighbors. D-SGD is efficient in single-iteration communication, but it is very sensitive to the network topology. For smooth objective functions, the transient stage (which measures the number of iterations the algorithm has to experience before achieving the linear speedup stage) of D-SGD is on the order of ${O}(n/(1-\beta)^2)$ and $O(n^3/(1-\beta)^4)$ for strongly and generally convex cost functions, respectively, where $1-\beta \in (0,1)$ is a topology-dependent quantity that approaches $0$ for a large and sparse network. Hence, D-SGD suffers from slow convergence for large and sparse networks. In this work, we revisit the convergence property of the D$^2$/Exact-Diffusion algorithm. By eliminating the influence of data heterogeneity between nodes, D$^2$/Exact-diffusion is shown to have an enhanced transient stage that is on the order of $\tilde{O}(n/(1-\beta))$ and $O(n^3/(1-\beta)^2)$ for strongly and generally convex cost functions (where $\tilde{O}(\cdot)$ hides all logarithm factors), respectively. Moreover, when D$^2$/Exact-Diffusion is implemented with both gradient accumulation and multi-round gossip communications, its transient stage can be further improved to $\tilde{O}(1/(1-\beta)^{\frac{1}{2}})$ and $\tilde{O}(n/(1-\beta))$ for strongly and generally convex cost functions, respectively. To our knowledge, these established results for D$^2$/Exact-Diffusion have the best, i.e., weakest) dependence on network topology compared to existing decentralized algorithms. Numerical simulations are conducted to validate our theories.

私たちは、$n$個のノードのネットワークが協力して、グローバル平均コストの最小値を見つける、分散型確率最適化問題を検討します。この問題に対する広く研究されている分散型アルゴリズムは、各ノードが近隣ノードとのみ平均化する分散型SGD (D-SGD)です。D-SGDは単一反復通信では効率的だが、ネットワークトポロジーに非常に敏感です。滑らかな目的関数の場合、D-SGDの過渡段階(線形高速化段階に到達するまでにアルゴリズムが経験しなければならない反復回数を測定)は、それぞれ、強く凸なコスト関数と一般に凸なコスト関数に対して${O}(n/(1-\beta)^2)$と$O(n^3/(1-\beta)^4)$のオーダーです。ここで、$1-\beta \in (0,1)$は、大規模で疎なネットワークでは$0$に近づく、トポロジーに依存する量です。したがって、D-SGDは大規模で疎なネットワークでは収束が遅いという問題があります。この研究では、D$^2$/Exact-Diffusionアルゴリズムの収束特性を再検討します。ノード間のデータの異質性の影響を排除することで、D$^2$/Exact-Diffusionは、それぞれ強く凸なコスト関数と一般的に凸なコスト関数に対して、$\tilde{O}(n/(1-\beta))$と$O(n^3/(1-\beta)^2)$のオーダーの強化された過渡段階を持つことが示されます(ここで、$\tilde{O}(\cdot)$はすべての対数因子を隠します)。さらに、D$^2$/Exact-Diffusionが勾配蓄積とマルチラウンドゴシップ通信の両方で実装されると、その過渡段階は、それぞれ強く凸なコスト関数と一般的に凸なコスト関数に対して、$\tilde{O}(1/(1-\beta)^{\frac{1}{2}})$と$\tilde{O}(n/(1-\beta))$にさらに改善されます。私たちの知る限り、D$^2$/Exact-Diffusionのこれらの確立された結果は、既存の分散アルゴリズムと比較して、ネットワークトポロジへの依存性が最も低く(つまり最も弱い)です。数値シミュレーションを実施して、私たちの理論を検証します。

Sparse Markov Models for High-dimensional Inference
高次元推論のためのスパースマルコフモデル

Finite-order Markov models are well-studied models for dependent finite alphabet data. Despite their generality, application in empirical work is rare when the order $d$ is large relative to the sample size $n$ (e.g., $d = \mathcal{O}(n)$). Practitioners rarely use higher-order Markov models because (1) the number of parameters grows exponentially with the order, (2) the sample size $n$ required to estimate each parameter grows exponentially with the order, and (3) the interpretation is often difficult. Here, we consider a subclass of Markov models called Mixture of Transition Distribution (MTD) models, proving that when the set of relevant lags is sparse (i.e., $\mathcal{O}(\log(n))$), we can consistently and efficiently recover the lags and estimate the transition probabilities of high-dimensional ($d = \mathcal{O}(n)$) MTD models. Moreover, the estimated model allows straightforward interpretation. The key innovation is a recursive procedure for a priori selection of the relevant lags of the model. We prove a new structural result for the MTD and an improved martingale concentration inequality to prove our results. Using simulations, we show that our method performs well compared to other relevant methods. We also illustrate the usefulness of our method on weather data where the proposed method correctly recovers the long-range dependence.

有限次マルコフモデルは、従属有限アルファベットデータのためのよく研究されたモデルです。その一般性にもかかわらず、次数$d$がサンプルサイズ$n$に対して大きい場合(例: $d = \mathcal{O}(n)$)、実証研究への応用はまれです。実践者が高次マルコフモデルを使用することはめったにありません。その理由は、(1)パラメータの数が次数とともに指数的に増加する、(2)各パラメータを推定するために必要なサンプルサイズ$n$が次数とともに指数的に増加する、(3)解釈が難しいことが多いためです。ここでは、混合遷移分布(MTD)モデルと呼ばれるマルコフモデルのサブクラスを検討し、関連するラグのセットがスパースである場合(つまり、$\mathcal{O}(\log(n))$)、一貫して効率的にラグを回復し、高次元($d = \mathcal{O}(n)$) MTDモデルの遷移確率を推定できることを証明します。さらに、推定モデルは簡単に解釈できます。重要な革新は、モデルの関連ラグを事前に選択するための再帰手順です。MTDの新しい構造結果と、改善されたマルチンゲール集中不等式を証明して、結果を証明します。シミュレーションを使用して、この方法が他の関連方法と比較して優れていることを示します。また、提案された方法が長距離依存性を正しく回復する気象データに対するこの方法の有用性も示します。

Distinguishing Cause and Effect in Bivariate Structural Causal Models: A Systematic Investigation
二変量構造因果モデルにおける因果関係の区別:系統的検討

Distinguishing cause and effect from purely observational data is a fundamental problem in science. Even the atomic bivariate case, seemingly the simplest, is challenging and re- quires further assumptions to be identifiable at all. In recent years a variety of approaches to address this problem has been developed, each with its own assumptions, strengths, and weaknesses. In machine learning common benchmarks with real and synthetic data have been a main driver of innovation. Synthetic benchmarks can explicitly model data characteristics such as the underlying functional relations and distributions to assess how methods deal with these. However, a systematic assessment of the state-of-the-art of meth- ods is currently missing. We provide a detailed and systematic comparison of a range of methods on a novel collection of datasets that systematically models individual data challenges. Further, we evaluate more recent methods missing in previous benchmarks. The novel suite of datasets will be contributed to the causeme.net benchmark platform to provide a continuously updated and searchable causal discovery method intercomparison database. Our aim is to assist users in finding the most suitable methods for their problem setting and for method developers to improve current and develop new methods.

純粋に観察されたデータから原因と結果を区別することは、科学における基本的な問題です。一見最も単純な原子二変量の場合でさえ、困難であり、識別するためにはさらなる仮定が必要です。近年、この問題に対処するためのさまざまなアプローチが開発されてきましたが、それぞれに独自の仮定、長所、短所があります。機械学習では、実際のデータと合成データを使用した共通のベンチマークがイノベーションの主な原動力となっています。合成ベンチマークは、基礎となる機能関係や分布などのデータ特性を明示的にモデル化して、メソッドがこれらに対処する方法を評価できます。ただし、最先端のメソッドの体系的な評価は現在欠如しています。私たちは、個々のデータの課題を体系的にモデル化する新しいデータセットのコレクションで、さまざまなメソッドの詳細で体系的な比較を提供します。さらに、以前のベンチマークでは欠落していたより最近のメソッドを評価します。新しいデータセットスイートは、causeme.netベンチマークプラットフォームに提供され、継続的に更新され、検索可能な因果発見メソッドの相互比較データベースを提供します。私たちの目的は、ユーザーが問題設定に最も適した方法を見つけられるように支援し、方法開発者が現在の方法を改善して新しい方法を開発できるように支援することです。

Elastic Gradient Descent, an Iterative Optimization Method Approximating the Solution Paths of the Elastic Net
弾性勾配降下法、弾性ネットの解の経路を近似する反復最適化法

The elastic net combines lasso and ridge regression to fuse the sparsity property of lasso with the grouping property of ridge regression. The connections between ridge regression and gradient descent and between lasso and forward stagewise regression have previously been shown. Similar to how the elastic net generalizes lasso and ridge regression, we introduce elastic gradient descent, a generalization of gradient descent and forward stagewise regression. We theoretically analyze elastic gradient descent and compare it to the elastic net and forward stagewise regression. Parts of the analysis are based on elastic gradient flow, a piecewise analytical construction, obtained for elastic gradient descent with infinitesimal step size. We also compare elastic gradient descent to the elastic net on real and simulated data and show that it provides similar solution paths, but is several orders of magnitude faster. Compared to forward stagewise regression, elastic gradient descent selects a model that, although still sparse, provides considerably lower prediction and estimation errors.

弾性ネットは、Lassoとリッジ回帰を組み合わせて、Lassoのスパース特性とリッジ回帰のグループ化特性を融合します。リッジ回帰と勾配降下法、Lassoと前向き段階的回帰の関係は、以前に示されました。弾性ネットがLassoとリッジ回帰を一般化するのと同様に、勾配降下法と前向き段階的回帰を一般化した弾性勾配降下法を紹介します。弾性勾配降下法を理論的に分析し、弾性ネットおよび前向き段階的回帰と比較します。分析の一部は、弾性勾配フローに基づいています。これは、微小ステップサイズで弾性勾配降下法に対して得られる区分解析構成です。また、実際のデータとシミュレートされたデータで弾性勾配降下法を弾性ネットと比較し、同様のソリューションパスを提供しますが、数桁高速であることを示します。前向き段階的回帰と比較すると、弾性勾配降下法では、まだスパースではあるものの、予測および推定エラーが大幅に低いモデルが選択されます。

On Biased Compression for Distributed Learning
分散学習のためのバイアス圧縮について

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( \delta L \exp[-\frac{\mu K}{\delta L}] + \frac{(C + \delta D)}{K\mu}\right)$, where $\delta\ge1$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.

ここ数年、さまざまな通信圧縮技術が、分散学習における通信ボトルネックの緩和に役立つ不可欠なツールとして登場しました。しかし、バイアス圧縮は、より研究され理解されているアンバイアス圧縮と比較して、実際には優れたパフォーマンスを示すことが多いにもかかわらず、それらについてはほとんどわかっていません。この研究では、3つのクラスのバイアス圧縮演算子(そのうち2つは新しいもの)と、(確率的)勾配降下法と分散(確率的)勾配降下法に適用した場合のパフォーマンスについて研究します。バイアス圧縮が、単一ノードと分散設定の両方で線形収束率につながることを初めて示します。私たちは、誤差フィードバック機構を採用した分散圧縮SGD法がエルゴード率$O\left( \delta L \exp[-\frac{\mu K}{\delta L}] + \frac{(C + \delta D)}{K\mu}\right)$を享受することを証明します。ここで、$\delta\ge1$は圧縮をさらに適用すると増加する圧縮パラメータ、$L$と$\mu$は平滑度定数と強凸性定数、$C$は確率的勾配ノイズを捕捉し(各ノードで完全な勾配が計算される場合は$C=0$)、$D$は最適値での勾配の分散を捕捉します(過剰パラメータ化モデルの場合は$D=0$)。さらに、通信勾配のいくつかの合成分布と経験分布の理論的研究を通じて、バイアス付き圧縮器がバイアスなしの圧縮器よりも優れている理由と程度を明らかにします。最後に、有望な理論的保証と実用的なパフォーマンスを備えたいくつかの新しいバイアス付き圧縮器を提案します。

Adaptive Clustering Using Kernel Density Estimators
カーネル密度推定器を使用した適応クラスタリング

We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as Hölder continuity, but only require intuitive geometric assumptions of non-parametric nature. In addition, we compare our results to other guarantees found in the literature and also present some experiments comparing our algorithm to $k$-means and hierarchical clustering.

私たちは、有限クラスターツリーのすべての分割と対応するクラスターを推定するための汎用的な再帰的アルゴリズムを導き出し、分析します。この汎用クラスタリングアルゴリズムがカーネル密度推定器からレベルセット推定値を受け取るときの統計的特性をさらに調査します。特に、有限のサンプル保証、一貫性、収束率、およびカーネル帯域幅を選択するための適応型データ駆動型戦略を導き出します。これらの結果では、ヘルダー連続性のような密度の連続性の仮定は必要なく、ノンパラメトリックな性質の直感的な幾何学的仮定のみが必要です。さらに、結果を文献にある他の保証と比較し、アルゴリズムを$k$-meansおよび階層クラスタリングと比較するいくつかの実験も提示します。

A Continuous-time Stochastic Gradient Descent Method for Continuous Data
連続データに対する連続時間確率的勾配降下法

Optimization problems with continuous data appear in, e.g., robust machine learning, functional data analysis, and variational inference. Here, the target function is given as an integral over a family of (continuously) indexed target functions—integrated with respect to a probability measure. Such problems can often be solved by stochastic optimization methods: performing optimization steps with respect to the indexed target function with randomly switched indices. In this work, we study a continuous-time variant of the stochastic gradient descent algorithm for optimization problems with continuous data. This so-called stochastic gradient process consists in a gradient flow minimizing an indexed target function that is coupled with a continuous-time index process determining the index. Index processes are, e.g., reflected diffusions, pure jump processes, or other Lévy processes on compact spaces. Thus, we study multiple sampling patterns for the continuous data space and allow for data simulated or streamed at runtime of the algorithm. We analyze the approximation properties of the stochastic gradient process and study its longtime behavior and ergodicity under constant and decreasing learning rates. We end with illustrating the applicability of the stochastic gradient process in a polynomial regression problem with noisy functional data, as well as in a physics-informed neural network.

連続データによる最適化問題は、例えば、ロバスト機械学習、機能データ解析、変分推論などで発生します。ここで、ターゲット関数は、確率測度に関して積分された、（連続的に）インデックス付けされたターゲット関数の族の積分として与えられます。このような問題は、多くの場合、確率的最適化手法によって解決できます。つまり、ランダムに切り替えられたインデックスを持つインデックス付けされたターゲット関数に関して最適化手順を実行します。この研究では、連続データによる最適化問題に対する確率的勾配降下アルゴリズムの連続時間バリアントを研究します。このいわゆる確率的勾配プロセスは、インデックス付けされたターゲット関数を最小化する勾配フローと、インデックスを決定する連続時間インデックスプロセスで構成されます。インデックスプロセスには、例えば、反射拡散、純粋ジャンププロセス、またはコンパクト空間上のその他のレヴィプロセスがあります。したがって、連続データ空間の複数のサンプリングパターンを研究し、アルゴリズムの実行時にデータをシミュレートまたはストリーミングできるようにします。確率的勾配過程の近似特性を分析し、一定および減少する学習率での長期的な動作とエルゴード性を研究します。最後に、ノイズの多い関数データを使用した多項式回帰問題と、物理学に基づくニューラルネットワークにおける確率的勾配過程の適用性を示します。

Online Non-stochastic Control with Partial Feedback
パーシャルフィードバックによるオンライン非確率制御

Online control with non-stochastic disturbances and adversarially chosen convex cost functions, referred to as online non-stochastic control, has recently attracted increasing attention. We study online non-stochastic control with partial feedback, where learners can only access partially observed states and partially informed (bandit) costs. The problem setting arises naturally in real-world decision-making applications and strictly generalizes exceptional cases studied disparately by previous works. We propose the first online algorithm for this problem, with an $\tilde{O}(T^{3/4})$ regret competing with the best policy in hindsight, where $T$ denotes the time horizon and the $\tilde{O}(\cdot)$-notation omits the poly-logarithmic factors in $T$. To further enhance the algorithms’ robustness to changing environments, we then design a novel method with a two-layer structure to optimize the dynamic regret, a more challenging measure that competes with time-varying policies. Our method is based on the online ensemble framework by treating the controller above as the base learner. On top of that, we design two different meta-combiners to simultaneously handle the unknown variation of environments and the memory issue arising from the online control. We prove that the two resulting algorithms enjoy $\tilde{O}(T^{3/4}(1+P_T)^{1/2})$ and $\tilde{O}(T^{3/4}(1+P_T)^{1/4}+T^{5/6})$ dynamic regret respectively, where $P_T$ measures the environmental non-stationarity. Our results are further extended to unknown transition matrices. Finally, empirical studies in both synthetic linear and simulated nonlinear tasks validate our method’s effectiveness, thus supporting the theoretical findings.

非確率的外乱と敵対的に選択された凸コスト関数を伴うオンライン制御は、オンライン非確率的制御と呼ばれ、最近ますます注目を集めています。私たちは、学習者が部分的に観測された状態と部分的に情報に基づいた（バンディット）コストにのみアクセスできる、部分的なフィードバックを伴うオンライン非確率的制御を研究します。問題設定は、現実世界の意思決定アプリケーションで自然に発生し、以前の研究で個別に研究された例外的なケースを厳密に一般化します。私たちはこの問題に対する最初のオンラインアルゴリズムを提案します。これは、後知恵での最善のポリシーと競合する$\tilde{O}(T^{3/4})$の後悔を伴うもので、ここで$T$は時間範囲を表し、$\tilde{O}(\cdot)$表記は$T$の多重対数因子を省略します。変化する環境に対するアルゴリズムの堅牢性をさらに強化するために、時間変動ポリシーと競合するより困難な尺度である動的後悔を最適化する2層構造の新しい方法を設計します。私たちの方法は、上記のコントローラーを基本学習器として扱うことで、オンラインアンサンブルフレームワークに基づいています。その上で、環境の未知の変動とオンライン制御から生じるメモリの問題を同時に処理するために、2つの異なるメタコンバイナーを設計します。結果として得られる2つのアルゴリズムが、それぞれ$\tilde{O}(T^{3/4}(1+P_T)^{1/2})$と$\tilde{O}(T^{3/4}(1+P_T)^{1/4}+T^{5/6})$の動的後悔を実現することを証明します。ここで、$P_T$は環境の非定常性を測定します。私たちの結果は、未知の遷移行列にさらに拡張されます。最後に、合成線形タスクとシミュレートされた非線形タスクの両方での経験的研究により、私たちの方法の有効性が検証され、理論的発見が裏付けられています。

Distributed Sparse Regression via Penalization
ペナルティ化による分散スパース回帰

We study sparse linear regression over a network of agents, modeled as an undirected graph (with no centralized node). The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint—the latter being instrumental to obtain distributed solution methods. While penalty-based consensus methods have been extensively studied in the optimization literature, their statistical and computational guarantees in the high dimensional setting remain unclear. This work provides an answer to this open problem. Our contribution is two-fold. First, we establish statistical consistency of the estimator: under a suitable choice of the penalty parameter, the optimal solution of the penalized problem achieves near optimal minimax rate $O(s \log d/N)$ in $\ell_2$-loss, where $s$ is the sparsity value, $d$ is the ambient dimension, and $N$ is the total sample size in the network—this matches centralized sample rates. Second, we show that the proximal-gradient algorithm applied to the penalized problem, which naturally leads to distributed implementations, converges linearly up to a tolerance of the order of the centralized statistical error—the rate scales as $O(d)$, revealing an unavoidable speed-accuracy dilemma. Numerical results demonstrate the tightness of the derived sample rate and convergence rate scalings.

私たちは、無向グラフ（集中ノードなし）としてモデル化されたエージェントのネットワーク上のスパース線形回帰を研究します。推定問題は、ローカルLASSO損失関数の合計とコンセンサス制約の2次ペナルティの最小化として定式化されます。後者は、分散ソリューションメソッドを取得するために役立つ。ペナルティベースのコンセンサスメソッドは最適化の文献で広範に研究されているが、高次元設定での統計的および計算上の保証は不明のままです。本研究は、この未解決の問題に対する答えを提供します。我々の貢献は2つあります。まず、推定量の統計的一貫性を確立します。ペナルティパラメーターを適切に選択すると、ペナルティ付き問題の最適解は、$\ell_2$損失でほぼ最適なミニマックスレート$O(s \log d/N)$を達成します。ここで、$s$はスパース値、$d$は周囲の次元、$N$はネットワークの合計サンプルサイズであり、これは集中サンプルレートに一致します。次に、ペナルティ付き問題に適用された近似勾配アルゴリズムは、分散実装に自然とつながり、集中統計誤差の許容範囲まで線形に収束することを示します。収束率は$O(d)$に比例し、避けられない速度と精度のジレンマが明らかになります。数値結果は、導出されたサンプルレートと収束率のスケーリングの厳密さを示しています。

Causal Discovery with Unobserved Confounding and Non-Gaussian Data
観測されていない交絡データと非ガウスデータによる因果関係の発見

We consider recovering causal structure from multivariate observational data. We assume the data arise from a linear structural equation model (SEM) in which the idiosyncratic errors are allowed to be dependent in order to capture possible latent confounding. Each SEM can be represented by a graph where vertices represent observed variables, directed edges represent direct causal effects, and bidirected edges represent dependence among error terms. Specifically, we assume that the true model corresponds to a bow-free acyclic path diagram; i.e., a graph that has at most one edge between any pair of nodes and is acyclic in the directed part. We show that when the errors are non-Gaussian, the exact causal structure encoded by such a graph, and not merely an equivalence class, can be recovered from observational data. The method we propose for this purpose uses estimates of suitable moments, but, in contrast to previous results, does not require specifying the number of latent variables a priori. We also characterize the output of our procedure when the assumptions are violated and the true graph is acyclic, but not bow-free. We illustrate the effectiveness of our procedure in simulations and an application to an ecology data set.

私たちは、多変量観測データから因果構造を復元することを検討します。データは、潜在的な交絡を捉えるために特異な誤差が従属関係にあることが許される線形構造方程式モデル(SEM)から生じたものと仮定します。各SEMは、頂点が観測変数、有向辺が直接的な因果効果、双方向辺が誤差項間の依存関係を表すグラフで表すことができます。具体的には、真のモデルは、弓形のない非巡回パス図、つまり、任意の2つのノード間に最大1つの辺を持ち、有向部分が非巡回であるグラフに対応するものと仮定します。誤差が非ガウス分布である場合、そのようなグラフによってエンコードされた正確な因果構造(同値類だけではない)が観測データから復元できることを示す。この目的で提案する方法では、適切なモーメントの推定値を使用するが、以前の結果とは対照的に、潜在変数の数を事前に指定する必要がない。また、仮定に違反し、真のグラフが非循環的だが、弓形ではない場合の手順の出力も特徴付けます。シミュレーションと生態学データセットへの適用における手順の有効性を示します。

Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation
ミニバッチ確率的近位点法のよりシャープな解析: 安定性、滑らかさ、および偏差

The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term.In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we propose a two-phase extension of M-SPP to achieve comparable convergence rates. Additionally, we establish a deviation bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP, which holds with high probability over the randomness of data while in expectation over the randomness of algorithm. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.

確率的近点(SPP)法は、強力な収束保証と、従来の確率的勾配降下法(SGD)法よりも優れた堅牢性を備え、計算オーバーヘッドをほとんどまたはまったく追加することなく、確率的最適化において最近注目を集めています。この記事では、凸複合リスク最小化問題を解決するためのSPPのミニバッチバリアントであるM-SPPについて検討します。その中核となるのは、アルゴリズム安定性理論の観点から導き出されたM-SPPの一連の新しい過剰リスク境界です。特に、平滑性と二次成長の条件下では、ミニバッチサイズ$n$および反復回数$T$のM-SPPが、$\mathcal{O}\left(\frac{1}{T^2}\right)$バイアス減衰項と$\mathcal{O}\left(\frac{1}{nT}\right)$分散減衰項からなる期待どおりの高速収束率を実現することを示します。小$n$-大$T$設定では、この結果は、モデルのノイズレベルが収束率に与える影響を明らかにすることで、SPPタイプのアプローチの最もよく知られている結果を大幅に改善します。補完的な小$T$-大$n$領域では、同等の収束率を達成するために、M-SPPの2段階拡張を提案します。さらに、M-SPPの非復元サンプリング変種のパラメーター推定誤差の偏差境界を確立します。これは、データのランダム性に対して高い確率で成り立ち、アルゴリズムのランダム性に対しても期待どおりに成り立ちます。Lassoおよびロジスティック回帰モデルに具体化した場合の理論的予測を裏付ける数値的証拠が提供されます。

Dynamic Ranking with the BTL Model: A Nearest Neighbor based Rank Centrality Method
BTLモデルによる動的ランク付け:最近傍に基づくランク中心法

Many applications such as recommendation systems or sports tournaments involve pairwise comparisons within a collection of $n$ items, the goal being to aggregate the binary outcomes of the comparisons in order to recover the latent strength and/or global ranking of the items. In recent years, this problem has received significant interest from a theoretical perspective with a number of methods being proposed, along with associated statistical guarantees under the assumption of a suitable generative model. While these results typically collect the pairwise comparisons as one comparison graph $G$, however in many applications — such as the outcomes of soccer matches during a tournament — the nature of pairwise outcomes can evolve with time. Theoretical results for such a dynamic setting are relatively limited compared to the aforementioned static setting. We study in this paper an extension of the classic BTL (Bradley-Terry-Luce) model for the static setting to our dynamic setup under the assumption that the probabilities of the pairwise outcomes evolve smoothly over the time domain $[0,1]$. Given a sequence of comparison graphs $(G_{t’})_{t’ \in \mathcal{T}}$ on a regular grid $\mathcal{T} \subset [0,1]$, we aim at recovering the latent strengths of the items $w_t^* \in \mathbb{R}^n$ at any time $t \in [0,1]$. To this end, we adapt the Rank Centrality method — a popular spectral approach for ranking in the static case — by locally averaging the available data on a suitable neighborhood of $t$. When $(G_{t’})_{t’ \in \mathcal{T}}$ is a sequence of Erdös-Renyi graphs, we provide non-asymptotic $\ell_2$ and $\ell_{\infty}$ error bounds for estimating $w_t^*$ which in particular establishes the consistency of this method in terms of $n$, and the grid size $|\mathcal{T}|$. We also complement our theoretical analysis with experiments on real and synthetic data.

推薦システムやスポーツトーナメントなどの多くのアプリケーションでは、$n$個のアイテムのコレクション内でのペアワイズ比較が行われます。その目的は、比較のバイナリ結果を集約して、アイテムの潜在的な強さやグローバルランキングを回復することです。近年、この問題は理論的な観点から大きな関心を集めており、適切な生成モデルを前提とした関連する統計的保証とともに、さまざまな方法が提案されています。これらの結果では通常、ペアワイズ比較が1つの比較グラフ$G$として収集されますが、多くのアプリケーション(トーナメント中のサッカーの試合の結果など)では、ペアワイズ結果の性質は時間とともに変化する可能性があります。このような動的設定の理論的結果は、前述の静的設定と比較して比較的限られています。この論文では、ペアワイズ結果の確率が時間領域$[0,1]$にわたってスムーズに変化するという仮定の下、静的設定の古典的なBTL (Bradley-Terry-Luce)モデルを動的設定に拡張することを検討します。正規グリッド$\mathcal{T} \subset [0,1]$上の比較グラフのシーケンス$(G_{t「})_{t」\in \mathcal{T}}$が与えられた場合、任意の時間$t \in [0,1]$におけるアイテム$w_t^* \in \mathbb{R}^n$の潜在的な強度を回復することを目指します。この目的のために、利用可能なデータを適切な近傍$t$で局所的に平均化することにより、静的なケースでのランキングによく使われるスペクトルアプローチであるランク中心性法を採用します。$(G_{t「})_{t」\in \mathcal{T}}$がエルデシュ・レニグラフのシーケンスである場合、$w_t^*$を推定するための非漸近的な$\ell_2$および$\ell_{\infty}$誤差境界を提供します。これは特に、$n$およびグリッドサイズ$|\mathcal{T}|$に関してこの方法の一貫性を確立します。また、実際のデータと合成データでの実験により理論分析を補完します。

Revisiting minimum description length complexity in overparameterized models
過度にパラメータ化されたモデルにおける最小記述長の複雑さの再検討

Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen’s principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $dn$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators.

複雑性は、一般化のパフォーマンスを通知することを目的とした統計学習理論の基礎となる基本概念です。パラメータ数は、低次元設定では有効ですが、パラメータ数がトレーニングサンプル数より多い場合、過剰パラメータ設定では十分に正当化されません。Rissanenの最小記述長(MDL)原理に基づく複雑性の尺度を再検討し、過剰パラメータ化モデルに対して有効な新しいMDLベースの複雑性(MDL-COMP)を定義します。MDL-COMPは、優れたリッジ推定クラスによって誘導されるエンコードに対する最適性基準によって定義されます。線形モデルとカーネル法に対するMDL-COMPの広範な理論的特徴付けを提供し、それがパラメータ数の関数ではなく、むしろ設計またはカーネルマトリックスの特異値と信号対雑音比の関数であることを示します。観測値$n$個、パラメータ$d$個、およびi.i.d.ガウス予測子を持つ線形モデルの場合、MDL-COMPは$dn$のとき$d$に比例して増加します。カーネル法については、MDL-COMPがサンプル内誤差の最小値を示し、入力の次元が増加するにつれて減少する可能性があることを示しています。また、MDL-COMPがサンプル内平均二乗誤差(MSE)の上限を示すことも証明しています。一連のシミュレーションと実データ実験により、データ駆動型のPrac-MDL-COMPが、限られたデータ設定でリッジ回帰によるテストMSEを最適化するためのハイパーパラメータ調整を示し、場合によってはクロス検証を改善し、(常に)計算コストを節約することを示しています。最後に、私たちの調査結果は、最近観察された過剰パラメータ化モデルでの二重ディセント現象は、非理想的な推定量の選択の結果である可能性があることも示唆しています。

Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach
スパース行列と低ランク行列の分解:離散最適化アプローチ

We study the Sparse Plus Low-Rank decomposition problem (SLR), which is the problem of decomposing a corrupted data matrix into a sparse matrix of perturbations plus a low-rank matrix containing the ground truth. SLR is a fundamental problem in Operations Research and Machine Learning which arises in various applications, including data compression, latent semantic indexing, collaborative filtering, and medical imaging. We introduce a novel formulation for SLR that directly models its underlying discreteness. For this formulation, we develop an alternating minimization heuristic that computes high-quality solutions and a novel semidefinite relaxation that provides meaningful bounds for the solutions returned by our heuristic. We also develop a custom branch-and-bound algorithm that leverages our heuristic and convex relaxations to solve small instances of SLR to certifiable (near) optimality. Given an input n-by-n matrix, our heuristic scales to solve instances where n = 10000 in minutes, our relaxation scales to instances where n = 200 in hours, and our branch-and-bound algorithm scales to instances where n = 25 in minutes. Our numerical results demonstrate that our approach outperforms existing state-of-the-art approaches in terms of rank, sparsity, and mean-square error while maintaining a comparable runtime.

私たちは、破損したデータ行列を、摂動のスパース行列と、真実を含む低ランク行列に分解する問題である、スパースプラス低ランク分解問題(SLR)を研究します。SLRは、オペレーションズリサーチと機械学習の基本的な問題であり、データ圧縮、潜在的意味インデックス、協調フィルタリング、医療用画像処理など、さまざまなアプリケーションで発生します。私たちは、SLRの基礎となる離散性を直接モデル化する新しい定式化を導入します。この定式化では、高品質のソリューションを計算する交互最小化ヒューリスティックと、ヒューリスティックによって返されるソリューションに意味のある境界を提供する新しい半正定値緩和を開発します。また、ヒューリスティックと凸緩和を活用して、SLRの小さなインスタンスを証明可能な(ほぼ)最適に解決するカスタム分岐限定アルゴリズムも開発します。入力n行n列の行列が与えられた場合、ヒューリスティックはn = 10000の場合のインスタンスを分単位で解決するようにスケーリングされ、緩和法はn = 200の場合のインスタンスを時間単位で解決するようにスケーリングされ、分岐限定アルゴリズムはn = 25の場合のインスタンスを分単位で解決するようにスケーリングされます。数値結果は、ランク、スパース性、平均二乗誤差の点で、同等の実行時間を維持しながら、既存の最先端のアプローチよりも優れたパフォーマンスを発揮することを示しています。

On the Estimation of Derivatives Using Plug-in Kernel Ridge Regression Estimators
プラグインカーネルリッジ回帰推定量を用いた導関数の推定について

We study the problem of estimating the derivatives of a regression function, which has a wide range of applications as a key nonparametric functional of unknown functions. Standard analysis may be tailored to specific derivative orders, and parameter tuning remains a daunting challenge particularly for high-order derivatives. In this article, we propose a simple plug-in kernel ridge regression (KRR) estimator in nonparametric regression with random design that is broadly applicable for multi-dimensional support and arbitrary mixed-partial derivatives. We provide a non-asymptotic analysis to study the behavior of the proposed estimator in a unified manner that encompasses the regression function and its derivatives, leading to two error bounds for a general class of kernels under the strong $L_\infty$ norm. In a concrete example specialized to kernels with polynomially decaying eigenvalues, the proposed estimator recovers the minimax optimal rate up to a logarithmic factor for estimating derivatives of functions in Hölder and Sobolev classes. Interestingly, the proposed estimator achieves the optimal rate of convergence with the same choice of tuning parameter for any order of derivatives. Hence, the proposed estimator enjoys a plug-in property for derivatives in that it automatically adapts to the order of derivatives to be estimated, enabling easy tuning in practice. Our simulation studies show favorable finite sample performance of the proposed method relative to several existing methods and corroborate the theoretical findings on its minimax optimality.

私たちは、未知関数の主要なノンパラメトリック関数として幅広い用途を持つ回帰関数の導関数を推定する問題を研究しています。標準的な分析は特定の導関数の次数に合わせて調整される場合があり、パラメータの調整は特に高次導関数の場合に困難な課題となっています。この記事では、多次元サポートと任意の混合偏導関数に広く適用可能な、ランダム設計のノンパラメトリック回帰におけるシンプルなプラグインカーネルリッジ回帰(KRR)推定量を提案します。回帰関数とその導関数を包含する統一的な方法で提案推定量の挙動を研究するための非漸近分析を提供し、強い$L_\infty$ノルムの下でのカーネルの一般的なクラスに対して2つの誤差境界を導きます。多項式的に減衰する固有値を持つカーネルに特化した具体的な例では、提案された推定量は、ヘルダーおよびソボレフクラスの関数の導関数を推定するための対数係数までのミニマックス最適率を回復します。興味深いことに、提案された推定量は、微分の順序に関係なく、同じ調整パラメータの選択で最適な収束率を実現します。したがって、提案された推定量は、推定される微分の順序に自動的に適応するという点で、微分のプラグイン特性を備えており、実際に簡単に調整できます。シミュレーション研究では、提案された方法がいくつかの既存の方法と比較して好ましい有限サンプルパフォーマンスを示し、ミニマックス最適性に関する理論的発見を裏付けています。

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction
高次元リスク予測のための代理支援半教師付き推論

Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.

電子健康記録(EHR)データによるリスクモデリングは、疾患の結果と高次元予測子を直接観察できないため困難です。この論文では、注釈付きの結果を含む小さなラベル付きデータと、結果サロゲートおよび高次元予測子の広範なラベルなしデータを活用する、サロゲート支援型半教師あり学習アプローチを開発します。結果サロゲートと高次元予測子を含むスパース補完モデルを構築することにより、観測されていない結果を補完することを提案します。さらに、リスク予測の区間推定を可能にするために、1ステップのバイアス補正を実行します。補完モデルとリスク予測モデルの両方が誤って指定されている場合でも、推論手順は有効です。ラベルなしデータを利用する新しい方法により、高密度のリスク予測モデルを使用した困難な設定で高次元の統計的推論が可能になります。既存の教師あり方法と比較して、このアプローチの優位性を示すために、広範なシミュレーション研究を紹介します。この方法を、EHRバイオバンクコホートを使用して2型糖尿病の遺伝的リスク予測に適用します。

ProtoryNet – Interpretable Text Classification Via Prototype Trajectories
ProtoryNet – プロトタイプの軌跡による解釈可能なテキスト分類

We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public datasets demonstrate that ProtoryNet achieves higher accuracy than the baseline prototype-based deep neural net and narrows the performance gap when compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report survey results indicating that human users find ProtoryNet more intuitive and easier to understand compared to other prototype-based methods.

私たちは、プロトタイプ軌跡という新しい概念に基づいた、テキスト分類用の新しい解釈可能なディープニューラルネットワーク、ProtoryNetを提案します。現代言語学のプロトタイプ理論に着想を得たProtoryNetは、テキストシーケンス内の各文に最も類似したプロトタイプを見つけ、各文と対応するアクティブプロトタイプとの近さをRNNバックボーンに入力することで予測を行います。次に、RNNバックボーンはプロトタイプの時間的パターンをキャプチャします。これをプロトタイプ軌跡と呼びます。プロトタイプ軌跡により、人間がテキストを分析する方法と同様に、RNNモデルの推論プロセスを直感的かつ詳細に解釈できます。また、モデルで使用されるプロトタイプの総数を減らして解釈性を高めるプロトタイププルーニング手順も設計しました。複数の公開データセットでの実験により、ProtoryNetはベースラインのプロトタイプベースのディープニューラルネットよりも高い精度を達成し、最先端のブラックボックスモデルとのパフォーマンスギャップを縮小することが実証されています。さらに、プロトタイプの削減後、結果として得られるProtoryNetモデルでは、すべてのデータセットに対して20個以下のプロトタイプしか必要ないため、解釈可能性が大幅に向上します。さらに、人間のユーザーは、他のプロトタイプベースの方法と比較して、ProtoryNetの方が直感的で理解しやすいと感じていることを示す調査結果を報告します。

Distributed Algorithms for U-statistics-based Empirical Risk Minimization
U統計に基づく経験的リスク最小化のための分散アルゴリズム

Empirical risk minimization, where the underlying loss function depends on a pair of data points, covers a wide range of application areas in statistics including pairwise ranking and survival analysis. The common empirical risk estimator obtained by averaging values of a loss function over all possible pairs of observations is essentially a U-statistic. One well-known problem with minimizing U-statistic type empirical risks, is that the computational complexity of U-statistics increases quadratically with the sample size. When faced with big data, this poses computational challenges as the colossal number of observation pairs virtually prohibits centralized computing to be performed on a single machine. This paper addresses this problem by developing two computationally and statistically efficient methods based on the divide-and-conquer strategy on a decentralized computing system, whereby the data are distributed among machines to perform the tasks. One of these methods is based on a surrogate of the empirical risk, while the other method extends the one-step updating scheme in classical M-estimation to the case of pairwise loss. We show that the proposed estimators are as asymptotically efficient as the benchmark global U-estimator obtained under centralized computing. As well, we introduce two distributed iterative algorithms to facilitate the implementation of the proposed methods, and conduct extensive numerical experiments to demonstrate their merit.

経験的リスク最小化は、基礎となる損失関数が1組のデータポイントに依存し、ペアワイズランキングや生存分析を含む統計の幅広い応用分野をカバーします。損失関数の値をすべての可能な観測ペアにわたって平均化することによって得られる一般的な経験的リスク推定量は、本質的にU統計量です。U統計量タイプの経験的リスクを最小化する際のよく知られた問題の1つは、U統計量の計算の複雑さがサンプルサイズとともに2乗で増加することです。ビッグデータに直面すると、観測ペアの数が膨大であるため、1台のマシンで集中計算を実行することが事実上不可能になるため、計算上の課題が生じます。この論文では、分散コンピューティングシステムでの分割統治戦略に基づく2つの計算的および統計的に効率的な方法を開発することでこの問題に対処します。分散コンピューティングシステムでは、データが複数のマシンに分散されてタスクが実行されます。これらの方法の1つは経験的リスクの代理に基づいており、もう1つは古典的なM推定の1ステップ更新スキームをペアワイズ損失の場合に拡張したものです。提案された推定量は、集中コンピューティングで得られたベンチマークのグローバルU推定量と同様に漸近的に効率的であることを示します。また、提案された方法の実装を容易にするために2つの分散反復アルゴリズムを導入し、そのメリットを実証するために広範な数値実験を実施します。

Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training?
パーソナライズされた連合学習のためのミニマックス推定:FedAvgとローカルトレーニングの代替案?

A widely recognized difficulty in federated learning arises from the statistical heterogeneity among clients: local datasets often originate from distinct yet not entirely unrelated probability distributions, and personalization is, therefore, necessary to achieve optimal results from each individual’s perspective. In this paper, we show how the excess risks of personalized federated learning using a smooth, strongly convex loss depend on data heterogeneity from a minimax point of view, with a focus on the FedAvg algorithm (McMahan et al., 2017) and pure local training (i.e., clients solve empirical risk minimization problems on their local datasets without any communication). Our main result reveals an approximate alternative between these two baseline algorithms for federated learning: the former algorithm is minimax rate optimal over a collection of instances when data heterogeneity is small, whereas the latter is minimax rate optimal when data heterogeneity is large, and the threshold is sharp up to a constant. As an implication, our results show that from a worst-case point of view, a dichotomous strategy that makes a choice between the two baseline algorithms is rate-optimal. Another implication is that the popular FedAvg following by local fine tuning strategy is also minimax optimal under additional regularity conditions. Our analysis relies on a new notion of algorithmic stability that takes into account the nature of federated learning.

フェデレーテッドラーニングの難しさとして広く認識されているのは、クライアント間の統計的異質性です。ローカルデータセットは、異なるが完全に無関係ではない確率分布から生成されることが多く、そのため、各個人の観点から最適な結果を得るにはパーソナライズが必要です。この論文では、滑らかで強い凸損失を使用するパーソナライズされたフェデレーテッドラーニングの過剰リスクが、ミニマックスの観点からデータの異質性にどのように依存するかを、FedAvgアルゴリズム(McMahanら、2017年)と純粋なローカルトレーニング(つまり、クライアントが通信なしでローカルデータセットで経験的リスク最小化問題を解く)に焦点を当てて示します。私たちの主な結果は、フェデレーテッドラーニングのこれら2つのベースラインアルゴリズムのおおよその代替案を示しています。前者のアルゴリズムは、データの異質性が小さい場合のインスタンスのコレクション全体でミニマックスレート最適ですが、後者のアルゴリズムは、データの異質性が大きく、しきい値が一定まで急峻な場合にミニマックスレート最適です。結論として、私たちの結果は、最悪のケースの観点から、2つのベースラインアルゴリズムのいずれかを選択する二分法戦略がレート最適であることを示しています。もう1つの結論は、一般的なFedAvgに続いてローカル微調整戦略を実行すると、追加の規則性条件下でもミニマックス最適になるということです。私たちの分析は、フェデレーテッドラーニングの性質を考慮したアルゴリズムの安定性という新しい概念に基づいています。

Nearest Neighbor Dirichlet Mixtures
最近傍ディリクレ混合

There is a rich literature on Bayesian methods for density estimation, which characterize the unknown density as a mixture of kernels. Such methods have advantages in terms of providing uncertainty quantification in estimation, while being adaptive to a rich variety of densities. However, relative to frequentist locally adaptive kernel methods, Bayesian approaches can be slow and unstable to implement in relying on Markov chain Monte Carlo algorithms. To maintain most of the strengths of Bayesian approaches without the computational disadvantages, we propose a class of nearest neighbor-Dirichlet mixtures. The approach starts by grouping the data into neighborhoods based on standard algorithms. Within each neighborhood, the density is characterized via a Bayesian parametric model, such as a Gaussian with unknown parameters. Assigning a Dirichlet prior to the weights on these local kernels, we obtain a pseudo-posterior for the weights and kernel parameters. A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density. Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classification.

密度推定のためのベイズ法に関する文献は豊富にあり、これは未知の密度をカーネルの混合として特徴づける。このような方法は、推定における不確実性の定量化を提供するという点で利点があり、また、さまざまな密度に適応できます。しかし、頻度主義の局所適応カーネル法と比較すると、ベイズ法はマルコフ連鎖モンテカルロアルゴリズムに依存するため、実装が遅く不安定になる可能性があります。ベイズ法の長所のほとんどを計算上の欠点なしに維持するために、我々は最近傍ディリクレ混合のクラスを提案します。このアプローチは、標準アルゴリズムに基づいてデータを近傍にグループ化することから始まる。各近傍内では、密度は、未知のパラメータを持つガウスなどのベイズパラメトリックモデルによって特徴づけられます。これらの局所カーネルの重みにディリクレ事前分布を割り当てると、重みとカーネルパラメータの疑似事後分布が得られます。結果として得られた疑似事後分布から未知の密度をサンプリングするための、単純で非常に並列なモンテカルロアルゴリズムが提案されています。望ましい漸近特性が示され、その方法はシミュレーション研究で評価され、分類のコンテキストにおける動機付けのデータセットに適用されます。

Learning to Rank under Multinomial Logit Choice
多項ロジット選択の下でのランク付けの学習

Learning the optimal ordering of content is an important challenge in website design. The learning to rank (LTR) framework models this problem as a sequential problem of selecting lists of content and observing where users decide to click. Most previous work on LTR assumes that the user considers each item in the list in isolation, and makes binary choices to click or not on each. We introduce a multinomial logit (MNL) choice model to the LTR framework, which captures the behaviour of users who consider the ordered list of items as a whole and make a single choice among all the items and a no-click option. Under the MNL model, the user favours items which are either inherently more attractive, or placed in a preferable position within the list. We propose upper confidence bound (UCB) algorithms to minimise regret in two settings – where the position dependent parameters are known, and unknown. We present theoretical analysis leading to an $\Omega(\sqrt{JT})$ lower bound for the problem, an $\tilde{O}(\sqrt{JT})$ upper bound on regret of the UCB algorithm in the known-parameter setting, and an $\tilde{O}(K^2\sqrt{JT})$ upper bound on regret, the first, in the more challenging unknown-position-parameter setting. Our analyses are based on tight new concentration results for Geometric random variables, and novel functional inequalities for maximum likelihood estimators computed on discrete data

コンテンツの最適な順序付けを学習することは、ウェブサイト設計における重要な課題です。ランキング学習(LTR)フレームワークは、この問題を、コンテンツのリストを選択し、ユーザーがクリックする場所を観察する順次的な問題としてモデル化します。LTRに関するこれまでの研究のほとんどは、ユーザーがリスト内の各項目を個別に検討し、それぞれをクリックするかしないかの二者択一を行うことを前提としています。私たちは、LTRフレームワークに多項ロジット(MNL)選択モデルを導入します。これは、順序付けられた項目のリストを全体として検討し、すべての項目の中から1つの選択とクリックしないオプションを行うユーザーの行動を捉えます。MNLモデルでは、ユーザーは、本質的に魅力的な項目、またはリスト内で好ましい位置にある項目を好みます。私たちは、位置依存パラメータが既知である場合と未知の場合の2つの設定で後悔を最小化する上側信頼境界(UCB)アルゴリズムを提案します。私たちは、この問題の下限値$\Omega(\sqrt{JT})$、パラメータが既知である設定でのUCBアルゴリズムの後悔の上限値$\tilde{O}(\sqrt{JT})$、そしてより困難な位置パラメータが未知である設定での後悔の上限値$\tilde{O}(K^2\sqrt{JT})$を導く理論分析を提示します。我々の分析は、幾何ランダム変数の厳密な新しい集中結果と、離散データで計算された最大尤度推定量の新しい機能的不等式に基づいています。

Scalable high-dimensional Bayesian varying coefficient models with unknown within-subject covariance
被験者内共分散が不明なスケーラブルな高次元ベイズ変動係数モデル

Nonparametric varying coefficient (NVC) models are useful for modeling time-varying effects on responses that are measured repeatedly for the same subjects. When the number of covariates is moderate or large, it is desirable to perform variable selection from the varying coefficient functions. However, existing methods for variable selection in NVC models either fail to account for within-subject correlations or require the practitioner to specify a parametric form for the correlation structure. In this paper, we introduce the nonparametric varying coefficient spike-and-slab lasso (NVC-SSL) for Bayesian high dimensional NVC models. Through the introduction of functional random effects, our method allows for flexible modeling of within-subject correlations without needing to specify a parametric covariance function. We further propose several scalable optimization and Markov chain Monte Carlo (MCMC) algorithms. For variable selection, we propose an Expectation Conditional Maximization (ECM) algorithm to rapidly obtain maximum a posteriori (MAP) estimates. Our ECM algorithm scales linearly in the total number of observations $N$ and the number of covariates $p$. For uncertainty quantification, we introduce an approximate MCMC algorithm that also scales linearly in both $N$ and $p$. We demonstrate the scalability, variable selection performance, and inferential capabilities of our method through simulations and a real data application. These algorithms are implemented in the publicly available R package NVCSSL on the Comprehensive R Archive Network.

ノンパラメトリック変動係数(NVC)モデルは、同じ被験者に対して繰り返し測定される応答に対する時間変動効果をモデル化するために役立ちます。共変量の数が中程度または多い場合、変動係数関数から変数選択を実行することが望ましいです。しかし、NVCモデルにおける変数選択の既存の方法は、被験者内相関を考慮に入れていないか、相関構造のパラメトリック形式を実務者が指定する必要があります。この論文では、ベイジアン高次元NVCモデル用のノンパラメトリック変動係数スパイクアンドスラブLasso (NVC-SSL)を紹介します。機能的ランダム効果の導入により、パラメトリック共分散関数を指定する必要なく、被験者内相関を柔軟にモデル化できます。さらに、スケーラブルな最適化およびマルコフ連鎖モンテカルロ(MCMC)アルゴリズムをいくつか提案します。変数選択については、最大事後確率(MAP)推定値を迅速に取得するための期待値条件付き最大化(ECM)アルゴリズムを提案します。ECMアルゴリズムは、観測の総数$N$と共変量数$p$に比例して増加します。不確実性の定量化のために、$N$と$p$の両方に比例して増加する近似MCMCアルゴリズムを導入します。シミュレーションと実際のデータアプリケーションを通じて、この方法のスケーラビリティ、変数選択パフォーマンス、推論機能を実証します。これらのアルゴリズムは、Comprehensive R Archive Networkで公開されているRパッケージNVCSSLに実装されています。

Multi-view Collaborative Gaussian Process Dynamical Systems
マルチビュー協調型ガウス過程力学系

Gaussian process dynamical systems (GPDSs) have shown their effectiveness in many tasks of machine learning. However, when they address multi-view data, current GPDSs do not explicitly model the dependence between private and shared latent variables. Instead, they introduce structurally and intrinsically discrete segmentation in the latent space. In this paper, we propose the multi-view collaborative Gaussian process dynamical systems (McGPDSs) model, which assumes that the private latent variable for each view is controlled by its dynamical prior and the shared latent variable. The relevance between private and shared latent variables can be automatically learned by optimization in the Bayesian framework. The model is capable of learning an effective latent representation and generating novel data of one view given data of the other view. We evaluate our model on two-view data sets, and our model obtains better performance compared with the state-of-the-art multi-view GPDSs.

ガウス過程力学システム(GPDS)は、機械学習の多くのタスクでその有効性を示しています。ただし、マルチビューデータを扱う場合、現在のGPDSでは、プライベート変数と共有潜在変数の間の依存関係は明示的にモデル化されません。それどころか、潜在空間に構造的および本質的に離散的なセグメンテーションを導入します。この論文では、各ビューのプライベート潜在変数がその動的事前変数と共有潜在変数によって制御されると仮定する、マルチビュー協調ガウスプロセス力学システム(McGPDS)モデルを提案します。プライベート変数と共有潜在変数との関連性は、ベイジアンフレームワークの最適化によって自動的に学習できます。このモデルは、効果的な潜在表現を学習し、一方のビューの新規データを他方のビューのデータと引き換えに生成することができます。2ビューデータセットでモデルを評価し、最先端のマルチビューGPDSと比較して、モデルのパフォーマンスが向上しています。

Fairlearn: Assessing and Improving Fairness of AI Systems
Fairlearn:AIシステムの公平性の評価と改善

Fairlearn is an open source project to help practitioners assess and improve fairness of artificial intelligence (AI) systems. The associated Python library, also named fairlearn, supports evaluation of a model’s output across affected populations and includes several algorithms for mitigating fairness issues. Grounded in the understanding that fairness is a sociotechnical challenge, the project integrates learning resources that aid practitioners in considering a system’s broader societal context.

Fairlearnは、実務家が人工知能(AI)システムの公平性を評価および改善するのに役立つオープンソースプロジェクトです。関連するPythonライブラリ(fairlearnとも呼ばれます)は、影響を受ける母集団全体でのモデルの出力の評価をサポートし、公平性の問題を軽減するためのいくつかのアルゴリズムが含まれています。公平性は社会技術的な課題であるという理解に基づいて、このプロジェクトは、実践者がシステムのより広範な社会的文脈を考慮するのに役立つ学習リソースを統合しています。

Scalable Real-Time Recurrent Learning Using Columnar-Constructive Networks
柱状建設的ネットワークを用いたスケーラブルなリアルタイム反復学習

Constructing states from sequences of observations is an important component of reinforcement learning agents. One solution for state construction is to use recurrent neural networks. Back-propagation through time (BPTT), and real-time recurrent learning (RTRL) are two popular gradient-based methods for recurrent learning. BPTT requires complete trajectories of observations before it can compute the gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the network for computationally efficient learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.

観測シーケンスから状態を構築することは、強化学習エージェントの重要なコンポーネントです。状態構築の1つのソリューションは、リカレントニューラルネットワークを使用することです。時間によるバックプロパゲーション(BPTT)とリアルタイムリカレント学習(RTRL)は、リカレント学習のための2つの一般的な勾配ベースの方法です。BPTTでは、勾配を計算する前に観測の完全な軌跡が必要であり、オンライン更新には適していません。RTRLはオンライン更新が可能ですが、大規模なネットワークへの拡張性は低くなります。この論文では、RTRLをスケーラブルにする2つの制約を提案します。ネットワークを独立したモジュールに分解するか、ネットワークを段階的に学習することで、RTRLをパラメーターの数に比例して拡張できることを示します。UOROやTruncated-BPTTなどの従来のスケーラブルな勾配推定アルゴリズムとは異なり、私たちのアルゴリズムは勾配推定にノイズやバイアスを追加しません。代わりに、ネットワークの機能容量と引き換えに計算効率の高い学習を行います。私たちは、動物の学習からヒントを得た予測ベンチマークと、Atari 2600ゲーム用の事前トレーニング済みポリシーのポリシー評価を行うことで、Truncated-BPTTに対する私たちのアプローチの有効性を実証します。

Torchhd: An Open Source Python Library to Support Research on Hyperdimensional Computing and Vector Symbolic Architectures
torchhd: 超次元コンピューティングとベクトル記号アーキテクチャの研究を支援するオープンソースのPythonライブラリ

Hyperdimensional computing (HD), also known as vector symbolic architectures (VSA), is a framework for computing with distributed representations by exploiting properties of random high-dimensional vector spaces. The commitment of the scientific community to aggregate and disseminate research in this particularly multidisciplinary area has been fundamental for its advancement. Joining these efforts, we present Torchhd, a high-performance open source Python library for HD/VSA. Torchhd seeks to make HD/VSA more accessible and serves as an efficient foundation for further research and application development. The easy-to-use library builds on top of PyTorch and features state-of-the-art HD/VSA functionality, clear documentation, and implementation examples from well-known publications. Comparing publicly available code with their corresponding Torchhd implementation shows that experiments can run up to 100x faster. Torchhd is available at: https://github.com/hyperdimensional-computing/torchhd.

ハイパーディメンショナルコンピューティング(HD)は、ベクトルシンボリックアーキテクチャ(VSA)とも呼ばれ、ランダムな高次元ベクトル空間の特性を利用して分散表現でコンピューティングを行うフレームワークです。この特に学際的な分野で研究を集約し、普及させるという科学コミュニティの取り組みは、この分野の発展にとって不可欠でした。これらの取り組みに加わり、私たちはHD/VSA用の高性能オープンソースPythonライブラリであるTorchhdを紹介します。Torchhdは、HD/VSAをよりアクセスしやすくすることを目指しており、さらなる研究とアプリケーション開発のための効率的な基盤として機能します。使いやすいライブラリはPyTorch上に構築されており、最先端のHD/VSA機能、わかりやすいドキュメント、有名な出版物からの実装例を備えています。公開されているコードと対応するTorchhd実装を比較すると、実験を最大100倍高速に実行できることがわかります。Torchhdはhttps://github.com/hyperdimensional-computing/torchhdで入手できます。

skrl: Modular and Flexible Library for Reinforcement Learning
skrl: 強化学習のためのモジュール式で柔軟なライブラリ

skrl is an open-source modular library for reinforcement learning written in Python and designed with a focus on readability, simplicity, and transparency of algorithm implementations. In addition to supporting environments that use the traditional interfaces from OpenAI Gym/Farama Gymnasium, DeepMind and others, it provides the facility to load, configure, and operate NVIDIA Isaac Gym, Isaac Orbit, and Omniverse Isaac Gym environments. Furthermore, it enables the simultaneous training of several agents with customizable scopes (subsets of environments among all available ones), which may or may not share resources, in the same run. The library’s documentation can be found at https://skrl.readthedocs.io and its source code is available on GitHub at https://github.com/Toni-SM/skrl.

skrlは、Pythonで記述され、アルゴリズム実装の可読性、シンプルさ、透明性に重点を置いて設計された、強化学習用のオープンソースのモジュール式ライブラリです。OpenAI Gym/Farama Gymnasium、DeepMindなどの従来のインターフェイスを使用する環境をサポートするだけでなく、NVIDIA Isaac Gym、Isaac Orbit、Omniverse Isaac Gym環境をロード、構成、運用する機能を提供します。さらに、カスタマイズ可能なスコープ(利用可能なすべての環境の環境のサブセット)を持つ複数のエージェントを同時にトレーニングできます。これらのエージェントは、リソースを共有する場合と共有しない場合があります。ライブラリのドキュメントはhttps://skrl.readthedocs.ioにあり、そのソースコードはGitHubのhttps://github.com/Toni-SM/skrlで入手できます。

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model
176B パラメータ言語モデルである BLOOM のカーボンフットプリントの推定

Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires computational resources, energy and materials. In the present article, we aim to quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle. We estimate that BLOOM’s final training emitted approximately 24.7 tonnes of CO2eq if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes ranging from equipment manufacturing to energy-based operational consumption. We also carry out an empirical study to measure the energy requirements and carbon emissions of its deployment for inference via an API endpoint receiving user queries in real-time. We conclude with a discussion regarding the difficulty of precisely estimating the carbon footprint of ML models and future research directions that can contribute towards improving carbon emissions reporting.

機械学習(ML)の進歩は、MLモデルのトレーニングに計算リソース、エネルギー、材料が必要であることを考えると、環境にコストがかかります。この論文では、1,760億のパラメータ言語モデルであるBLOOMのライフサイクル全体でのカーボンフットプリントを定量化することを目指しています。BLOOMの最終訓練では、動的消費電力のみを考慮すると約24.7トン、機器製造からエネルギーベースの運用消費までを対象とすると約50.5トンのCO2eqを排出したと試算しています。また、ユーザークエリをリアルタイムで受信するAPIエンドポイントを介した推論のために、そのデプロイのエネルギー要件と炭素排出量を測定するための実証研究も実施します。最後に、MLモデルのカーボンフットプリントを正確に推定することの難しさと、カーボン排出量レポートの改善に貢献できる将来の研究の方向性について議論します。

Adaptive False Discovery Rate Control with Privacy Guarantee
プライバシー保証付きの適応型誤検出率制御

Differentially private multiple testing procedures can protect the information of individuals used in hypothesis tests while guaranteeing a small fraction of false discoveries. In this paper, we propose a differentially private adaptive FDR control method that can control the classic FDR metric exactly at a user-specified level $\alpha$ with a privacy guarantee, which is a non-trivial improvement compared to the differentially private Benjamini-Hochberg method proposed in Dwork et al. (2021). Our analysis is based on two key insights: 1) a novel $p$-value transformation that preserves both privacy and the mirror conservative property, and 2) a mirror peeling algorithm that allows the construction of the filtration and application of the optimal stopping technique. Numerical studies demonstrate that the proposed DP-AdaPT performs better compared to the existing differentially private FDR control methods. Compared to the non-private AdaPT, it incurs a small accuracy loss but significantly reduces the computation cost.

差分プライバシーの多重検定手順は、仮説検定で使用される個人の情報を保護しながら、誤検出をごくわずかしか保証しません。この論文では、プライバシー保証付きでユーザー指定のレベル$\alpha$で古典的なFDRメトリックを正確に制御できる差分プライバシー適応型FDR制御方法を提案します。これは、Dworkら(2021)で提案された差分プライバシーBenjamini-Hochberg法と比較して、重要な改善点です。私たちの分析は、2つの重要な洞察に基づいています。1)プライバシーとミラー保守特性の両方を維持する新しい$p$値変換、および2)フィルタリングの構築と最適停止手法の適用を可能にするミラー剥離アルゴリズムです。数値研究では、提案されたDP-AdaPTは、既存の差分プライバシーFDR制御方法と比較してパフォーマンスが優れていることが実証されています。非プライベートAdaPTと比較すると、精度の低下はわずかですが、計算コストが大幅に削減されます。

Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas:検索拡張言語モデルによる少数ショット学習

Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval-augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval-augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and Natural Questions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameter model by 3% despite having 50x fewer parameters.

大規模な言語モデルは、さまざまなタスクで印象的な少数ショットの結果を示しています。しかし、質問応答やファクトチェックなどのタスクのように、知識がそのような結果に鍵となる場合、知識を保存するための大量のパラメータ数が必要になるようです。検索拡張モデルは、多くのパラメータを必要とせずに知識集約型のタスクに優れていることが知られていますが、それらが少数のショット設定で動作するかどうかは不明です。この作品では、慎重に設計され、事前に訓練された検索拡張言語モデルであるAtlasを紹介します。これは、非常に少ないトレーニング例で知識集約型のタスクを学習できます。MMLU、KILT、Natural Questionsなど幅広いタスクで評価を行い、ドキュメントインデックスの内容が与える影響を検討し、簡単に更新できることを示しています。特に、Atlasはわずか64の例を使用してNatural Questionsで42%以上の精度を達成し、パラメータが50倍少ないにもかかわらず、540Bパラメータモデルを3%上回っています。

Convex Reinforcement Learning in Finite Trials
有限試行における凸強化学習

Convex Reinforcement Learning (RL) is a recently introduced framework that generalizes the standard RL objective to any convex (or concave) function of the state distribution induced by the agent’s policy. This framework subsumes several applications of practical interest, such as pure exploration, imitation learning, and risk-averse RL, among others. However, the previous convex RL literature implicitly evaluates the agent’s performance over infinite realizations (or trials), while most of the applications require excellent performance over a handful, or even just one, trials. To meet this practical demand, we formulate convex RL in finite trials, where the objective is any convex function of the empirical state distribution computed over a finite number of realizations. In this paper, we provide a comprehensive theoretical study of the setting, which includes an analysis of the importance of non-Markovian policies to achieve optimality, as well as a characterization of the computational and statistical complexity of the problem in various configurations.

凸強化学習(RL)は、最近導入されたフレームワークで、標準的なRL目的を、エージェントのポリシーによって誘導される状態分布の任意の凸(または凹)関数に一般化します。このフレームワークには、純粋探索、模倣学習、リスク回避RLなど、実用上の関心のあるいくつかのアプリケーションが含まれます。ただし、これまでの凸強化学習の文献では、エージェントのパフォーマンスを無限の実現(または試行)にわたって暗黙的に評価していますが、ほとんどのアプリケーションでは、少数の試行、または1つの試行で優れたパフォーマンスが求められます。この実用上の要求を満たすために、有限試行で凸強化学習を定式化します。この場合、目的は、有限数の実現で計算された経験的状態分布の任意の凸関数です。この論文では、設定の包括的な理論的研究を提供します。これには、最適性を達成するための非マルコフポリシーの重要性の分析、およびさまざまな構成における問題の計算と統計の複雑さの特性評価が含まれます。

Unbiased Multilevel Monte Carlo Methods for Intractable Distributions: MLMC Meets MCMC
難解な分布のための不偏マルチレベルモンテカルロ法:MLMCとMCMCの出会い

Constructing unbiased estimators from Markov chain Monte Carlo (MCMC) outputs is a difficult problem that has recently received a lot of attention in the statistics and machine learning communities. However, the current unbiased MCMC framework only works when the quantity of interest is an expectation, which excludes many practical applications. In this paper, we propose a general method for constructing unbiased estimators for functions of expectations and extend it to construct unbiased estimators for nested expectations. Our approach combines and generalizes the unbiased MCMC and Multilevel Monte Carlo (MLMC) methods. In contrast to traditional sequential methods, our estimator can be implemented on parallel processors. We show that our estimator has a finite variance and computational complexity and can achieve $\varepsilon$-accuracy within the optimal $O(1/\varepsilon^2)$ computational cost under mild conditions. Numerical experiments confirm our theoretical findings and demonstrate the benefits of unbiased estimators in the massively parallel regime.

マルコフ連鎖モンテカルロ(MCMC)出力から不偏推定量を構築することは、統計学や機械学習のコミュニティで最近多くの注目を集めている難しい問題です。しかし、現在の不偏MCMCフレームワークは、関心のある量が期待値である場合にのみ機能し、多くの実用的なアプリケーションが除外されます。この論文では、期待値の関数の不偏推定量を構築するための一般的な方法を提案し、それを拡張して、入れ子になった期待値の不偏推定量を構築します。私たちのアプローチは、不偏MCMC法とマルチレベルモンテカルロ(MLMC)法を組み合わせて一般化します。従来の逐次法とは対照的に、私たちの推定量は並列プロセッサに実装できます。私たちの推定量は有限の分散と計算量を持ち、穏やかな条件下で最適な$O(1/\varepsilon^2)$計算コスト内で$\varepsilon$精度を達成できることを示します。数値実験は理論的発見を確認し、超並列環境での不偏推定量の利点を実証します。

Improving multiple-try Metropolis with local balancing
ローカルバランシングによるマルチトライメトロポリスの改善

Multiple-try Metropolis (MTM) is a popular Markov chain Monte Carlo method with the appealing feature of being amenable to parallel computing. At each iteration, it samples several candidates for the next state of the Markov chain and randomly selects one of them based on a weight function. The canonical weight function is proportional to the target density. We show both theoretically and empirically that this weight function induces pathological behaviours in high dimensions, especially during the convergence phase. We propose to instead use weight functions akin to the locally-balanced proposal distributions of Zanella (2020), thus yielding MTM algorithms that do not exhibit those pathological behaviours. To theoretically analyse these algorithms, we study the high-dimensional performance of ideal schemes that can be thought of as MTM algorithms which sample an infinite number of candidates at each iteration, as well as the discrepancy between such schemes and the MTM algorithms which sample a finite number of candidates. Our analysis unveils a strong distinction between the convergence and stationary phases: in the former, local balancing is crucial and effective to achieve fast convergence, while in the latter, the canonical and novel weight functions yield similar performance. Numerical experiments include an application in precision medicine involving a computationally-expensive forward model, which makes the use of parallel computing within MTM iterations beneficial.

マルチトライメトロポリス(MTM)は、並列計算に適しているという魅力的な機能を備えた、人気の高いマルコフ連鎖モンテカルロ法です。各反復で、マルコフ連鎖の次の状態の候補をいくつかサンプリングし、重み関数に基づいてランダムに1つを選択します。標準重み関数は、ターゲット密度に比例します。この重み関数は、特に収束フェーズ中に、高次元で病的な動作を引き起こすことを理論的および経験的に示します。代わりに、Zanella (2020)のローカルバランス提案分布に似た重み関数を使用することを提案します。これにより、病的な動作を示さないMTMアルゴリズムが生成されます。これらのアルゴリズムを理論的に分析するために、各反復で無限の数の候補をサンプリングするMTMアルゴリズムと考えられる理想的なスキームの高次元パフォーマンス、およびそのようなスキームと有限の数の候補をサンプリングするMTMアルゴリズムとの相違を調べます。私たちの分析により、収束段階と定常段階の間に明確な違いがあることが明らかになりました。前者では、ローカルバランシングが高速収束の達成に重要かつ効果的ですが、後者では、標準の重み関数と新しい重み関数が同様のパフォーマンスをもたらします。数値実験には、計算コストの高いフォワードモデルを含む精密医療への応用が含まれており、MTM反復内での並列コンピューティングの使用が有益になります。

Importance Sparsification for Sinkhorn Algorithm
シンクホーンアルゴリズムの重要度スパーシフィケーション

Sinkhorn algorithm has been used pervasively to approximate the solution to optimal transport (OT) and unbalanced optimal transport (UOT) problems. However, its practical application is limited due to the high computational complexity. To alleviate the computational burden, we propose a novel importance sparsification method, called Spar-Sink, to efficiently approximate entropy-regularized OT and UOT solutions. Specifically, our method employs natural upper bounds for unknown optimal transport plans to establish effective sampling probabilities, and constructs a sparse kernel matrix to accelerate Sinkhorn iterations, reducing the computational cost of each iteration from $O(n^2)$ to $\widetilde{O}(n)$ for a sample of size $n$. Theoretically, we show the proposed estimators for the regularized OT and UOT problems are consistent under mild regularity conditions. Experiments on various synthetic data demonstrate Spar-Sink outperforms mainstream competitors in terms of both estimation error and speed. A real-world echocardiogram data analysis shows Spar-Sink can effectively estimate and visualize cardiac cycles, from which one can identify heart failure and arrhythmia. To evaluate the numerical accuracy of cardiac cycle prediction, we consider the task of predicting the end-systole time point using the end-diastole one. Results show Spar-Sink performs as well as the classical Sinkhorn algorithm, requiring significantly less computational time.

Sinkhornアルゴリズムは、最適輸送(OT)問題および不均衡最適輸送(UOT)問題の解を近似するために広く使用されています。ただし、計算の複雑さが高いため、実際の適用は限られています。計算負荷を軽減するために、エントロピー正規化OTおよびUOT解を効率的に近似する、Spar-Sinkと呼ばれる新しい重要度スパース化手法を提案します。具体的には、この手法では、未知の最適輸送計画に自然な上限を使用して有効なサンプリング確率を確立し、スパースカーネルマトリックスを構築してSinkhorn反復を高速化し、各反復の計算コストをサンプルサイズ$n$に対して$O(n^2)$から$\widetilde{O}(n)$に削減します。理論的には、正規化されたOTおよびUOT問題に対して提案された推定量は、軽度の正規性条件下で一貫していることを示しています。さまざまな合成データでの実験では、Spar-Sinkが推定誤差と速度の両方の点で主流の競合製品よりも優れていることが実証されています。実際の心エコー図データ分析では、Spar-Sinkが心拍周期を効果的に推定して視覚化できることが示されており、心不全や不整脈を特定することができます。心拍周期予測の数値的精度を評価するために、拡張期終末時点を使用して収縮期終末時点を予測するタスクを検討します。結果は、Spar-Sinkが従来のSinkhornアルゴリズムと同等のパフォーマンスを発揮し、計算時間が大幅に短縮されることを示しています。

Graph Attention Retrospective
グラフ・アテンション・レトロスペクティブ

Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an “easy” regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the “hard” regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classification, we provide a positive result on graph attention’s robustness against structural noise in the graph. In particular, our robustness result implies that graph attention can be strictly better than both the simple graph convolution and the best linear classifier of node features. We evaluate our theoretical results on synthetic and real-world data.

グラフベースの学習は、急速に成長している機械学習のサブフィールドであり、ソーシャルネットワーク、引用ネットワーク、バイオインフォマティクスに応用されています。最も人気のあるモデルの1つがグラフアテンションネットワークです。これは、ノードの隣接ノードを区別しない単純なグラフ畳み込みとは対照的に、ノードが隣接ノードの特徴から非均一な方法で情報を集約できるようにするために導入されました。この論文では、グラフアテンションネットワークの動作を理論的に研究します。コンテキストストキャスティックブロックモデルのノード分類の問題に対するグラフアテンションメカニズムのパフォーマンスに関する複数の結果を証明します。ここで、ノードの特徴はガウス分布と確率ブロックモデルのエッジの混合から取得されます。ガウス分布の平均間の距離が十分に大きい「簡単な」状況では、グラフアテンションはクラス間エッジとクラス内エッジを区別できることを示します。したがって、重要なエッジの重みは維持され、重要でないエッジの重みは大幅に削減されます。その結果、これが完全なノード分類を意味することを示します。「ハード」な状況では、すべてのアテンションメカニズムがクラス内エッジとクラス間エッジを区別できないことを示しています。さらに、クラス内エッジをクラス間エッジから分離できたとしても、グラフアテンション畳み込みは(ほぼ)完全にはノードを分類できないことを示しています。完璧なノード分類を超えて、グラフ内の構造ノイズに対するグラフアテンションの堅牢性について肯定的な結果を提供します。特に、堅牢性の結果は、グラフアテンションが単純なグラフ畳み込みとノード機能の最適な線形分類器の両方よりも確実に優れていることを示しています。合成データと実世界のデータで理論上の結果を評価します。

Confidence Intervals and Hypothesis Testing for High-dimensional Quantile Regression: Convolution Smoothing and Debiasing
高次元分位点回帰の信頼区間と仮説検定: 畳み込み平滑化とバイアス除去

$\ell_1$-penalized quantile regression ($\ell_1$-QR) is a useful tool for modeling the relationship between input and output variables when detecting heterogeneous effects in the high-dimensional setting. Hypothesis tests can then be formulated based on the debiased $\ell_1$-QR estimator that reduces the bias induced by Lasso penalty. However, the non-smoothness of the quantile loss brings great challenges to the computation, especially when the data dimension is high. Recently, the convolution-type smoothed quantile regression (SQR) model has been proposed to overcome such shortcoming, and people developed theory of estimation and variable selection therein. In this work, we combine the debiased method with SQR model and come up with the debiased $\ell_1$-SQR estimator, based on which we then establish confidence intervals and hypothesis testing in the high-dimensional setup. Theoretically, we provide the non-asymptotic Bahadur representation for our proposed estimator and also the Berry-Esseen bound, which implies the empirical coverage rates for the studentized confidence intervals. Furthermore, we build up the theory of hypothesis testing on both a single variable and a group of variables. Finally, we exhibit extensive numerical experiments on both simulated and real data to demonstrate the good performance of our method.

$\ell_1$ペナルティ付き分位点回帰($\ell_1$-QR)は、高次元設定で異質な効果を検出する際に、入力変数と出力変数の関係をモデル化するための便利なツールです。仮説検定は、Lassoペナルティによって引き起こされるバイアスを軽減する、バイアス除去された$\ell_1$-QR推定量に基づいて策定できます。ただし、分位点損失の非平滑性は、特にデータ次元が高い場合に、計算に大きな課題をもたらします。最近、畳み込み型平滑化分位点回帰(SQR)モデルがこのような欠点を克服するために提案され、推定と変数選択の理論が開発されました。この研究では、バイアス除去法とSQRモデルを組み合わせて、バイアス除去された$\ell_1$-SQR推定量を考案し、それに基づいて高次元設定で信頼区間と仮説検定を確立します。理論的には、提案する推定量に対する非漸近的Bahadur表現と、スチューデント化された信頼区間の実証的カバレッジ率を示すBerry-Esseen境界を提供します。さらに、単一の変数と変数のグループの両方に対する仮説検定の理論を構築します。最後に、シミュレーションデータと実際のデータの両方で広範な数値実験を行い、この方法の優れたパフォーマンスを実証します。

Selection by Prediction with Conformal p-values
共形 p 値による予測による選択

Decision making or scientific discovery pipelines such as job hiring and drug discovery often involve multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions from a machine learning model to shortlist a few candidates from a large pool. We study screening procedures that aim to select candidates whose unobserved outcomes exceed user-specified values. We develop a method that wraps around any prediction model to produce a subset of candidates while controlling the proportion of falsely selected units. Building upon the conformal inference framework, our method first constructs p-values that quantify the statistical evidence for large outcomes; it then determines the shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. In many cases, the procedure selects candidates whose predictions are above a data-dependent threshold. Our theoretical guarantee holds under mild exchangeability conditions on the samples, generalizing existing results on multiple conformal p-values. We demonstrate the empirical performance of our method via simulations, and apply it to job hiring and drug discovery datasets.

採用や創薬などの意思決定や科学的発見のパイプラインは、多くの場合、複数の段階を伴います。リソースを大量に消費するステップの前に、多くの場合、機械学習モデルからの予測を使用して、大規模なプールから少数の候補者を絞り込む初期スクリーニングが行われます。私たちは、観測されていない結果がユーザー指定の値を超える候補者を選択することを目的としたスクリーニング手順を研究します。私たちは、誤って選択されたユニットの割合を制御しながら、任意の予測モデルをラップして候補のサブセットを生成する方法を開発します。共形推論フレームワークに基づいて、私たちの方法は、まず大きな結果の統計的証拠を定量化するp値を構築します。次に、p値を多重検定の文献で導入されたしきい値と比較して、候補者リストを決定します。多くの場合、この手順は、予測がデータ依存のしきい値を超える候補者を選択します。私たちの理論的な保証は、サンプルの軽度の交換可能性条件下で保持され、複数の共形p値に関する既存の結果を一般化します。私たちは、シミュレーションによって私たちの方法の実証的なパフォーマンスを実証し、それを採用と創薬データセットに適用します。

Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics
アルファダイバージェンス変分推論と重要度評価の合致加重オートエンコーダ:方法論と漸近論

Several algorithms involving the Variational Rényi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the importance weighted auto-encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.

Variational Rényi (VR)境界を含むいくつかのアルゴリズムは、ターゲットの事後分布と変分分布との間のアルファ発散を最小化するために提案されています。有望な経験的結果にもかかわらず、これらのアルゴリズムは偏った確率的勾配降下法に頼っているため、理論的な保証はありません。この論文では、重要度加重オートエンコーダ(IWAE)バウンドの一般化であるVR-IWAEバウンドを形式化し、研究します。VR-IWAE境界はいくつかの望ましい特性を享受し、特に再パラメータ化されたケースのVR境界と同じ確率的勾配降下手順につながることを示しますが、今回は偏りのない勾配推定器に依存しています。次に、VR-IWAE束縛、したがって標準IWAE束縛の2つの補完的な理論分析を提供します。これらの分析は、これらの境界の利点またはその欠如に光を当てます。最後に、玩具と実データの例に対する私たちの理論的な主張を説明します。

Sparse Graph Learning from Spatiotemporal Time Series
時空間時系列からのスパースグラフ学習

Outstanding achievements of graph neural networks for spatiotemporal time series analysis show that relational constraints introduce an effective inductive bias into neural forecasting architectures. Often, however, the relational information characterizing the underlying data-generating process is unavailable and the practitioner is left with the problem of inferring from data which relational graph to use in the subsequent processing stages. We propose novel, principled – yet practical – probabilistic score-based methods that learn the relational dependencies as distributions over graphs while maximizing end-to-end the performance at task. The proposed graph learning framework is based on consolidated variance reduction techniques for Monte Carlo score-based gradient estimation, is theoretically grounded, and, as we show, effective in practice. In this paper, we focus on the time series forecasting problem and show that, by tailoring the gradient estimators to the graph learning problem, we are able to achieve state-of-the-art performance while controlling the sparsity of the learned graph and the computational scalability. We empirically assess the effectiveness of the proposed method on synthetic and real-world benchmarks, showing that the proposed solution can be used as a stand-alone graph identification procedure as well as a graph learning component of an end-to-end forecasting architecture.

時空間時系列分析のためのグラフニューラルネットワークの優れた成果は、関係制約がニューラル予測アーキテクチャに効果的な帰納的バイアスを導入することを示しています。ただし、多くの場合、基礎となるデータ生成プロセスを特徴付ける関係情報は利用できず、実務者は、後続の処理段階でどの関係グラフを使用するかをデータから推測するという問題を抱えています。私たちは、タスクでエンドツーエンドのパフォーマンスを最大化しながら、関係依存関係をグラフ上の分布として学習する、新しい、原理的でありながら実用的な確率スコアベースの方法を提案します。提案されたグラフ学習フレームワークは、モンテカルロスコアベースの勾配推定のための統合された分散削減手法に基づいており、理論的に根拠があり、私たちが示すように、実際に効果的です。この論文では、時系列予測の問題に焦点を当て、勾配推定器をグラフ学習問題に合わせて調整することで、学習したグラフのスパース性と計算のスケーラビリティを制御しながら、最先端のパフォーマンスを実現できることを示します。私たちは、合成ベンチマークと実世界のベンチマークで提案手法の有効性を経験的に評価し、提案ソリューションがスタンドアロンのグラフ識別手順としてだけでなく、エンドツーエンドの予測アーキテクチャのグラフ学習コンポーネントとしても使用できることを示しています。

Improved Powered Stochastic Optimization Algorithms for Large-Scale Machine Learning
大規模機械学習のための改良型確率最適化アルゴリズム

Stochastic optimization, especially stochastic gradient descent (SGD), is now the workhorse for the vast majority of problems in machine learning. Various strategies, e.g., control variates, adaptive learning rate, momentum technique, etc., have been developed to improve canonical SGD that is of a low convergence rate and the poor generalization in practice. Most of these strategies improve SGD that can be attributed to control the updating direction (e.g., gradient descent or gradient ascent direction), or manipulate the learning rate. Along these two lines, this work first develops and analyzes a novel type of improved powered stochastic gradient descent algorithms from the perspectives of variance reduction, where the updating direction was determined by the Powerball function. Additionally, to bridge the gap between powered stochastic optimization (PSO) and the learning rate, which is now still an open problem for PSO, we propose an adaptive mechanism of updating the learning rate that resorts the Barzilai-Borwein (BB) like scheme, not only for the proposed algorithm, but also for classical PSO algorithms. The theoretical properties of the resulting algorithms for non-convex optimization problems are technically analyzed. Empirical tests using various benchmark data sets indicate the efficiency and robustness of our proposed algorithms.

確率的最適化、特に確率的勾配降下法(SGD)は、現在、機械学習における大多数の問題に対する主力です。制御変量、適応学習率、モメンタム法などのさまざまな戦略が、収束率が低く、実際には一般化が不十分な標準SGDを改善するために開発されてきました。これらの戦略のほとんどは、更新方向(勾配降下法または勾配上昇方向など)を制御するか、学習率を操作することでSGDを改善します。この2つの方針に沿って、本研究ではまず、分散削減の観点から、更新方向がパワーボール関数によって決定される、新しいタイプの改良されたべき乗確率的勾配降下法アルゴリズムを開発および分析します。さらに、PSOと学習率の間のギャップを埋めるために(PSOにとってはまだ未解決の問題です)、提案アルゴリズムだけでなく従来のPSOアルゴリズムにも、Barzilai-Borwein (BB)のようなスキームを利用する学習率更新の適応メカニズムを提案します。非凸最適化問題に対する結果として得られるアルゴリズムの理論的特性は技術的に分析されます。さまざまなベンチマークデータセットを使用した実証テストにより、提案アルゴリズムの効率性と堅牢性が示されます。

PaLM: Scaling Language Modeling with Pathways
PaLM:パスウェイによる言語モデリングのスケーリング

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

大規模言語モデルは、少数ショット学習を使用してさまざまな自然言語タスクで優れたパフォーマンスを発揮することが示されています。これにより、モデルを特定のアプリケーションに適応させるために必要なタスク固有のトレーニング例の数が大幅に削減されます。規模が少数ショット学習に与える影響をさらに理解するために、5,400億のパラメータを持つ高密度にアクティブ化されたTransformer言語モデルをトレーニングしました。これをPathways Language Model (PaLM)と呼んでいます。複数のTPUポッドにわたる非常に効率的なトレーニングを可能にする新しいMLシステムであるPathwaysを使用して、6144個のTPU v4チップでPaLMをトレーニングしました。数百の言語理解および生成ベンチマークで最先端の少数ショット学習結果を達成することで、スケーリングの継続的なメリットを実証しています。これらのタスクの多くで、PaLM 540Bは画期的なパフォーマンスを達成し、一連のマルチステップ推論タスクで微調整された最先端のパフォーマンスを上回り、最近リリースされたBIG-benchベンチマークで平均的な人間のパフォーマンスを上回りました。多数のBIG-benchタスクは、モデル規模によって不連続な改善を示しました。つまり、最大モデルにスケールするとパフォーマンスが急激に向上したということです。PaLMは多言語タスクとソースコード生成にも強力な機能を備えており、さまざまなベンチマークでそれを実証しています。さらに、バイアスと毒性に関する包括的な分析を提供し、モデル規模に関するトレーニングデータの記憶の範囲を調査します。最後に、大規模言語モデルに関連する倫理的考慮事項について説明し、潜在的な緩和戦略について説明します。

Leaky Hockey Stick Loss: The First Negatively Divergent Margin-based Loss Function for Classification
リーキーホッケースティック損失:分類のための最初の負の発散マージンベースの損失関数

Many modern classification algorithms are formulated through the regularized empirical risk minimization (ERM) framework, where the risk is defined based on a loss function. We point out that although the loss function in decision theory is non-negative by definition, the non-negativity of the loss function in ERM is not necessary to be classification-calibrated and to produce a Bayes consistent classifier. We introduce the leaky hockey stick loss (LHS loss), the first negatively divergent margin-based loss function. We prove that the LHS loss is classification-calibrated. When the hinge loss is replaced with the LHS loss in the ERM approach for deriving the kernel support vector machine (SVM), the corresponding optimization problem has a well-defined solution named the kernel leaky hockey stick classifier (LHS classifier). Under mild regularity conditions, we prove that the kernel LHS classifier is Bayes risk consistent. In our theoretical analysis, we overcome multiple challenges caused by the negative divergence of the LHS loss that does not exist in the analysis of the usual kernel machines. For a numerical demonstration, we provide a computationally efficient algorithm to solve the kernel LHS classifier and compare it to the kernel SVM on simulated data and fifteen benchmark data sets. To conclude this work, we further present a class of negatively divergent margin-based loss functions that have similar theoretical properties to those of the LHS loss. Interestingly, the LHS loss can be viewed as a limiting case of this family of negatively divergent margin-based loss functions.

多くの最新の分類アルゴリズムは、リスクが損失関数に基づいて定義される、正規化された経験的リスク最小化(ERM)フレームワークを通じて定式化されます。決定理論の損失関数は定義により非負ですが、ERMの損失関数の非負性は、分類調整されてベイズ整合分類子を生成するために必要ではないことを指摘します。負に発散する最初のマージンベースの損失関数である、リーキーホッケースティック損失(LHS損失)を紹介します。LHS損失が分類調整されていることを証明します。カーネルサポートベクターマシン(SVM)を導出するためのERMアプローチでヒンジ損失をLHS損失に置き換えると、対応する最適化問題には、カーネルリーキーホッケースティック分類子(LHS分類子)と呼ばれる明確に定義されたソリューションがあります。軽度の正則性条件下では、カーネルLHS分類子がベイズリスク整合であることを証明します。理論分析では、通常のカーネルマシンの分析では存在しない、LHS損失の負の発散によって引き起こされる複数の課題を克服します。数値デモンストレーションでは、カーネルLHS分類器を解決するための計算効率の高いアルゴリズムを提供し、シミュレートされたデータと15個のベンチマークデータセットでカーネルSVMと比較します。この研究の結論として、LHS損失と同様の理論的特性を持つ負の発散マージンベースの損失関数のクラスをさらに提示します。興味深いことに、LHS損失は、この負の発散マージンベースの損失関数のファミリーの限界ケースとして見ることができます。

Efficient Computation of Rankings from Pairwise Comparisons
ペアワイズ比較からのランキングの効率的な計算

We study the ranking of individuals, teams, or objects, based on pairwise comparisons between them, using the Bradley-Terry model. Estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that provably returns identical results but does so much faster—over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive a number of results regarding its convergence.

私たちは、Bradley-Terryモデルを使用して、個人、チーム、またはオブジェクト間のペアワイズ比較に基づいて、それらのランク付けを研究します。このモデル内のランキングの推定は、通常、約100年前にZarmeloによって最初に導入された単純な反復アルゴリズムを使用して行われます。ここでは、同じ結果を証明可能に返すが、はるかに高速(場合によっては100倍以上)を行う、同様に単純なイテレーションについて説明します。このアルゴリズムをさまざまなサンプルデータセットに適用して示し、その収束に関するいくつかの結果を導き出します。

Scalable Computation of Causal Bounds
因果境界のスケーラブルな計算

We consider the problem of computing bounds for causal queries on causal graphs with unobserved confounders and discrete valued observed variables, where identifiability does not hold. Existing non-parametric approaches for computing such bounds use linear programming (LP) formulations that quickly become intractable for existing solvers because the size of the LP grows exponentially in the number of edges in the causal graph. We show that this LP can be significantly pruned, allowing us to compute bounds for significantly larger causal inference problems compared to existing techniques. This pruning procedure allows us to compute bounds in closed form for a special class of problems, including a well-studied family of problems where multiple confounded treatments influence an outcome. We extend our pruning methodology to fractional LPs which compute bounds for causal queries which incorporate additional observations about the unit. We show that our methods provide significant runtime improvement compared to benchmarks in experiments and extend our results to the finite data setting. For causal inference without additional observations, we propose an efficient greedy heuristic that produces high quality bounds, and scales to problems that are several orders of magnitude larger than those for which the pruned LP can be solved.

私たちは、識別可能性が成り立たない、観測されない交絡因子と離散値の観測変数を持つ因果グラフ上の因果クエリの境界を計算する問題を考察します。このような境界を計算する既存のノンパラメトリック手法は、線形計画法(LP)定式化を使用するが、LPのサイズは因果グラフのエッジの数に応じて指数関数的に増加するため、既存のソルバーではすぐに扱いにくくなります。私たちは、このLPを大幅に削減できることを示し、既存の手法と比較して大幅に大きな因果推論問題に対する境界を計算できるようにします。この削減手順により、複数の交絡した処理が結果に影響を与える、よく研究されている問題群を含む、特殊なクラスの問題に対して、閉じた形式で境界を計算できます。私たちは、ユニットに関する追加の観測を組み込んだ因果クエリの境界を計算する分数LPに、我々の削減手法を拡張します。私たちは、我々の方法が、実験のベンチマークと比較して実行時間を大幅に改善することを示し、結果を有限データ設定に拡張します。追加の観測なしで因果推論を行うために、私たちは、高品質の境界を生成し、剪定されたLPが解決できる問題よりも数桁大きい問題に拡張できる効率的な貪欲ヒューリスティックを提案します。

Neural Q-learning for solving PDEs
偏微分方程式を解くためのニューラルQ学習

Solving high-dimensional partial differential equations (PDEs) is a major challenge in scientific computing. We develop a new numerical method for solving elliptic-type PDEs by adapting the Q-learning algorithm in reinforcement learning. To solve PDEs with Dirichlet boundary condition, our “Q-PDE” algorithm is mesh-free and therefore has the potential to overcome the curse of dimensionality. Using a neural tangent kernel (NTK) approach, we prove that the neural network approximator for the PDE solution, trained with the Q-PDE algorithm, converges to the trajectory of an infinite-dimensional ordinary differential equation (ODE) as the number of hidden units $\rightarrow \infty$. For monotone PDEs (i.e., those given by monotone operators, which may be nonlinear), despite the lack of a spectral gap in the NTK, we then prove that the limit neural network, which satisfies the infinite-dimensional ODE, strongly converges in $L^2$ to the PDE solution as the training time $\rightarrow \infty$. More generally, we can prove that any fixed point of the wide-network limit for the Q-PDE algorithm is a solution of the PDE (not necessarily under the monotone condition). The numerical performance of the Q-PDE algorithm is studied for several elliptic PDEs.

高次元偏微分方程式(PDE)を解くことは、科学計算における大きな課題です。強化学習にQ学習アルゴリズムを適用することで、楕円型PDEを解くための新しい数値手法を開発しました。ディリクレ境界条件を持つ偏微分方程式を解くために、我々の「Q-PDE」アルゴリズムはメッシュフリーであり、したがって次元の呪いを克服する可能性を秘めています。ニューラルタンジェントカーネル（NTK）アプローチを使用して、Q-PDEアルゴリズムでトレーニングされた偏微分方程式解のニューラルネットワーク近似器が、隠れユニットの数$\rightarrow \infty$として無限次元常微分方程式（ODE）の軌跡に収束することを証明した。単調偏微分方程式（すなわち、単調演算子によって与えられ、非線形である可能性のある偏微分方程式）の場合、NTKにスペクトルギャップがないにもかかわらず、無限次元ODEを満たす極限ニューラルネットワークが、トレーニング時間$\rightarrow \infty$としてPDE解に$L^2$で強く収束することを証明した。より一般的には、Q-PDEアルゴリズムのワイドネットワーク極限の任意の固定点は、PDEの解であることを証明できます。(必ずしも単調条件ではない)。いくつかの楕円偏微分方程式について、Q-PDEアルゴリズムの数値的性能を研究します。

Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models
汚染されたガウスモデルにおけるロバスト推定のための扱いやすく最適に近い敵対的アルゴリズム

Consider the problem of simultaneous estimation of location and variance matrix under Huber’s contaminated Gaussian model. First, we study minimum $f$-divergence estimation at the population level, corresponding to a generative adversarial method with a nonparametric discriminator and establish conditions on $f$-divergences which lead to robust estimation, similarly to robustness of minimum distance estimation. More importantly, we develop tractable adversarial algorithms with simple spline discriminators, which can be defined by nested optimization such that the discriminator parameters are determined by maximizing a concave objective function given the current generator. The proposed methods are shown to achieve minimax optimal rates or near-optimal rates depending on the $f$-divergence and the penalty used. This is the first time such near-optimal error rates are established for adversarial algorithms with linear discriminators under Huber’s contamination model. We present simulation studies to demonstrate advantages of the proposed methods over classic robust estimators, pairwise methods, and a generative adversarial method with neural network discriminators.

Huberの汚染ガウスモデルの下での位置と分散行列の同時推定の問題を考えてみましょう。まず、ノンパラメトリック識別器を使用した生成的敵対的手法に対応する、集団レベルでの最小$f$ダイバージェンス推定を研究し、最小距離推定の堅牢性と同様に、堅牢な推定につながる$f$ダイバージェンスの条件を確立します。さらに重要なことに、単純なスプライン識別器を使用した扱いやすい敵対的アルゴリズムを開発します。これは、現在の生成器を与えられた場合に凹目的関数を最大化することで識別器パラメータが決定されるように、ネストされた最適化によって定義できます。提案された方法は、$f$ダイバージェンスと使用されるペナルティに応じて、ミニマックス最適レートまたはほぼ最適レートを達成することが示されています。Huberの汚染モデルの下で線形識別器を使用した敵対的アルゴリズムでこのようなほぼ最適なエラー率が確立されたのはこれが初めてです。私たちは、従来の堅牢な推定器、ペアワイズ法、ニューラルネットワーク識別器を使用した生成的敵対的方法に比べて提案された方法の利点を示すシミュレーション研究を紹介します。

MultiZoo and MultiBench: A Standardized Toolkit for Multimodal Deep Learning
MultiZoo と MultiBench: マルチモーダル深層学習のための標準化されたツールキット

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of >20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community.

マルチモーダル表現の学習には、複数の異種データソースからの情報を統合することが含まれます。実世界での堅牢性を確保しながら、十分に研究されていないモダリティとタスクへの進歩を加速するために、20を超えるコアマルチモーダルアルゴリズムの標準化された実装で構成される公開ツールキットであるMultiZooと、15のデータセット、10のモダリティ、20の予測タスク、6つの研究分野にまたがる大規模ベンチマークであるMultiBenchをリリースしました。これらを組み合わせることで、データの読み込み、実験のセットアップ、モデル評価を簡素化および標準化する、自動化されたエンドツーエンドの機械学習パイプラインが提供されます。総合的な評価を可能にするために、(1)一般化、(2)時間と空間の複雑さ、(3)モダリティの堅牢性を評価するための包括的な方法論を提供します。MultiBenchは、使いやすさ、アクセシビリティ、再現性を確保しながら、マルチモーダルモデルの機能と制限をより深く理解するための道を開きます。当社のツールキットは公開されており、定期的に更新され、コミュニティからのインプットを歓迎します。

Strategic Knowledge Transfer
戦略的な知識の伝達

In the course of playing or solving a game, it is common to face a series of changing other-agent strategies. These strategies often share elements: the set of possible policies to play has overlap, and the policies are sampled at the beginning of play by possibly differing distributions. As it faces the series of strategies, therefore, an agent has the opportunity to transfer its learned play against the previously encountered other-agent policies. We tackle two problems: (1) how can learned responses transfer across changing opponent strategies, and (2) how can this transfer be used to reduced the cumulative cost of learning in game solving. The first problem we characterize as the strategic knowledge transfer problem. For value-based response policies, we demonstrate that Q-Mixing approximately solves this problem by appropriately averaging the component Q-values. Solutions to the first problem can be applied to reduce the computational cost of learning-based game solving algorithms. We offer two algorithms that operate within the Policy-Space Response Oracles (PSRO) framework. Mixed-Oracles reduces the per-policy construction cost by transferring responses from previously encountered opponents. Mixed-Opponents performs strategic knowledge transfer by combining the previously encountered opponents into a single novel policy. Experimental evaluation of these methods on general-sum grid-world games provide evidence about their advantages and limitations in comparison to standard PSRO.

ゲームをプレイしたり解決したりする過程で、他のエージェントの戦略が次々と変化していくのが一般的です。これらの戦略には、多くの場合、共通の要素があります。プレイ可能なポリシーのセットには重複があり、ポリシーはプレイの開始時に、異なる分布によってサンプリングされる可能性があります。したがって、エージェントは一連の戦略に直面すると、以前に遭遇した他のエージェントのポリシーに対して、学習したプレイを転送する機会があります。私たちは、(1)学習した応答を変化する対戦相手の戦略間で転送する方法、および(2)この転送を使用して、ゲーム解決における学習の累積コストを削減する方法という2つの問題に取り組んでいます。最初の問題は、戦略的知識転送問題として特徴付けられます。価値ベースの応答ポリシーの場合、Q-Mixingは、コンポーネントQ値を適切に平均化することで、この問題を近似的に解決することを示します。最初の問題の解決策は、学習ベースのゲーム解決アルゴリズムの計算コストを削減するために適用できます。ポリシー空間応答オラクル(PSRO)フレームワーク内で動作する2つのアルゴリズムを提供します。Mixed-Oraclesは、以前に遭遇した対戦相手からの応答を転送することで、ポリシーごとの構築コストを削減します。Mixed-Opponentsは、以前に遭遇した対戦相手を1つの新しいポリシーに組み合わせることで、戦略的な知識転送を実行します。一般和グリッドワールドゲームでのこれらの方法の実験的評価により、標準PSROと比較した利点と限界についての証拠が得られます。

Lifted Bregman Training of Neural Networks
ニューラルネットワークのブレグマン学習を解禁

We introduce a novel mathematical formulation for the training of feed-forward neural networks with (potentially non-smooth) proximal maps as activation functions. This formulation is based on Bregman distances and a key advantage is that its partial derivatives with respect to the network’s parameters do not require the computation of derivatives of the network’s activation functions. Instead of estimating the parameters with a combination of first-order optimisation method and back-propagation (as is the state-of-the-art), we propose the use of non-smooth first-order optimisation methods that exploit the specific structure of the novel formulation. We present several numerical results that demonstrate that these training approaches can be equally well or even better suited for the training of neural network-based classifiers and (denoising) autoencoders with sparse coding compared to more conventional training frameworks.

私たちは、活性化関数として(潜在的に非滑らか)近位マップを持つフィードフォワードニューラルネットワークのトレーニングのための新しい数学的定式化を紹介します。この定式化はブレグマン距離に基づいており、主な利点は、ネットワークのパラメーターに関する偏導関数がネットワークの活性化関数の導関数の計算を必要としないことです。一次最適化法とバックプロパゲーションの組み合わせでパラメータを推定する代わりに(最先端のように)、新しい製剤の特定の構造を利用する非滑らかな一次最適化方法の使用を提案します。これらの学習アプローチは、従来の学習フレームワークと比較して、ニューラルネットワークベースの分類器やスパースコーディングによる(ノイズ除去)自己符号化器の学習に同等またはそれ以上に適していることを示すいくつかの数値結果を示します。

Statistical Comparisons of Classifiers by Generalized Stochastic Dominance
一般化確率的優勢による分類器の統計的比較

Although being a crucial question for the development of machine learning algorithms, there is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria. Every comparison framework is confronted with (at least) three fundamental challenges: the multiplicity of quality criteria, the multiplicity of data sets and the randomness of the selection of data sets. In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory. Based on so-called preference systems, our framework ranks classifiers by a generalized concept of stochastic dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates. Moreover, we show that generalized stochastic dominance can be operationalized by solving easy-to-handle linear programs and moreover statistically tested employing an adapted two-sample observation-randomization test. This yields indeed a powerful framework for the statistical comparison of classifiers over multiple data sets with respect to multiple quality criteria simultaneously. We illustrate and investigate our framework in a simulation study and with a set of standard benchmark data sets.

機械学習アルゴリズムの開発にとって極めて重要な問題であるにもかかわらず、複数のデータセットにわたって分類器をいくつかの基準に関して比較する方法については、まだコンセンサスが得られていません。すべての比較フレームワークは、（少なくとも）3つの基本的な課題に直面しています。品質基準の多様性、データセットの多様性、およびデータセットの選択のランダム性です。この論文では、意思決定理論の最近の進展を採用することで、活発な議論に新しい視点を追加します。いわゆる選好システムに基づいて、私たちのフレームワークは、分類器を確率的優位性の一般化された概念によってランク付けします。これにより、面倒でしばしば自己矛盾さえする集計への依存を強力に回避できます。さらに、一般化された確率的優位性は、扱いやすい線形計画を解くことで操作可能であり、さらに適応された2サンプルの観測ランダム化テストを使用して統計的にテストできることを示します。これにより、複数のデータセットにわたって分類器を複数の品質基準に関して同時に統計的に比較するための強力なフレームワークが得られます。私たちは、シミュレーション研究と一連の標準ベンチマークデータセットを使用してフレームワークを説明および調査します。

Sample Complexity for Distributionally Robust Learning under chi-square divergence
カイ二乗発散下での分布ロバスト学習のためのサンプル複雑性

This paper investigates the sample complexity of learning a distributionally robust predictor under a particular distributional shift based on $\chi^2$-divergence, which is well known for its computational feasibility and statistical properties. We demonstrate that any hypothesis class $\mathcal{H}$ with finite VC dimension is distributionally robustly learnable. Moreover, we show that when the perturbation size is smaller than a constant, finite VC dimension is also necessary for distributionally robust learning by deriving a lower bound of sample complexity in terms of VC dimension.

この論文では、計算可能性と統計的特性でよく知られている$chi^2$-divergenceに基づく特定の分布シフトの下で分布的にロバストな予測子を学習するサンプルの複雑さを調査します。有限のVC次元を持つ任意の仮説クラス$mathcal{H}$が分布的にロバストに学習可能であることを示します。さらに、摂動サイズが定数よりも小さい場合、VC次元の観点からサンプルの複雑さの下限を導出することにより、分布的にロバストな学習には有限VC次元も必要であることを示します。

Interpretable and Fair Boolean Rule Sets via Column Generation
列生成による解釈可能で公正なブールルールセット

This paper considers the learning of Boolean rules in disjunctive normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) as an interpretable model for classification. An integer program is formulated to optimally trade classification accuracy for rule simplicity. We also consider the fairness setting and extend the formulation to include explicit constraints on two different measures of classification parity: equality of opportunity and equalized odds. Column generation (CG) is used to efficiently search over an exponential number of candidate rules without the need for heuristic rule mining. To handle large data sets, we propose an approximate CG algorithm using randomization. Compared to three recently proposed alternatives, the CG algorithm dominates the accuracy-simplicity trade-off in 8 out of 16 data sets. When maximized for accuracy, CG is competitive with rule learners designed for this purpose, sometimes finding significantly simpler solutions that are no less accurate. Compared to other fair and interpretable classifiers, our method is able to find rule sets that meet stricter notions of fairness with a modest trade-off in accuracy.

この論文では、分類の解釈可能なモデルとして、選言正規形(DNF、ANDのOR、決定ルールセットに相当)のブールルールの学習について検討します。整数計画は、分類精度とルールの単純さを最適にトレードオフするように定式化されます。また、公平性の設定も考慮し、分類のパリティの2つの異なる尺度(機会の平等と均等化オッズ)に対する明示的な制約を含めるように定式化を拡張します。列生成(CG)は、ヒューリスティックなルールマイニングを必要とせずに、指数関数的な数の候補ルールを効率的に検索するために使用されます。大規模なデータセットを処理するために、ランダム化を使用した近似CGアルゴリズムを提案します。最近提案された3つの代替アルゴリズムと比較すると、CGアルゴリズムは16のデータセットのうち8つで精度と単純さのトレードオフを上回ります。精度を最大化すると、CGはこの目的のために設計されたルール学習器と競合し、精度が劣らない大幅に単純なソリューションを見つけることもあります。他の公平かつ解釈可能な分類器と比較して、私たちの方法は、精度をわずかに犠牲にして、より厳格な公平性の概念を満たすルールセットを見つけることができます。

On the Optimality of Nuclear-norm-based Matrix Completion for Problems with Smooth Non-linear Structure
平滑非線形構造問題に対する核ノルムに基づく行列補完の最適性について

Nuclear-norm-based matrix completion was originally developed for imputing missing entries in low rank, or approximately low rank matrices. However, it has proven widely effective in many problems where there is no reason to assume low-dimensional linear structure in the underlying matrix, as would be imposed by rank constraints. In this manuscript we show that nuclear-norm-based matrix completion attains within a log factor of the minimax rate for estimating the mean structure of matrices that are not necessarily low-rank, but lie in a low-dimensional non-linear manifold, when observations are missing completely at random. In particular, we give upper bounds on the rate of convergence as a function of the number of rows, columns, and observed entries in the matrix, as well as the smoothness and dimension of the non-linear embedding. We additionally give a minimax lower bound: This lower bound agrees with our upper bound (up to a logarithmic factor), which shows that nuclear-norm penalization is (up to log terms) minimax rate optimal for these problems.

核ノルムベースの行列補完は、もともとは低ランクまたはほぼ低ランクの行列の欠損エントリを補完するために開発されました。しかし、ランク制約によって課されるような、基礎となる行列の低次元線形構造を想定する理由がない多くの問題で広く有効であることが証明されています。この論文では、観測値が完全にランダムに欠損している場合、必ずしも低ランクではないが低次元の非線形多様体にある行列の平均構造を推定するために、核ノルムベースの行列補完がミニマックス率の対数係数以内を達成することを示します。特に、行列の行、列、および観測エントリの数、および非線形埋め込みの滑らかさと次元の関数として収束率の上限を示します。さらに、ミニマックス下限を示します。この下限は、（対数係数まで）上限と一致しており、核ノルムペナルティが（対数項まで）これらの問題に対してミニマックス率最適であることを示しています。

Autoregressive Networks
自己回帰ネットワーク

We propose a first-order autoregressive (i.e. AR(1)) model for dynamic network processes in which edges change over time while nodes remain unchanged. The model depicts the dynamic changes explicitly. It also facilitates simple and efficient statistical inference methods including a permutation test for diagnostic checking for the fitted network models. The proposed model can be applied to the network processes with various underlying structures but with independent edges. As an illustration, an AR(1) stochastic block model has been investigated in depth, which characterizes the latent communities by the transition probabilities over time. This leads to a new and more effective spectral clustering algorithm for identifying the latent communities. We have derived a finite sample condition under which the perfect recovery of the community structure can be achieved by the newly defined spectral clustering algorithm. Furthermore the inference for a change point is incorporated into the AR(1) stochastic block model to cater for possible structure changes. We have derived the explicit error rates for the maximum likelihood estimator of the change-point. Application with three real data sets illustrates both relevance and usefulness of the proposed AR(1) models and the associate inference methods.

私たちは、エッジが時間とともに変化し、ノードは変化しない動的ネットワークプロセスのための一次自己回帰（すなわちAR(1)）モデルを提案します。このモデルは、動的変化を明示的に描写します。また、適合されたネットワークモデルの診断チェックのための順列検定を含む、単純で効率的な統計的推論法を容易にします。提案されたモデルは、さまざまな基礎構造を持ちながら独立したエッジを持つネットワークプロセスに適用できます。例として、AR(1)確率的ブロックモデルが詳細に調査され、時間経過に伴う遷移確率によって潜在的コミュニティを特徴付けています。これにより、潜在的コミュニティを識別するための新しい、より効果的なスペクトルクラスタリングアルゴリズムがもたらされます。私たちは、新たに定義されたスペクトルクラスタリングアルゴリズムによってコミュニティ構造の完全な回復を達成できる有限サンプル条件を導出した。さらに、変化点の推論がAR(1)確率的ブロックモデルに組み込まれ、起こり得る構造変化に対応しています。変化点の最大尤度推定量の明示的なエラー率を導出した。3つの実際のデータセットへの適用により、提案されたAR(1)モデルと関連する推論方法の関連性と有用性が実証されました。

Merlion: End-to-End Machine Learning for Time Series
Merlion:時系列のエンドツーエンド機械学習

We introduce Merlion, an open-source machine learning library for time series. It features a unified interface for many commonly used models and datasets for forecasting and anomaly detection on both univariate and multivariate time series, along with standard pre/post-processing layers. It has several modules to improve ease-of-use, including a no-code visual dashboard, anomaly score calibration to improve interpetability, AutoML for hyperparameter tuning and model selection, and model ensembling. Merlion also provides an evaluation framework that simulates the live deployment of a model in production, and a distributed computing backend to run time series models at industrial scale. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs and benchmark them across multiple datasets.

私たちは、時系列のオープンソースの機械学習ライブラリであるMerlionを紹介します。これは、単変量と多変量の両方の時系列での予測と異常検出、および標準の前処理/後処理レイヤーのために一般的に使用される多くのモデルとデータセットの統一インターフェイスを備えています。ノーコードビジュアルダッシュボード、相互運用性を向上させるための異常スコアキャリブレーション、ハイパーパラメータチューニングとモデル選択のためのAutoML、モデルアンサンブルなど、使いやすさを向上させるためのいくつかのモジュールがあります。Merlionは、本番環境でのモデルのライブデプロイをシミュレートする評価フレームワークと、産業規模で時系列モデルを実行するための分散コンピューティングバックエンドも提供しています。このライブラリは、エンジニアや研究者が特定の時系列ニーズに対応するモデルを迅速に開発し、複数のデータセット間でベンチマークを行うためのワンストップソリューションを提供することを目的としています。

Limits of Dense Simplicial Complexes
高密度単純錯体の限界

We develop a theory of limits for sequences of dense abstract simplicial complexes, where a sequence is considered convergent if its homomorphism densities converge. The limiting objects are represented by stacks of measurable $[0,1]$-valued functions on unit cubes of increasing dimension, each corresponding to a dimension of the abstract simplicial complex. We show that convergence in homomorphism density implies convergence in a cut-metric, and vice versa, as well as showing that simplicial complexes sampled from the limit objects closely resemble its structure. Applying this framework, we also partially characterize the convergence of nonuniform hypergraphs.

私たちは、密集した抽象的な単純錯体の配列の極限の理論を発展させ、その準同型密度が収束する場合、配列は収束すると考えられます。制限オブジェクトは、増加する次元の単位キューブ上の測定可能な$[0,1]$値関数のスタックによって表され、それぞれが抽象的な単純複素数の次元に対応します。準同型密度の収束はカットメトリックでの収束を意味し、その逆も同様であることを示し、極限オブジェクトからサンプリングされた単純複素体がその構造によく似ていることを示します。このフレームワークを適用して、不均一なハイパーグラフの収束も部分的に特徴付けます。

RankSEG: A Consistent Ranking-based Framework for Segmentation
RankSEG:セグメンテーションのための一貫したランキングベースのフレームワーク

Segmentation has emerged as a fundamental field of computer vision and natural language processing, which assigns a label to every pixel/feature to extract regions of interest from an image/text. To evaluate the performance of segmentation, the Dice and IoU metrics are used to measure the degree of overlap between the ground truth and the predicted segmentation. In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice-/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are not consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation. We study statistical properties of the proposed framework. We show it is Dice-/IoU-calibrated, and its excess risk bounds and the rate of convergence are also provided. The numerical effectiveness of RankDice/mRankDice is demonstrated in various simulated examples and Fine-annotated CityScapes, Pascal VOC and Kvasir-SEG datasets with state-of-the-art deep learning architectures. Python module and source code are available on Github at (https://github.com/statmlben/rankseg).

セグメンテーションは、画像/テキストから関心領域を抽出するためにすべてのピクセル/特徴にラベルを割り当てる、コンピュータービジョンと自然言語処理の基本分野として浮上しました。セグメンテーションのパフォーマンスを評価するために、DiceおよびIoUメトリックを使用して、グラウンドトゥルースと予測されたセグメンテーションの重なりの度合いを測定します。この論文では、分類キャリブレーションまたは分類におけるフィッシャー一貫性に類似した、ベイズ規則およびDice/IoUキャリブレーションを含む、Dice/IoUメトリックに関するセグメンテーションの理論的基礎を確立します。ほとんどの動作損失を伴う既存のしきい値ベースのフレームワークは、Dice/IoUメトリックに関して一貫性がなく、したがって次善のソリューションにつながる可能性があることを証明します。この落とし穴に対処するために、ベイズセグメンテーション規則のプラグインルールに触発された、一貫性のある新しいランキングベースのフレームワーク、つまりRankDice/RankIoUを提案します。大規模で高次元のセグメンテーションで提案されたフレームワークを実装するために、GPU並列実行を備えた3つの数値アルゴリズムが開発されています。提案されたフレームワークの統計的特性を調査します。Dice/IoU調整済みであることを示し、その過剰リスク境界と収束率も提供します。RankDice/mRankDiceの数値的有効性は、さまざまなシミュレーション例と、最先端のディープラーニングアーキテクチャを使用したFine注釈付きCityScapes、Pascal VOC、Kvasir-SEGデータセットで実証されています。Pythonモジュールとソースコードは、Github (https://github.com/statmlben/rankseg)で入手できます。

Conditional Distribution Function Estimation Using Neural Networks for Censored and Uncensored Data
打ち切りデータと打ち切りなしデータに対するニューラルネットワークを用いた条件付き分布関数推定

Most work in neural networks focuses on estimating the conditional mean of a continuous response variable given a set of covariates. In this article, we consider estimating the conditional distribution function using neural networks for both censored and uncensored data. The algorithm is built upon the data structure particularly constructed for the Cox regression with time-dependent covariates. Without imposing any model assumptions, we consider a loss function that is based on the full likelihood where the conditional hazard function is the only unknown nonparametric parameter, for which unconstrained optimization methods can be applied. Through simulation studies, we show that the proposed method possesses desirable performance, whereas the partial likelihood method and the traditional neural networks with $L_2$ loss yields biased estimates when model assumptions are violated. We further illustrate the proposed method with several real-world data sets.

ニューラルネットワークでのほとんどの作業は、共変量のセットが与えられた場合の連続応答変数の条件付き平均を推定することに重点を置いています。この記事では、打ち切りデータと打ち切りなしデータの両方について、ニューラルネットワークを使用して条件付き分布関数を推定することを検討します。このアルゴリズムは、時間依存の共変量を持つCox回帰用に特別に構築されたデータ構造に基づいて構築されています。モデルの仮定を課すことなく、条件付きハザード関数が唯一の未知のノンパラメトリックパラメーターであり、制約のない最適化手法を適用できる全尤度に基づく損失関数を検討します。シミュレーション研究を通じて、提案手法が望ましい性能を持つのに対し、偏尤法と$L_2$損失を持つ従来のニューラルネットワークは、モデルの仮定に違反すると偏った推定値が得られることを示します。さらに、提案された方法をいくつかの実世界のデータセットで説明します。

Single Timescale Actor-Critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees
収束保証付き線形二次レギュレータを解く単一時間スケールアクタークリティック法

We propose a single timescale actor-critic algorithm to solve the linear quadratic regulator (LQR) problem. A least squares temporal difference (LSTD) method is applied to the critic and a natural policy gradient method is used for the actor. We give a proof of convergence with sample complexity $\mathcal{O}(\varepsilon^{-1} \log(\varepsilon^{-1})^2)$. The method in the proof is applicable to general single timescale bilevel optimization problems. We also numerically validate our theoretical results on the convergence.

私たちは、線形二次レギュレーター(LQR)問題を解くために、単一のタイムスケールアクタークリティックアルゴリズムを提案します。最小二乗時間差分(LSTD)法が批評家に適用され、アクターには自然方策勾配法が使用されます。サンプルの複雑さ$mathcal{O}(varepsilon^{-1} log(varepsilon^{-1})^2)$で収束の証明を与えます。証明の方法は、一般的な単一タイムスケールの二値最適化問題に適用できます。また、収束に関する理論的な結果を数値的に検証します。

Multi-source Learning via Completion of Block-wise Overlapping Noisy Matrices
ブロックワイズオーバーラップノイズ行列の完成によるマルチソース学習

Electronic healthcare records (EHR) provide a rich resource for healthcare research. An important problem for the efficient utilization of the EHR data is the representation of the EHR features, which include the unstructured clinical narratives and the structured codified data. Matrix factorization-based embeddings trained using the summary-level co-occurrence statistics of EHR data have provided a promising solution for feature representation while preserving patients’ privacy. However, such methods do not work well with multi-source data when these sources have overlapping but non-identical features. To accommodate multi-sources learning, we propose a novel word embedding generative model. To obtain multi-source embeddings, we design an efficient Block-wise Overlapping Noisy Matrix Integration (BONMI) algorithm to aggregate the multi-source pointwise mutual information matrices optimally with a theoretical guarantee. Our algorithm can also be applied to other multi-source data integration problems with a similar data structure. A by-product of BONMI is the contribution to the field of matrix completion by considering the missing mechanism other than the entry-wise independent missing. We show that the entry-wise missing assumption, despite its prevalence in the works of matrix completion, is not necessary to guarantee recovery. We prove the statistical rate of our estimator, which is comparable to the rate under independent missingness. Simulation studies show that BONMI performs well under a variety of configurations. We further illustrate the utility of BONMI by integrating multi-lingual multi-source medical text and EHR data to perform two tasks: (i) co-training semantic embeddings for medical concepts in both English and Chinese and (ii) the translation between English and Chinese medical concepts. Our method shows an advantage over existing methods.

電子医療記録(EHR)は、医療研究のための豊富なリソースを提供します。EHRデータの効率的な利用における重要な問題は、構造化されていない臨床ナラティブと構造化されたコード化データを含むEHR機能の表現です。EHRデータの要約レベルの共起統計を使用してトレーニングされた行列分解ベースの埋め込みは、患者のプライバシーを保護しながら機能を表現する有望なソリューションを提供してきました。ただし、このような方法は、これらのソースが重複しているが同一ではない機能を持つ場合、マルチソースデータではうまく機能しません。マルチソース学習に対応するために、新しい単語埋め込み生成モデルを提案します。マルチソース埋め込みを取得するために、理論的な保証付きでマルチソースのポイントワイズ相互情報行列を最適に集約する効率的なブロックワイズオーバーラップノイズ行列統合(BONMI)アルゴリズムを設計します。このアルゴリズムは、同様のデータ構造を持つ他のマルチソースデータ統合問題にも適用できます。BONMIの副産物は、エントリワイズ独立欠損以外の欠損メカニズムを考慮することで、行列補完の分野に貢献することです。エントリごとの欠損仮定は、行列補完の研究で広く普及しているにもかかわらず、回復を保証するために必要ではないことを示しています。独立した欠損がある場合の推定値と同等の統計的割合を証明します。シミュレーション研究では、BONMIがさまざまな構成で適切に機能することが示されています。さらに、多言語マルチソース医療テキストとEHRデータを統合して、(i)英語と中国語の両方での医療概念の意味埋め込みの共同トレーニングと、(ii)英語と中国語の医療概念間の翻訳という2つのタスクを実行することで、BONMIの有用性を示します。私たちの方法は、既存の方法よりも優れています。

A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning
マルチエージェント強化学習のための分布値関数の因数分解のための統一フレームワーク

In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected value function factorization methods to enable the factorization of return distributions. To validate DFAC, we first demonstrate its ability to factorize the value functions of a simple matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps of the StarCraft Multi-Agent Challenge and six self-designed Ultra Hard maps, showing that DFAC is able to outperform a number of baselines.

完全に協調的なマルチエージェント強化学習(MARL)設定では、各エージェントの部分的な可観測性と他のエージェントのポリシーが絶えず変化するため、環境は非常に確率的です。そこで、上記の課題に対して、分布RLと価値関数の因数分解法を統合するための統一フレームワーク、DFACを提案しました。このフレームワークは、期待値関数の因数分解法を一般化して、戻り値の分布の因数分解を可能にします。DFACを検証するために、まず、確率的報酬を持つ単純な行列ゲームの価値関数を因数分解する能力を実証します。次に、StarCraft Multi-Agent ChallengeのすべてのSuper Hardマップと6つの自己設計Ultra Hardマップで実験を行い、DFACがいくつかのベースラインを上回ることができることを示しています。

Functional L-Optimality Subsampling for Functional Generalized Linear Models with Massive Data
大量データを持つ汎関数一般化線形モデルのための関数的L最適性サブサンプリング

Massive data bring the big challenges of memory and computation for analysis. These challenges can be tackled by taking subsamples from the full data as a surrogate. For functional data, it is common to collect multiple measurements over their domains, which require even more memory and computation time when the sample size is large. The computation would be much more intensive when statistical inference is required through bootstrap samples. Motivated by analyzing large-scale kidney transplant data, we propose an optimal subsampling method based on the functional L-optimality criterion for functional generalized linear models. To the best of our knowledge, this is the first attempt to propose a subsampling method for functional data analysis. The asymptotic properties of the resultant estimators are also established. The analysis results from extensive simulation studies and from the kidney transplant data show that the functional L-optimality subsampling (FLoS) method is much better than the uniform subsampling approach and can well approximate the results based on the full data while dramatically reducing the computation time and memory.

膨大なデータは、分析のためのメモリと計算に大きな課題をもたらします。これらの課題は、フルデータからサブサンプルを代理として取得することで対処できます。機能データの場合、ドメイン全体で複数の測定値を収集するのが一般的であり、サンプルサイズが大きい場合はさらに多くのメモリと計算時間が必要になります。ブートストラップサンプルによる統計的推論が必要な場合は、計算がはるかに集中的になります。大規模な腎臓移植データの分析に動機付けられて、機能一般化線形モデルの機能L最適基準に基づく最適なサブサンプリング方法を提案します。私たちの知る限り、これは機能データ分析のサブサンプリング方法を提案する最初の試みです。結果として得られる推定値の漸近特性も確立されています。広範なシミュレーション研究と腎臓移植データからの分析結果は、機能L最適サブサンプリング(FLoS)方法が均一サブサンプリングアプローチよりもはるかに優れており、計算時間とメモリを大幅に削減しながら、フルデータに基づく結果を適切に近似できることを示しています。

Adaptation Augmented Model-based Policy Optimization
適応拡張モデルに基づく政策最適化

Compared to model-free reinforcement learning (RL), model-based RL is often more sample efficient by leveraging a learned dynamics model to help decision making. However, the learned model is usually not perfectly accurate and the error will compound in multi-step predictions, which can lead to poor asymptotic performance. In this paper, we first derive an upper bound of the return discrepancy between the real dynamics and the learned model, which reveals the fundamental problem of distribution shift between simulated data and real data. Inspired by the theoretical analysis, we propose an adaptation augmented model-based policy optimization (AMPO) framework to address the distribution shift problem from the perspectives of feature learning and instance re-weighting, respectively. Specifically, the feature-based variant, namely FAMPO, introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data, while the instance-based variant, termed as IAMPO, utilizes importance sampling to re-weight the real samples used to train the model. Besides model learning, we also investigate how to improve policy optimization in the model usage phase by selecting simulated samples with different probability according to their uncertainty. Extensive experiments on challenging continuous control tasks show that FAMPO and IAMPO, coupled with our model usage technique, achieves superior performance against baselines, which demonstrates the effectiveness of the proposed methods.

モデルフリー強化学習(RL)と比較して、モデルベースRLは、学習したダイナミクスモデルを活用して意思決定を支援することで、多くの場合、よりサンプル効率が高くなります。ただし、学習したモデルは通常、完全に正確ではなく、エラーが複数ステップの予測で累積し、漸近的なパフォーマンスが低下する可能性があります。この論文では、まず、実際のダイナミクスと学習したモデル間のリターンの不一致の上限を導出します。これにより、シミュレートされたデータと実際のデータ間の分布シフトという基本的な問題が明らかになります。理論分析に触発されて、特徴学習とインスタンスの再重み付けの観点から分布シフト問題に対処するための、適応拡張モデルベースポリシー最適化(AMPO)フレームワークを提案します。具体的には、特徴ベースのバリアントであるFAMPOは、実際のデータとシミュレートされたデータの特徴分布間の積分確率メトリック(IPM)を最小化するために教師なしモデル適応を導入し、インスタンスベースのバリアントであるIAMPOは、重要度サンプリングを利用して、モデルのトレーニングに使用される実際のサンプルを再重み付けします。モデル学習に加えて、不確実性に応じて異なる確率でシミュレートされたサンプルを選択することにより、モデル使用フェーズでのポリシー最適化を改善する方法も調査します。困難な連続制御タスクに関する広範な実験により、FAMPOとIAMPOをモデル使用手法と組み合わせると、ベースラインに対して優れたパフォーマンスが達成され、提案された方法の有効性が実証されます。

GANs as Gradient Flows that Converge
収束する勾配フローとしての GAN

This paper approaches the unsupervised learning problem by gradient descent in the space of probability density functions. A main result shows that along the gradient flow induced by a distribution-dependent ordinary differential equation (ODE), the unknown data distribution emerges as the long-time limit. That is, one can uncover the data distribution by simulating the distribution-dependent ODE. Intriguingly, the simulation of the ODE is shown equivalent to the training of generative adversarial networks (GANs). This equivalence provides a new “cooperative” view of GANs and, more importantly, sheds new light on the divergence of GANs. In particular, it reveals that the GAN algorithm implicitly minimizes the mean squared error (MSE) between two sets of samples, and this MSE fitting alone can cause GANs to diverge. To construct a solution to the distribution-dependent ODE, we first show that the associated nonlinear Fokker-Planck equation has a unique weak solution, by the Crandall-Liggett theorem for differential equations in Banach spaces. Based on this solution to the Fokker-Planck equation, we construct a unique solution to the ODE, using Trevisan’s superposition principle. The convergence of the induced gradient flow to the data distribution is obtained by analyzing the Fokker-Planck equation.

この論文では、確率密度関数の空間における勾配降下法によって、教師なし学習の問題にアプローチします。主な結果は、分布依存常微分方程式(ODE)によって誘導される勾配フローに沿って、未知のデータ分布が長時間の限界として現れることを示しています。つまり、分布依存ODEをシミュレートすることで、データ分布を明らかにすることができます。興味深いことに、ODEのシミュレーションは、生成的敵対ネットワーク(GAN)のトレーニングと同等であることが示されています。この同等性は、GANの新しい「協力的」な見方を提供し、さらに重要なことに、GANの発散に新たな光を当てます。特に、GANアルゴリズムは2セットのサンプル間の平均二乗誤差(MSE)を暗黙的に最小化し、このMSEフィッティングだけでGANが発散する可能性があることが明らかになりました。分布依存のODEの解を構築するには、まず、バナッハ空間の微分方程式のCrandall-Liggett定理によって、関連する非線形Fokker-Planck方程式に一意の弱解があることを示します。Fokker-Planck方程式のこの解に基づいて、Trevisanの重ね合わせ原理を使用して、ODEの一意の解を構築します。Fokker-Planck方程式を解析することで、誘導された勾配フローのデータ分布への収束が得られます。

Random Forests for Change Point Detection
変化点検出のためのランダムフォレスト

We propose a novel multivariate nonparametric multiple change point detection method using classifiers. We construct a classifier log-likelihood ratio that uses class probability predictions to compare different change point configurations. We propose a computationally feasible search method that is particularly well suited for random forests, denoted by changeforest. However, the method can be paired with any classifier that yields class probability predictions, which we illustrate by also using a $k$-nearest neighbor classifier. We prove that it consistently locates change points in single change point settings when paired with a consistent classifier. Our proposed method changeforest achieves improved empirical performance in an extensive simulation study compared to existing multivariate nonparametric change point detection methods. An efficient implementation of our method is made available for R, Python, and Rust users in the changeforest software package.

私たちは、分類器を用いた新しい多変量ノンパラメトリック多重変化点検出法を提案します。分類器の対数尤度比を構築し、分類確率予測を使用してさまざまな変化点の構成を比較します。私たちは、changeforestで示されるランダムフォレストに特に適した、計算的に実行可能な探索方法を提案します。ただし、この方法は、クラス確率予測を生成する任意の分類子と組み合わせることができ、これについては$k$-nearest neighbor分類子を使用して説明します。一貫性のある分類器と組み合わせた場合、単一の変更ポイント設定で変更ポイントを一貫して特定することを証明します。私たちが提案する手法changeforestは、既存の多変量ノンパラメトリック変化点検出法と比較して、広範なシミュレーション研究において改善された経験的性能を達成します。このメソッドの効率的な実装は、R、Python、およびRustのユーザーがchangeforestソフトウェアパッケージで利用できます。

Least Squares Model Averaging for Distributed Data
分散データの最小二乗モデルの平均化

Divide and conquer algorithm is a common strategy applied in big data. Model averaging has the natural divide-and-conquer feature, but its theory has not been developed in big data scenarios. The goal of this paper is to fill this gap. We propose two divide-and-conquer-type model averaging estimators for linear models with distributed data. Under some regularity conditions, we show that the weights from Mallows model averaging criterion converge in L2 to the theoretically optimal weights minimizing the risk of the model averaging estimator. We also give the bounds of the in-sample and out-of-sample mean squared errors and prove the asymptotic optimality for the proposed model averaging estimators. Our conclusions hold even when the dimensions and the number of candidate models are divergent. Simulation results and a real airline data analysis illustrate that the proposed model averaging methods perform better than the commonly used model selection and model averaging methods in distributed data cases. Our approaches contribute to model averaging theory in distributed data and parallel computations, and can be applied in big data analysis to save time and reduce the computational burden.

分割統治アルゴリズムは、ビッグデータでよく適用される戦略です。モデル平均化には自然な分割統治機能がありますが、その理論はビッグデータのシナリオでは開発されていません。この論文の目的は、このギャップを埋めることです。分散データを持つ線形モデルに対して、2つの分割統治型モデル平均化推定量を提案します。いくつかの規則性条件下では、Mallowsモデル平均化基準の重みがL2で理論的に最適な重みに収束し、モデル平均化推定量のリスクを最小化することを示します。また、サンプル内およびサンプル外の平均二乗誤差の範囲を示し、提案されたモデル平均化推定量の漸近最適性を証明します。次元と候補モデルの数が発散する場合でも、結論は当てはまります。シミュレーション結果と実際の航空会社のデータ分析は、分散データの場合に、提案されたモデル平均化方法が一般的に使用されるモデル選択およびモデル平均化方法よりも優れていることを示しています。私たちのアプローチは、分散データと並列計算におけるモデル平均化理論に貢献し、ビッグデータ分析に適用して時間を節約し、計算負荷を軽減することができます。

An Empirical Investigation of the Role of Pre-training in Lifelong Learning
生涯学習における事前研修の役割に関する実証的研究

The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks, including a large-scale study using a novel data set of 15 diverse NLP tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach outperforms several state-of-the-art task-sequential continual learning algorithms across multiple settings, occasionally even without retaining a memory that scales in size with the number of tasks.

機械学習における生涯学習パラダイムは、生物学的学習に似ているだけでなく、過度のモデル再トレーニングを回避してエネルギーの無駄を減らす可能性があることからも、より顕著な孤立学習スキームの魅力的な代替手段です。このパラダイムの主な課題は、壊滅的な忘却という現象です。機械学習における事前トレーニング済みモデルの人気と成功が高まる中、私たちは次のような疑問を提起します。事前トレーニングは生涯学習、特に壊滅的な忘却に関してどのような役割を果たすのでしょうか。私たちは、大規模な事前トレーニング済みモデルのコンテキストで既存の方法を調査し、15の多様なNLPタスクの新しいデータセットを使用した大規模な研究を含む、さまざまなテキストおよび画像分類タスクでのパフォーマンスを評価します。すべての設定において、ランダムに初期化されたモデルと比較して、複数のタスクを順番に学習する場合、一般的な事前トレーニングによって壊滅的な忘却の影響が暗黙的に軽減されることがわかりました。次に、この設定で事前トレーニングによって忘却が軽減される理由をさらに調査します。私たちは損失ランドスケープを分析することでこの現象を研究し、事前トレーニングされた重みが最小値を広げることで忘却を緩和するように見えることを発見しました。この洞察に基づいて、現在のタスク損失と損失盆地の鮮明さを共同で最適化し、順次微調整中に盆地を明示的に広くすることを提案します。この最適化アプローチは、複数の設定にわたっていくつかの最先端のタスク順次継続学習アルゴリズムよりも優れており、タスクの数に応じてサイズが拡大するメモリを保持しなくても、パフォーマンスが向上することを示しています。

Polynomial-Time Algorithms for Counting and Sampling Markov Equivalent DAGs with Applications
アプリケーションを使用したマルコフ等価DAGのカウントとサンプリングのための多項式時間アルゴリズム

Counting and sampling directed acyclic graphs from a Markov equivalence class are fundamental tasks in graphical causal analysis. In this paper we show that these tasks can be performed in polynomial time, solving a long-standing open problem in this area. Our algorithms are effective and easily implementable. As we show in experiments, these breakthroughs make thought-to-be-infeasible strategies in active learning of causal structures and causal effect identification with regard to a Markov equivalence class practically applicable.

マルコフ同値クラスからの有向非巡回グラフのカウントとサンプリングは、グラフィカルな因果分析の基本的なタスクです。この論文では、これらのタスクが多項式時間で実行でき、この領域で長年の未解決の問題を解決できることを示します。当社のアルゴリズムは効果的で、簡単に実装できます。実験で示されているように、これらのブレークスルーは、因果構造の能動的学習とマルコフ等価クラスに関する因果効果の特定において、実行不可能と考えられる戦略を実用化します。

An Inexact Augmented Lagrangian Algorithm for Training Leaky ReLU Neural Network with Group Sparsity
グループスパース性を持つ漏洩ReLUニューラルネットワークを訓練するための不正確な拡張ラグランジュアルゴリズム

The leaky ReLU network with a group sparse regularization term has been widely used in the recent years. However, training such network yields a nonsmooth nonconvex optimization problem and there exists a lack of approaches to compute a stationary point deterministically. In this paper, we first resolve the multi-layer composite term in the original optimization problem by introducing auxiliary variables and additional constraints. We show the new model has a nonempty and bounded solution set and its feasible set satisfies the Mangasarian-Fromovitz constraint qualification. Moreover, we show the relationship between the new model and the original problem. Remarkably, we propose an inexact augmented Lagrangian algorithm for solving the new model, and show the convergence of the algorithm to a KKT point. Numerical experiments demonstrate that our algorithm is more efficient for training sparse leaky ReLU neural networks than some well-known algorithms.

近年、群スパース正則化項を持つ漏洩ReLUネットワークが広く利用されています。ただし、このようなネットワークに学習させると、平滑でない非凸最適化問題が生成され、定常点を決定論的に計算するアプローチが不足しています。この論文では、まず、補助変数と追加の制約を導入することにより、元の最適化問題の多層複合項を解決します。新しいモデルには空でない有界解セットがあり、その実行可能セットがMangasarian-Fromovitz制約条件を満たしていることを示します。さらに、新しいモデルと元の問題との関係を示します。驚くべきことに、新しいモデルを解くための不正確な拡張ラグランジュアルゴリズムを提案し、アルゴリズムのKKT点への収束を示します。数値実験では、私たちのアルゴリズムは、いくつかのよく知られたアルゴリズムよりも、スパースなリーキーReLUニューラルネットワークのトレーニングに効率的であることが示されています。

Entropic Fictitious Play for Mean Field Optimization Problem
平均場最適化問題に対するエントロピー架空遊び

We study two-layer neural networks in the mean field limit, where the number of neurons tends to infinity. In this regime, the optimization over the neuron parameters becomes the optimization over the probability measures, and by adding an entropic regularizer, the minimizer of the problem is identified as a fixed point. We propose a novel training algorithm named entropic fictitious play, inspired by the classical fictitious play in game theory for learning Nash equilibriums, to recover this fixed point, and the algorithm exhibits a two-loop iteration structure. Exponential convergence is proved in this paper and we also verify our theoretical results by simple numerical examples.

私たちは、ニューロンの数が無限大になる傾向がある平均場の限界で2層のニューラルネットワークを研究しています。この領域では、ニューロンパラメータの最適化は確率測度に対する最適化となり、エントロピー正則化子を追加することで、問題の最小化器は不動点として識別されます。この不動点を回復するために、ゲーム理論の古典的な架空の遊びに触発されて、エントロピー架空の遊びと名付けられた新しい学習アルゴリズムを提案し、このアルゴリズムは2ループの反復構造を示します。この論文では指数関数的収束が証明されており、理論的な結果を簡単な数値例によっても検証しています。

GFlowNet Foundations
GFlowNet の基礎

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets, including a new local and efficient training objective called detailed balance for the analogy with MCMC. GFlowNets can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspecified and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlowNets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, continuous actions and modular energy functions.

生成フローネットワーク(GFlowNet)は、能動学習コンテキストで多様な候補セットをサンプリングする方法として導入され、与えられた報酬関数にほぼ比例してサンプリングするトレーニング目標が与えられています。この論文では、MCMCとの類似性から詳細バランスと呼ばれる新しいローカルで効率的なトレーニング目標を含む、GFlowNetの追加の理論的特性をいくつか示します。GFlowNetは、一部の変数が指定されていない場合の結合確率分布とそれに対応する周辺分布を推定するために使用でき、特に興味深いことに、セットやグラフなどの複合オブジェクト上の分布を表すことができます。GFlowNetは、計算コストの高いMCMC法で通常行われる作業を、単一のトレーニング済み生成パスで償却します。また、パーティション関数と自由エネルギー、サブセット(サブグラフ)が与えられたスーパーセット(スーパーグラフ)の条件付き確率、および与えられたセット(グラフ)のすべてのスーパーセット(スーパーグラフ)上の周辺分布を推定するためにも使用できます。エントロピーと相互情報量、連続アクション、モジュラーエネルギー関数の推定を可能にするバリエーションを導入します。

LibMTL: A Python Library for Deep Multi-Task Learning
LibMTL: ディープ・マルチタスク学習のためのPythonライブラリ

This paper presents LibMTL, an open-source Python library built on PyTorch, which provides a unified, comprehensive, reproducible, and extensible implementation framework for Multi-Task Learning (MTL). LibMTL considers different settings and approaches in MTL, and it supports a large number of state-of-the-art MTL methods, including 13 optimization strategies and 8 architectures. Moreover, the modular design in LibMTL makes it easy to use and well-extensible, thus users can easily and fast develop new MTL methods, compare with existing MTL methods fairly, or apply MTL algorithms to real-world applications with the support of LibMTL. The source code and detailed documentations of LibMTL are available at https://github.com/median-research-group/LibMTL and https://libmtl.readthedocs.io, respectively.

この論文では、PyTorch上に構築されたオープンソースのPythonライブラリであるLibMTLについて紹介します。これは、マルチタスク学習(MTL)のための統一された、包括的で、再現性があり、拡張可能な実装フレームワークを提供します。LibMTLは、MTLのさまざまな設定とアプローチを考慮し、13の最適化戦略と8つのアーキテクチャを含む多数の最先端のMTLメソッドをサポートしています。さらに、LibMTLのモジュラー設計により、使いやすく、拡張性が高いため、ユーザーは新しいMTLメソッドを簡単かつ迅速に開発したり、既存のMTLメソッドと公正に比較したり、LibMTLのサポートを受けてMTLアルゴリズムを実際のアプリケーションに適用したりできます。LibMTLのソースコードと詳細なドキュメントは、それぞれhttps://github.com/median-research-group/LibMTLとhttps://libmtl.readthedocs.ioで入手できます。

Minimax Risk Classifiers with 0-1 Loss
損失が 0-1 のミニマックスリスク分類器

Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable confidence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice.

教師あり分類手法では、トレーニングサンプルを使用して、予想される損失が0-1の小さい分類ルール(エラー確率)を学習します。従来の方法では、0-1の損失の代わりに代理損失を使用し、特定のルールファミリー(仮説クラス)を考慮することで、扱いやすい学習が可能になり、サンプル外の一般化が可能になります。この論文では、基になる分布を含めることができる分布の不確実性セットに関して、最悪のケースの0-1損失を最小化するミニマックスリスク分類子(MRC)を、調整可能な信頼度で紹介します。MRCは学習時に厳密なパフォーマンス保証を提供できること、および特性カーネルによって与えられる特徴マッピングを使用して強く普遍的に一貫性があることを示します。また、この論文では、MRC学習のための効率的な最適化手法を提案し、提示された手法が正確な分類と実際の厳しいパフォーマンス保証を提供できることを示しています。

Augmented Sparsifiers for Generalized Hypergraph Cuts
一般化ハイパーグラフカットのための拡張スパーシファイア

Hypergraph generalizations of many graph cut problems and algorithms have recently been introduced to better model data and systems characterized by multiway relationships. Recent work in machine learning and theoretical computer science uses a generalized cut function for a hypergraph $\mathcal{H} = (V,\mathcal{E})$ that associates each hyperedge $e \in \mathcal{E}$ with a splitting function ${\bf w}_e$, which assigns a penalty to each way of separating the nodes of $e$. When each ${\bf w}_e$ satisfies ${\bf w}_e(S) = g(\lvert S \rvert)$ for some concave function $g$, previous work has shown how to reduce the generalized hypergraph cut problem to a directed graph cut problem, although the resulting graph may be very dense. We introduce a framework for sparsifying hypergraph-to-graph reductions, where the hypergraph cut function is $(1+\varepsilon)$-approximated by a cut on a directed graph. For $\varepsilon > 0$ we need at most $O(\varepsilon^{-1}|e| \log |e|)$ edges to reduce any hyperedge $e$, while only $O(|e| \varepsilon^{-1/2} \log \log \frac{1}{\varepsilon})$ edges are needed to approximate the clique expansion, a widely used heuristic in hypergraph clustering. Our framework leads to improved results for solving cut problems in co-occurrence graphs, decomposable submodular function minimization problems, and localized hypergraph clustering problems.

最近、多くのグラフカット問題とアルゴリズムのハイパーグラフ一般化が導入され、多方向関係を特徴とするデータとシステムをより適切にモデル化できるようになりました。機械学習と理論計算機科学における最近の研究では、ハイパーグラフ$\mathcal{H} = (V,\mathcal{E})$の一般化カット関数が使用されています。この関数は、各ハイパーエッジ$e \in \mathcal{E}$を分割関数${\bf w}_e$に関連付け、$e$のノードを分離する各方法にペナルティを割り当てます。各${\bf w}_e$が何らかの凹関数$g$に対して${\bf w}_e(S) = g(\lvert S \rvert)$を満たす場合、結果として得られるグラフは非常に密になる可能性がありますが、一般化ハイパーグラフカット問題を有向グラフカット問題に簡略化する方法が記載されています。ハイパーグラフからグラフへの縮小をスパース化するフレームワークを紹介します。このフレームワークでは、ハイパーグラフのカット関数が有向グラフ上のカットによって$(1+\varepsilon)$近似されます。$\varepsilon > 0$の場合、任意のハイパーエッジ$e$を縮小するために必要なエッジは最大で$O(\varepsilon^{-1}|e| \log |e|)$個ですが、ハイパーグラフクラスタリングで広く使用されているヒューリスティックであるクリーク展開を近似するには、$O(|e| \varepsilon^{-1/2} \log \log \frac{1}{\varepsilon})$個のエッジのみが必要です。このフレームワークにより、共起グラフのカット問題、分解可能なサブモジュラ関数の最小化問題、および局所的なハイパーグラフクラスタリング問題を解決するための結果が向上します。

Non-stationary Online Learning with Memory and Non-stochastic Control
記憶と非確率的制御による非定常オンライン学習

We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions and thus captures temporal effects of learning problems. In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments, which competes algorithms’ decisions with a sequence of changing comparators. We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret in terms of time horizon, non-stationarity measure, and memory length. The key technical challenge is how to control the switching cost, the cumulative movements of player’s decisions, which is neatly addressed by a novel switching-cost-aware online ensemble approach equipped with a new meta-base decomposition of dynamic policy regret and a careful design of meta-learner and base-learner that explicitly regularizes the switching cost. The results are further applied to tackle non-stationarity in online non-stochastic control (Agarwal et al., 2019), i.e., controlling a linear dynamical system with adversarial disturbance and convex cost functions. We derive a novel gradient-based controller with dynamic policy regret guarantees, which is the first controller provably competitive to a sequence of changing policies for online non-stochastic control.

私たちは、メモリ付きオンライン凸最適化(OCO)の問題を研究します。メモリ付きオンライン凸最適化では、損失関数が過去の決定に依存することを可能にし、学習問題の時間的影響を捉えます。この論文では、非定常環境に堅牢なアルゴリズムを設計するためのパフォーマンス指標として動的ポリシーリグレットを導入します。動的ポリシーリグレットは、アルゴリズムの決定を変化するコンパレータのシーケンスと競合させます。時間範囲、非定常性指標、およびメモリ長の観点から、最適な動的ポリシーリグレットを証明できる、メモリ付きOCOの新しいアルゴリズムを提案します。重要な技術的課題は、切り替えコスト、つまりプレーヤーの決定の累積的な動きを制御する方法ですが、これは、動的ポリシーリグレットの新しいメタベース分解と、切り替えコストを明示的に正規化するメタ学習者とベース学習者の慎重な設計を備えた、切り替えコストを考慮した新しいオンラインアンサンブルアプローチによってうまく対処されています。この結果は、オンライン非確率制御(Agarwalら、2019)における非定常性の解決、つまり敵対的外乱と凸コスト関数を持つ線形動的システムの制御にさらに適用されます。動的ポリシーリグレット保証を備えた新しい勾配ベースのコントローラーを導出します。これは、オンライン非確率制御の一連の変化するポリシーに対して競争力があることが証明された最初のコントローラーです。

L0Learn: A Scalable Package for Sparse Learning using L0 Regularization
L0Learn: L0正則化を用いたスパース学習のためのスケーラブルなパッケージ

We present L0Learn: an open-source package for sparse linear regression and classification using $\ell_0$ regularization. L0Learn implements scalable, approximate algorithms, based on coordinate descent and local combinatorial optimization. The package is built using C++ and has user-friendly R and Python interfaces. L0Learn can address problems with millions of features, achieving competitive run times and statistical performance with state-of-the-art sparse learning packages. L0Learn is available on both CRAN and GitHub.

私たちは、L0Learnは、$ell_0$正則化を使用したスパース線形回帰と分類のためのオープンソースパッケージです。L0Learnは、座標降下と局所的な組み合わせ最適化に基づくスケーラブルな近似アルゴリズムを実装します。このパッケージはC++を使用してビルドされ、ユーザーフレンドリなRとPythonのインターフェイスを備えています。L0Learnは、何百万もの機能で問題に対処し、最先端のスパース学習パッケージを使用して競争力のある実行時間と統計パフォーマンスを達成できます。L0Learnは、CRANとGitHubの両方で利用できます。

Buffered Asynchronous SGD for Byzantine Learning
ビザンチン学習のためのバッファ付き非同期 SGD

Distributed learning has become a hot research topic due to its wide application in cluster-based large-scale learning, federated learning, edge computing, and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on the server. Furthermore, we also propose an improved variant of BASGD, called BASGD with momentum (BASGDm), by introducing local momentum into BASGD. Compared with those methods which need to store instances on server, BASGD and BASGDm have a wider scope of application. Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers.

分散学習は、クラスターベースの大規模学習、フェデレーテッドラーニング、エッジコンピューティングなどに幅広く応用されているため、注目の研究トピックになっています。従来の分散学習方法のほとんどは、通常、障害や攻撃がないことを前提としています。しかし、実際のアプリケーションでは、通信障害や悪意のある攻撃など、多くの予期しないケースが発生する可能性があります。そのため、障害や攻撃を伴う分散学習を指すビザンチン学習(BL)が最近注目を集めています。既存のBL方法のほとんどは同期型ですが、異種またはオフラインのワーカーのために、一部のアプリケーションでは実用的ではありません。このような場合、非同期BL (ABL)が通常好まれます。この論文では、ABL用のバッファ付き非同期確率的勾配降下法(BASGD)と呼ばれる新しい方法を提案します。私たちの知る限り、BASGDは、サーバーにインスタンスを保存せずに非全知攻撃に抵抗できる最初のABL方法です。さらに、ローカルモーメンタムをBASGDに導入することで、BASGDの改良版であるBASGD with momentum (BASGDm)も提案します。サーバーにインスタンスを保存する必要のある方法と比較すると、BASGDとBASGDmは適用範囲が広くなります。BASGDとBASGDmはどちらもさまざまな集約ルールと互換性があります。さらに、BASGDとBASGDmはどちらも収束性があり、障害や攻撃に耐えられることが証明されています。実験結果によると、ワーカーに障害や攻撃が発生した場合、当社の方法は既存のABLベースラインを大幅に上回っています。

A Non-parametric View of FedAvg and FedProx:Beyond Stationary Points
FedAvg と FedProx のノンパラメトリックビュー:固定点を超えて

Federated Learning (FL) is a promising decentralized learning framework and has great potentials in privacy preservation and in lowering the computation load at the cloud. Recent work showed that FedAvg and FedProx — the two widely-adopted FL algorithms — fail to reach the stationary points of the global optimization objective even for homogeneous linear regression problems. Further, it is concerned that the common model learned might not generalize well locally at all in the presence of heterogeneity. In this paper, we analyze the convergence and statistical efficiency of FedAvg and FedProx, addressing the above two concerns. Our analysis is based on the standard non-parametric regression in a reproducing kernel Hilbert space (RKHS), and allows for heterogeneous local data distributions and unbalanced local datasets. We prove that the estimation errors, measured in either the empirical norm or the RKHS norm, decay with a rate of $1/t$ in general and exponentially for finite-rank kernels. In certain heterogeneous settings, these upper bounds also imply that both FedAvg and FedProx achieve the optimal error rate. To further analytically quantify the impact of the heterogeneity at each client, we propose and characterize a novel notion-federation gain, defined as the reduction of the estimation error for a client to join the FL. We discover that when the data heterogeneity is moderate, a client with limited local data can benefit from a common model with a large federation gain. Two new insights introduced by considering the statistical aspect are: (1) requiring the standard bounded dissimilarity is pessimistic for the convergence analysis of FedAvg and FedProx; (2) despite inconsistency of stationary points, their limiting points are unbiased estimators of the underlying truth. Numerical experiments further corroborate our theoretical findings.

フェデレーテッドラーニング(FL)は、将来有望な分散学習フレームワークであり、プライバシー保護とクラウドでの計算負荷の軽減に大きな可能性を秘めています。最近の研究では、広く採用されている2つのFLアルゴリズムであるFedAvgとFedProxは、均質な線形回帰問題であっても、グローバル最適化目標の定常点に到達できないことが示されました。さらに、異質性がある場合、学習した共通モデルが局所的にまったく一般化されない可能性があることが懸念されています。この論文では、FedAvgとFedProxの収束と統計的効率を分析し、上記の2つの懸念に対処します。私たちの分析は、再生カーネルヒルベルト空間(RKHS)での標準的なノンパラメトリック回帰に基づいており、異質なローカルデータ分布と不均衡なローカルデータセットを許容します。経験的ノルムまたはRKHSノルムのいずれかで測定された推定誤差は、一般に$1/t$の速度で減少し、有限ランクカーネルの場合は指数関数的に減少することを証明します。特定の異質設定では、これらの上限は、FedAvgとFedProxの両方が最適なエラー率を達成することも意味します。各クライアントでの異質性の影響をさらに分析的に定量化するために、クライアントがFLに参加するための推定エラーの削減として定義される新しい概念である連合ゲインを提案し、特徴付けます。データの異質性が中程度の場合、ローカルデータが限られているクライアントは、大きな連合ゲインを持つ共通モデルの恩恵を受けることができることがわかりました。統計的側面を考慮することでもたらされた2つの新しい洞察は、(1)標準的な境界付き相違度を要求することは、FedAvgとFedProxの収束分析にとって悲観的である、(2)定常点の不一致にもかかわらず、それらの極限点は基礎となる真実の不偏推定値である、というものです。数値実験により、理論的発見がさらに裏付けられました。

Multiplayer Performative Prediction: Learning in Decision-Dependent Games
マルチプレイヤーパフォーマティブ予測:意思決定依存型ゲームにおける学習

Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers’ actions. This paper formulates a new game theoretic framework for this phenomenon, called multi-player performative prediction. We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter equilibria are arguably more informative, but are generally computationally difficult to find since they are solutions of non-monotone games. We show that under mild assumptions, the performatively stable equilibria can be found efficiently by a variety of algorithms, including repeated retraining and the repeated (stochastic) gradient method. We then establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results.

学習の問題は、一般的に、集団データが競合する意思決定者の行動に反応するという興味深いフィードバックメカニズムを示します。この論文では、マルチプレイヤーパフォーマティブ予測と呼ばれるこの現象に対する新しいゲーム理論的フレームワークを定式化します。2つの異なるソリューションコンセプト、つまり(i)パフォーマティブ安定均衡と(ii)ゲームのナッシュ均衡に焦点を当てます。後者の均衡は、おそらくより有益ですが、非単調ゲームのソリューションであるため、一般的に計算的に見つけるのが困難です。軽度の仮定の下では、パフォーマティブ安定均衡は、反復再トレーニングや反復(確率的)勾配法などのさまざまなアルゴリズムによって効率的に見つけられることを示します。次に、ゲームの強い単調性に対する透明な十分条件を確立し、それを使用してナッシュ均衡を見つけるためのアルゴリズムを開発します。各プレイヤーが分布のパラメトリック記述と経験的リスクの勾配ステップを交互に学習する、導関数フリー法と適応勾配アルゴリズムを調査します。合成および半合成の数値実験で結果を示します。

Variational Inverting Network for Statistical Inverse Problems of Partial Differential Equations
偏微分方程式の統計的逆問題に対する変分反転ネットワーク

To quantify uncertainties in inverse problems of partial differential equations (PDEs), we formulate them into statistical inference problems using Bayes’ formula. Recently, well-justified infinite-dimensional Bayesian analysis methods have been developed to construct dimension-independent algorithms. However, there are three challenges for these infinite-dimensional Bayesian methods: prior measures usually act as regularizers and are not able to incorporate prior information efficiently; complex noises, such as more practical non-i.i.d. distributed noises, are rarely considered; and time-consuming forward PDE solvers are needed to estimate posterior statistical quantities. To address these issues, an infinite-dimensional inference framework has been proposed based on the infinite-dimensional variational inference method and deep generative models. Specifically, by introducing some measure equivalence assumptions, we derive the evidence lower bound in the infinite-dimensional setting and provide possible parametric strategies that yield a general inference framework called the Variational Inverting Network (VINet). This inference framework can encode prior and noise information from learning examples. In addition, relying on the power of deep neural networks, the posterior mean and variance can be efficiently and explicitly generated in the inference stage. In numerical experiments, we design specific network structures that yield a computable VINet from the general inference framework. Numerical examples of linear inverse problems of an elliptic equation and the Helmholtz equation are presented to illustrate the effectiveness of the proposed inference framework.

偏微分方程式(PDE)の逆問題における不確実性を定量化するために、ベイズの公式を使用して統計的推論問題に定式化します。最近、次元に依存しないアルゴリズムを構築するために、十分に正当化された無限次元ベイズ分析法が開発されました。ただし、これらの無限次元ベイズ法には3つの課題があります。事前測度は通常、正則化子として機能し、事前情報を効率的に組み込むことができません。より実用的な非i.i.d.分散ノイズなどの複雑なノイズはほとんど考慮されません。事後統計量を推定するには、時間のかかる順方向PDEソルバーが必要です。これらの問題に対処するために、無限次元変分推論法と深層生成モデルに基づく無限次元推論フレームワークが提案されています。具体的には、いくつかの測度同値仮定を導入することにより、無限次元設定での証拠の下限を導出し、変分反転ネットワーク(VINet)と呼ばれる一般的な推論フレームワークを生み出す可能なパラメトリック戦略を提供します。この推論フレームワークは、学習例から事前情報とノイズ情報をエンコードできます。さらに、ディープニューラルネットワークの力を利用することで、推論段階で事後平均と分散を効率的かつ明示的に生成できます。数値実験では、一般的な推論フレームワークから計算可能なVINetを生成する特定のネットワーク構造を設計します。楕円方程式とヘルムホルツ方程式の線形逆問題の数値例を示し、提案された推論フレームワークの有効性を示します。

Model-based Causal Discovery for Zero-Inflated Count Data
ゼロ膨張カウントデータに対するモデルベース因果関係の発見

Zero-inflated count data arise in a wide range of scientific areas such as social science, biology, and genomics. Very few causal discovery approaches can adequately account for excessive zeros as well as various features of multivariate count data such as overdispersion. In this paper, we propose a new zero-inflated generalized hypergeometric directed acyclic graph (ZiG-DAG) model for inference of causal structure from purely observational zero-inflated count data. The proposed ZiG-DAGs exploit a broad family of generalized hypergeometric probability distributions and are useful for modeling various types of zero-inflated count data with great flexibility. In addition, ZiG-DAGs allow for both linear and nonlinear causal relationships. We prove that the causal structure is identifiable for the proposed ZiG-DAGs via a general proof technique for count data, which is applicable beyond the proposed model for investigating causal identifiability. Score-based algorithms are developed for causal structure learning. Extensive synthetic experiments as well as a real dataset with known ground truth demonstrate the superior performance of the proposed method against state-of-the-art alternative methods in discovering causal structure from observational zero-inflated count data. An application of reverse-engineering a gene regulatory network from a single-cell RNA-sequencing dataset illustrates the utility of ZiG-DAGs in practice.

ゼロインフレカウントデータは、社会科学、生物学、ゲノミクスなど、幅広い科学分野で発生します。過剰なゼロや、過剰分散などの多変量カウントデータのさまざまな特徴を適切に説明できる因果発見アプローチはほとんどありません。この論文では、純粋に観測的なゼロインフレカウントデータから因果構造を推論するための、新しいゼロインフレ一般化超幾何有向非巡回グラフ(ZiG-DAG)モデルを提案します。提案されたZiG-DAGは、一般化超幾何確率分布の広範なファミリーを活用し、さまざまなタイプのゼロインフレカウントデータを非常に柔軟にモデル化するのに役立ちます。さらに、ZiG-DAGでは、線形と非線形の両方の因果関係が可能です。提案されたZiG-DAGの因果構造は、カウントデータに対する一般的な証明手法によって識別可能であることを証明します。これは、因果識別可能性の調査に提案モデルを超えて適用できます。スコアベースのアルゴリズムは因果構造学習用に開発されています。広範囲にわたる合成実験と既知のグラウンドトゥルースを持つ実際のデータセットは、観測ゼロインフレカウントデータから因果構造を発見する際の最先端の代替方法に対して提案された方法の優れたパフォーマンスを実証しています。単一細胞RNAシーケンスデータセットから遺伝子制御ネットワークをリバースエンジニアリングするアプリケーションは、ZiG-DAGの実際の有用性を示しています。

Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity
一般空間を持つMDPのためのQ学習:弱い連続性下での量子化による収束と近傍最適性

Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) (also called controlled Markov chains) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions (called Quantized Q-Learning) converges to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a partially observed Markov decision process (POMDP), (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.

強化学習アルゴリズムでは、マルコフ決定過程(MDP) (制御マルコフ連鎖とも呼ばれる)の状態空間と行動空間の有限性が求められることが多く、このようなアルゴリズムを連続的な状態空間と行動空間に適用できるようにするためのさまざまな取り組みが文献で行われてきました。この論文では、非常に緩やかな規則性条件(特に、MDPの遷移カーネルの弱い連続性のみを含む)の下で、状態と行動の量子化(量子化Q学習と呼ばれる)による標準Borel MDPのQ学習が限界に収束し、さらにこの限界が最適性方程式を満たし、明示的なパフォーマンス境界を持つか、漸近的に最適であることが保証されるかのいずれかでほぼ最適になることを示します。私たちのアプローチは、(i)量子化を測定カーネルと見なし、量子化されたMDPを部分的に観測されたマルコフ決定プロセス（POMDP）と見なすこと、(ii)POMDPに対するQ学習の近似最適性と収束結果を利用すること、そして(iii)最後に、構築されたPOMDPの固定点に対応することを示す弱連続カーネルを持つMDPの有限状態モデル近似の近似最適性に基づいています。したがって、私たちの論文は、連続MDPに対するQ学習の適用性に関する非常に一般的な収束と近似の結果を示しています。

CodaLab Competitions: An Open Source Platform to Organize Scientific Challenges
CodaLabコンペティション:科学的な課題を整理するためのオープンソースプラットフォーム

CodaLab Competitions is an open source web platform designed to help data scientists and research teams to crowd-source the resolution of machine learning problems through the organization of competitions, also called challenges or contests. CodaLab Competitions provides useful features such as multiple phases, results and code submissions, multi-score leaderboards, and jobs running inside Docker containers. The platform is very flexible and can handle large scale experiments, by allowing organizers to upload large datasets and provide their own CPU or GPU compute workers.

CodaLab Competitionsは、データサイエンティストや研究チームが、チャレンジやコンテストとも呼ばれるコンペティションの組織を通じて、機械学習の問題解決をクラウドソーシングできるように設計されたオープンソースのWebプラットフォームです。CodaLab Competitionsは、複数のフェーズ、結果とコードの提出、マルチスコアリーダーボード、Dockerコンテナ内で実行されるジョブなどの便利な機能を提供します。このプラットフォームは非常に柔軟性があり、オーガナイザーが大規模なデータセットをアップロードし、独自のCPUまたはGPUコンピューティングワーカーを提供できるため、大規模な実験を処理できます。

Contrasting Identifying Assumptions of Average Causal Effects: Robustness and Semiparametric Efficiency
平均因果効果の対照的な同定仮定:ロバスト性とセミパラメトリック効率

Semiparametric inference on average causal effects from observational data is based on assumptions yielding identification of the effects. In practice, several distinct identifying assumptions may be plausible; an analyst has to make a delicate choice between these models. In this paper, we study three identifying assumptions based on the potential outcome framework: the back-door assumption, which uses pre-treatment covariates, the front-door assumption, which uses mediators, and the two-door assumption using pre-treatment covariates and mediators simultaneously. We provide the efficient influence functions and the corresponding semiparametric efficiency bounds that hold under these assumptions, and their combinations. We demonstrate that neither of the identification models provides uniformly the most efficient estimation and give conditions under which some bounds are lower than others. We show when semiparametric estimating equation estimators based on influence functions attain the bounds, and study the robustness of the estimators to misspecification of the nuisance models. The theory is complemented with simulation experiments on the finite sample behavior of the estimators. The results obtained are relevant for an analyst facing a choice between several plausible identifying assumptions and corresponding estimators. Our results show that this choice implies a trade-off between efficiency and robustness to misspecification of the nuisance models.

観察データから平均因果効果を推定するセミパラメトリック推論は、効果の識別をもたらす仮定に基づいています。実際には、複数の異なる識別仮定が妥当である可能性があり、分析者はこれらのモデルの間で微妙な選択を行う必要があります。この論文では、潜在的結果フレームワークに基づく3つの識別仮定、すなわち、前処理共変量を使用するバックドア仮定、メディエーターを使用するフロントドア仮定、前処理共変量とメディエーターを同時に使用するツードア仮定について検討します。これらの仮定の下で成立する効率的な影響関数と対応するセミパラメトリック効率境界、およびそれらの組み合わせを示します。識別モデルのいずれも一様に最も効率的な推定値を提供しないことを示し、一部の境界が他の境界よりも低くなる条件を示します。影響関数に基づくセミパラメトリック推定方程式推定量が境界に達する時期を示し、推定量の誤指定に対する堅牢性を検討します。この理論は、推定量の有限サンプル動作に関するシミュレーション実験によって補完されます。得られた結果は、いくつかの妥当な識別仮定とそれに対応する推定値の選択に直面している分析者にとって重要です。私たちの結果は、この選択が、効率性と、迷惑モデルの誤った指定に対する堅牢性との間のトレードオフを意味することを示しています。

Variational Gibbs Inference for Statistical Model Estimation from Incomplete Data
不完全データからの統計モデル推定のための変分ギブス推論

Statistical models are central to machine learning with broad applicability across a range of downstream tasks. The models are controlled by free parameters that are typically estimated from data by maximum-likelihood estimation or approximations thereof. However, when faced with real-world data sets many of the models run into a critical issue: they are formulated in terms of fully-observed data, whereas in practice the data sets are plagued with missing data. The theory of statistical model estimation from incomplete data is conceptually similar to the estimation of latent-variable models, where powerful tools such as variational inference (VI) exist. However, in contrast to standard latent-variable models, parameter estimation with incomplete data often requires estimating exponentially-many conditional distributions of the missing variables, hence making standard VI methods intractable. We address this gap by introducing variational Gibbs inference (VGI), a new general-purpose method to estimate the parameters of statistical models from incomplete data. We validate VGI on a set of synthetic and real-world estimation tasks, estimating important machine learning models such as variational autoencoders and normalising flows from incomplete data. The proposed method, whilst general-purpose, achieves competitive or better performance than existing model-specific estimation methods.

統計モデルは機械学習の中心であり、さまざまな下流タスクに幅広く適用できます。モデルは自由パラメータによって制御され、通常、最大尤度推定またはその近似によってデータから推定されます。しかし、実際のデータセットに直面すると、多くのモデルが重大な問題にぶつかります。モデルは完全に観測されたデータに基づいて定式化されているのに対し、実際にはデータセットには欠損データがあります。不完全なデータからの統計モデル推定の理論は、潜在変数モデルの推定と概念的に似ており、変分推論(VI)などの強力なツールがあります。ただし、標準的な潜在変数モデルとは対照的に、不完全なデータによるパラメータ推定では、欠損変数の指数関数的に多くの条件付き分布を推定する必要があることが多く、そのため標準的なVI法は扱いにくくなります。このギャップに対処するために、不完全なデータから統計モデルのパラメータを推定する新しい汎用方法である変分ギブス推論(VGI)を導入します。私たちは、変分オートエンコーダなどの重要な機械学習モデルを推定し、不完全なデータからのフローを正規化する一連の合成推定タスクと現実世界の推定タスクでVGIを検証しました。提案された方法は汎用的でありながら、既存のモデル固有の推定方法と同等以上のパフォーマンスを実現します。

Clustering and Structural Robustness in Causal Diagrams
因果図におけるクラスタリングと構造ロバスト性

Graphs are commonly used to represent and visualize causal relations. For a small number of variables, this approach provides a succinct and clear view of the scenario at hand. As the number of variables under study increases, the graphical approach may become impractical, and the clarity of the representation is lost. Clustering of variables is a natural way to reduce the size of the causal diagram, but it may erroneously change the essential properties of the causal relations if implemented arbitrarily. We define a specific type of cluster, called transit cluster, that is guaranteed to preserve the identifiability properties of causal effects under certain conditions. We provide a sound and complete algorithm for finding all transit clusters in a given graph and demonstrate how clustering can simplify the identification of causal effects. We also study the inverse problem, where one starts with a clustered graph and looks for extended graphs where the identifiability properties of causal effects remain unchanged. We show that this kind of structural robustness is closely related to transit clusters.

グラフは、因果関係を表現し、視覚化するためによく使用されます。変数の数が少ない場合、このアプローチは、手元のシナリオを簡潔かつ明確に表示します。調査対象の変数の数が増えると、グラフィカルアプローチは実用的でなくなり、表現の明瞭さが失われる可能性があります。変数のクラスタリングは、因果図のサイズを縮小する自然な方法ですが、恣意的に実装すると、因果関係の重要な特性を誤って変更する可能性があります。私たちは、特定の条件下で因果効果の識別可能性特性を保持することが保証されている、トランジットクラスターと呼ばれる特定のタイプのクラスターを定義します。私たちは、特定のグラフ内のすべてのトランジットクラスターを見つけるための健全で完全なアルゴリズムを提供し、クラスタリングによって因果効果の識別がどのように簡素化されるかを示します。また、逆の問題も研究します。これは、クラスター化されたグラフから始めて、因果効果の識別可能性特性が変更されない拡張グラフを探すというものです。私たちは、この種の構造的堅牢性がトランジットクラスターと密接に関連していることを示します。

MMD Aggregated Two-Sample Test
MMDアグリゲート2サンプルテスト

We propose two novel nonparametric two-sample kernel tests based on the Maximum Mean Discrepancy (MMD). First, for a fixed kernel, we construct an MMD test using either permutations or a wild bootstrap, two popular numerical procedures to determine the test threshold. We prove that this test controls the probability of type I error non-asymptotically. Hence, it can be used reliably even in settings with small sample sizes as it remains well-calibrated, which differs from previous MMD tests which only guarantee correct test level asymptotically. When the difference in densities lies in a Sobolev ball, we prove minimax optimality of our MMD test with a specific kernel depending on the smoothness parameter of the Sobolev ball. In practice, this parameter is unknown and, hence, the optimal MMD test with this particular kernel cannot be used. To overcome this issue, we construct an aggregated test, called MMDAgg, which is adaptive to the smoothness parameter. The test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We prove that MMDAgg still controls the level non-asymptotically, and achieves the minimax rate over Sobolev balls, up to an iterated logarithmic term. Our guarantees are not restricted to a specific type of kernel, but hold for any product of one-dimensional translation invariant characteristic kernels. We provide a user-friendly parameter-free implementation of MMDAgg using an adaptive collection of bandwidths. We demonstrate that MMDAgg significantly outperforms alternative state-of-the-art MMD-based two-sample tests on synthetic data satisfying the Sobolev smoothness assumption, and that, on real-world image data, MMDAgg closely matches the power of tests leveraging the use of models such as neural networks.

私たちは、最大平均差異(MMD)に基づく2つの新しいノンパラメトリック2サンプルカーネルテストを提案します。まず、固定カーネルに対して、テストしきい値を決定するための2つの一般的な数値手順である順列またはワイルドブートストラップのいずれかを使用してMMDテストを構築します。このテストがタイプIエラーの確率を非漸近的に制御することを証明します。したがって、このテストは適切に調整されているため、サンプルサイズが小さい設定でも確実に使用できます。これは、漸近的に正しいテストレベルを保証するだけの以前のMMDテストとは異なります。密度の差がソボレフボールにある場合、ソボレフボールの滑らかさのパラメーターに依存する特定のカーネルを使用したMMDテストのミニマックス最適性を証明します。実際には、このパラメーターは不明であるため、この特定のカーネルを使用した最適なMMDテストは使用できません。この問題を克服するために、滑らかさのパラメーターに適応するMMDAggと呼ばれる集約テストを構築します。使用されるカーネルのコレクション全体でテストパワーが最大化されます。カーネル選択のためのホールドアウトデータ(テストパワーの損失につながる)や、中央値ヒューリスティックなどの任意のカーネル選択は必要ありません。MMDAggはレベルを非漸近的に制御し、反復対数項までソボレフボール上でミニマックスレートを達成することを証明します。保証は特定のタイプのカーネルに限定されず、1次元の平行移動不変特性カーネルの任意の積に当てはまります。適応的な帯域幅のコレクションを使用して、ユーザーフレンドリなパラメーターフリーのMMDAgg実装を提供します。MMDAggは、ソボレフの滑らかさの仮定を満たす合成データに対して、最先端のMMDベースの2サンプルテストよりも大幅に優れていること、また、実際の画像データに対して、MMDAggはニューラルネットワークなどのモデルを使用したテストのパワーにほぼ匹敵することを実証します。

Divide-and-Conquer Fusion
ディバイド・アンド・コンカー・フュージョン

Combining several (sample approximations of) distributions, which we term sub-posteriors, into a single distribution proportional to their product, is a common challenge. Occurring, for instance, in distributed ‘big data’ problems, or when working under multi-party privacy constraints. Many existing approaches resort to approximating the individual sub-posteriors for practical necessity, then find either an analytical approximation or sample approximation of the resulting (product-pooled) posterior. The quality of the posterior approximation for these approaches is poor when the sub-posteriors fall out-with a narrow range of distributional form, such as being approximately Gaussian. Recently, a Fusion approach has been proposed which finds an exact Monte Carlo approximation of the posterior, circumventing the drawbacks of approximate approaches. Unfortunately, existing Fusion approaches have a number of computational limitations, particularly when unifying a large number of sub-posteriors. In this paper, we generalise the theory underpinning existing Fusion approaches, and embed the resulting methodology within a recursive divide-and-conquer sequential Monte Carlo paradigm. This ultimately leads to a competitive Fusion approach, which is robust to increasing numbers of sub-posteriors.

サブ事後分布と呼ばれる複数の分布（サンプル近似値）を、それらの積に比例する単一の分布に結合することは、一般的な課題です。たとえば、分散型「ビッグデータ」の問題や、複数の当事者のプライバシー制約下で作業する場合に発生します。多くの既存のアプローチでは、実用的な必要性から個々のサブ事後分布を近似し、結果として得られる（積をプールした）事後分布の解析近似値またはサンプル近似値を求めています。これらのアプローチの事後近似値の品質は、サブ事後分布が分布形式の範囲が狭い場合（近似的にガウス分布など）には低くなります。最近、事後分布の正確なモンテカルロ近似値を求めるFusionアプローチが提案され、近似アプローチの欠点を回避しています。残念ながら、既存のFusionアプローチには、特に多数のサブ事後分布を統合する場合に、計算上の制限がいくつかあります。この論文では、既存のFusionアプローチの基礎となる理論を一般化し、その結果得られた方法論を再帰的な分割統治法の順次モンテカルロパラダイムに組み込みます。これにより、最終的には、サブ事後確率の増加に対して堅牢な競争力のあるFusionアプローチが実現します。

PAC-learning for Strategic Classification
戦略的分類のためのPAC学習

The study of strategic or adversarial manipulation of testing data to fool a classifier has attracted much recent attention. Most previous works have focused on two extreme situations where any testing data point either is completely adversarial or always equally prefers the positive label. In this paper, we generalize both of these through a unified framework by considering strategic agents with heterogenous preferences, and introduce the notion of strategic VC-dimension (SVC) to capture the PAC-learnability in our general strategic setup. SVC provably generalizes the recent concept of adversarial VC-dimension (AVC) introduced by Cullina et al. (2018). We instantiate our framework for the fundamental strategic linear classification problem. We fully characterize: (1) the statistical learnability of linear classifiers by pinning down its SVC; (2) its computational tractability by pinning down the complexity of the empirical risk minimization problem. Interestingly, the SVC of linear classifiers is always upper bounded by its standard VC-dimension. This characterization also strictly generalizes the AVC bound for linear classifiers in (Cullina et al., 2018). Finally, we briefly investigate the power of randomization in our strategic classification setup. We show that randomization may strictly increase the accuracy in general, but will not help in the special case of adversarial classification with zero-manipulation-cost.

分類器を騙すためのテストデータの戦略的または敵対的操作の研究は、最近多くの注目を集めています。これまでの研究のほとんどは、テストデータポイントが完全に敵対的であるか、常に肯定的なラベルを等しく好むという2つの極端な状況に焦点を当てていました。この論文では、異質な好みを持つ戦略的エージェントを考慮することにより、統一されたフレームワークを通じてこれら両方を一般化し、一般的な戦略的設定でPAC学習可能性を捉えるために戦略的VC次元（SVC）の概念を導入します。SVCは、Cullinaら（2018）によって導入された最近の敵対的VC次元（AVC）の概念を証明可能に一般化します。基本的な戦略的線形分類問題に対するフレームワークをインスタンス化します。（1）SVCを特定することにより、線形分類器の統計的学習可能性を完全に特徴付けます。（2）経験的リスク最小化問題の複雑さを特定することにより、計算上の扱いやすさを完全に特徴付けます。興味深いことに、線形分類器のSVCは常に標準のVC次元によって上限が決まります。この特性は、(Cullinaら, 2018)の線形分類器のAVC境界を厳密に一般化します。最後に、戦略的分類設定におけるランダム化の威力を簡単に調査します。ランダム化は一般に精度を厳密に向上させる可能性がありますが、操作コストがゼロの敵対的分類の特殊なケースでは役に立たないことを示します。

Insights into Ordinal Embedding Algorithms: A Systematic Evaluation
順序埋め込みアルゴリズムへの洞察:系統的評価

The objective of ordinal embedding is to find a Euclidean representation of a set of abstract items, using only answers to triplet comparisons of the form “Is item $i$ closer to item $j$ or item $k$?”. In recent years, numerous algorithms have been proposed to solve this problem. However, there does not exist a fair and thorough assessment of these embedding methods and therefore several key questions remain unanswered: Which algorithms perform better when the embedding dimension is constrained or few triplet comparisons are available? Which ones scale better with increasing sample size or dimension? In our paper, we address these questions and provide an extensive and systematic empirical evaluation of existing algorithms as well as a new neural network approach. We find that simple, relatively unknown, non-convex methods consistently outperform all other algorithms across a broad range of tasks including more recent and elaborate methods based on neural networks or landmark approaches. This finding can be explained by the insight that many of the non-convex optimization approaches do not suffer from local optima. Our comprehensive assessment is enabled by our unified library of popular embedding algorithms that leverages GPU resources and allows for fast and accurate embeddings of millions of data points.

順序埋め込みの目的は、「アイテム$i$はアイテム$j$またはアイテム$k$のどちらに近いか」という形式の3つ組の比較に対する回答のみを使用して、抽象的なアイテムのセットのユークリッド表現を見つけることです。近年、この問題を解決するためのアルゴリズムが数多く提案されています。ただし、これらの埋め込み方法の公正で徹底した評価は存在せず、そのため、いくつかの重要な質問が未回答のままです。埋め込み次元が制約されている場合、または利用可能な3つ組の比較が少ない場合、どのアルゴリズムのパフォーマンスが向上するか。サンプルサイズまたは次元の増加に応じて、どのアルゴリズムがより適切にスケーリングするか。私たちの論文では、これらの質問に対処し、既存のアルゴリズムと新しいニューラルネットワークアプローチの広範かつ体系的な経験的評価を提供します。私たちは、比較的知られていない単純な非凸手法が、ニューラルネットワークまたはランドマークアプローチに基づく最近の精巧な手法を含む幅広いタスクで、他のすべてのアルゴリズムよりも一貫して優れていることを発見しました。この発見は、多くの非凸最適化アプローチが局所最適値の影響を受けないという洞察によって説明できます。当社の包括的な評価は、GPUリソースを活用し、数百万のデータポイントの高速かつ正確な埋め込みを可能にする、一般的な埋め込みアルゴリズムの統合ライブラリによって可能になります。

Clustering with Tangles: Algorithmic Framework and Theoretical Guarantees
もつれによるクラスタリング:アルゴリズムフレームワークと理論的保証

Originally, tangles were invented as an abstract tool in mathematical graph theory to prove the famous graph minor theorem. In this paper, we showcase the practical potential of tangles in machine learning applications. Given a collection of cuts of any dataset, tangles aggregate these cuts to point in the direction of a dense structure. As a result, a cluster is softly characterized by a set of consistent pointers. This highly flexible approach can solve clustering problems in various setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. The output of our proposed framework is hierarchical and induces the notion of a soft dendrogram, which can help explore the cluster structure of a dataset. The computational complexity of aggregating the cuts is linear in the number of data points. Thus the bottleneck of the tangle approach is to generate the cuts, for which simple and fast algorithms form a sufficient basis. In our paper we construct the algorithmic framework for clustering with tangles, prove theoretical guarantees in various settings, and provide extensive simulations and use cases. Python code is available on github.

もともとタングルは、有名なグラフマイナー定理を証明するための数学的グラフ理論の抽象ツールとして発明されました。この論文では、機械学習アプリケーションにおけるタングルの実用的な可能性を紹介します。任意のデータセットのカットのコレクションが与えられると、タングルはこれらのカットを集約して密な構造の方向を指します。その結果、クラスターは一貫したポインターのセットによってソフトに特徴付けられます。この非常に柔軟なアプローチは、グラフ内のコミュニティ検出に関するアンケートからメトリック空間内のクラスタリングポイントまで、さまざまな設定でクラスタリングの問題を解決できます。提案されたフレームワークの出力は階層的であり、データセットのクラスター構造を調査するのに役立つソフトデンドログラムの概念を誘導します。カットを集約するための計算の複雑さは、データポイントの数に比例します。したがって、タングルアプローチのボトルネックはカットを生成することであり、シンプルで高速なアルゴリズムが十分な基礎となります。私たちの論文では、タングルを使用したクラスタリングのアルゴリズムフレームワークを構築し、さまざまな設定で理論的な保証を証明し、広範なシミュレーションと使用例を示します。Pythonコードはgithubで入手できます。

Random Feature Neural Networks Learn Black-Scholes Type PDEs Without Curse of Dimensionality
ランダム特徴ニューラルネットワークは、次元の呪いなしにブラックショールズタイプの偏微分方程式を学習します

This article investigates the use of random feature neural networks for learning Kolmogorov partial (integro-)differential equations associated to Black-Scholes and more general exponential Lévy models. Random feature neural networks are single-hidden-layer feedforward neural networks in which the hidden weights are randomly generated and only the output weights are trainable. This makes training particularly simple, but (a priori) reduces expressivity. Interestingly, this is not the case for certain Black-Scholes type PDEs, as we show here. We derive bounds for the prediction error of random neural networks for learning sufficiently non-degenerate Black-Scholes type models. A full error analysis – bounding the approximation, generalization and optimization error of the algorithm – is provided and it is shown that the derived bounds do not suffer from the curse of dimensionality. We also investigate an application of these results to basket options and validate the bounds numerically. These results prove that neural networks are able to learn solutions to suitable Black-Scholes type PDEs without the curse of dimensionality. In addition, this provides an example of a relevant learning problem in which random feature neural networks are provably efficient.

この記事では、ブラック・ショールズモデルおよびより一般的な指数レヴィモデルに関連するコルモゴロフ偏微分方程式の学習にランダム特徴ニューラルネットワークを使用する方法について調査します。ランダム特徴ニューラルネットワークは、単一の隠れ層フィードフォワードニューラルネットワークで、隠れた重みはランダムに生成され、出力重みのみがトレーニング可能です。これによりトレーニングは特に簡単になりますが、(事前に)表現力が低下します。興味深いことに、ここで示すように、これは特定のブラック・ショールズ型PDEには当てはまりません。十分に非退化なブラック・ショールズ型モデルを学習するためのランダムニューラルネットワークの予測誤差の境界を導出します。アルゴリズムの近似、一般化、および最適化誤差を境界とする完全な誤差分析が提供され、導出された境界は次元の呪いの影響を受けないことが示されています。また、これらの結果をバスケットオプションに適用し、境界を数値的に検証します。これらの結果は、ニューラルネットワークが次元の呪いなしに適切なブラックショールズ型偏微分方程式の解を学習できることを証明しています。さらに、これはランダムフィーチャニューラルネットワークが効率的であることが証明されている関連学習問題の例を示しています。

The Proximal ID Algorithm
近位IDアルゴリズム

Unobserved confounding is a fundamental obstacle to establishing valid causal conclusions from observational data. Two complementary types of approaches have been developed to address this obstacle: obtaining identification using fortuitous external aids, such as instrumental variables or proxies, or by means of the ID algorithm, using Markov restrictions on the full data distribution encoded in graphical causal models. In this paper we aim to develop a synthesis of the former and latter approaches to identification in causal inference to yield the most general identification algorithm in multivariate systems currently known — the proximal ID algorithm. In addition to being able to obtain nonparametric identification in all cases where the ID algorithm succeeds, our approach allows us to systematically exploit proxies to adjust for the presence of unobserved confounders that would have otherwise prevented identification. In addition, we outline a class of estimation strategies for causal parameters identified by our method in an important special case. We illustrate our approach by simulation studies and a data application.

観測されない交絡は、観測データから有効な因果的結論を確立する上での根本的な障害です。この障害に対処するために、2つの相補的なアプローチが開発されました。1つは、操作変数やプロキシなどの偶然の外部補助を使用して識別を取得する方法、もう1つは、グラフィカル因果モデルにエンコードされた完全なデータ分布に対するマルコフ制約を使用するIDアルゴリズムによる方法です。この論文では、因果推論における識別に対する前者と後者のアプローチを統合し、現在知られている多変量システムで最も一般的な識別アルゴリズムである近位IDアルゴリズムを作成することを目指しています。IDアルゴリズムが成功するすべてのケースでノンパラメトリック識別を取得できることに加えて、このアプローチでは、プロキシを体系的に利用して、そうでなければ識別を妨げていたであろう観測されない交絡因子の存在を調整できます。さらに、重要な特殊なケースでこの方法によって識別された因果パラメータの推定戦略のクラスを概説します。シミュレーション研究とデータアプリケーションによってこのアプローチを説明します。

Quantifying Network Similarity using Graph Cumulants
グラフキュムラントを使用したネットワークの類似性の定量化

How might one test the hypothesis that networks were sampled from the same distribution? Here, we compare two statistical tests that use subgraph counts to address this question. The first uses the empirical subgraph densities themselves as estimates of those of the underlying distribution. The second test uses a new approach that converts these subgraph densities into estimates of the graph cumulants of the distribution (without any increase in computational complexity). We demonstrate — via theory, simulation, and application to real data — the superior statistical power of using graph cumulants. In summary, when analyzing data using subgraph/motif densities, we suggest using the corresponding graph cumulants instead.

ネットワークが同じ分布からサンプリングされたという仮説をどのように検定できるでしょうか。ここでは、この問題に対処するためにサブグラフカウントを使用する2つの統計的検定を比較します。1つ目は、経験的サブグラフ密度自体を、基礎となる分布の密度の推定値として使用します。2番目のテストでは、これらのサブグラフ密度を分布のグラフキュムラントの推定値に変換する新しいアプローチを使用します(計算の複雑さは増加しません)。理論、シミュレーション、および実際のデータへの適用を通じて—グラフキュムラントを使用することの優れた統計的能力—実証します。要約すると、サブグラフ/モチーフ密度を使用してデータを分析する場合は、代わりに対応するグラフキュミュラントを使用することをお勧めします。

Learning an Explicit Hyper-parameter Prediction Function Conditioned on Tasks
タスクに条件付けられた陽的ハイパーパラメータ予測関数の学習

Meta learning has attracted much attention recently in machine learning community. Contrary to conventional machine learning aiming to learn inherent prediction rules to predict labels for new query data, meta learning aims to learn the learning methodology for machine learning from observed tasks, so as to generalize to new query tasks by leveraging the meta-learned learning methodology. In this study, we achieve such learning methodology by learning an explicit hyper-parameter prediction function shared by all training tasks, and we call this learning process as Simulating Learning Methodology (SLeM). Specifically, this function is represented as a parameterized function called meta-learner, mapping from a training/test task to its suitable hyper-parameter setting, extracted from a pre-specified function set called meta learning machine. Such setting guarantees that the meta-learned learning methodology is able to flexibly fit diverse query tasks, instead of only obtaining fixed hyper-parameters by many current meta learning methods, with less adaptability to query task’s variations. Such understanding of meta learning also makes it easily succeed from traditional learning theory for analyzing its generalization bounds with general losses/tasks/models. The theory naturally leads to some feasible controlling strategies for ameliorating the quality of the extracted meta-learner, verified to be able to finely ameliorate its generalization capability in some typical meta learning applications, including few-shot regression, few-shot classification and domain generalization. The source code of our method is released at https://github.com/xjtushujun/SLeM-Theory.

メタ学習は、最近、機械学習コミュニティで大きな注目を集めています。従来の機械学習が、新しいクエリデータのラベルを予測するための固有の予測ルールを学習することを目的としているのに対し、メタ学習は、観察されたタスクから機械学習の学習方法を学習し、メタ学習された学習方法論を活用して新しいクエリタスクに一般化することを目的としています。この研究では、すべてのトレーニングタスクで共有される明示的なハイパーパラメータ予測関数を学習することでこのような学習方法論を実現し、この学習プロセスをシミュレーション学習方法論(SLeM)と呼んでいます。具体的には、この関数は、トレーニング/テストタスクから、メタ学習マシンと呼ばれる事前に指定された関数セットから抽出された適切なハイパーパラメータ設定にマッピングする、メタ学習器と呼ばれるパラメータ化された関数として表されます。このような設定により、メタ学習された学習方法論は、クエリタスクの変動に対する適応性が低く、多くの現在のメタ学習方法で固定のハイパーパラメータのみを取得するのではなく、さまざまなクエリタスクに柔軟に適合できることが保証されます。メタ学習のこのような理解により、一般的な損失/タスク/モデルによる一般化境界を分析するための従来の学習理論から簡単に成功することもできます。理論は、抽出されたメタ学習者の質を改善するためのいくつかの実行可能な制御戦略に自然につながり、少数ショット回帰、少数ショット分類、ドメイン一般化などのいくつかの一般的なメタ学習アプリケーションでその一般化機能を細かく改善できることが検証されています。私たちの方法のソースコードは、https://github.com/xjtushujun/SLeM-Theoryで公開されています。

On the Theoretical Equivalence of Several Trade-Off Curves Assessing Statistical Proximity
統計的近接性を評価するいくつかのトレードオフ曲線の理論的同等性について

The recent advent of powerful generative models has triggered the renewed development of quantitative measures to assess the proximity of two probability distributions. As the scalar Frechet Inception Distance remains popular, several methods have explored computing entire curves, which reveal the trade-off between the fidelity and variability of the first distribution with respect to the second one. Several of such variants have been proposed independently and while intuitively similar, their relationship has not yet been made explicit. In an effort to make the emerging picture of generative evaluation more clear, we propose a unification of four curves known respectively as: the Precision-Recall (PR) curve, the Lorenz curve, the Receiver Operating Characteristic (ROC) curve and a special case of Rényi divergence frontiers. In addition, we discuss possible links between PR / Lorenz curves with the derivation of domain adaptation bounds.

近年の強力な生成モデルの出現により、2つの確率分布の近接性を評価するための定量的尺度が再び開発されるようになりました。スカラーのFrechet Inception Distanceが依然として人気があるため、いくつかの方法で曲線全体の計算が検討され、最初の分布と2番目の分布に対する忠実度と変動性の間のトレードオフが明らかになりました。そのようなバリアントのいくつかは独立して提案されており、直感的には似ていますが、それらの関係はまだ明確にされていません。生成的評価の新たなイメージをより明確にするために、精度-再現率(PR)曲線、ローレンツ曲線、受信者動作特性(ROC)曲線、およびRényi発散フロンティアの特別なケースとしてそれぞれ知られている4つの曲線の統合を提案します。さらに、PR/ローレンツ曲線とドメイン適応境界の導出との間の可能なリンクについて説明します。

Metrizing Weak Convergence with Maximum Mean Discrepancies
最大平均不一致による弱収束の計測

This paper characterizes the maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures for a wide class of kernels. More precisely, we prove that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel $k$, whose RKHS-functions vanish at infinity (i.e., $H_k \subset C_0$), metrizes the weak convergence of probability measures if and only if $k$ is continuous and integrally strictly positive definite ($\int$s.p.d.) over all signed, finite, regular Borel measures. We also correct a prior result of Simon-Gabriel and Schölkopf (JMLR 2018, Thm. 12) by showing that there exist both bounded continuous $\int$s.p.d. kernels that do not metrize weak convergence and bounded continuous non-$\int$s.p.d. kernels that do metrize it.

この論文では、幅広い種類のカーネルの確率測度の弱い収束を計測する最大平均不一致(MMD)を特徴付けます。より正確には、局所的にコンパクトで非コンパクトなハウスドルフ空間上で、RKHS関数が無限大で消失する(つまり、$H_k subset C_0$)有界連続ボレル測定可能カーネル$k$)のMMDが、$k$が連続で、すべての符号に対して積分的に厳密に正の定値（$int$s.p.d.]である場合に限り、確率測度の弱い収束を計測することを証明します。有限で規則的なボレル測定。また、Simon-GabrielとSchölkopf (JMLR 2018, Thm. 12)の以前の結果を修正し、弱い収束を計測しない有界連続$int$s.p.d.カーネルと、弱収束を計測する有界連続非$int$s.p.d.カーネルの両方が存在することを示しました。

Quasi-Equivalence between Width and Depth of Neural Networks
ニューラルネットワークの幅と深さの間の準等価性

While classic studies proved that wide networks allow universal approximation, recent research and successes of deep learning demonstrate the power of deep networks. Based on a symmetric consideration, we investigate if the design of artificial neural networks should have a directional preference, and what the mechanism of interaction is between the width and depth of a network. Inspired by the De Morgan law, we address this fundamental question by establishing a quasi-equivalence between the width and depth of ReLU networks. We formulate two transforms for mapping an arbitrary ReLU network to a wide ReLU network and a deep ReLU network respectively, so that the essentially same capability of the original network can be implemented. Based on our findings, a deep network has a wide equivalent, and vice versa, subject to an arbitrarily small error.

古典的な研究では、幅の広いネットワークが普遍的な近似を可能にすることが証明されていますが、最近の研究とディープラーニングの成功は、ディープネットワークの力を示しています。対称的な考察に基づいて、人工ニューラルネットワークの設計が指向性を持つべきかどうか、およびネットワークの幅と深さの間の相互作用のメカニズムは何かを調査します。ド・モルガンの法則に触発されて、ReLUネットワークの幅と深さの間に準等価性を確立することにより、この基本的な問題に取り組みます。任意のReLUネットワークを広いReLUネットワークと深いReLUネットワークにそれぞれマッピングするための2つの変換を定式化し、元のネットワークと基本的に同じ機能を実装できるようにします。私たちの調査結果に基づくと、ディープネットワークには幅広い等価物があり、その逆もまた然りで、任意の小さな誤差の影響を受けます。

Naive regression requires weaker assumptions than factor models to adjust for multiple cause confounding
単純回帰では、多重原因の交絡を調整するために、因子モデルよりも弱い仮定が必要です

The empirical practice of using factor models to adjust for shared, unobserved confounders, $\boldsymbol{Z}$, in observational settings with multiple treatments, $\boldsymbol{A}$, is widespread in fields including genetics, networks, medicine, and politics. Wang and Blei (2019, WB) generalize these procedures to develop the “deconfounder,” a causal inference method using factor models of $\boldsymbol{A}$ to estimate “substitute confounders,” $\widehat{\boldsymbol{Z}}$, then estimating treatment effects—regressing the outcome, $\boldsymbol{Y}$, on part of $\boldsymbol{A}$ while adjusting for $\widehat{\boldsymbol{Z}}$. WB claim the deconfounder is unbiased when (among other assumptions) there are no single-cause confounders and $\widehat{\boldsymbol{Z}}$ is “pinpointed.” We clarify pinpointing requires each confounder to affect infinitely many treatments. We prove that when the conditions hold for the deconfounder to be asymptotically unbiased, a naive semiparametric regression of $\boldsymbol{Y}$ on $\boldsymbol{A}$ which ignores confounding is also asymptotically unbiased. We provide bias formulas for finite numbers of treatments and show that different deconfounders exhibit different kinds of bias. We replicate every deconfounder analysis with available data and find that neither the naive regression nor the deconfounder consistently outperform the other. In practice, the deconfounder produces implausible estimates in WB’s case study of movie earnings: estimates suggest comic author Stan Lee’s cameo appearances causally contributed $15.5 billion, most of Marvel movie revenue. We conclude neither approach is a viable substitute for careful research design in real-world applications.

複数の治療法$\boldsymbol{A}$を伴う観察設定において、因子モデルを使用して共通の観測されない交絡因子$\boldsymbol{Z}$を調整する経験的慣行は、遺伝学、ネットワーク、医学、政治などの分野で広く行われています。WangとBlei (2019, WB)は、これらの手順を一般化して「交絡除去法」を開発しました。これは、因子モデル$\boldsymbol{A}$を使用して「代替交絡因子」$\widehat{\boldsymbol{Z}}$を推定し、次に治療効果を推定します。つまり、結果$\boldsymbol{Y}$を$\boldsymbol{A}$の一部に回帰させ、$\widehat{\boldsymbol{Z}}$を調整します。WBは、（他の仮定の中でも）単一原因の交絡因子が存在せず、$\widehat{\boldsymbol{Z}}$が「特定」されている場合、交絡除去因子は不偏であると主張しています。特定には、各交絡因子が無限の数の治療に影響を与える必要があることを明確にします。交絡除去因子が漸近的に不偏であるための条件が満たされている場合、交絡を無視した$\boldsymbol{Y}$の$\boldsymbol{A}$に対するナイーブなセミパラメトリック回帰も漸近的に不偏であることを証明します。有限数の治療に対するバイアスの式を提供し、異なる交絡除去因子が異なる種類のバイアスを示すことを示します。利用可能なデータを使用してすべての交絡除去分析を再現し、ナイーブ回帰も交絡除去因子も一貫して他方よりも優れているわけではないことを発見しました。実際には、交絡除去因子は、WBの映画収益のケーススタディで信じ難い推定値を生み出します。推定値によると、漫画家スタンリーのカメオ出演は、マーベル映画の収益のほとんどである155億ドルの因果的貢献をしています。どちらのアプローチも、現実世界のアプリケーションでは慎重な研究設計の代替にはならないと結論付けています。

Factor Graph Neural Networks
因子グラフニューラルネットワーク

In recent years, we have witnessed a surge of Graph Neural Networks (GNNs), most of which can learn powerful representations in an end-to-end fashion with great success in many real-world applications. They have resemblance to Probabilistic Graphical Models (PGMs), but break free from some limitations of PGMs. By aiming to provide expressive methods for representation learning instead of computing marginals or most likely configurations, GNNs provide flexibility in the choice of information flowing rules while maintaining good performance. Despite their success and inspirations, they lack efficient ways to represent and learn higher-order relations among variables/nodes. More expressive higher-order GNNs which operate on k-tuples of nodes need increased computational resources in order to process higher-order tensors. We propose Factor Graph Neural Networks (FGNNs) to effectively capture higher-order relations for inference and learning. To do so, we first derive an efficient approximate Sum-Product loopy belief propagation inference algorithm for discrete higher-order PGMs. We then neuralize the novel message passing scheme into a Factor Graph Neural Network (FGNN) module by allowing richer representations of the message update rules; this facilitates both efficient inference and powerful end-to-end learning. We further show that with a suitable choice of message aggregation operators, our FGNN is also able to represent Max-Product belief propagation, providing a single family of architecture that can represent both Max and Sum-Product loopy belief propagation. Our extensive experimental evaluation on synthetic as well as real datasets demonstrates the potential of the proposed model.

近年、グラフニューラルネットワーク(GNN)が急増しており、そのほとんどはエンドツーエンドで強力な表現を学習でき、多くの実際のアプリケーションで大きな成功を収めています。これらは確率的グラフィカルモデル(PGM)に似ていますが、PGMのいくつかの制限から解放されています。周辺または最も可能性の高い構成を計算する代わりに表現学習のための表現方法を提供することを目的としたGNNは、優れたパフォーマンスを維持しながら、情報フロールールの選択に柔軟性を提供します。これらの成功とインスピレーションにもかかわらず、変数/ノード間の高次の関係を表現および学習する効率的な方法が欠けています。kタプルのノードで動作する、より表現力の高い高次のGNNでは、高次のテンソルを処理するために、計算リソースを増やす必要があります。推論と学習のために高次の関係を効果的にキャプチャするために、ファクターグラフニューラルネットワーク(FGNN)を提案します。そのために、まず、離散高次PGM用の効率的な近似Sum-Productループビリーフプロパゲーション推論アルゴリズムを導出します。次に、メッセージ更新ルールのより豊富な表現を可能にすることで、新しいメッセージパッシングスキームをファクターグラフニューラルネットワーク(FGNN)モジュールにニューラル化します。これにより、効率的な推論と強力なエンドツーエンドの学習の両方が可能になります。さらに、メッセージ集約演算子を適切に選択することで、FGNNはMax-Productのビリーフプロパゲーションも表現でき、MaxとSum-Productのループビリーフプロパゲーションの両方を表現できる単一のアーキテクチャファミリを提供できることを示しています。合成データセットと実際のデータセットでの広範な実験評価により、提案モデルの可能性が実証されています。

Dropout Training is Distributionally Robust Optimal
ドロップアウト学習は分布的にロバスト、最適

This paper shows that dropout training in generalized linear models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician’s covariates using a multiplicative nonparametric errors-in-variables model. In this game, nature’s least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability $\delta$. This result implies that dropout training indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. The paper makes a concrete recommendation on how to select the tuning parameter $\delta$. The paper also provides a novel, parallelizable, unbiased multi-level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout, provided the number of data points is much smaller than the dimension of the covariate vector.

この論文では、一般化線形モデルにおけるドロップアウトトレーニングが、敵対的な性質が乗法的なノンパラメトリック変数エラーモデルを使用して統計学者の共変量を破損する2人プレイのゼロ和ゲームのミニマックスソリューションであることを示しています。このゲームでは、自然界で最も好ましくない分布はドロップアウトノイズであり、自然界は共変量ベクトルのエントリを一定の確率$\delta$で独立して削除します。この結果は、ドロップアウトトレーニングが、サンプル内データの乗法的な摂動から生じる分布に対して、実際にサンプル外の期待損失保証を提供することを意味します。この論文では、チューニングパラメーター$\delta$の選択方法について具体的な推奨事項を示します。また、この論文では、ドロップアウトトレーニングの実装を高速化するための、並列化可能な新しいバイアスのないマルチレベルモンテカルロアルゴリズムも提供しています。データポイントの数が共変量ベクトルの次元よりもはるかに小さい場合、このアルゴリズムの計算コストは、ドロップアウトの単純な実装に比べてはるかに小さくなります。

Variational Inference for Deblending Crowded Starfields
混雑した星空を解読するための変分推論

In images collected by astronomical surveys, stars and galaxies often overlap visually. Deblending is the task of distinguishing and characterizing individual light sources in survey images. We propose StarNet, a Bayesian method to deblend sources in astronomical images of crowded star fields. StarNet leverages recent advances in variational inference, including amortized variational distributions and an optimization objective targeting an expectation of the forward KL divergence. In our experiments with SDSS images of the M2 globular cluster, StarNet is substantially more accurate than two competing methods: Probabilistic Cataloging (PCAT), a method that uses MCMC for inference, and DAOPHOT, a software pipeline employed by SDSS for deblending. In addition, the amortized approach to inference gives StarNet the scaling characteristics necessary to perform Bayesian inference on modern astronomical surveys.

天文調査で収集した画像では、星や銀河が視覚的に重なっていることがよくあります。デブレンディングは、調査画像内の個々の光源を区別し、特徴付けるタスクです。私たちは、混雑した星空の天文画像からソースをデブレンドするベイズ法であるStarNetを提案します。StarNetは、償却された変分分布や、前方KL発散の期待値を対象とする最適化目標など、変分推論の最近の進歩を活用しています。M2球状星団のSDSS画像を用いた実験では、StarNetは、MCMCを推論に使用するProbabilistic Cataloging(PCAT)と、SDSSがデブレンディングに採用するソフトウェアパイプラインであるDAOPHOTという2つの競合する方法よりも大幅に精度が高いことがわかりました。さらに、推論に対する償却アプローチにより、StarNetは現代の天文調査でベイズ推論を実行するために必要なスケーリング特性を得ることができます。

F2A2: Flexible Fully-decentralized Approximate Actor-critic for Cooperative Multi-agent Reinforcement Learning
F2A2: 協調的マルチエージェント強化学習のための柔軟な完全分散近似アクター評論家

Traditional centralized multi-agent reinforcement learning (MARL) algorithms are sometimes unpractical in complicated applications due to non-interactivity between agents, the curse of dimensionality, and computation complexity. Hence, several decentralized MARL algorithms are motivated. However, existing decentralized methods only handle the fully cooperative setting where massive information needs to be transmitted in training. The block coordinate gradient descent scheme they used for successive independent actor and critic steps can simplify the calculation, but it causes serious bias. This paper proposes a flexible fully decentralized actor-critic MARL framework, which can combine most of the actor-critic methods and handle large-scale general cooperative multi-agent settings. A primal-dual hybrid gradient descent type algorithm framework is designed to learn individual agents separately for decentralization. From the perspective of each agent, policy improvement and value evaluation are jointly optimized, which can stabilize multi-agent policy learning. Furthermore, the proposed framework can achieve scalability and stability for the large-scale environment. This framework also reduces information transmission by the parameter sharing mechanism and novel modeling-other-agents methods based on theory-of-mind and online supervised learning. Sufficient experiments in cooperative Multi-agent Particle Environment and StarCraft II show that the proposed decentralized MARL instantiation algorithms perform competitively against conventional centralized and decentralized methods.

従来の集中型マルチエージェント強化学習（MARL）アルゴリズムは、エージェント間の非対話性、次元の呪い、計算の複雑さのため、複雑なアプリケーションでは実用的ではない場合があります。そのため、いくつかの分散型MARLアルゴリズムが動機付けられています。ただし、既存の分散型方法は、トレーニングで大量の情報を転送する必要がある完全協力設定のみを処理します。連続する独立したアクターと批評家のステップに使用したブロック座標勾配降下スキームは、計算を簡素化できますが、重大なバイアスが発生します。この論文では、ほとんどのアクタークリティック方法を組み合わせて、大規模な一般的な協力型マルチエージェント設定を処理できる、柔軟な完全分散型アクタークリティックMARLフレームワークを提案します。プライマルデュアルハイブリッド勾配降下型アルゴリズムフレームワークは、分散化のために個々のエージェントを個別に学習するように設計されています。各エージェントの観点から、ポリシーの改善と価値評価が共同で最適化され、マルチエージェントポリシー学習を安定させることができます。さらに、提案されたフレームワークは、大規模環境のスケーラビリティと安定性を実現できます。このフレームワークは、パラメータ共有メカニズムと、心の理論とオンライン教師あり学習に基づく新しい他のエージェントのモデリング方法によって、情報伝達も削減します。協調型マルチエージェント粒子環境とStarCraft IIでの十分な実験により、提案された分散型MARLインスタンス化アルゴリズムは、従来の集中型および分散型の方法に対して競争力のあるパフォーマンスを発揮することが示されています。

Comprehensive Algorithm Portfolio Evaluation using Item Response Theory
項目応答理論を用いた包括的なアルゴリズムポートフォリオ評価

Item Response Theory (IRT) has been proposed within the field of Educational Psychometrics to assess student ability as well as test question difficulty and discrimination power. More recently, IRT has been applied to evaluate machine learning algorithm performance on a single classification dataset, where the student is now an algorithm, and the test question is an observation to be classified by the algorithm. In this paper we present a modified IRT-based framework for evaluating a portfolio of algorithms across a repository of datasets, while simultaneously eliciting a richer suite of characteristics – such as algorithm consistency and anomalousness – that describe important aspects of algorithm performance. These characteristics arise from a novel inversion and reinterpretation of the traditional IRT model without requiring additional dataset feature computations. We test this framework on algorithm portfolios for a wide range of applications, demonstrating the broad applicability of this method as an insightful algorithm evaluation tool. Furthermore, the explainable nature of IRT parameters yield an increased understanding of algorithm portfolios.

項目反応理論(IRT)は、教育心理測定学の分野で、学生の能力、テスト問題の難易度、識別力を評価することを提案してきました。最近では、IRTは単一の分類データセットに対する機械学習アルゴリズムのパフォーマンスを評価するために適用されています。ここでは、学生がアルゴリズムとなり、テスト問題はアルゴリズムによって分類される観察となります。この論文では、データセットのリポジトリ全体でアルゴリズムのポートフォリオを評価するための、修正されたIRTベースのフレームワークを提示します。同時に、アルゴリズムの一貫性や異常性など、アルゴリズムのパフォーマンスの重要な側面を説明するより豊富な特性セットを引き出します。これらの特性は、追加のデータセット機能計算を必要とせずに、従来のIRTモデルを新たに反転および再解釈することで生じます。このフレームワークを幅広いアプリケーションのアルゴリズムポートフォリオでテストし、この方法が洞察に富んだアルゴリズム評価ツールとして幅広く適用できることを実証しました。さらに、IRTパラメーターの説明可能な性質により、アルゴリズムポートフォリオの理解が深まります。

Evaluating Instrument Validity using the Principle of Independent Mechanisms
独立メカニズムの原理を用いた機器の有効性の評価

The validity of instrumental variables to estimate causal effects is typically justified narratively and often remains controversial. Critical assumptions are difficult to evaluate since they involve unobserved variables. Building on Janzing and Schoelkopf’s (2018) method to quantify a degree of confounding in multivariate linear models, we develop a test that evaluates instrument validity without relying on Balke and Pearl’s (1997) inequality constraints. Instead, our approach is based on the Principle of Independent Mechanisms, which states that causal models have a modular structure. Monte Carlo studies show a high accuracy of the procedure. We apply our method to two empirical studies: first, we can corroborate the narrative justification given by Card (1995) for the validity of college proximity as an instrument for educational attainment in his work on the financial returns to education. Second, we cannot reject the validity of past savings rates as an instrument for economic development to estimate its causal effect on democracy (Acemoglu et al, 2008).

因果効果を推定するための操作変数の妥当性は、通常、物語的に正当化され、しばしば議論の的となっています。重要な仮定は、観測されない変数を含むため、評価が困難です。多変量線型モデルにおける交絡の程度を定量化するJanzingとSchoelkopf (2018)の方法に基づいて、BalkeとPearl (1997)の不等式制約に依存せずに、道具変数の妥当性を評価するテストを開発した。代わりに、私たちのアプローチは、因果モデルがモジュール構造を持つという独立メカニズムの原理に基づいています。モンテカルロ研究では、この手順の高い精度が示されています。私たちは、この方法を2つの実証研究に適用した。まず、教育の経済的リターンに関する研究で、大学への近さが教育達成の手段として妥当であるとしてCard (1995)が示した物語的正当性を裏付けることができます。第二に、経済発展が民主主義に及ぼす因果的影響を推定するための手段としての過去の貯蓄率の妥当性を否定することはできない（Acemogluら、2008）。

Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity
最適に近いサンプル複雑度を持つゼロサムマルコフゲームにおけるモデルベースマルチエージェントRL

Model-based reinforcement learning (RL), which finds an optimal policy after establishing an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, we aim to ad- dress the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of Oe(|S||A||B|(1 − γ)−3ε−2) for finding the Nash equilibrium (NE) value up to some ε error, and the ε-NE policies with a smooth planning oracle, where γ is the discount factor, and S,A,B denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward- aware setting, where the sample complexity lower bound is Ωe(|S|(|A| + |B|)(1 − γ)−3ε−2), and this model-based approach is near-optimal with only a gap on the |A|, |B| dependence. Our results not only illustrate the sample-efficiency of this basic model-based MARL approach, but also elaborate on the fundamental tradeoff between its power (easily handling the reward-agnostic case) and limitation (less adaptive and suboptimal in |A|, |B|), which particularly arises in the multi-agent context.

モデルベースの強化学習(RL)は、経験的モデルを確立した後に最適なポリシーを見つけるもので、長い間RLの基礎の1つとして認識されてきました。これは、学習フェーズと計画フェーズを自然に分離し、すべてのエージェントが同時にポリシーを改善するときに非定常性の問題を回避するため、マルチエージェントRL (MARL)に特に適しています。直感的で広く使用されているにもかかわらず、モデルベースのMARLアルゴリズムのサンプル複雑性は十分に調査されていません。この論文では、そのサンプル複雑性に関する基本的な疑問に取り組むことを目的としています。おそらく最も基本的なMARL設定、つまり生成モデルへのアクセスのみが与えられた2人のプレーヤーの割引ゼロ和マルコフゲームを研究します。モデルベースのMARLは、ある ε 誤差までのナッシュ均衡(NE)値と、スムーズプランニングオラクルによる ε-NEポリシーを見つけるためのサンプル複雑度がOe(|S||A||B|(1− γ)−3ε−2)になることを示します。ここで、γ は割引率、S、A、Bは状態空間、2つのエージェントのアクション空間を表します。さらに、アルゴリズムが報酬に依存しない場合、一致する下限を確立することで、このようなサンプル境界がミニマックス最適(対数係数まで)であることを示します。これは、サンプル複雑度の下限が Ωe(|S|(|A| + |B|)(1− γ)−3ε−2)である通常の報酬を考慮した設定とは対照的であり、このモデルベースのアプローチは、|A|、|B|依存性にのみギャップがあるほぼ最適です。私たちの結果は、この基本的なモデルベースのMARLアプローチのサンプル効率を示すだけでなく、そのパワー(報酬に依存しないケースを簡単に処理できる)と制限(|A|、|B|で適応性が低く、最適ではない)の間の基本的なトレードオフについても詳しく説明しています。これは、特にマルチエージェントのコンテキストで発生します。

Posterior Consistency for Bayesian Relevance Vector Machines
ベイズ関連性ベクトルマシンの事後一貫性

Statistical modeling and inference problems with sample sizes substantially smaller than the number of available covariates are challenging. Chakraborty et al. (2012) did a full hierarchical Bayesian analysis of nonlinear regression in such situations using relevance vector machines based on reproducing kernel Hilbert space (RKHS). But they did not provide any theoretical properties associated with their procedure. The present paper revisits their problem, introduces a new class of global-local priors different from theirs, and provides results on posterior consistency as well as on posterior contraction rates.

サンプル・サイズが使用可能な共変量の数よりも大幅に小さい統計モデリングと推論の問題は困難です。Chakrabortyら(2012)は、再現カーネルヒルベルト空間(RKHS)に基づく関連性ベクトルマシンを使用して、このような状況での非線形回帰の完全な階層的ベイズ分析を行いました。しかし、彼らはその手順に関連する理論的な特性を提供しませんでした。この論文では、彼らの問題を再検討し、彼らのものとは異なる新しいクラスのグローバルローカル事前確率を導入し、事後一貫性と事後収縮率に関する結果を提供します。

From Classification Accuracy to Proper Scoring Rules: Elicitability of Probabilistic Top List Predictions
分類精度から適切なスコアリングルールまで:確率的トップリスト予測の誘発性

In the face of uncertainty, the need for probabilistic assessments has long been recognized in the literature on forecasting. In classification, however, comparative evaluation of classifiers often focuses on predictions specifying a single class through the use of simple accuracy measures, which disregard any probabilistic uncertainty quantification. I propose probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions. The probabilistic top list functional is elicitable through the use of strictly consistent evaluation metrics. The proposed evaluation metrics are based on symmetric proper scoring rules and admit comparison of various types of predictions ranging from single-class point predictions to fully specified predictive distributions. The Brier score yields a metric that is particularly well suited for this kind of comparison.

不確実性に直面して、確率的評価の必要性は、予測に関する文献で長い間認識されてきました。ただし、分類器の比較評価では、多くの場合、確率的な不確実性の定量化を無視する単純な精度測定を使用して1つのクラスを指定する予測に焦点を当てます。私は、単一クラスの予測と予測分布の間のギャップを埋める、分類における新しいタイプの予測として、確率的トップリストを提案します。確率的なトップリストファンクショナルは、厳密に一貫性のある評価メトリックを使用することで引き出すことができます。提案された評価メトリックは、対称的な適切なスコアリングルールに基づいており、単一クラスのポイント予測から完全に指定された予測分布まで、さまざまなタイプの予測の比較が可能です。Brierスコアは、この種の比較に特に適したメトリックを生成します。

Beyond the Golden Ratio for Variational Inequality Algorithms
変分不等式アルゴリズムの黄金比を超えて

We improve the understanding of the golden ratio algorithm, which solves monotone variational inequalities (VI) and convex-concave min-max problems via the distinctive feature of adapting the step sizes to the local Lipschitz constants. Adaptive step sizes not only eliminate the need to pick hyperparameters, but they also remove the necessity of global Lipschitz continuity and can increase from one iteration to the next. We first establish the equivalence of this algorithm with popular VI methods such as reflected gradient, Popov or optimistic gradient descent-ascent (OGDA) in the unconstrained case with constant step sizes. We then move on to the constrained setting and introduce a new analysis that allows to use larger step sizes, to complete the bridge between the golden ratio algorithm and the existing algorithms in the literature. Doing so, we actually eliminate the link between the golden ratio {$\frac{1+\sqrt{5}}{2}$} and the algorithm. Moreover, we improve the adaptive version of the algorithm, first by removing the maximum step size hyperparameter (an artifact from the analysis), and secondly, by adjusting it to nonmonotone problems with weak Minty solutions, with superior empirical performance.

私たちは、ステップサイズをローカルLipschitz定数に適応させるという特徴的な機能によって、単調変分不等式(VI)と凸凹最小最大問題を解決する黄金比アルゴリズムの理解を深めます。適応ステップサイズは、ハイパーパラメータを選択する必要性をなくすだけでなく、グローバルLipschitz連続性の必要性も排除し、反復ごとに増やすことができます。まず、一定のステップサイズで制約のないケースで、このアルゴリズムが、反射勾配法、ポポフ法、楽観的勾配降下上昇法(OGDA)などの一般的なVI法と同等であることを確認します。次に、制約付き設定に移り、より大きなステップサイズを使用できる新しい分析を導入して、黄金比アルゴリズムと文献の既存のアルゴリズムとの橋渡しを完了します。そうすることで、黄金比{$\frac{1+\sqrt{5}}{2}$}とアルゴリズムの間のリンクが実際に排除されます。さらに、まず最大ステップサイズのハイパーパラメータ(分析からのアーティファクト)を削除し、次に弱いMintyソリューションを持つ非単調な問題に調整することで、アルゴリズムの適応バージョンを改善し、優れた経験的パフォーマンスを実現しました。

Incremental Learning in Diagonal Linear Networks
対角線形ネットワークにおける増分学習

Diagonal linear networks (DLNs) are a toy simplification of artificial neural networks; they consist in a quadratic reparametrization of linear regression inducing a sparse implicit regularization. In this paper, we describe the trajectory of the gradient flow of DLNs in the limit of small initialization. We show that incremental learning is effectively performed in the limit: coordinates are successively activated, while the iterate is the minimizer of the loss constrained to have support on the active coordinates only. This shows that the sparse implicit regularization of DLNs decreases with time. This work is restricted to the underparametrized regime with anti-correlated features for technical reasons.

対角線線形ネットワーク(DLN)は、人工ニューラルネットワークを単純化したおもちゃです。これらは、スパースな陰的正則化を誘発する線形回帰の2次再パラメータ化で構成されます。この論文では、小さな初期化の限界におけるDLNの勾配流れの軌跡について述べる。インクリメンタル学習は、極限で効果的に実行されることを示します:座標は逐次的にアクティブ化されますが、反復は、アクティブ座標のみをサポートするように制約された損失の最小化器です。これは、DLNのスパースな暗黙的な正則化が時間とともに減少することを示しています。この作業は、技術的な理由から、反相関特性を持つアンダーパラメータ化されたレジームに限定されています。

Small Transformers Compute Universal Metric Embeddings
小型トランスフォーマーはユニバーサルメトリック埋め込みを計算します

We study representations of data from an arbitrary metric space $\mathcal{X}$ in the space of univariate Gaussian mixtures equipped with a transport metric (Delon and Desolneux 2020). We prove embedding guarantees for feature maps implemented by small neural networks called probabilistic transformers. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $n\log(n)$ and width about $n^2$ can bi-H\”older embed any $n$-point dataset from $\mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If the geometry of $\mathcal{X}$ is sufficiently regular, we obtain stronger bi-Lipschitz guarantees for all points. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers compute bi-Hölder embeddings with arbitrarily small distortion. Our results show that any finite metric dataset, from vertices on a graph to functions a function space, can be faithfully represented in a single representation space, and that the representation can be implemented by a simple transformer architecture. Thus one may only need a modular set of machine learning tools compatible with this one representation space, many of which already exist, for downstream supervised and unsupervised learning from a great variety of data types.

私たちは、輸送メトリック（Delon and Desolneux 2020）を備えた単変量ガウス混合空間における任意のメトリック空間$\mathcal{X}$からのデータの表現を研究します。確率的変換器と呼ばれる小さなニューラルネットワークによって実装された特徴マップの埋め込み保証を証明します。我々の保証は記憶型です。深さ約$n\log(n)$、幅約$n^2$の確率的変換器は、$\mathcal{X}$からの任意の$n$点データセットを低い計量歪みでbi-H\”older埋め込みでき、次元の呪いを回避できることを証明しています。さらに、確率的bi-Lipschitz保証を導出します。これは、歪みの量と、ランダムに選択された2つの点がその歪みで埋め込まれる確率をトレードオフします。$\mathcal{X}$の幾何学が十分に規則的であれば、すべての点に対してより強力なbi-Lipschitz保証が得られます。応用として、リーマン多様体、計量木、および特定の種類の組合せグラフからのデータセットのニューラル埋め込み保証を導出します。代わりに多変量ガウス混合に埋め込む場合、確率的変換器は任意に小さい歪みでbi-Hölder埋め込みを計算することを示す。我々の結果は次のことを示しています。グラフ上の頂点から関数空間の関数まで、任意の有限メトリックデータセットは単一の表現空間で忠実に表現でき、その表現は単純なトランスフォーマーアーキテクチャで実装できます。したがって、さまざまなデータタイプからの下流の教師あり学習と教師なし学習には、この1つの表現空間と互換性のある機械学習ツールのモジュールセット(その多くは既に存在している)のみが必要になる可能性があります。

DART: Distance Assisted Recursive Testing
DART: 距離支援再帰的テスト

Multiple testing is a commonly used tool in modern data science. Sometimes, the hypotheses are embedded in a space; the distances between the hypotheses reflect their co-null/co- alternative patterns. Properly incorporating the distance information in testing will boost testing power. Hence, we developed a new multiple testing framework named Distance Assisted Recursive Testing (DART). DART features in joint artificial intelligence (AI) and statistics modeling. It has two stages. The first stage uses AI models to construct an aggregation tree that reflects the distance information. The second stage uses statistical models to embed the testing on the tree and control the false discovery rate. Theoretical analysis and numerical experiments demonstrated that DART generates valid, robust, and powerful results. We applied DART to a clinical trial in the allogeneic stem cell transplantation study to identify the gut microbiota whose abundance was impacted by post-transplant care.

マルチプルテストは、現代のデータサイエンスで一般的に使用されるツールです。時には、仮説が空間に埋め込まれていることもあります。仮説間の距離は、それらの共帰無/共代替パターンを反映しています。距離情報を適切に組み込むことで、試験力を高めることができます。そこで、Distance Assisted Recursive Testing(DART)という新しいマルチプルテストフレームワークを開発しました。DARTは、人工知能(AI)と統計モデリングの共同機能を備えています。2つのステージがあります。最初のステージでは、AIモデルを使用して、距離情報を反映する集計ツリーを構築します。第2段階では、統計モデルを使用してテストをツリーに埋め込み、誤検出率を制御します。理論解析と数値実験により、DARTは有効で堅牢で強力な結果を生成することが実証されました。同種幹細胞移植研究の臨床試験にDARTを適用し、移植後ケアによってその存在量が影響を受けた腸内細菌叢を特定しました。

Inference on the Change Point under a High Dimensional Covariance Shift
高次元共分散シフト下の変化点に関する推論

We consider the problem of constructing asymptotically valid confidence intervals for the change point in a high-dimensional covariance shift setting. A novel estimator for the change point parameter is developed, and its asymptotic distribution under high dimensional scaling obtained. We establish that the proposed estimator exhibits a sharp $O_p(\psi^{-2})$ rate of convergence, wherein $\psi$ represents the jump size between model parameters before and after the change point. Further, the form of the asymptotic distributions under both a vanishing and a non-vanishing regime of the jump size are characterized. In the former case, it corresponds to the argmax of an asymmetric Brownian motion, while in the latter case to the argmax of an asymmetric random walk. We then obtain the relationship between these distributions, which allows construction of regime (vanishing vs non-vanishing) adaptive confidence intervals. Easy to implement algorithms for the proposed methodology are developed and their performance illustrated on synthetic and real data sets.

私たちは、高次元共分散シフト設定における変化点の漸近的に有効な信頼区間を構築する問題について検討します。変化点パラメータの新しい推定量が開発され、高次元スケーリング下でのその漸近分布が得られました。提案された推定量は、鋭い$O_p(\psi^{-2})$収束率を示すことが確認されました。ここで、$\psi$は変化点の前後のモデルパラメータ間のジャンプサイズを表します。さらに、ジャンプサイズがゼロになる場合とゼロにならない場合の両方での漸近分布の形が特徴付けられます。前者の場合、これは非対称ブラウン運動のargmaxに対応し、後者の場合、非対称ランダムウォークのargmaxに対応します。次に、これらの分布間の関係を取得し、これにより、(ゼロvsゼロでない)適応型信頼区間の構築が可能になります。提案された方法論の実装が容易なアルゴリズムが開発され、合成データセットと実際のデータセットでそのパフォーマンスが示されます。

Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start
低水準収縮によるバイレベル最適化:ウォームスタートなしの最適なサンプル複雑性

We analyse a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, equilibrium models, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e. they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. However, there are situations, e.g., meta learning and equilibrium models, in which the warm-start procedure is not well-suited or ineffective. In this work we show that without warm-start, it is still possible to achieve order-wise (near) optimal sample complexity. In particular, we propose a simple method which uses (stochastic) fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $\epsilon$-stationary point using $O(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-1})$ samples for the stochastic and the deterministic setting, respectively. Finally, compared to methods using warm-start, our approach yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.

私たちは、上位レベルの問題が滑らかな目的関数の最小化から成り、下位レベルの問題が滑らかな収縮マップの不動点を見つけることである、一般的なクラスの二階層問題を分析します。このタイプの問題には、メタ学習、平衡モデル、ハイパーパラメータ最適化、およびデータポイズニング敵対的攻撃のインスタンスが含まれます。最近のいくつかの研究では、下位レベルの問題をウォームスタートするアルゴリズムが提案されています。つまり、下位レベルのソルバーの開始点として、以前の下位レベルの近似ソリューションを使用します。このウォームスタート手順により、確率的および決定論的設定の両方でサンプル複雑度を改善でき、場合によっては順序ごとに最適なサンプル複雑度を達成できます。ただし、メタ学習や平衡モデルなど、ウォームスタート手順が適していないか効果がない状況もあります。この研究では、ウォームスタートがなくても、順序ごとに（ほぼ）最適なサンプル複雑度を達成できることを示します。特に、下位レベルでは（確率的）固定点反復を使用し、上位レベルでは投影された不正確な勾配降下法を使用する簡単な方法を提案します。この方法では、確率的設定と決定論的設定に対してそれぞれ$O(\epsilon^{-2})$と$\tilde{O}(\epsilon^{-1})$のサンプルを使用して$\epsilon$定常点に到達します。最後に、ウォームスタートを使用する方法と比較して、私たちのアプローチは、上位レベルと下位レベルの反復の間の結合された相互作用を調べる必要がない、より単純な分析をもたらします。

A Parameter-Free Conditional Gradient Method for Composite Minimization under Hölder Condition
ホルダー条件下の複合最小化のためのパラメータフリー条件付き勾配法

In this paper we consider a composite optimization problem that minimizes the sum of a weakly smooth function and a convex function with either a bounded domain or a uniformly convex structure. In particular, we first present a parameter-dependent conditional gradient method for this problem, whose step sizes require prior knowledge of the parameters associated with the Hölder continuity of the gradient of the weakly smooth function, and establish its rate of convergence. Given that these parameters could be unknown or known but possibly conservative, such a method may suffer from implementation issue or slow convergence. We therefore propose a parameter-free conditional gradient method whose step size is determined by using a constructive local quadratic upper approximation and an adaptive line search scheme, without using any problem parameter. We show that this method achieves the same rate of convergence as the parameter-dependent conditional gradient method. Preliminary experiments are also conducted and illustrate the superior performance of the parameter-free conditional gradient method over the methods with some other step size rules.

この論文では、弱滑らかな関数と、有界領域または一様凸構造のいずれかを持つ凸関数の和を最小化する複合最適化問題を検討します。特に、まずこの問題に対するパラメータ依存条件付き勾配法を提示します。このステップサイズには、弱滑らかな関数の勾配のHölder連続性に関連するパラメータの事前知識が必要であり、その収束率を確立します。これらのパラメータは未知または既知だが保守的である可能性があることを考えると、このような方法は実装上の問題や収束の遅さに悩まされる可能性があります。そこで、問題パラメータを使用せずに、構成的局所二次上近似と適応型直線探索スキームを使用してステップサイズを決定する、パラメータフリー条件付き勾配法を提案します。この方法が、パラメータ依存条件付き勾配法と同じ収束率を達成することを示す。予備実験も実施し、パラメータフリー条件付き勾配法が他のステップサイズ規則を持つ方法よりも優れたパフォーマンスを示すことを示した。

Robust Methods for High-Dimensional Linear Learning
高次元線形学習のためのロバストな手法

We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features $d$ may exceed the sample size $n$. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla $s$-sparsity, we are able to reach the $s\log (d)/n$ rate under heavy-tails and $\eta$-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source Python library called linlearn, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.

私たちは、特徴数$d$がサンプルサイズ$n$を超える可能性がある高次元バッチ設定において、統計的に堅牢で計算効率の高い線形学習法を提案します。一般的な学習設定では、考慮する損失関数が勾配リプシッツであるかどうかに応じて2つのアルゴリズムを使用します。次に、バニラスパース、グループスパース、低ランク行列回復などのいくつかのアプリケーションでフレームワークをインスタンス化します。これにより、各アプリケーションで、裾の重い分布と外れ値の存在下でほぼ最適な推定率を達成する、効率的で堅牢な学習アルゴリズムが実現されます。バニラ$s$スパース性の場合、裾の重い分布と$\eta$破損の下で$s\log (d)/n$のレートに到達でき、計算コストは堅牢でない類似物と同等です。私たちは、linlearnと呼ばれるオープンソースのPythonライブラリでアルゴリズムの効率的な実装を提供し、それを使用して数値実験を実行し、文献で提案されている他の最近のアプローチとの比較とともに理論的発見を確認します。

A Framework and Benchmark for Deep Batch Active Learning for Regression
回帰のためのディープバッチアクティブラーニングのフレームワークとベンチマーク

The acquisition of labels for supervised learning can be expensive. To improve the sample efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods. Our framework encompasses many existing Bayesian methods based on Gaussian process approximations of neural networks as well as non-Bayesian methods. Additionally, we propose to replace the commonly used last-layer features with sketched finite-width neural tangent kernels and to combine them with a novel clustering method. To evaluate different methods, we introduce an open-source benchmark consisting of 15 large tabular regression data sets. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results.

教師あり学習のラベルの取得はコストがかかる場合があります。ニューラルネットワーク回帰のサンプル効率を改善するために、ラベル付けのためにラベルなしデータのバッチを適応的に選択するアクティブラーニング手法を研究します。(ネットワーク依存の)基本カーネル、カーネル変換、および選択手法からこのような手法を構築するためのフレームワークを紹介します。このフレームワークには、ニューラルネットワークのガウス過程近似に基づく既存のベイズ手法や非ベイズ手法が多数含まれています。さらに、一般的に使用される最終層の特徴をスケッチされた有限幅のニューラル接線カーネルに置き換え、新しいクラスタリング手法と組み合わせることを提案します。さまざまな手法を評価するために、15の大規模な表形式の回帰データセットで構成されるオープンソースベンチマークを導入します。提案された手法は、ベンチマークで最先端の手法よりも優れており、大規模なデータセットに拡張でき、ネットワークアーキテクチャやトレーニングコードを調整することなくすぐに使用できます。私たちは、すべてのカーネル、カーネル変換、および選択方法の効率的な実装を含み、結果を再現するために使用できるオープンソースコードを提供します。

Preconditioned Gradient Descent for Overparameterized Nonconvex Burer–Monteiro Factorization with Global Optimality Certification
大域的最適性証明を伴う過剰パラメータ化非凸 Burer-Monteiro 分解のための前処理勾配降下法

We consider using gradient descent to minimize the nonconvex function $f(X)=\phi(XX^{T})$ over an $n\times r$ factor matrix $X$, in which $\phi$ is an underlying smooth convex cost function defined over $n\times n$ matrices. While only a second-order stationary point $X$ can be provably found in reasonable time, if $X$ is additionally rank deficient, then its rank deficiency certifies it as being globally optimal. This way of certifying global optimality necessarily requires the search rank $r$ of the current iterate $X$ to be overparameterized with respect to the rank $r^{\star}$ of the global minimizer $X^{\star}$. Unfortunately, overparameterization significantly slows down the convergence of gradient descent, from a linear rate with $r=r^{\star}$ to a sublinear rate when $r>r^{\star}$, even when $\phi$ is strongly convex. In this paper, we propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case, while also making it agnostic to possible ill-conditioning in the global minimizer $X^{\star}$.

私たちは、勾配降下法を使用して、$n\times r$因子行列$X$上の非凸関数$f(X)=\phi(XX^{T})$を最小化することを考えます。ここで、$\phi$は、$n\times n$行列上で定義される基礎となる滑らかな凸コスト関数です。2次定常点$X$のみが妥当な時間で証明可能に見つかりますが、$X$がさらにランク不足である場合、そのランク不足により、$X$はグローバルに最適であると証明されます。このグローバル最適性の証明方法では、必然的に、現在の反復$X$の検索ランク$r$が、グローバル最小化関数$X^{\star}$のランク$r^{\star}$に関して過剰パラメータ化される必要があります。残念ながら、過剰パラメータ化により、$\phi$が強く凸であっても、勾配降下法の収束が、$r=r^{\star}$での線形速度から、$r>r^{\star}$での非線形速度まで大幅に遅くなります。この論文では、過剰パラメータ化されたケースで勾配降下法の収束率を線形に戻し、同時にグローバル最小化器$X^{\star}$の悪条件化の可能性を無視できる安価な前処理を提案します。

Flexible Model Aggregation for Quantile Regression
分位点回帰のための柔軟なモデル集計

Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions, or to model a diverse population without being overly reductive. For instance, epidemiological forecasts, cost estimates, and revenue predictions all benefit from being able to quantify the range of possible values accurately. As such, many models have been developed for this problem over many years of research in statistics, machine learning, and related fields. Rather than proposing yet another (new) algorithm for quantile regression we adopt a meta viewpoint: we investigate methods for aggregating any number of conditional quantile models, in order to improve accuracy and robustness. We consider weighted ensembles where weights may vary over not only individual models, but also over quantile levels, and feature values. All of the models we consider in this paper can be fit using modern deep learning toolkits, and hence are widely accessible (from an implementation point of view) and scalable. To improve the accuracy of the predicted quantiles (or equivalently, prediction intervals), we develop tools for ensuring that quantiles remain monotonically ordered, and apply conformal calibration methods. These can be used without any modification of the original library of base models. We also review some basic theory surrounding quantile aggregation and related scoring rules, and contribute a few new results to this literature (for example, the fact that post sorting or post isotonic regression can only improve the weighted interval score). Finally, we provide an extensive suite of empirical comparisons across 34 data sets from two different benchmark repositories.

分位回帰は、予測の不確実性を定量化したり、過度に簡略化することなく多様な集団をモデル化したりする必要性から生まれた、統計学習における基本的な問題です。たとえば、疫学的予測、コスト見積もり、収益予測はすべて、可能な値の範囲を正確に定量化できることから恩恵を受けます。そのため、統計、機械学習、および関連分野における長年の研究を通じて、この問題に対する多くのモデルが開発されてきました。分位回帰のさらに別の（新しい）アルゴリズムを提案するのではなく、メタ観点を採用します。つまり、精度と堅牢性を向上させるために、任意の数の条件付き分位モデルを集約する方法を調査します。重みが個々のモデルだけでなく、分位レベルや特徴値によっても変化する可能性がある重み付きアンサンブルを検討します。この論文で検討するモデルはすべて、最新のディープラーニングツールキットを使用して適合できるため、（実装の観点から）広くアクセスでき、スケーラブルです。予測された分位数(または予測区間と同等)の精度を向上させるために、分位数が単調に順序付けられていることを保証するツールを開発し、等角較正法を適用します。これらは、基本モデルのオリジナルライブラリを変更することなく使用できます。また、分位数の集約と関連するスコアリングルールに関する基本理論をいくつか見直し、この文献にいくつかの新しい結果を提供します(たとえば、後ソートまたは後等位回帰では加重区間スコアのみが向上するという事実)。最後に、2つの異なるベンチマークリポジトリからの34のデータセットにわたる広範な実験比較を提供します。

q-Learning in Continuous Time
連続時間でのq-学習

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term “(little) q-function”. This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a “q-learning” theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor–critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

私たちは、Wangら(2020)によって導入されたエントロピー正則化された探索的拡散過程定式化の下で、強化学習(RL)のための連続時間版Q学習を研究します。従来の(大きな) Q関数は連続時間で崩壊するため、その一次近似を考慮し、「(小さな) q関数」という用語を生み出す。この関数は、瞬間的な優位性率関数とハミルトニアンに関連しています。私たちは、時間の離散化に依存しないq関数を中心とした「q学習」理論を展開します。確率的ポリシーが与えられた場合、オンポリシー設定とオフポリシー設定の両方で、特定の確率過程のマルチンゲール条件によって、関連するq関数と価値関数を共同で特徴付ける。次に、この理論を適用して、q関数から生成されたギブス測度の密度関数を明示的に計算できるかどうかに応じて、基礎となるRL問題を解決するためのさまざまなアクター-クリティックアルゴリズムを考案します。私たちのアルゴリズムの1つは、よく知られているQ学習アルゴリズムSARSAを解釈し、もう1つはJiaとZhou (2022b)で提案されたポリシー勾配(PG)ベースの連続時間アルゴリズムを復元します。最後に、シミュレーション実験を行って、私たちのアルゴリズムのパフォーマンスをJiaとZhou (2022b)のPGベースのアルゴリズムや時間離散化された従来のQ学習アルゴリズムと比較します。

Multivariate Soft Rank via Entropy-Regularized Optimal Transport: Sample Efficiency and Generative Modeling
エントロピー正則化最適輸送による多変量ソフトランク:サンプル効率と生成モデリング

The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting as corresponding to an optimal transport map, while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper, we point to and alleviate some of the shortcomings of these GoF statistics that are of practical significance, namely high computational cost, curse of dimensionality in statistical sample complexity, and lack of differentiability with respect to the data. We show that all these issues are addressed by defining multivariate rank as an entropic transport map derived from the entropic regularization of the optimal transport problem, which we refer to as the soft rank. We consequently propose two new statistics, the soft rank energy (sRE) and soft rank maximum mean discrepancy (sRMMD). Given n sample data points, we provide non-asymptotic convergence rates for the sample estimate of the entropic transport map to its population version that are essentially of the order n^(-1/2) when the source measure is subgaussian and the target measure has compact support. This result is novel compared to existing results which achieve a rate of n^(-1) but crucially rely on both measures having compact support. In contrast, the corresponding convergence rate of estimating an optimal transport map, and hence the rank map, is exponential in the data dimension. We leverage these fast convergence rates to show that the sample estimates of sRE and sRMMD converge rapidly to their population versions. Combined with the computational efficiency of methods in solving the entropy-regularized optimal transport problem, these results enable efficient rank-based GoF statistical computation, even in high dimensions. Furthermore, the sample estimates of sRE and sRMMD are differentiable with respect to the data and amenable to popular machine learning frameworks that rely on gradient methods. We leverage these properties towards showcasing their utility for generative modeling on two important problems: image generation and generating valid knockoffs for controlled feature selection.

最適輸送のフレームワークは、ランクの概念を、最適輸送マップに対応する多変量設定に拡張するために活用され、その結果得られる適合度(GoF)統計の望ましい特性が保持されています。特に、ランクエネルギー(RE)とランク最大平均不一致(RMMD)は、帰無仮説の下で分布フリーであり、統計的検定で高い検出力を示し、外れ値に対して堅牢です。この論文では、これらのGoF統計の実用上重要な欠点、つまり計算コストの高さ、統計的サンプルの複雑さにおける次元の呪い、およびデータに関する微分可能性の欠如を指摘し、軽減します。これらの問題はすべて、多変量ランクを最適輸送問題のエントロピー正則化から導出されたエントロピー輸送マップとして定義することで解決されることを示します。これをソフトランクと呼びます。その結果、ソフトランクエネルギー(sRE)とソフトランク最大平均不一致(sRMMD)という2つの新しい統計を提案します。n個のサンプルデータポイントが与えられた場合、ソース測度がサブガウスでターゲット測度がコンパクトサポートを持つ場合、エントロピー輸送マップのサンプル推定値からその母集団バージョンへの非漸近収束率は基本的にn^(-1/2)のオーダーになります。この結果は、n^(-1)のレートを達成しながらも、両方の測度がコンパクトサポートを持つことに大きく依存している既存の結果と比較して新しいものです。対照的に、最適な輸送マップ、つまりランクマップを推定する対応する収束率は、データ次元で指数関数的です。これらの高速収束率を利用して、sREおよびsRMMDのサンプル推定値が母集団バージョンに急速に収束することを示します。エントロピー正規化最適輸送問題を解く方法の計算効率と組み合わせると、これらの結果により、高次元でも効率的なランクベースのGoF統計計算が可能になります。さらに、sREおよびsRMMDのサンプル推定値はデータに関して微分可能であり、勾配法に依存する一般的な機械学習フレームワークに適しています。私たちはこれらの特性を活用して、画像生成と制御された特徴選択のための有効な模倣品の生成という2つの重要な問題における生成モデリングの有用性を示します。

Infinite-dimensional optimization and Bayesian nonparametric learning of stochastic differential equations
無限次元最適化と確率微分方程式のベイズノンパラメトリック学習

The paper has two major themes. The first part of the paper establishes certain general results for infinite-dimensional optimization problems on Hilbert spaces. These results cover the classical representer theorem and many of its variants as special cases and offer a wider scope of applications. The second part of the paper then develops a systematic approach for learning the drift function of a stochastic differential equation by integrating the results of the first part with Bayesian hierarchical framework. Importantly, our Bayesian approach incorporates low-cost sparse learning through proper use of shrinkage priors while allowing proper quantification of uncertainty through posterior distributions. Several examples at the end illustrate the accuracy of our learning scheme.

この論文には2つの主要なテーマがあります。この論文の最初の部分では、ヒルベルト空間上の無限次元最適化問題に対する特定の一般的な結果を確立します。これらの結果は、古典的な表現定理とその多くのバリアントを特殊なケースとしてカバーし、より広い範囲のアプリケーションを提供します。次に、論文の第2部では、第1部の結果をベイズ階層フレームワークと統合することにより、確率微分方程式のドリフト関数を学習するための体系的なアプローチを開発します。重要なことに、私たちのベイジアンアプローチは、収縮事前分布の適切な使用による低コストのスパース学習を組み込んでおり、事後分布による不確実性の適切な定量化を可能にしています。最後にあるいくつかの例は、私たちの学習スキームの正確さを示しています。

Asynchronous Iterations in Optimization: New Sequence Results and Sharper Algorithmic Guarantees
最適化における非同期反復: 新しいシーケンス結果とよりシャープなアルゴリズム保証

We introduce novel convergence results for asynchronous iterations that appear in the analysis of parallel and distributed optimization algorithms. The results are simple to apply and give explicit estimates for how the degree of asynchrony impacts the convergence rates of the iterates. Our results shorten, streamline and strengthen existing convergence proofs for several asynchronous optimization methods and allow us to establish convergence guarantees for popular algorithms that were thus far lacking a complete theoretical understanding. Specifically, we use our results to derive better iteration complexity bounds for proximal incremental aggregated gradient methods, to obtain tighter guarantees depending on the average rather than maximum delay for the asynchronous stochastic gradient descent method, to provide less conservative analyses of the speedup conditions for asynchronous block-coordinate implementations of Krasnoselskii–Mann iterations, and to quantify the convergence rates for totally asynchronous iterations under various assumptions on communication delays and update rates.

私たちは、並列および分散最適化アルゴリズムの解析で現れる非同期反復の新しい収束結果を紹介します。結果は簡単に適用でき、非同期の程度が反復の収束率にどのように影響するかについて明確な推定値を提供します。我々の結果は、いくつかの非同期最適化方法の既存の収束証明を短縮、合理化、強化し、これまで完全な理論的理解が欠如していた一般的なアルゴリズムの収束保証を確立することを可能にします。具体的には、我々は結果を使用して、近似増分集約勾配法のより優れた反復複雑度境界を導出し、非同期確率的勾配降下法の最大遅延ではなく平均遅延に応じたより厳密な保証を取得し、クラスノセルスキー-マン反復の非同期ブロック座標実装の高速化条件のより保守的でない解析を提供し、通信遅延と更新レートに関するさまざまな仮定の下で完全に非同期の反復の収束率を定量化します。

Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the in the O(ε^(-7/4)) Complexity
非凸加速勾配降下法の再開: O(ε^(-7/4)) の計算量で多重対数因子が不要

This paper studies accelerated gradient methods for nonconvex optimization with Lipschitz continuous gradient and Hessian. We propose two simple accelerated gradient methods, restarted accelerated gradient descent (AGD) and restarted heavy ball (HB) method, and establish that our methods achieve an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-7/4})$ number of gradient evaluations by elementary proofs. Theoretically, our complexity does not hide any polylogarithmic factors, and thus it improves over the best known one by the $O(\log\frac{1}{\epsilon})$ factor. Our algorithms are simple in the sense that they only consist of Nesterov’s classical AGD or Polyak’s HB iterations, as well as a restart mechanism. They do not invoke negative curvature exploitation or minimization of regularized surrogate functions as the subroutines. In contrast with existing analysis, our elementary proofs use less advanced techniques and do not invoke the analysis of strongly convex AGD or HB.

この論文では、リプシッツ連続勾配法とヘッセ行列を用いた非凸最適化のための加速勾配法について検討します。私たちは、再開加速勾配降下法(AGD)と再開ヘビーボール法(HB)という2つの単純な加速勾配法を提案し、我々の方法が$O(\epsilon^{-7/4})$回の勾配評価で$\epsilon$近似の1次定常点を達成することを初等的証明によって確立します。理論的には、我々の複雑さは多重対数因子を隠さないため、既知の最良のものより$O(\log\frac{1}{\epsilon})$倍改善されます。我々のアルゴリズムは、ネステロフの古典的なAGDまたはポリアックのHB反復と再開メカニズムのみで構成されるという意味で単純です。それらは、負の曲率の利用や正規化代理関数の最小化をサブルーチンとして呼び出しません。既存の分析とは対照的に、私たちの基本的な証明ではそれほど高度な技術は使用されず、強凸AGDまたはHBの分析は呼び出されません。

Integrating Random Effects in Deep Neural Networks
ディープニューラルネットワークにおけるランダム効果の統合

Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.

ディープニューラルネットワーク(DNN)などの最新の教師あり学習アプローチでは、通常、観測された応答が統計的に独立していると暗黙的に想定されます。対照的に、相関データは実際の大規模アプリケーションで広く使用されており、相関の一般的なソースには空間、時間、クラスタリング構造が含まれます。これらの相関はDNNによって無視されるか、特定のユースケース向けにアドホックソリューションが開発されます。DNNで相関データを処理するには、混合モデルフレームワークを使用することを提案します。相関構造の根底にある効果をランダム効果として扱うことで、混合モデルは過剰適合したパラメーター推定を回避し、最終的に予測パフォーマンスを向上させることができます。混合モデルとDNNを組み合わせる鍵は、ガウス負対数尤度(NLL)を自然損失関数として使用し、確率的勾配降下法(SGD)などのDNN機構で最小化することです。NLLは標準のDNN損失関数のように分解されないため、NLLでのSGDの使用にはいくつかの理論的および実装上の課題があり、それらに対処します。LMMNNと呼ばれる私たちのアプローチは、さまざまなシミュレートされたデータセットと実際のデータセットのさまざまな相関シナリオで、自然な競合相手よりもパフォーマンスが向上することが実証されています。私たちの焦点は回帰設定と表形式のデータセットにありますが、分類の結果もいくつか示しています。私たちのコードはhttps://github.com/gsimchoni/lmmnnで入手できます。

Adaptive Data Depth via Multi-Armed Bandits
マルチアームバンディットによる適応型データ深度

Data depth, introduced by Tukey (1975), is an important tool in data science, robust statistics, and computational geometry. One chief barrier to its broader practical utility is that many common measures of depth are computationally intensive, requiring on the order of $n^d$ operations to exactly compute the depth of a single point within a data set of $n$ points in $d$-dimensional space. Often however, we are not directly interested in the absolute depths of the points, but rather in their relative ordering. For example, we may want to find the most central point in a data set (a generalized median), or to identify and remove all outliers (points on the fringe of the data set with low depth). With this observation, we develop a novel instance-adaptive algorithm for adaptive data depth computation by reducing the problem of exactly computing $n$ depths to an $n$-armed stochastic multi-armed bandit problem which we can efficiently solve. We focus our exposition on simplicial depth, developed by Liu (1990), which has emerged as a promising notion of depth due to its interpretability and asymptotic properties. We provide general data-dependent theoretical guarantees for our proposed algorithms, which readily extend to many other common measures of data depth including majority depth, Oja depth, and likelihood depth. When specialized to the case where the gaps in the data follow a power law distribution with parameter $\alpha<2$, we reduce the complexity of identifying the deepest point in the data set (the simplicial median) from $O(n^d)$ to $\tilde{O}(n^{d-(d-1)\alpha/2})$, where $\tilde{O}$ suppresses a logarithmic factor. We corroborate our theoretical results with numerical experiments on synthetic data, showing the practical utility of our proposed methods.

Tukey (1975)によって導入されたデータ深度は、データサイエンス、ロバスト統計、計算幾何学における重要なツールです。その幅広い実用性に対する主な障壁の1つは、深度の一般的な測定方法の多くが計算集約的であり、d次元空間のn点のデータセット内の1点の深度を正確に計算するためにn^d回の演算が必要になることです。ただし、多くの場合、ポイントの絶対深度に直接関心があるのではなく、ポイントの相対的な順序に興味があります。たとえば、データセットの最も中心的なポイント(一般化された中央値)を見つけたり、すべての外れ値(深度が低いデータセットの端にあるポイント)を特定して削除したりしたい場合があります。この観察に基づいて、n深度を正確に計算する問題を、効率的に解決できるnアームの確率的多腕バンディット問題に縮小することで、適応型データ深度計算のための新しいインスタンス適応型アルゴリズムを開発しました。私たちは、Liu (1990)によって開発された単体深度に焦点を当てて解説します。単体深度は、その解釈可能性と漸近特性により、有望な深度の概念として浮上しています。提案アルゴリズムには、一般的なデータ依存の理論的保証を提供します。これは、多数決深度、Oja深度、尤度深度など、他の多くの一般的なデータ深度測定に容易に拡張できます。データ内のギャップがパラメーター$\alpha<2$のべき乗分布に従うケースに特化することで、データセット内の最も深いポイント(単体中央値)を特定する複雑さが$O(n^d)$から$\tilde{O}(n^{d-(d-1)\alpha/2})$に軽減されます。ここで、$\tilde{O}$は対数係数を抑制します。私たちは、合成データに対する数値実験で理論的結果を裏付け、提案方法の実用性を示します。

Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees
勾配ブースト決定木の影響推定法の適応と評価

Influence estimation analyzes how changes to the training data can lead to different model predictions; this analysis can help us better understand these predictions, the models making those predictions, and the data sets they are trained on. However, most influence-estimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are a powerful and widely-used class of models; however, these models are black boxes with opaque decision-making processes. In the pursuit of better understanding GBDT predictions and generally improving these models, we adapt recent and popular influence-estimation methods designed for deep learning models to GBDTs. Specifically, we adapt representer-point methods and TracIn, denoting our new methods TREX and BoostIn, respectively; source code is available at https://github.com/jjbrophy47/treeinfluence. We compare these methods to LeafInfluence and other baselines using 5 different evaluation measures on 22 real-world data sets with 4 popular GBDT implementations. These experiments give us a comprehensive overview of how different approaches to influence estimation work in GBDT models. We find BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster. Our evaluation also suggests the gold-standard approach of leave-one-out (LOO) retraining consistently identifies the single-most influential training example but performs poorly at finding the most influential set of training examples for a given target prediction.

影響推定では、トレーニングデータの変更によって異なるモデル予測がどのように生じるかを分析します。この分析により、これらの予測、それらの予測を行うモデル、およびそれらのモデルがトレーニングされるデータセットをより深く理解できます。ただし、影響推定手法のほとんどは、連続パラメータを持つディープラーニングモデル用に設計されています。勾配ブースティング決定木(GBDT)は強力で広く使用されているモデルのクラスですが、これらのモデルは不透明な意思決定プロセスを持つブラックボックスです。GBDT予測をより深く理解し、これらのモデルを全体的に改善するために、ディープラーニングモデル用に設計された最近の一般的な影響推定方法をGBDTに適応させます。具体的には、代表点法とTracInを適応させ、それぞれ新しい方法TREXとBoostInを示します。ソースコードはhttps://github.com/jjbrophy47/treeinfluenceで入手できます。これらの方法を、4つの一般的なGBDT実装を使用して22の実際のデータセットで5つの異なる評価基準を使用してLeafInfluenceおよびその他のベースラインと比較します。これらの実験により、GBDTモデルで影響推定を行うさまざまなアプローチがどのように機能するかについて、包括的な概要が得られます。BoostInは、GBDTの効率的な影響推定方法であり、既存の作業と同等かそれ以上のパフォーマンスを発揮しながら、4桁も高速であることがわかりました。また、私たちの評価では、leave-one-out (LOO)再トレーニングのゴールドスタンダードアプローチは、最も影響力のある単一のトレーニング例を一貫して特定しますが、特定のターゲット予測に対して最も影響力のあるトレーニング例のセットを見つけるパフォーマンスが低いことも示されています。

Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process
準ベルヌーイ棒破砕過程を用いた一貫性のあるモデルベースのクラスタリング

In mixture modeling and clustering applications, the number of components and clusters is often not known. A stick-breaking mixture model, such as the Dirichlet process mixture model, is an appealing construction that assumes infinitely many components, while shrinking the weights of most of the unused components to near zero. However, it is well-known that this shrinkage is inadequate: even when the component distribution is correctly specified, spurious weights appear and give an inconsistent estimate of the number of clusters. In this article, we propose a simple solution: when breaking each mixture weight stick into two pieces, the length of the second piece is multiplied by a quasi-Bernoulli random variable, taking value one or a small constant close to zero. This effectively creates a soft truncation and further shrinks the unused weights. Asymptotically, we show that as long as this small constant diminishes to zero at a rate faster than $o(1/n^2)$, with $n$ the sample size and given data from a finite mixture model, the posterior distribution will converge to the true number of clusters. In comparison, we rigorously explore Dirichlet process mixture models using a concentration parameter that is either constant or rapidly diminishes to zero—both of which lead to inconsistency for the number of clusters. Our proposed model is easy to implement, requiring only a small modification of a standard Gibbs sampler for mixture models. In simulations and a data application of clustering brain networks, our proposed method recovers the ground-truth number of clusters, and leads to a small number of clusters.

混合モデリングおよびクラスタリングアプリケーションでは、コンポーネントとクラスターの数がわからないことがよくあります。ディリクレ過程混合モデルなどのスティックブレーク混合モデルは、コンポーネントが無限にあると想定しながら、未使用のコンポーネントのほとんどの重みをほぼゼロに縮小する魅力的な構成です。ただし、この縮小が不十分であることはよく知られています。コンポーネントの分布が正しく指定されている場合でも、誤った重みが発生し、クラスターの数の推定値が矛盾します。この記事では、簡単な解決策を提案します。各混合重みスティックを2つに分割するときに、2番目の部分の長さに準ベルヌーイランダム変数を掛けて、値を1またはゼロに近い小さな定数にします。これにより、ソフトトランケーションが効果的に作成され、未使用の重みがさらに縮小されます。漸近的に、サンプルサイズをnとし、有限混合モデルからのデータを与えた場合、この小さな定数が$o(1/n^2)$よりも速い速度でゼロに減少する限り、事後分布は真のクラスター数に収束することを示します。これに対して、定数または急速にゼロに減少する濃度パラメータを使用してディリクレ過程混合モデルを厳密に調査します。どちらもクラスター数の不一致につながります。提案モデルは実装が簡単で、混合モデル用の標準ギブスサンプラーを少し変更するだけで済みます。脳ネットワークのクラスタリングのシミュレーションとデータアプリケーションでは、提案方法は真のクラスター数を回復し、クラスターの数を少なくします。

Selective inference for k-means clustering
k-meansクラスタリングの選択的推論

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

私たちは、k-meansクラスタリングによって識別された観測値のクラスタ間の平均の差を検定する問題を考えます。この設定では、古典的な仮説検定は、タイプIのエラー率の水増しにつながります。最近の研究では、Gaoら(2022)は、階層的クラスタリングの文脈で関連する問題を検討しました。残念ながら、そのソリューションは階層クラスタリングのコンテキストに合わせて高度に調整されているため、k-meansクラスタリングの設定には適用できません。この論文では、k-meansアルゴリズムのすべての中間クラスタリング割り当てを条件とするp値を提案します。p値は、有限サンプルのk平均クラスタリングを使用して得られたクラスターのペア間の平均の差の検定に対して選択的なタイプIエラーを制御し、効率的に計算できることを示します。私たちは、手書きの数字データやシングルセルRNAシーケンシングデータに提案を適用します。

Generalization error bounds for multiclass sparse linear classifiers
多クラススパース線形分類器の一般化誤差範囲

We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global row-wise sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.

私たちは、スパース多項ロジスティック回帰による高次元多クラス分類について考えます。二項分類とは異なり、多クラス設定では、回帰係数行列上のさまざまな構造的仮定に関連付けられたスパース性の概念の全範囲について考えることができます。私たちは、特定のタイプのスパース性を手元で捕捉する凸ペナルティを伴うペナルティ付き最尤法に基づく計算上実現可能な特徴選択手順を提案します。特に、グローバルな行単位のスパース性、二重行方向のスパース性、および低ランクのスパース性を考慮し、適切に選択された調整パラメーターを使用して、派生したプラグイン分類器が、マルチクラススパース線形分類器の対応するクラス内でminimax汎化誤差限界(誤分類の過剰リスクの観点から)を達成することを示します。開発されたアプローチは一般的であり、他のタイプのスパース性にも適応できます。

MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning
MALib:集団ベースのマルチエージェント強化学習のための並列フレームワーク

Population-based multi-agent reinforcement learning (PB-MARL) encompasses a range of methods that merge dynamic population selection with multi-agent reinforcement learning algorithms (MARL). While PB-MARL has demonstrated notable achievements in complex multi-agent tasks, its sequential execution is plagued by low computational efficiency due to the diversity in computing patterns and policy combinations. We propose a solution involving a stateless central task dispatcher and stateful workers to handle PB-MARL’s subroutines, thereby capitalizing on parallelism across various components for efficient problem-solving. In line with this approach, we introduce MALib, a parallel framework that incorporates a task control model, independent data servers, and an abstraction of MARL training paradigms. The framework has undergone extensive testing and is available under the MIT license (https://github.com/sjtu-marl/malib)

集団ベースのマルチエージェント強化学習(PB-MARL)には、動的母集団選択とマルチエージェント強化学習アルゴリズム(MARL)をマージするさまざまな方法が含まれます。PB-MARLは、複雑なマルチエージェントタスクで顕著な成果を示していますが、その逐次実行は、計算パターンとポリシーの組み合わせの多様性による計算効率の低さに悩まされています。私たちは、ステートレスな中央タスクディスパッチャーとPB-MARLのサブルーチンを処理するステートフルワーカーを含むソリューションを提案し、それによって効率的な問題解決のためにさまざまなコンポーネント間の並列性を活用します。このアプローチに沿って、タスク制御モデル、独立したデータサーバー、およびMARLトレーニングパラダイムの抽象化を組み込んだ並列フレームワークであるMALibを導入します。このフレームワークは広範なテストを受けており、MITライセンス(https://github.com/sjtu-marl/malib)の下で利用できます

Controlling Wasserstein Distances by Kernel Norms with Application to Compressive Statistical Learning
カーネルノルムによるワッサーシュタイン距離の制御と圧縮統計学習への応用

Comparing probability distributions is at the crux of many machine learning algorithms. Maximum Mean Discrepancies (MMD) and Wasserstein distances are two classes of distances between probability distributions that have attracted abundant attention in past years. This paper establishes some conditions under which the Wasserstein distance can be controlled by MMD norms. Our work is motivated by the compressive statistical learning (CSL) theory, a general framework for resource-efficient large scale learning in which the training data is summarized in a single vector (called sketch) that captures the information relevant to the considered learning task. Inspired by existing results in CSL, we introduce the Hölder Lower Restricted Isometric Property and show that this property comes with interesting guarantees for compressive statistical learning. Based on the relations between the MMD and the Wasserstein distances, we provide guarantees for compressive statistical learning by introducing and studying the concept of Wasserstein regularity of the learning task, that is when some task-specific metric between probability distributions can be bounded by a Wasserstein distance.

確率分布の比較は、多くの機械学習アルゴリズムの核心です。最大平均差異(MMD)とワッサーシュタイン距離は、過去数年間に多大な注目を集めてきた確率分布間の距離の2つのクラスです。この論文では、ワッサーシュタイン距離をMMD規範で制御できる条件をいくつか確立します。私たちの研究は、リソース効率の高い大規模学習の一般的なフレームワークである圧縮統計学習(CSL)理論に触発されています。この理論では、トレーニングデータは、検討中の学習タスクに関連する情報をキャプチャする単一のベクトル(スケッチと呼ばれる)に要約されます。CSLの既存の結果に触発されて、Hölderの制限された等尺性特性を紹介し、この特性が圧縮統計学習の興味深い保証をもたらすことを示します。MMDとワッサーシュタイン距離の関係に基づいて、学習タスクのワッサーシュタイン規則性の概念を紹介して研究することにより、圧縮統計学習の保証を提供します。これは、確率分布間のタスク固有のメトリックがワッサーシュタイン距離によって制限される場合です。

Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition
PL条件を持つ非凸強凹最小-最大問題に対する高速目的関数と双対性ギャップ収束

This paper focuses on stochastic methods for solving smooth non-convex strongly-concave min-max problems, which have received increasing attention due to their potential applications in deep learning (e.g., deep AUC maximization, distributionally robust optimization). However, most of the existing algorithms are slow in practice, and their analysis revolves around the convergence to a nearly stationary point. We consider leveraging the Polyak-Lojasiewicz (PL) condition to design faster stochastic algorithms with stronger convergence guarantee. Although PL condition has been utilized for designing many stochastic minimization algorithms, their applications for non-convex min-max optimization remain rare. In this paper, we propose and analyze a generic framework of proximal stage-based method with many well-known stochastic updates embeddable. Fast convergence is established in terms of both the primal objective gap and the duality gap. Compared with existing studies, (i) our analysis is based on a novel Lyapunov function consisting of the primal objective gap and the duality gap of a regularized function, and (ii) the results are more comprehensive with improved rates that have better dependence on the condition number under different assumptions. We also conduct deep and non-deep learning experiments to verify the effectiveness of our methods.

この論文では、ディープラーニングへの応用（ディープAUC最大化、分布的に堅牢な最適化など）の可能性からますます注目を集めている、滑らかな非凸強凹の最小最大問題を解くための確率的手法に焦点を当てています。しかし、既存のアルゴリズムのほとんどは実際には遅く、その分析はほぼ定常点への収束を中心に展開されています。Polyak-Lojasiewicz（PL）条件を活用して、より強力な収束保証を備えたより高速な確率的アルゴリズムを設計することを検討します。PL条件は多くの確率的最小化アルゴリズムの設計に利用されてきましたが、非凸最小最大最適化への応用はまだまれです。この論文では、多くのよく知られた確率的更新を埋め込むことができる近位ステージベースの方法の一般的なフレームワークを提案し、分析します。主目的ギャップと双対性ギャップの両方の観点から、高速収束が確立されています。既存の研究と比較すると、(i)私たちの分析は、正規化関数の主目的ギャップと双対ギャップからなる新しいリアプノフ関数に基づいており、(ii)結果はより包括的で、さまざまな仮定の下で条件数への依存性が向上したレートが改善されています。また、私たちの方法の有効性を検証するために、ディープラーニングと非ディープラーニングの実験も行っています。

Stochastic Optimization under Distributional Drift
分布ドリフト下での確率最適化

We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking efficiency of the proximal stochastic gradient method benefits significantly from a step decay schedule. Numerical experiments illustrate our results.

私たちは、未知で確率的なダイナミクスに従って進化している凸関数を最小化する問題を考えます。これは、時間と決定変数自体に共同で依存する可能性があります。このような問題は、機械学習や信号処理の文献に、概念ドリフト、確率的追跡、パフォーマティブ予測などの名前であふれています。私たちは、期待値と高い確率の両方で有効な範囲に焦点を当てて、反復平均化を使用した確率的アルゴリズムの新しい非漸近収束保証を提供します。得られる効率の推定値は、最適化誤差、勾配ノイズ、および時間ドリフトの影響を明確に分離します。特に、近位確率的勾配法の追跡効率がステップ減衰スケジュールから大幅に恩恵を受ける低ドリフト・トゥ・ノイズ・レジームを特定しました。数値実験は、私たちの結果を示しています。

Off-Policy Actor-Critic with Emphatic Weightings
強調された重み付けを持つオフポリシーの俳優-評論家

A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods—particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)—converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.

オンポリシー設定には、ポリシー勾配定理により勾配の簡略化された形式が提供されるため、理論的に健全なさまざまなポリシー勾配アルゴリズムが存在します。しかし、オフポリシー設定は、複数の目的が存在し、明示的なオフポリシーポリシー勾配定理がないため、明確ではありません。この研究では、これらの目的を1つのオフポリシー目的に統合し、この統合された目的に対してポリシー勾配定理を提供します。導出には、強調重み付けと関心関数が含まれます。強調重み付け付きアクタークリティック(ACE)と呼ばれるアルゴリズムで、勾配を近似する複数の戦略を示します。反例で、以前の(半勾配)オフポリシーアクタークリティックメソッド、特にオフポリシーアクタークリティック(OffPAC)と決定論的ポリシー勾配(DPG)は間違った解に収束しますが、ACEは最適解を見つけることを証明します。また、これらの半勾配アプローチが実際に優れたパフォーマンスを発揮できる理由についても強調し、ACEの分散削減戦略を提案します。2つの従来の制御環境と、各勾配近似によるトレードオフを示すように設計された画像ベースの環境で、ACEのいくつかのバリエーションを経験的に調査します。強調重みを直接近似することにより、テストしたすべての設定でACEがOffPACと同等かそれ以上のパフォーマンスを発揮することがわかりました。

Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and Personalized Federated Learning
モデルに依存しないメタ学習とパーソナライズされた連合学習のためのメモリベースの最適化手法

In recent years, model-agnostic meta-learning (MAML) has become a popular research area. However, the stochastic optimization of MAML is still underdeveloped. Existing MAML algorithms rely on the “episode” idea by sampling a few tasks and data points to update the meta-model at each iteration. Nonetheless, these algorithms either fail to guarantee convergence with a constant mini-batch size or require processing a large number of tasks at every iteration, which is unsuitable for continual learning or cross-device federated learning where only a small number of tasks are available per iteration or per round. To address these issues, this paper proposes memory-based stochastic algorithms for MAML that converge with vanishing error. The proposed algorithms require sampling a constant number of tasks and data samples per iteration, making them suitable for the continual learning scenario. Moreover, we introduce a communication-efficient memory-based MAML algorithm for personalized federated learning in cross-device (with client sampling) and cross-silo (without client sampling) settings. Our theoretical analysis improves the optimization theory for MAML, and our empirical results corroborate our theoretical findings.

近年、モデルに依存しないメタ学習(MAML)が人気の研究分野となっています。しかし、MAMLの確率的最適化はまだ十分には開発されていません。既存のMAMLアルゴリズムは、各反復でメタモデルを更新するために少数のタスクとデータポイントをサンプリングする「エピソード」の考え方に依存しています。しかし、これらのアルゴリズムは、一定のミニバッチサイズでの収束を保証できないか、反復ごとに多数のタスクを処理する必要があり、反復ごとまたはラウンドごとに少数のタスクしか利用できない継続学習やデバイス間連合学習には適していません。これらの問題に対処するために、この論文では、誤差がゼロで収束するMAML用のメモリベースの確率的アルゴリズムを提案します。提案されたアルゴリズムでは、反復ごとに一定数のタスクとデータサンプルをサンプリングする必要があるため、継続学習シナリオに適しています。さらに、デバイス間(クライアントサンプリングあり)およびサイロ間(クライアントサンプリングなし)設定でのパーソナライズされたフェデレーションラーニングのための、通信効率の高いメモリベースのMAMLアルゴリズムを紹介します。理論分析によりMAMLの最適化理論が改善され、実証結果によって理論的発見が裏付けられます。

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering
ベイジアンモデルベースクラスタリングにおける次元の呪縛からの脱出

Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.

ベイズ混合モデルは、適切な不確実性定量化による高次元データのクラスタリングに広く使用されています。ただし、観測の次元が増加すると、事後推論ではクラスターが多すぎたり少なすぎたりする傾向があります。この記事では、サンプルサイズを固定し、データの次元を増やす非標準設定でランダム分割事後を研究することで、この動作について説明します。次元が増加すると、有限サンプル事後がすべての観測を異なるクラスターに割り当てるか、すべての観測を同じクラスターに割り当てる傾向がある条件を示します。興味深いことに、観測をクラスターに分割する可能性のあるすべてのものが正の事前確率を持つ限り、条件はクラスタリング事前分布の選択に依存せず、真のデータ生成モデルに関係なく当てはまります。次に、観測データの分割を誘導する低次元潜在変数のセットに対するベイズクラスタリング(Lamb)の潜在混合のクラスを提案します。このモデルはスケーラブルな事後推論に適しており、軽度の仮定の下で高次元性の落とし穴を回避できることを示しています。提案されたアプローチは、シミュレーション研究およびscRNAseqに基づく細胞タイプの推論への応用において優れたパフォーマンスを発揮することが示されています。

Large sample spectral analysis of graph-based multi-manifold clustering
グラフベース多多様体クラスタリングの大規模サンプルスペクトル解析

In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds $\M = \M_1 \cup\dots \cup \M_N$ that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on $\M$ with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.

この研究では、多様体クラスタリング(MMC)のグラフベースアルゴリズムの統計的特性について研究します。MMCの目標は、与えられたユークリッドデータセットの基になる多様体構造を取得することです。この構造は、互いに交差したり、異なる次元を持つ可能性のある多様体$\M = \M_1 \cup\dots \cup \M_N$の和集合上の分布をサンプリングすることによって取得されると想定されます。データセット上の類似性グラフが、対応するグラフラプラシアンがMMC問題を解決するための適切な幾何学的情報を取得するために満たす必要のある十分な条件を調査します。正確には、観測から構築された適切なグラフラプラシアンを使用して、$\M$上のテンソル化ラプラシアンのスペクトル近似の高確率誤差境界を提供します。復元されたテンソル化ラプラシアンには、個々の基になる多様体すべての幾何学的情報がすべて含まれています。私たちは、これらの十分条件を満たす、角度制約付きの環状近接グラフと呼ばれる類似グラフの族の例を示します。私たちは、我々のグラフ族を、文献にある接平面の配置に基づく他の構成と比較します。広範な数値実験により、我々の理論がMMC問題に与える洞察が拡大されます。

On Tilted Losses in Machine Learning: Theory and Applications
機械学習における傾斜損失について:理論と応用

Exponential tilting is a technique commonly used in fields such as statistics, probability, information theory, and optimization to create parametric distribution shifts. Despite its prevalence in related fields, tilting has not seen widespread use in machine learning. In this work, we aim to bridge this gap by exploring the use of tilting in risk minimization. We study a simple extension to ERM—tilted empirical risk minimization (TERM)—which uses exponential tilting to flexibly tune the impact of individual losses. The resulting framework has several useful properties: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to the tail probability of losses. Our work makes connections between TERM and related objectives, such as Value-at-Risk, Conditional Value-at-Risk, and distributionally robust optimization (DRO). We develop batch and stochastic first-order optimization methods for solving TERM, provide convergence guarantees for the solvers, and show that the framework can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications in machine learning, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. Despite the straightforward modification TERM makes to traditional ERM objectives, we find that the framework can consistently outperform ERM and deliver competitive performance with state-of-the-art, problem-specific approaches.

指数傾斜は、統計、確率、情報理論、最適化などの分野でパラメトリック分布シフトを作成するために一般的に使用される手法です。関連分野で普及しているにもかかわらず、傾斜は機械学習で広く使用されていません。この研究では、リスク最小化における傾斜の使用を検討することで、このギャップを埋めることを目指しています。指数傾斜を使用して個々の損失の影響を柔軟に調整する、ERMの単純な拡張である傾斜経験的リスク最小化(TERM)を研究します。結果として得られるフレームワークには、いくつかの便利な特性があります。TERMは外れ値の影響をそれぞれ増加または減少させて公平性または堅牢性を実現できること、一般化に役立つ分散削減特性があること、損失のテール確率の滑らかな近似として見ることができることを示しています。私たちの研究では、TERMと、Value-at-Risk、条件付きValue-at-Risk、分布的に堅牢な最適化(DRO)などの関連目標との関連を示します。TERMを解決するためのバッチおよび確率的一次最適化手法を開発し、ソルバーの収束保証を提供し、フレームワークが一般的な代替案と比較して効率的に解決できることを示します。最後に、サブグループ間の公平性の強制、外れ値の影響の緩和、クラスの不均衡の処理など、機械学習のさまざまなアプリケーションにTERMを使用できることを実証します。TERMが従来のERMの目的に対して行う単純な変更にもかかわらず、フレームワークは一貫してERMを上回り、最先端の問題固有のアプローチで競争力のあるパフォーマンスを提供できることがわかりました。

Optimal Convergence Rates for Distributed Nystroem Approximation
分散ニストローム近似の最適収束率

The distributed kernel ridge regression (DKRR) has shown great potential in processing complicated tasks. However, DKRR only made use of the local samples that failed to capture the global characteristics. Besides, the existing optimal learning guarantees were provided in expectation and only pertain to the attainable case that the target regression lies exactly in the kernel space. In this paper, we propose distributed learning with globally-shared Nystroem centers (DNystroem), which utilizes global information across the local clients. We also study the statistical properties of DNystroem in expectation and in probability, respectively, and obtain several state-of-the-art results with the minimax optimal learning rates. Note that, the optimal convergence rates for DNystroem pertain to the non-attainable case, while the statistical results allow more partitions and require fewer Nystroem centers. Finally, we conduct experiments on several real-world datasets to validate the effectiveness of the proposed algorithm, and the empirical results coincide with our theoretical findings.

分散カーネルリッジ回帰(DKRR)は、複雑なタスクの処理に大きな可能性を示しています。しかし、DKRRは、グローバルな特性を捉えることができなかったローカルサンプルのみを使用していました。さらに、既存の最適学習保証は期待値で提供されており、ターゲット回帰がカーネル空間内に正確に存在する達成可能なケースにのみ関係します。この論文では、ローカルクライアント間でグローバル情報を利用する、グローバルに共有されたNystroemセンター(DNystroem)を使用した分散学習を提案します。また、期待値と確率のそれぞれにおけるDNystroemの統計特性を調査し、ミニマックス最適学習率で最先端の結果をいくつか得ました。DNystroemの最適収束率は達成不可能なケースに関係しますが、統計結果ではより多くのパーティションが許可され、より少ないNystroemセンターが必要になることに注意してください。最後に、提案されたアルゴリズムの有効性を検証するために、いくつかの実際のデータセットで実験を行い、経験的結果は理論的発見と一致しています。

Jump Interval-Learning for Individualized Decision Making with Continuous Treatments
連続治療による個別意思決定のためのジャンプ間隔学習

An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a Warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.

個別決定ルール(IDR)は、各個人の観察された特性に基づいて、特定の治療を各個人に割り当てる決定関数です。文献の既存の研究のほとんどは、2値または有限数の治療オプションを持つ設定を考慮しています。この論文では、連続的な治療設定に焦点を当て、期待される結果を最大化する個別間隔値決定ルール(I2DR)を開発するためのジャンプ間隔学習を提案します。単一の治療を推奨するIDRとは異なり、提案されたI2DRは各個人に対して治療オプションの間隔を生成するため、実際に実装する際の柔軟性が高まります。最適なI2DRを導出するために、ジャンプ間隔学習法は、ジャンプペナルティ回帰によって治療と共変量を与えられた結果の条件付き平均を推定し、推定された結果回帰関数に基づいて対応する最適なI2DRを導出します。回帰子は、明確な解釈のために線形にすることも、複雑な治療と共変量の相互作用をモデル化するためにディープニューラルネットワークにすることもできます。ジャンプインターバル学習を実装するために、結果回帰関数を効率的に計算する動的プログラミングに基づく検索アルゴリズムを開発しました。結果のI2DRの統計特性は、結果回帰関数が治療空間上の区分関数または連続関数のいずれかである場合に確立されます。さらに、(推定された)最適ポリシーの下で平均結果を推測する手順を開発しました。提案されたI2DRの実証的妥当性を実証するために、広範なシミュレーションとワルファリン研究への実際のデータアプリケーションが実施されました。

Policy Gradient Methods Find the Nash Equilibrium in N-player General-sum Linear-quadratic Games
方策勾配法によるNプレイヤー一般和線形二次ゲームにおけるナッシュ均衡を求める

We consider a general-sum N-player linear-quadratic game with stochastic dynamics over a finite horizon and prove the global convergence of the natural policy gradient method to the Nash equilibrium. In order to prove convergence of the method we require a certain amount of noise in the system. We give a condition, essentially a lower bound on the covariance of the noise in terms of the model parameters, in order to guarantee convergence. We illustrate our results with numerical experiments to show that even in situations where the policy gradient method may not converge in the deterministic setting, the addition of noise leads to convergence.

私たちは、有限の地平線上の確率的ダイナミクスを持つ一般和Nプレイヤー線形二次ゲームを考え、自然方策勾配法のナッシュ均衡への大域収束を証明します。この手法の収束性を証明するためには、システム内にある程度のノイズが必要です。収束を保証するために、モデルパラメータの観点からノイズの共分散に基本的に下限となる条件を与えます。結果を数値実験で示し、決定論的な設定でポリシー勾配法が収束しない可能性がある状況でも、ノイズの追加が収束につながることを示します。

Asymptotics of Network Embeddings Learned via Subsampling
サブサンプリングによって学習されたネットワーク埋め込みの漸近論

Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.

ネットワークデータは、ノード分類、ノードクラスタリング、リンク予測などの興味深いタスクを含む、現代の機械学習のいたるところに存在します。一般的なアプローチは、ネットワークのユークリッド埋め込みを学習することから始まり、ベクトル値データ用に開発されたアルゴリズムが適用されます。大規模なネットワークの場合、埋め込みは、サブサンプリングスキームを自由に選択できる確率的勾配法を使用して学習されます。このような方法は、実証的に優れたパフォーマンスを発揮しますが、理論的には十分に理解されていません。私たちの研究では、サブサンプリングアプローチ(node2vecなど)を使用した表現方法を1つの統一フレームワークにカプセル化します。グラフが交換可能であるという仮定の下で、学習された埋め込みベクトルの分布が漸近的に分離することを証明します。さらに、漸近分布と提供された収束率を、損失関数の選択と埋め込み次元を含む潜在パラメータの観点から特徴付けます。これにより、埋め込みベクトルが何を表すか、およびこれらの方法が下流のタスクでどの程度うまく機能するかを理解するための理論的基礎が提供されます。特に、一般的に使用される損失関数は、フィッシャー一貫性の欠如などの欠点につながる可能性があることがわかります。

Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks
2層幅ニューラルネットワークによる平均二乗誤差回帰のための勾配降下法の陰的バイアス

We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. For stochastic gradient descent we obtain the same implicit bias result. We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.

私たちは、幅の広いニューラルネットワークの勾配降下法によるトレーニングと、関数空間における対応する暗黙のバイアスについて調査します。単変量回帰の場合、幅$n$の浅いReLUネットワークのトレーニングの解は、トレーニングデータに適合し、初期関数との差が、ネットワークパラメーターの初期化に使用される確率分布に依存する曲率ペナルティによって重み付けされた2次導関数の2ノルムが最小となる関数の$n^{- 1/2}$以内にあることを示します。さまざまな一般的な初期化手順について、曲率ペナルティ関数を明示的に計算します。たとえば、一様分布による非対称初期化では、一定の曲率ペナルティが得られるため、解関数はトレーニングデータの自然な3次スプライン補間になります。確率的勾配降下法の場合、同じ暗黙のバイアス結果が得られます。さまざまな活性化関数で同様の結果が得られます。多変量回帰の場合、2次導関数を分数ラプラシアンのラドン変換に置き換えることで、類似の結果を示します。定数ペナルティ関数を生成する初期化スキームの場合、解は多調和スプラインです。さらに、トレーニング軌跡は、正規化強度が減少する平滑化スプラインの軌跡によって捕捉されることを示します。

Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection
ノンパラメトリック変数選択による文脈依存型オンライン学習における次元削減

We consider a contextual online learning (multi-armed bandit) problem with high-dimensional covariate $x$ and decision $y$. The reward function to learn, $f(x,y)$, does not have a particular parametric form. The literature has shown that the optimal regret is $\tilde{O}(T^{(d_x\!+\!d_y\!+\!1)/(d_x\!+\!d_y\!+\!2)})$, where $d_x$ and $d_y$ are the dimensions of $x$ and $y$, and thus it suffers from the curse of dimensionality. In many applications, only a small subset of variables in the covariate affect the value of $f$, which is referred to as sparsity in statistics. To take advantage of the sparsity structure of the covariate, we propose a variable selection algorithm called BV-LASSO, which incorporates novel ideas such as binning and voting to apply LASSO to nonparametric settings. Using it as a subroutine, we can achieve the regret $\tilde{O}(T^{(d_x^*\!+\!d_y\!+\!1)/(d_x^*\!+\!d_y\!+\!2)})$, where $d_x^*$ is the effective covariate dimension. The regret matches the optimal regret when the covariate is $d^*_x$-dimensional and thus cannot be improved. Our algorithm may serve as a general recipe to achieve dimension reduction via variable selection in nonparametric settings.

私たちは、高次元の共変量$x$と決定$y$を持つコンテキストオンライン学習(多腕バンディット)問題を考察します。学習する報酬関数$f(x,y)$には、特定のパラメトリック形式はありません。文献によると、最適な後悔は$\tilde{O}(T^{(d_x\!+\!d_y\!+\!1)/(d_x\!+\!d_y\!+\!2)})$であり、ここで$d_x$と$d_y$は$x$と$y$の次元であるため、次元の呪いに悩まされます。多くのアプリケーションでは、共変量内の変数の小さなサブセットのみが$f$の値に影響し、統計ではこれをスパース性と呼びます。共変量のスパース構造を利用するために、ビニングや投票などの新しいアイデアを取り入れてLASSOをノンパラメトリック設定に適用するBV-LASSOと呼ばれる変数選択アルゴリズムを提案します。これをサブルーチンとして使用すると、後悔$\tilde{O}(T^{(d_x^*\!+\!d_y\!+\!1)/(d_x^*\!+\!d_y\!+\!2)})$を達成できます。ここで、$d_x^*$は有効な共変量の次元です。共変量が$d^*_x$次元の場合、後悔は最適な後悔と一致し、したがって改善できません。私たちのアルゴリズムは、ノンパラメトリック設定で変数選択を介して次元削減を達成するための一般的なレシピとして役立つ可能性があります。

Sparse GCA and Thresholded Gradient Descent
スパース GCA としきい値勾配降下法

Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple data sets. It generalizes canonical correlation analysis that is designed for two data sets. We study sparse GCA when there are potentially multiple leading generalized correlation tuples in data that are of interest and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as a generalized eigenvalue problem at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic data sets.

一般化相関分析(GCA)は、複数のデータセットにわたる線形関係を明らかにすることに関係しています。これは、2つのデータセット用に設計された正準相関分析を一般化します。スパースGCAは、関心のあるデータに複数の先行一般化相関タプルが存在する可能性があり、読み込み行列に0以外の行が少ない場合に調査します。これには、相関行列のスパースCCAとスパースPCAが特殊なケースとして含まれます。まず、正規化制約を慎重に選択することにより、母集団レベルとサンプルレベルの両方で一般化固有値問題としてスパースGCAを定式化します。サンプル最適化問題のラグランジュ形式に基づいて、高次元のGCA負荷ベクトルと行列を推定するためのしきい値勾配降下アルゴリズムを提案します。適切な初期化を使用してアルゴリズムによって生成された推定器の厳密な推定誤差範囲を導き出します。また、多くの合成データセットに対するアルゴリズムの優れた性能も示しています。

MARS: A Second-Order Reduction Algorithm for High-Dimensional Sparse Precision Matrices Estimation
MARS:高次元スパース精度行列推定のための2次削減アルゴリズム

Estimation of the precision matrix (or inverse covariance matrix) is of great importance in statistical data analysis and machine learning. However, as the number of parameters scales quadratically with the dimension $p$, the computation becomes very challenging when $p$ is large. In this paper, we propose an adaptive sieving reduction algorithm to generate a solution path for the estimation of precision matrices under the $\ell_1$ penalized D-trace loss, with each subproblem being solved by a second-order algorithm. In each iteration of our algorithm, we are able to greatly reduce the number of variables in the problem based on the Karush-Kuhn-Tucker (KKT) conditions and the sparse structure of the estimated precision matrix in the previous iteration. As a result, our algorithm is capable of handling data sets with very high dimensions that may go beyond the capacity of the existing methods. Moreover, for the sub-problem in each iteration, other than solving the primal problem directly, we develop a semismooth Newton augmented Lagrangian algorithm with global linear convergence rate on the dual problem to improve the efficiency. Theoretical properties of our proposed algorithm have been established. In particular, we show that the convergence rate of our algorithm is asymptotically superlinear. The high efficiency and promising performance of our algorithm are illustrated via extensive simulation studies and real data applications, with comparison to several state-of-the-art solvers.

精度行列（または逆共分散行列）の推定は、統計データ分析と機械学習において非常に重要です。しかし、パラメータの数は次元$p$の2乗に比例して増加するため、$p$が大きい場合、計算は非常に困難になります。この論文では、$\ell_1$ペナルティ付きDトレース損失の下で精度行列を推定するためのソリューションパスを生成する適応ふるい削減アルゴリズムを提案し、各サブ問題は2次アルゴリズムによって解決されます。アルゴリズムの各反復では、Karush-Kuhn-Tucker (KKT)条件と、前回の反復で推定された精度行列のスパース構造に基づいて、問題内の変数の数を大幅に削減できます。その結果、当社のアルゴリズムは、既存の方法の容量を超える可能性のある非常に高い次元のデータセットを処理できます。さらに、各反復のサブ問題に対して、主問題を直接解く以外に、効率性を向上させるために、双対問題に対するグローバル線形収束率を持つ半滑らかなニュートン拡張ラグランジュアルゴリズムを開発します。提案されたアルゴリズムの理論的特性は確立されています。特に、アルゴリズムの収束率は漸近的に超線形であることを示しています。アルゴリズムの高い効率性と有望なパフォーマンスは、いくつかの最先端のソルバーとの比較による広範なシミュレーション研究と実際のデータアプリケーションによって実証されています。

Exploiting Discovered Regression Discontinuities to Debias Conditioned-on-observable Estimators
発見された回帰不連続性を利用して、条件付き観測可能な推定量をバイアス除去する

Regression discontinuity (RD) designs are widely used to estimate causal effects in the absence of a randomized experiment. However, standard approaches to RD analysis face two significant limitations. First, they require a priori knowledge of discontinuities in treatment. Second, they yield doubly-local treatment effect estimates, and fail to provide more general causal effect estimates away from the discontinuity. To address these limitations, we introduce a novel method for automatically detecting RDs at scale, integrating information from multiple discovered discontinuities with an observational estimator, and extrapolating away from discovered, local RDs. We demonstrate the performance of our method on two synthetic datasets, showing improved performance compared to direct use of an observational estimator, direct extrapolation of RD estimates, and existing methods for combining multiple causal effect estimates. Finally, we apply our novel method to estimate spatially heterogeneous treatment effects in the context of a recent economic development problem.

回帰不連続(RD)デザインは、ランダム化実験がない場合に因果効果を推定するために広く使用されています。ただし、RD分析の標準的なアプローチには、2つの大きな制限があります。まず、治療における不連続性の事前知識が必要です。次に、二重にローカルな治療効果推定値が生成され、不連続性から離れたより一般的な因果効果推定値を提供できません。これらの制限に対処するために、大規模なRDを自動検出し、発見された複数の不連続性からの情報を観測推定量と統合し、発見されたローカルRDから外挿する新しい方法を紹介します。2つの合成データセットでこの方法のパフォーマンスを実証し、観測推定量の直接使用、RD推定値の直接外挿、および複数の因果効果推定値を組み合わせる既存の方法と比較してパフォーマンスが向上していることを示しました。最後に、最近の経済開発問題のコンテキストで、空間的に不均一な治療効果を推定するために新しい方法を適用します。

Generalized Linear Models in Non-interactive Local Differential Privacy with Public Data
公開データを用いた非対話型局所差分プライバシーにおける一般化線形モデル

In this paper, we study the problem of estimating smooth Generalized Linear Models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Unlike its classical setting, our model allows the server to access additional public but unlabeled data. In the first part of the paper, we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. sampled from a zero-mean multivariate Gaussian distribution. Motivated by the Stein’s lemma, we present an $(\epsilon, \delta)$-NLDP algorithm for GLMs. Moreover, the sample complexity of public and private data for the algorithm to achieve an $\ell_2$-norm estimation error of $\alpha$ (with high probability) is ${O}(p \alpha^{-2})$ and $\tilde{O}(p^3\alpha^{-2}\epsilon^{-2})$ respectively, where $p$ is the dimension of the feature vector. This is a significant improvement over the previously known exponential or quasi-polynomial in $\alpha^{-1}$, or exponential in $p$ sample complexities of GLMs with no public data. Then we consider a more general setting where each data record is i.i.d. sampled from some sub-Gaussian distribution with bounded $\ell_1$-norm. Based on a variant of Stein’s lemma, we propose an $(\epsilon, \delta)$-NLDP algorithm for GLMs whose sample complexity of public and private data to achieve an $\ell_\infty$-norm estimation error of $\alpha$ is ${O}(p^2\alpha^{-2})$ and $\tilde{O}(p^2\alpha^{-2}\epsilon^{-2})$ respectively, under some mild assumptions and if $\alpha$ is not too small i.e., $\alpha\geq \Omega(\frac{1}{\sqrt{p}})$). In the second part of the paper, we extend our idea to the problem of estimating non-linear regressions and show similar results as in GLMs for both multivariate Gaussian and sub-Gaussian cases. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets. To our best knowledge, this is the first paper showing the existence of efficient and effective algorithms for GLMs and non-linear regressions in the NLDP model with unlabeled public data.

この論文では、非対話型ローカル差分プライバシー(NLDP)モデルで滑らかな一般化線形モデル(GLM)を推定する問題について検討します。従来の設定とは異なり、このモデルでは、サーバーが追加の公開されているがラベル付けされていないデータにアクセスできます。論文の最初の部分では、GLMに焦点を当てます。具体的には、各データレコードがゼロ平均多変量ガウス分布からi.i.d.サンプリングされる場合を最初に検討します。Steinの補題に基づいて、GLM用の$(\epsilon, \delta)$-NLDPアルゴリズムを紹介します。さらに、アルゴリズムが$\ell_2$ノルム推定誤差$\alpha$を(高い確率で)達成するための公開データと非公開データのサンプル複雑度は、それぞれ${O}(p \alpha^{-2})$と$\tilde{O}(p^3\alpha^{-2}\epsilon^{-2})$です($p$は特徴ベクトルの次元)。これは、公開データがないGLMの既知の$\alpha^{-1}$の指数関数または準多項式、または$p$の指数関数のサンプル複雑度に比べて大幅に改善されています。次に、各データレコードが、制限された$\ell_1$ノルムを持つサブガウス分布からi.i.d.サンプリングされる、より一般的な設定を検討します。Steinの補題の変形に基づいて、公開データと非公開データのサンプル複雑度がそれぞれ$\ell_\infty$ノルム推定誤差$\alpha$を達成するためのGLM用の$(\epsilon, \delta)$-NLDPアルゴリズムを提案します(いくつかの緩い仮定の下、$\alpha$が小さすぎない場合、つまり$\alpha\geq \Omega(\frac{1}{\sqrt{p}})$である場合)。論文の後半では、このアイデアを非線形回帰の推定の問題に拡張し、多変量ガウス分布とサブガウス分布の両方のケースでGLMと同様の結果を示します。最後に、合成データセットと実際のデータセットの両方で実験を行い、アルゴリズムの有効性を示します。私たちの知る限り、これはラベル付けされていない公開データを使用したNLDPモデルにおけるGLMと非線形回帰のための効率的かつ効果的なアルゴリズムの存在を示す最初の論文です。

A Rigorous Information-Theoretic Definition of Redundancy and Relevancy in Feature Selection Based on (Partial) Information Decomposition
(部分的な)情報分解に基づく特徴選択における冗長性と関連性の厳密な情報理論的定義

Selecting a minimal feature set that is maximally informative about a target variable is a central task in machine learning and statistics. Information theory provides a powerful framework for formulating feature selection algorithms—yet, a rigorous, information-theoretic definition of feature relevancy, which accounts for feature interactions such as redundant and synergistic contributions, is still missing. We argue that this lack is inherent to classical information theory which does not provide measures to decompose the information a set of variables provides about a target into unique, redundant, and synergistic contributions. Such a decomposition has been introduced only recently by the partial information decomposition (PID) framework. Using PID, we clarify why feature selection is a conceptually difficult problem when approached using information theory and provide a novel definition of feature relevancy and redundancy in PID terms. From this definition, we show that the conditional mutual information (CMI) maximizes relevancy while minimizing redundancy and propose an iterative, CMI-based algorithm for practical feature selection. We demonstrate the power of our CMI-based algorithm in comparison to the unconditional mutual information on benchmark examples and provide corresponding PID estimates to highlight how PID allows to quantify information contribution of features and their interactions in feature-selection problems.

対象変数について最大限の情報を提供する最小限の特徴セットを選択することは、機械学習と統計学における中心的なタスクです。情報理論は、特徴選択アルゴリズムを策定するための強力なフレームワークを提供しますが、冗長性や相乗的寄与などの特徴の相互作用を考慮した、特徴関連性の厳密な情報理論的定義はまだありません。この欠如は、一連の変数が対象について提供する情報を、一意、冗長、相乗的寄与に分解する手段を提供しない古典的な情報理論に固有のものであると私たちは主張します。このような分解は、部分情報分解(PID)フレームワークによって最近になって導入されました。PIDを使用して、情報理論を使用してアプローチする場合に特徴選択が概念的に難しい問題である理由を明らかにし、PIDの用語で特徴関連性と冗長性の新しい定義を提供します。この定義から、条件付き相互情報量(CMI)が関連性を最大化し、冗長性を最小化することを示し、実用的な特徴選択のための反復的なCMIベースのアルゴリズムを提案します。ベンチマーク例における無条件相互情報量と比較して、CMIベースのアルゴリズムの威力を実証し、対応するPID推定値を提供して、PIDが特徴選択問題における特徴の情報寄与とそれらの相互作用をどのように定量化できるかを強調します。

Combinatorial Optimization and Reasoning with Graph Neural Networks
グラフニューラルネットワークによる組み合わせ最適化と推論

Combinatorial optimization is a well-established area in operations research and computer science. Until recently, its methods have focused on solving problem instances in isolation, ignoring that they often stem from related data distributions in practice. However, recent years have seen a surge of interest in using machine learning, especially graph neural networks, as a key building block for combinatorial tasks, either directly as solvers or by enhancing exact solvers. The inductive bias of GNNs effectively encodes combinatorial and relational input due to their invariance to permutations and awareness of input sparsity. This paper presents a conceptual review of recent key advancements in this emerging field, aiming at optimization and machine learning researchers.

組み合わせ最適化は、オペレーションズリサーチとコンピューターサイエンスの分野で確立された分野です。最近まで、その手法は問題事例を単独で解決することに重点を置いており、実際には関連するデータ分布に起因していることが多いことを無視していました。しかし、近年では、機械学習、特にグラフニューラルネットワークを、ソルバーとして直接、または正確なソルバーを強化することにより、組み合わせタスクの主要なビルディングブロックとして使用することへの関心が高まっています。GNNの帰納的バイアスは、順列に対する不変性と入力スパースの認識により、組み合わせ入力と関係入力を効果的にエンコードします。この論文では、最適化と機械学習の研究者を対象とした、この新しい分野における最近の主要な進歩の概念レビューを紹介します。

A First Look into the Carbon Footprint of Federated Learning
フェデレーテッド・ラーニングのカーボンフットプリントを初めて調査

Despite impressive results, deep learning-based technologies also raise severe privacy and environmental concerns induced by the training procedure often conducted in data centers. In response, alternatives to centralized training such as Federated Learning (FL) have emerged. FL is now starting to be deployed at a global scale by companies that must adhere to new legal demands and policies originating from governments and social groups advocating for privacy protection. However, the potential environmental impact related to FL remains unclear and unexplored. This article offers the first-ever systematic study of the carbon footprint of FL. We propose a rigorous model to quantify the carbon footprint, hence facilitating the investigation of the relationship between FL design and carbon emissions. We also compare the carbon footprint of FL to traditional centralized learning. Our findings show that, depending on the configuration, FL can emit up to two orders of magnitude more carbon than centralized training. However, in certain settings, it can be comparable to centralized learning due to the reduced energy consumption of embedded devices. Finally, we highlight and connect the results to the future challenges and trends in FL to reduce its environmental impact, including algorithms efficiency, hardware capabilities, and stronger industry transparency.

素晴らしい結果にもかかわらず、ディープラーニングベースのテクノロジーは、データセンターで行われることが多いトレーニング手順によって引き起こされる深刻なプライバシーと環境の懸念も引き起こします。これに対応して、フェデレーテッドラーニング(FL)などの集中型トレーニングの代替手段が登場しました。FLは現在、プライバシー保護を主張する政府や社会団体から発せられる新しい法的要求やポリシーを遵守する必要がある企業によって、世界規模で導入され始めています。しかし、FLに関連する潜在的な環境への影響は不明で、未調査のままです。この記事では、FLのカーボンフットプリントに関する初めての体系的な研究を紹介します。私たちは、カーボンフットプリントを定量化する厳密なモデルを提案し、FLの設計と炭素排出量の関係の調査を容易にします。また、FLのカーボンフットプリントを従来の集中型学習と比較します。私たちの調査結果によると、構成によっては、FLは集中型トレーニングよりも最大2桁多くの炭素を排出する可能性があります。ただし、特定の設定では、組み込みデバイスのエネルギー消費が削減されるため、集中型学習に匹敵する可能性があります。最後に、アルゴリズムの効率、ハードウェア機能、業界の透明性の向上など、環境への影響を軽減するためのフロリダ州の将来の課題と傾向を強調し、その結果と結び付けます。

An Eigenmodel for Dynamic Multilayer Networks
動的多層ネットワークのための固有モデル

Dynamic multilayer networks frequently represent the structure of multiple co-evolving relations; however, statistical models are not well-developed for this prevalent network type. Here, we propose a new latent space model for dynamic multilayer networks. The key feature of our model is its ability to identify common time-varying structures shared by all layers while also accounting for layer-wise variation and degree heterogeneity. We establish the identifiability of the model’s parameters and develop a structured mean-field variational inference approach to estimate the model’s posterior, which scales to networks previously intractable to dynamic latent space models. We demonstrate the estimation procedure’s accuracy and scalability on simulated networks. We apply the model to two real-world problems: discerning regional conflicts in a data set of international relations and quantifying infectious disease spread throughout a school based on the student’s daily contact patterns.

動的多層ネットワークは、多くの場合、複数の共進化する関係の構造を表します。ただし、この一般的なネットワークタイプに対する統計モデルは十分に開発されていません。ここでは、動的多層ネットワークのための新しい潜在空間モデルを提案します。このモデルの主な特徴は、すべての層に共通する時間変動構造を同定すると同時に、層ごとの変動と次数の不均一性も考慮できることです。モデルのパラメータの識別可能性を確立し、モデルの事後を推定するための構造化された平均場変分推論アプローチを開発します。これは、以前は動的潜在空間モデルに難しかったネットワークにスケーリングされます。シミュレーションネットワーク上での推定手順の精度とスケーラビリティを実証します。このモデルを、国際関係のデータセットで地域紛争を識別することと、生徒の日常的な接触パターンに基づいて学校全体に広がる感染症の定量化という2つの現実世界の問題に適用します。

Graph Clustering with Graph Neural Networks
グラフニューラルネットワークによるグラフクラスタリング

Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs—does this mean that GNN pooling methods do a good job at clustering graphs? Surprisingly, the answer is no—current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods’ poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics.

グラフニューラルネットワーク(GNN)は、ノード分類やリンク予測などの多くのグラフ分析タスクで最先端の結果を達成しています。ただし、グラフクラスタリングなどのグラフ上の重要な教師なし問題は、GNNの進歩に対してより抵抗力があることが判明しています。グラフクラスタリングは、GNNのノードプーリングと同じ全体的な目標を持っています。これは、GNNプーリングメソッドがグラフのクラスタリングに適していることを意味するのでしょうか。驚くべきことに、答えは「いいえ」です。現在のGNNプーリングメソッドは、学習された表現に適用されたk-meansなどの単純なベースラインが適切に機能する場合に、クラスター構造を回復できないことがよくあります。グラフ構造と属性データの両方でさまざまな信号対雑音シナリオを調査するための一連の実験を慎重に設計することで、さらに調査します。これらの方法のクラスタリングにおけるパフォーマンスの低さに対処するために、クラスタリング品質のモジュール性測定にヒントを得た教師なしプーリング方法であるDeep Modularity Networks (DMoN)を紹介し、現実世界のグラフの困難なクラスタリング構造の回復にどのように対処するかを示します。同様に、現実世界のデータでは、DMoNがグラウンドトゥルースラベルと強く相関する高品質のクラスターを生成し、さまざまなメトリックにわたって他のプーリング方法よりも40%以上改善された最先端の結果を達成することを示します。

Euler-Lagrange Analysis of Generative Adversarial Networks
生成的敵対ネットワークのオイラー・ラグランジュ分析

We consider Generative Adversarial Networks (GANs) and address the underlying functional optimization problem ab initio within a variational setting. Strictly speaking, the optimization of the generator and discriminator functions must be carried out in accordance with the Euler-Lagrange conditions, which become particularly relevant in scenarios where the optimization cost involves regularizers comprising the derivatives of these functions. Considering Wasserstein GANs (WGANs) with a gradient-norm penalty, we show that the optimal discriminator is the solution to a Poisson differential equation. In principle, the optimal discriminator can be obtained in closed form without having to train a neural network. We illustrate this by employing a Fourier-series approximation to solve the Poisson differential equation. Experimental results based on synthesized Gaussian data demonstrate superior convergence behavior of the proposed approach in comparison with the baseline WGAN variants that employ weight-clipping, gradient or Lipschitz penalties on the discriminator on low-dimensional data. We also analyze the truncation error of the Fourier-series approximation and the estimation error of the Fourier coefficients in a high-dimensional setting. We demonstrate applications to real-world images considering latent-space prior matching in Wasserstein autoencoders and present performance comparisons on benchmark datasets such as MNIST, SVHN, CelebA, CIFAR-10, and Ukiyo-E. We demonstrate that the proposed approach achieves comparable reconstruction error and Frechet inception distance with faster convergence and up to two-fold improvement in image sharpness.

私たちは、生成的敵対ネットワーク(GAN)を考慮し、変分設定内で基礎となる機能最適化問題を最初から解決します。厳密に言えば、生成関数と識別関数の最適化は、オイラー・ラグランジュ条件に従って実行する必要があります。これは、最適化コストにこれらの関数の導関数を含む正則化子が含まれるシナリオで特に重要になります。勾配ノルムペナルティを持つWasserstein GAN (WGAN)を考慮すると、最適な識別子がポアソン微分方程式の解であることがわかります。原則として、最適な識別子は、ニューラルネットワークをトレーニングしなくても閉じた形式で取得できます。これを、フーリエ級数近似を使用してポアソン微分方程式を解くことで示します。合成されたガウスデータに基づく実験結果は、低次元データの識別子に重みクリッピング、勾配、またはLipschitzペナルティを使用するベースラインWGANバリアントと比較して、提案されたアプローチの優れた収束動作を示しています。また、高次元設定におけるフーリエ級数近似の打ち切り誤差とフーリエ係数の推定誤差も分析します。ワッサーシュタインオートエンコーダーの潜在空間事前マッチングを考慮した実世界の画像への応用を実証し、MNIST、SVHN、CelebA、CIFAR-10、浮世絵などのベンチマークデータセットでのパフォーマンス比較を示します。提案されたアプローチは、同等の再構成誤差とフレシェ開始距離を達成し、より高速な収束と最大2倍の画像鮮明度の向上を実現することを実証します。

Statistical Robustness of Empirical Risks in Machine Learning
機械学習における経験的リスクの統計的ロバスト性

This paper studies convergence of empirical risks in reproducing kernel Hilbert spaces (RKHS). A conventional assumption in the existing research is that empirical training data are generated by the unknown true probability distribution but this may not be satisfied in some practical circumstances. Consequently the existing convergence results may not provide a guarantee as to whether the empirical risks are reliable or not when the data are potentially corrupted (generated by a distribution perturbed from the true). In this paper, we fill out the gap from robust statistics perspective (Krätschmer, Schied and Zähle (2012); Krätschmer, Schied and Zähle (2014); Guo and Xu (2020). First, we derive moderate sufficient conditions under which the expected risk changes stably (continuously) against small perturbation of the probability distributions of the underlying random variables and demonstrate how the cost function and kernel affect the stability. Second, we examine the difference between laws of the statistical estimators of the expected optimal loss based on pure data and contaminated data using Prokhorov metric and Kantorovich metric, and derive some asymptotic qualitative and non-asymptotic quantitative statistical robustness results. Third, we identify appropriate metrics under which the statistical estimators are uniformly asymptotically consistent. These results provide theoretical grounding for analysing asymptotic convergence and examining reliability of the statistical estimators in a number of regression models.

この論文では、再現カーネルヒルベルト空間(RKHS)における経験的リスクの収束について研究しています。既存の研究における従来の仮定では、経験的トレーニングデータは未知の真の確率分布によって生成されるとされていますが、これは実際の状況によっては満たされない可能性があります。その結果、既存の収束結果は、データが潜在的に破損している場合(真の分布から乱れた分布によって生成される場合)、経験的リスクが信頼できるかどうかについて保証を提供しない可能性があります。この論文では、ロバスト統計の観点からギャップを埋めます(Krätschmer、Schied、Zähle (2012)、Krätschmer、Schied、Zähle (2014)、GuoおよびXu (2020))。まず、基礎となるランダム変数の確率分布の小さな摂動に対して期待リスクが安定して(連続的に)変化するための中程度の十分条件を導出し、コスト関数とカーネルが安定性にどのように影響するかを示します。次に、プロホロフメトリックとカントロビッチメトリックを使用して、純粋なデータと汚染されたデータに基づく期待最適損失の統計的推定量の法則の違いを調べ、漸近的な定性的および非漸近的な定量的統計的ロバスト性の結果を導きます。最後に、統計的推定量が一様に漸近的に一致する適切なメトリックを特定します。

HiGrad: Uncertainty Quantification for Online Learning and Stochastic Approximation
HiGrad:オンライン学習と確率的近似のための不確実性定量化

Stochastic gradient descent (SGD) is an immensely popular approach for online learning in settings where data arrives in a stream or data sizes are very large. However, despite an ever-increasing volume of work on SGD, much less is known about the statistical inferential properties of SGD-based predictions. Taking a fully inferential viewpoint, this paper introduces a novel procedure termed HiGrad to conduct statistical inference for online learning, without incurring additional computational cost compared with SGD. The HiGrad procedure begins by performing SGD updates for a while and then splits the single thread into several threads, and this procedure hierarchically operates in this fashion along each thread. With predictions provided by multiple threads in place, a $t$-based confidence interval is constructed by decorrelating predictions using covariance structures given by a Donsker-style extension of the Ruppert–Polyak averaging scheme, which is a technical contribution of independent interest. Under certain regularity conditions, the HiGrad confidence interval is shown to attain asymptotically exact coverage probability. Finally, the performance of HiGrad is evaluated through extensive simulation studies and a real data example. An R package \texttt{higrad} has been developed to implement the method.

確率的勾配降下法(SGD)は、データがストリームで到着したり、データサイズが非常に大きい環境でのオンライン学習に非常によく使われる手法です。しかし、SGDに関する研究は増え続けているにもかかわらず、SGDベースの予測の統計的推論特性についてはほとんどわかっていません。この論文では、完全な推論の観点から、SGDと比較して追加の計算コストをかけずにオンライン学習の統計的推論を実行するためのHiGradという新しい手順を紹介します。HiGrad手順は、まずしばらくSGD更新を実行し、次に単一のスレッドを複数のスレッドに分割します。この手順は、各スレッドに沿ってこのように階層的に動作します。複数のスレッドによって予測が提供されたら、独立した関心の技術的貢献であるRuppert-Polyak平均化スキームのDonskerスタイルの拡張によって提供される共分散構造を使用して予測を非相関化することで、tベースの信頼区間が構築されます。一定の規則性条件下では、HiGrad信頼区間は漸近的に正確なカバレッジ確率を達成することが示されています。最後に、広範なシミュレーション研究と実際のデータ例を通じて、HiGradのパフォーマンスが評価されます。このメソッドを実装するために、Rパッケージ\texttt{higrad}が開発されました。

Benign overfitting in ridge regression
リッジ回帰における良性のオーバーフィット

In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.

ディープラーニングの最近の多くのアプリケーションでは、ニューラルネットワークのパラメーターの数が、トレーニングに使用されるデータポイントの数よりはるかに多くなっています。これらの実践に触発されて、最近の理論研究の大部分は、過剰パラメーター化モデルの研究に費やされてきました。この領域における中心的な現象の1つは、モデルがノイズの多いデータを補間する能力であり、それでもテストエラーはそのデータのノイズ量より低いままです。arXiv:1906.11300は、データの共分散構造について特徴付け、補間ソリューションを最小の$\ell_2$ノルムで考慮し、データに独立したコンポーネントがある場合、線形回帰でこのような現象が発生する可能性があります。彼らは分散項に明確な境界を与え、データの共分散が小さな共次元の部分空間で高い有効ランクを持つ場合にのみ、分散項が小さくなることを示しました。独立性の仮定を排除し、バイアス項に明確な境界を与えることで、彼らの結果を強化し、完成させます。したがって、私たちの結果は、カーネル回帰などのarXiv:1906.11300のものよりもはるかに一般的な設定に適用され、ノイズがどのように減衰されるかだけでなく、真の信号のどの部分が学習されるかも特徴付けます。さらに、この結果をリッジ回帰の設定に拡張することで、別の興味深い現象を説明できます。つまり、最適な正則化が負になる一般的な十分条件を示します。

Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities
計算効率の高いディープラーニング:アルゴリズムのトレンドと機会

Although deep learning has made great progress in recent years, the exploding economic and environmental costs of training neural networks are becoming unsustainable. To address this problem, there has been a great deal of research on *algorithmically-efficient deep learning*, which seeks to reduce training costs not at the hardware or implementation level, but through changes in the semantics of the training program. In this paper, we present a structured and comprehensive overview of the research in this field. First, we formalize the *algorithmic speedup* problem, then we use fundamental building blocks of algorithmically efficient training to develop a taxonomy. Our taxonomy highlights commonalities of seemingly disparate methods and reveals current research gaps. Next, we present evaluation best practices to enable comprehensive, fair, and reliable comparisons of speedup techniques. To further aid research and applications, we discuss common bottlenecks in the training pipeline (illustrated via experiments) and offer taxonomic mitigation strategies for them. Finally, we highlight some unsolved research challenges and present promising future directions.

近年、ディープラーニングは大きな進歩を遂げていますが、ニューラルネットワークのトレーニングにかかる経済的および環境的コストは爆発的に増加し、持続不可能になりつつあります。この問題に対処するために、ハードウェアや実装レベルではなく、トレーニングプログラムのセマンティクスの変更を通じてトレーニングコストを削減しようとする、*アルゴリズム的に効率的なディープラーニング*に関する研究が盛んに行われてきました。この論文では、この分野の研究の構造化された包括的な概要を示します。まず、*アルゴリズムによる高速化*問題を形式化し、次にアルゴリズム的に効率的なトレーニングの基本的な構成要素を使用して分類法を開発します。私たちの分類法は、一見異なる方法の共通点を強調し、現在の研究のギャップを明らかにします。次に、高速化手法の包括的で公平かつ信頼性の高い比較を可能にする評価のベストプラクティスを示します。研究とアプリケーションをさらに支援するために、トレーニングパイプラインの一般的なボトルネック（実験によって説明）について説明し、それらに対する分類上の緩和戦略を示します。最後に、未解決の研究上の課題をいくつか強調し、有望な将来の方向性を示します。

Minimal Width for Universal Property of Deep RNN
Deep RNNの普遍的な性質のための最小幅

A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep narrow networks with bounded width and arbitrary depth are more effective than wide shallow networks with arbitrary width and bounded depth in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+3$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is sigmoid or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and technique can shed light on further research on deep RNNs.

リカレントニューラルネットワーク(RNN)は、シーケンシャルデータを扱うための広く使用されているディープラーニングネットワークです。無限幅のRNNは、動的システムを模倣して、コンパクトな領域内の任意のオープン動的システムを近似できます。一般に、制限された幅と任意の深さを持つ深く狭いネットワークは、任意の幅と制限された深さを持つ広く浅いネットワークよりも実際には効果的ですが、深く狭い構造の普遍近似定理はまだ十分に研究されていません。この研究では、深く狭いRNNの普遍性を証明し、普遍性の最小幅の上限がデータの長さに依存しないことを示します。具体的には、ReLU活性化による深層RNNが、それぞれ幅$d_x+d_y+3$と$\max\{d_x+1,d_y\}$の任意の連続関数または$L^p$関数を近似できることを示します。ここで、ターゲット関数は$\mathbb{R}^{d_x}$のベクトルの有限シーケンスを$\mathbb{R}^{d_y}$のベクトルの有限シーケンスにマッピングします。また、活性化関数がシグモイド以上の場合に必要となる追加の幅も計算します。さらに、双方向RNNなどの他の再帰型ネットワークの普遍性を証明します。多層パーセプトロンとRNNを橋渡しする私たちの理論と技術は、深層RNNのさらなる研究に光を当てることができます。

Maximum likelihood estimation in Gaussian process regression is ill-posed
ガウス過程回帰における最尤推定は不適切である

Gaussian process regression underpins countless academic and industrial applications of machine learning and statistics, with maximum likelihood estimation routinely used to select appropriate parameters for the covariance kernel. However, it remains an open problem to establish the circumstances in which maximum likelihood estimation is well-posed, that is, when the predictions of the regression model are insensitive to small perturbations of the data. This article identifies scenarios where the maximum likelihood estimator fails to be well-posed, in that the predictive distributions are not Lipschitz in the data with respect to the Hellinger distance. These failure cases occur in the noiseless data setting, for any Gaussian process with a stationary covariance function whose lengthscale parameter is estimated using maximum likelihood. Although the failure of maximum likelihood estimation is part of Gaussian process folklore, these rigorous theoretical results appear to be the first of their kind. The implication of these negative results is that well-posedness may need to be assessed post-hoc, on a case-by-case basis, when maximum likelihood estimation is used to train a Gaussian process model.

ガウス過程回帰は、機械学習と統計学の無数の学術的および産業的応用を支えており、共分散カーネルの適切なパラメータを選択するために、最尤推定が日常的に使用されています。しかし、最尤推定が適切に設定される状況、つまり回帰モデルの予測がデータの小さな変動に影響されない状況を確立することは、未解決の問題のままです。この記事では、予測分布がヘリンガー距離に関してデータ内でリプシッツ分布ではないという点で、最尤推定量が適切に設定されないシナリオを特定します。これらの失敗例は、ノイズのないデータ設定で、長さスケールパラメータが最尤法を使用して推定される定常共分散関数を持つガウス過程に対して発生します。最尤推定の失敗はガウス過程の伝承の一部ですが、これらの厳密な理論的結果は、その種の最初のものであると思われます。これらの否定的な結果の意味は、最大尤度推定を使用してガウス過程モデルをトレーニングする場合、適切性を事後的にケースバイケースで評価する必要がある可能性があるということです。

An Annotated Graph Model with Differential Degree Heterogeneity for Directed Networks
有向ネットワークのための微分次数不均一性を持つ注釈付きグラフモデル

Directed networks are conveniently represented as graphs in which ordered edges encode interactions between vertices. Despite their wide availability, there is a shortage of statistical models amenable for inference, specially when contextual information and degree heterogeneity are present. This paper presents an annotated graph model with parameters explicitly accounting for these features. To overcome the curse of dimensionality due to modelling degree heterogeneity, we introduce a sparsity assumption and propose a penalized likelihood approach with $\ell_1$-regularization for parameter estimation. We study the estimation and selection consistency of this approach under a sparse network assumption, and show that inference on the covariate parameter is straightforward, thus bypassing the need for the kind of debiasing commonly employed in $\ell_1$-penalized likelihood estimation. Simulation and data analysis corroborate our theoretical findings.

有向ネットワークは、順序付けられたエッジが頂点間の相互作用をエンコードするグラフとして便利に表されます。その幅広い利用可能性にもかかわらず、特に文脈情報と程度の不均一性が存在する場合、推論に適した統計モデルが不足しています。この論文では、これらの機能を明示的に考慮したパラメーターを持つ注釈付きグラフモデルを示します。モデリングの次数不均一性による次元の呪縛を克服するために、スパース性の仮定を導入し、パラメータ推定のための$ell_1$正則化によるペナルティ付き尤度アプローチを提案します。スパースネットワークの仮定の下で、このアプローチの推定と選択の一貫性を研究し、共変量パラメータの推論が単純であり、したがって、$ell_1$のペナルティを受けた尤度推定で一般的に使用される種類のバイアス除去の必要性を回避します。シミュレーションとデータ解析は、私たちの理論的発見を裏付けています。

A Unified Framework for Optimization-Based Graph Coarsening
最適化ベースのグラフ粗大化のための統一フレームワーク

Graph coarsening is a widely used dimensionality reduction technique for approaching large-scale graph machine-learning problems. Given a large graph, graph coarsening aims to learn a smaller-tractable graph while preserving the properties of the originally given graph. Graph data consist of node features and graph matrix (e.g., adjacency and Laplacian). The existing graph coarsening methods ignore the node features and rely solely on a graph matrix to simplify graphs. In this paper, we introduce a novel optimization-based framework for graph dimensionality reduction. The proposed framework lies in the unification of graph learning and dimensionality reduction. It takes both the graph matrix and the node features as the input and learns the coarsen graph matrix and the coarsen feature matrix jointly while ensuring desired properties. The proposed optimization formulation is a multi-block non-convex optimization problem, which is solved efficiently by leveraging block majorization-minimization, $\log$ determinant, Dirichlet energy, and regularization frameworks. The proposed algorithms are provably convergent and practically amenable to numerous tasks. It is also established that the learned coarsened graph is $\epsilon\in(0,1)$ similar to the original graph. Extensive experiments elucidate the efficacy of the proposed framework for real-world applications.

グラフ粗大化は、大規模なグラフ機械学習の問題に取り組むための次元削減手法として広く使用されています。大きなグラフが与えられた場合、グラフ粗大化は、元のグラフの特性を維持しながら、より小さく扱いやすいグラフを学習することを目的としています。グラフデータは、ノード機能とグラフ行列(隣接性やラプラシアンなど)で構成されます。既存のグラフ粗大化手法では、ノード機能を無視し、グラフ行列のみを使用してグラフを簡略化します。この論文では、グラフ次元削減のための新しい最適化ベースのフレームワークを紹介します。提案されたフレームワークは、グラフ学習と次元削減の統合にあります。グラフ行列とノード機能の両方を入力として受け取り、望ましい特性を確保しながら、粗いグラフ行列と粗い機能行列を共同で学習します。提案された最適化定式化は、ブロックの主要化最小化、$\log$行列式、ディリクレエネルギー、および正則化フレームワークを活用して効率的に解決される、マルチブロックの非凸最適化問題です。提案されたアルゴリズムは、収束性が証明されており、多くのタスクに実用的に適用できます。また、学習された粗いグラフは、元のグラフと$\epsilon\in(0,1)$類似していることも確認されています。広範囲にわたる実験により、提案されたフレームワークの実世界のアプリケーションに対する有効性が明らかにされています。

Deep linear networks can benignly overfit when shallow ones do
深い線形ネットワークは、浅いネットワークがオーバーフィットすると、良性に過適合する可能性があります

We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm’s ability to “hide the noise”. Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.

私たちは、勾配フローを使用して学習された深層線形ネットワークを補間する過剰なリスクを制限しました。以前に最小の$ell_2$-ノルム内挿のリスク境界を確立するために使用された設定では、ランダムに初期化された深い線形ネットワークが、最小の$ell_2$-ノルム内挿の既知の境界に近接または一致できることを示します。また、この分析では、深層線形モデルの内挿は、最小の$ell_2$-norm解とまったく同じ条件付き分散を持つことも明らかになりました。ノイズは条件付き分散を通じてのみ過剰リスクに影響を与えるため、これは、深さがアルゴリズムの「ノイズを隠す」能力を向上させないことを意味します。私たちのシミュレーションでは、境界の側面が単純なデータ分布の典型的な動作を反映していることを検証します。また、ReLUネットワークを使用したシミュレーションでも同様の現象が見られることがわかりましたが、そこでの状況はより微妙です。

SQLFlow: An Extensible Toolkit Integrating DB and AI
SQLFlow: DB と AI を統合した拡張可能なツールキット

Integrating AI algorithms into databases is an ongoing effort in both academia and industry. We introduce SQLFlow, a toolkit seamlessly combining data manipulations and AI operations that can be run locally or remotely. SQLFlow extends SQL syntax to support typical AI tasks including model training, inference, interpretation, and mathematical optimization. It is compatible with a variety of database management systems (DBMS) and AI engines, including MySQL, TiDB, MaxCompute, and Hive, as well as TensorFlow, scikit-learn, and XGBoost. Documentations and case studies are available at https://sqlflow.org. The source code and additional details can be found at https://github.com/sql-machine-learning/sqlflow.

AIアルゴリズムをデータベースに統合することは、学界と産業界の両方で継続的な取り組みです。SQLFlowは、データ操作とAI操作をシームレスに組み合わせ、ローカルまたはリモートで実行できるツールキットです。SQLFlowは、SQL構文を拡張して、モデルのトレーニング、推論、解釈、数学的最適化などの一般的なAIタスクをサポートします。MySQL、TiDB、MaxCompute、Hive、TensorFlow、scikit-learn、XGBoostなど、さまざまなデータベース管理システム(DBMS)およびAIエンジンと互換性があります。ドキュメントとケーススタディは、https://sqlflow.orgで入手できます。ソースコードと詳細については、https://github.com/sql-machine-learning/sqlflowを参照してください。

Learning Good State and Action Representations for Markov Decision Process via Tensor Decomposition
テンソル分解によるマルコフ決定過程の良好な状態表現と行動表現の学習

The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP’s tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation.

連続状態作用マルコフ決定過程(MDP)の遷移カーネルは、自然なテンソル構造を認めます。この論文では、経験的軌跡から意味のある低次元の状態と行動の表現を特定するための、テンソルに触発された教師なし学習方法を提案します。この手法は、カーネル化、重要度サンプリング、および低タッカーランク近似によってMDPのテンソル構造を利用します。この方法は、さらに、状態とアクションをそれぞれクラスター化し、最適な離散MDP抽象化を見つけるために使用できます。テンソル濃度と埋め込み後の拡散距離の保存について、シャープな統計的誤差範囲を提供します。さらに、学習した状態/アクションの抽象化が、潜在的なブロック構造が存在する場合に正確な近似を提供し、ポリシー評価などのダウンストリームタスクでの関数近似を可能にすることを証明します。

Generalization Bounds for Adversarial Contrastive Learning
敵対的対照学習の一般化限界

Deep networks are well-known to be fragile to adversarial attacks, and adversarial training is one of the most popular methods used to train a robust model. To take advantage of unlabeled data, recent works have applied adversarial training to contrastive learning (Adversarial Contrastive Learning; ACL for short) and obtain promising robust performance. However, the theory of ACL is not well understood. To fill this gap, we leverage the Rademacher omplexity to analyze the generalization performance of ACL, with a particular focus on linear models and multi-layer neural networks under $\ell_p$ attack ($p \ge 1$). Our theory shows that the average adversarial risk of the downstream tasks can be upper bounded by the adversarial unsupervised risk of the upstream task. The experimental results validate our theory.

ディープネットワークは敵対的攻撃に対して脆弱であることがよく知られており、敵対的トレーニングは、堅牢なモデルをトレーニングするために使用される最も一般的な方法の1つです。ラベル付けされていないデータを活用するために、最近の研究では、敵対的学習に敵対的トレーニングを適用しています(Adversarial Contrastive Learning;ACLと略してACL)で、有望な堅牢なパフォーマンスが得られます。しかし、ACLの理論はよく理解されていません。このギャップを埋めるために、Rademacherの複雑性を活用して、ACLの汎化パフォーマンスを分析し、特に$ell_p$攻撃（$pge 1$）の下での線形モデルと多層ニューラルネットワークに焦点を当てています。私たちの理論は、下流タスクの平均敵対的リスクは、上流タスクの敵対的監督なしリスクによって上限になる可能性があることを示しています。実験結果は私たちの理論を裏付けています。

The Implicit Bias of Benign Overfitting
良性の過学習の暗黙のバイアス

The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining near-optimal expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally *not* occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted squared hinge loss. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings.

良性過剰適合の現象は、予測子がノイズの多いトレーニングデータに完全に適合しながら、ほぼ最適な期待損失を達成するもので、近年多くの注目を集めていますが、適切に指定された線形回帰の設定以外では、まだ十分に理解されていません。この論文では、回帰タスクと分類タスクの両方で、良性過剰適合が発生する可能性がある場合と発生しない可能性がある場合に関するいくつかの新しい結果を示します。線形予測子の良性過剰適合のプロトタイプでかなり一般的なデータモデルを検討します。このモデルでは、固定次元$k$の任意の入力分布が高次元分布と連結されます。必ずしも適切に指定されていない線形回帰の場合、最小ノルム補間予測子(標準的なトレーニング方法が収束する)は一般に矛盾したソリューションに偏っているため、良性過剰適合は一般に発生しないことを示します。さらに、いくつかの回帰問題における良性過剰適合の存在が、他の回帰問題における良性過剰適合の存在を排除することを証明することで、これが標準的な線形回帰を超えてどのように拡張できるかを示します。次に、分類問題に目を向け、そこでの状況がはるかに好ましいことを示します。具体的には、最大マージン予測子(標準的なトレーニング方法が方向に収束することが知られている)が、重み付き二乗ヒンジ損失を最小化する方向に漸近的に偏っていることを証明します。これにより、分類における良性過剰適合の問題を、この損失が誤分類エラーの良い代替物であるかどうかというより単純な問題に縮小し、それを使用していくつかの新しい設定で良性過剰適合を示すことができます。

The Hyperspherical Geometry of Community Detection: Modularity as a Distance
コミュニティ検出の超球面幾何学:距離としてのモジュール性

We introduce a metric space of clusterings, where clusterings are described by a binary vector indexed by the vertex-pairs. We extend this geometry to a hypersphere and prove that maximizing modularity is equivalent to minimizing the angular distance to some modularity vector over the set of clustering vectors. In that sense, modularity-based community detection methods can be seen as a subclass of a more general class of projection methods, which we define as the community detection methods that adhere to the following two-step procedure: first, mapping the network to a point on the hypersphere; second, projecting this point to the set of clustering vectors. We show that this class of projection methods contains many interesting community detection methods. Many of these new methods cannot be described in terms of null models and resolution parameters, as is customary for modularity-based methods. We provide a new characterization of such methods in terms of meridians and latitudes of the hypersphere. In addition, by relating the modularity resolution parameter to the latitude of the corresponding modularity vector, we obtain a new interpretation of the resolution limit that modularity maximization is known to suffer from.

私たちは、クラスタリングの計量空間を導入します。クラスタリングは頂点ペアでインデックス付けされたバイナリベクトルで記述されます。この幾何学を超球に拡張し、モジュール性を最大化することは、クラスタリングベクトルのセット上のモジュール性ベクトルへの角度距離を最小化することと同等であることを証明します。その意味で、モジュール性ベースのコミュニティ検出方法は、より一般的なクラスの投影方法のサブクラスと見なすことができます。投影方法は、次の2段階の手順に従うコミュニティ検出方法として定義されます。まず、ネットワークを超球上の点にマッピングします。次に、この点をクラスタリングベクトルのセットに投影します。このクラスの投影方法には、多くの興味深いコミュニティ検出方法が含まれていることを示します。これらの新しい方法の多くは、モジュール性ベースの方法では一般的であるヌルモデルと解像度パラメーターの観点から記述できません。超球の子午線と緯度の観点から、このような方法の新しい特徴を示します。さらに、モジュール性解像度パラメータを対応するモジュール性ベクトルの緯度に関連付けることにより、モジュール性最大化が受けることが知られている解像度限界の新しい解釈が得られます。

FLIP: A Utility Preserving Privacy Mechanism for Time Series
FLIP:時系列のプライバシー保護メカニズムを保護するユーティリティ

Guaranteeing privacy in released data is an important goal for data-producing agencies. There has been extensive research on developing suitable privacy mechanisms in recent years. Particularly notable is the idea of noise addition with the guarantee of differential privacy. There are, however, concerns about compromising data utility when very stringent privacy mechanisms are applied. Such compromises can be quite stark in correlated data, such as time series data. Adding white noise to a stochastic process may significantly change the correlation structure, a facet of the process that is essential to optimal prediction. We propose the use of all-pass filtering as a privacy mechanism for regularly sampled time series data, showing that this procedure preserves certain types of utility while also providing sufficient privacy guarantees to entity-level time series. Numerical studies explore the practical performance of the new method, and an empirical application to labor force data show the method’s favorable utility properties in comparison to other competing privacy mechanisms.

公開されたデータのプライバシーを保証することは、データ作成機関にとって重要な目標です。近年、適切なプライバシーメカニズムの開発に関する広範な研究が行われています。特に注目すべきは、差分プライバシーの保証を伴うノイズ追加というアイデアです。ただし、非常に厳格なプライバシーメカニズムを適用すると、データの有用性が損なわれるという懸念があります。このような妥協は、時系列データなどの相関データでは非常に顕著になる可能性があります。確率過程にホワイトノイズを追加すると、相関構造が大幅に変化する可能性があります。相関構造は、最適な予測に不可欠なプロセスの側面です。私たちは、定期的にサンプリングされた時系列データのプライバシーメカニズムとしてオールパスフィルタリングを使用することを提案し、この手順によって特定の種類の有用性が維持されると同時に、エンティティレベルの時系列に十分なプライバシー保証が提供されることを示しています。数値研究では、新しい方法の実際のパフォーマンスを調査し、労働力データへの実証的な適用により、他の競合するプライバシーメカニズムと比較して、この方法の好ましい有用性特性が示されました。

A General Theory for Federated Optimization with Asynchronous and Heterogeneous Clients Updates
非同期クライアントと異種クライアント更新によるフェデレーテッド最適化の一般理論

We propose a novel framework to study asynchronous federated learning optimization with delays in gradient updates. Our theoretical framework extends the standard FedAvg aggregation scheme by introducing stochastic aggregation weights to represent the variability of the clients update time, due for example to heterogeneous hardware capabilities. Our formalism applies to the general federated setting where clients have heterogeneous datasets and perform at least one step of stochastic gradient descent (SGD). We demonstrate convergence for such a scheme and provide sufficient conditions for the related minimum to be the optimum of the federated problem. We show that our general framework applies to existing optimization schemes including centralized learning, FedAvg, asynchronous FedAvg, and FedBuff. The theory here provided allows drawing meaningful guidelines for designing a federated learning experiment in heterogeneous conditions. In particular, we develop in this work FedFix, a novel extension of FedAvg enabling efficient asynchronous federated training while preserving the convergence stability of synchronous aggregation. We empirically demonstrate our theory on a series of experiments showing that asynchronous FedAvg leads to fast convergence at the expense of stability, and we finally demonstrate the improvements of FedFix over synchronous and asynchronous FedAvg.

私たちは、勾配更新の遅延を伴う非同期連合学習の最適化を研究するための新しいフレームワークを提案します。我々の理論的フレームワークは、例えば異種ハードウェア機能によるクライアント更新時間の変動性を表すために確率的集約重みを導入することにより、標準のFedAvg集約スキームを拡張します。我々の形式主義は、クライアントが異種データセットを持ち、少なくとも1ステップの確率的勾配降下法(SGD)を実行する一般的な連合設定に適用できます。我々はこのようなスキームの収束を実証し、関連する最小値が連合問題の最適値となるための十分な条件を提供します。私たちは、我々の一般的なフレームワークが、集中学習、FedAvg、非同期FedAvg、およびFedBuffを含む既存の最適化スキームに適用できることを示す。ここで提供される理論により、異種条件下での連合学習実験を設計するための有意義なガイドラインを描くことができます。特に、我々はこの研究で、同期集約の収束安定性を維持しながら効率的な非同期連合トレーニングを可能にするFedAvgの新しい拡張であるFedFixを開発します。我々は一連の実験で理論を実証し、非同期FedAvgは安定性を犠牲にして高速収束につながることを示しました。そして最後に、同期および非同期FedAvgに対するFedFixの改善を実証しました。

Dimensionless machine learning: Imposing exact units equivariance
無次元機械学習:正確な単位等価性の強制

Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods that are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology.

単位等分散（または単位共分散）は、物理的に関連する測定量間の関係が自己無撞着な次元スケーリングに従わなければならないという要件から生じる正確な対称性です。ここでは、この対称性を（非コンパクトな）グループアクションの観点から表現し、次元解析と等変機械学習のアイデアを使用して、正確に単位等変機械学習の方法論を提供します。任意の学習タスクについて、まず次元解析の古典的な結果を使用して入力の無次元バージョンを構築し、次に無次元空間で推論を実行します。私たちのアプローチは、回転やその他のグループに等変な幅広い機械学習方法に単位等分散を課すために使用できます。対称性が重要なシンボリック回帰やエミュレーションなどのコンテキストで得られるサンプル内およびサンプル外の予測精度の向上について説明します。物理学と生態学の動的システムを含む簡単な数値例を使用して、私たちのアプローチを説明します。

Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors
物理情報に基づく事前確率を使用した不完全コンピュータモデルのベイズ較正

We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model’s structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.

私たちは、微分方程式で表されるコンピュータモデルの物理パラメータとモデル定式化における不確実性を定量化するのに適した、計算効率の高いデータ駆動型フレームワークを紹介します。私たちは、共分散関数のモデル構造をエンコードするマルチ出力GP事前分布である物理学情報に基づく事前分布を構築します。これは、物理パラメータとモデル予測の不確実性を定量化する完全なベイズフレームワークに拡張されます。物理モデルは実際のプロセスの不完全な記述であることが多いため、食い違い関数を考慮することで、モデルが観測データから逸脱することを許可します。推論にはハミルトンモンテカルロが使用されます。さらに、ビッグデータの近似が開発され、計算の複雑さが$\mathcal{O}(N^3)$から$\mathcal{O}(N\cdot m^2),$に削減されます(ここで$m \ll N.$)。我々のアプローチは、物理が時間依存ODE (心血管モデル)と時空依存PDE (熱方程式)によって記述されるシミュレーションと実際のデータのケーススタディで実証されています。研究では、1)現実がモデルの選択よりも複雑で、2)データ取得プロセスに偏りがあり、正確な予測も生成できる場合でも、私たちのモデル化フレームワークが物理モデルの真のパラメータを回復できることが示されています。さらに、私たちのアプローチは従来のベイズ較正法よりも計算が速いことが実証されています。

Risk Bounds for Positive-Unlabeled Learning Under the Selected At Random Assumption
選択されたランダム仮定の下での正のラベルなし学習のリスク限界

Positive-Unlabeled learning (PU learning) is a special case of semi-supervised binary classification where only a fraction of positive examples is labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to the standard classification setting. Finally, we provide a lower bound on the minimax risk proving that the upper bound is almost optimal.

ポジティブ-ラベルなし学習(PU学習)は、ポジティブな例の一部のみがラベル付けされる半教師あり二項分類の特殊なケースです。次に、情報が不足しているにもかかわらず、正しい分類子を見つけるのが課題となります。最近、ラベル付けされる確率が共変量に依存する可能性がある場合に対処するための新しい方法論が導入されました。この論文では、この一般的な仮定の下でPU学習のリスク境界を確立することに関心があります。さらに、ラベルノイズがPU学習に与える影響を、標準的な分類設定と比較して定量化します。最後に、ミニマックスリスクの下限を提供し、上限がほぼ最適であることを証明します。

Concentration analysis of multivariate elliptic diffusions
多変量楕円拡散の濃度解析

We prove concentration inequalities and associated PAC bounds for both continuous- and discrete-time additive functionals for possibly unbounded functions of multivariate, nonreversible diffusion processes. Our analysis relies on an approach via the Poisson equation allowing us to consider a very broad class of subexponentially ergodic, multivariate diffusion processes. These results add to existing concentration inequalities for additive functionals of diffusion processes which have so far been only available for either bounded functions or for unbounded functions of processes from a significantly smaller class. We demonstrate the power of these exponential inequalities by two examples of very different areas. Considering a possibly high-dimensional, parametric, nonlinear drift model under sparsity constraints we apply the continuous-time concentration results to validate the restricted eigenvalue condition for Lasso estimation, which is fundamental for the derivation of oracle inequalities. The results for discrete additive functionals are applied for an investigation of the unadjusted Langevin MCMC algorithm for sampling of moderately heavy tailed densities $\pi$. In particular, we provide PAC bounds for the sample Monte Carlo estimator of integrals $\pi(f)$ for polynomially growing functions $f$ that quantify sufficient sample and step sizes for approximation within a prescribed margin with high probability.

私たちは、多変量非可逆拡散過程のおそらく無限関数に対する連続時間と離散時間の両方の加法関数の濃度不等式と関連するPAC境界を証明します。我々の分析は、ポアソン方程式を介したアプローチに依存しており、これにより、非常に広範な亜指数エルゴード多変量拡散過程を考慮できます。これらの結果は、これまでは有界関数または大幅に小さいクラスの過程の無限関数に対してのみ利用可能であった、拡散過程の加法関数の既存の濃度不等式に追加されます。私たちは、非常に異なる分野の2つの例によってこれらの指数不等式の威力を示します。スパース制約の下でおそらく高次元のパラメトリック非線形ドリフトモデルを考慮し、連続時間濃度の結果を適用して、オラクル不等式の導出の基礎となるLasso推定の制限付き固有値条件を検証します。離散加法関数の結果は、中程度に重い裾の密度$\pi$のサンプリングに対する未調整のLangevin MCMCアルゴリズムの調査に適用されます。特に、規定のマージン内で高い確率で近似するのに十分なサンプルとステップサイズを定量化する多項式成長関数$f$の積分$\pi(f)$のサンプルモンテカルロ推定量のPAC境界を提供します。

Knowledge Hypergraph Embedding Meets Relational Algebra
知識ハイパーグラフの埋め込みと関係代数の出会い

Relational databases are a successful model for data storage, and rely on query languages for information retrieval. Most of these query languages are based on relational algebra, a mathematical formalization at the core of relational models. Knowledge graphs are flexible data storage structures that allow for knowledge completion using machine learning techniques. Knowledge hypergraphs generalize knowledge graphs by allowing multi-argument relations. This work studies knowledge hypergraph completion through the lens of relational algebra and its core operations. We explore the space between relational algebra foundations and machine learning techniques for knowledge completion. We investigate whether such methods can capture high-level abstractions in terms of relational algebra operations. We propose a simple embedding-based model called Relational Algebra Embedding (ReAlE) that performs link prediction in knowledge hypergraphs. We show theoretically that ReAlE is fully expressive and can represent the relational algebra operations of renaming, projection, set union, selection, and set difference. We verify experimentally that ReAlE outperforms state-of-the-art models in knowledge hypergraph completion, and in representing each of these primitive relational algebra operations. For the latter experiment, we generate a synthetic knowledge hypergraph, for which we design an algorithm based on the Erdos-R’enyi model for generating random graphs.

リレーショナルデータベースは、データストレージの成功したモデルであり、情報検索にはクエリ言語に依存しています。これらのクエリ言語のほとんどは、リレーショナルモデルの核となる数学的形式化であるリレーショナル代数に基づいています。ナレッジグラフは、機械学習技術を使用した知識補完を可能にする柔軟なデータストレージ構造です。ナレッジハイパーグラフは、複数の引数の関係を許可することでナレッジグラフを一般化します。この研究では、リレーショナル代数とそのコア操作の観点からナレッジハイパーグラフの補完について検討します。リレーショナル代数の基礎と知識補完のための機械学習技術の間の空間を探ります。このような方法がリレーショナル代数操作の観点から高レベルの抽象化を捉えられるかどうかを調査します。ナレッジハイパーグラフでリンク予測を実行する、リレーショナル代数埋め込み(ReAlE)と呼ばれる単純な埋め込みベースのモデルを提案します。理論的には、ReAlEは表現力に富み、名前変更、射影、集合の和集合、選択、集合の差というリレーショナル代数操作を表現できることが示されています。私たちは、ReAlEが知識ハイパーグラフの完成と、これらの基本的なリレーショナル代数演算のそれぞれを表現する点で最先端のモデルよりも優れていることを実験的に検証しました。後者の実験では、合成知識ハイパーグラフを生成し、ランダムグラフを生成するためのErdos-R’enyiモデルに基づくアルゴリズムを設計しました。

Intrinsic Gaussian Process on Unknown Manifolds with Probabilistic Metrics
確率的計量による未知多様体上の内因性ガウス過程

This article presents a novel approach to construct Intrinsic Gaussian Processes for regression on unknown manifolds with probabilistic metrics (GPUM) in point clouds. In many real world applications, one often encounters high dimensional data (e.g.‘point cloud data’) centered around some lower dimensional unknown manifolds. The geometry of manifold is in general different from the usual Euclidean geometry. Naively applying traditional smoothing methods such as Euclidean Gaussian Processes (GPs) to manifold-valued data and so ignoring the geometry of the space can potentially lead to highly misleading predictions and inferences. A manifold embedded in a high dimensional Euclidean space can be well described by a probabilistic mapping function and the corresponding latent space. We investigate the geometrical structure of the unknown manifolds using the Bayesian Gaussian Processes latent variable models(B-GPLVM) and Riemannian geometry. The distribution of the metric tensor is learned using B-GPLVM. The boundary of the resulting manifold is defined based on the uncertainty quantification of the mapping. We use the probabilistic metric tensor to simulate Brownian Motion paths on the unknown manifold. The heat kernel is estimated as the transition density of Brownian Motion and used as the covariance functions of GPUM. The applications of GPUM are illustrated in the simulation studies on the Swiss roll, high dimensional real datasets of WiFi signals and image data examples. Its performance is compared with the Graph Laplacian GP, Graph Mat\'{e}rn GP and Euclidean GP.

この記事では、点群における確率的メトリクス(GPUM)を用いた未知の多様体への回帰のための固有ガウス過程を構築する新しいアプローチを紹介します。多くの実世界のアプリケーションでは、より低次元の未知の多様体を中心とした高次元データ(「点群データ」など)に遭遇することがよくあります。多様体の幾何学は、一般に通常のユークリッド幾何学とは異なります。ユークリッドガウス過程(GP)などの従来の平滑化手法を多様体値データに単純に適用し、空間の幾何学を無視すると、非常に誤解を招く予測や推論につながる可能性があります。高次元ユークリッド空間に埋め込まれた多様体は、確率的マッピング関数と対応する潜在空間によって適切に記述できます。ベイジアンガウス過程潜在変数モデル(B-GPLVM)とリーマン幾何学を使用して、未知の多様体の幾何学的構造を調査します。計量テンソルの分布は、B-GPLVMを使用して学習されます。結果として得られる多様体の境界は、マッピングの不確実性の定量化に基づいて定義されます。確率計量テンソルを使用して、未知の多様体上のブラウン運動パスをシミュレートします。熱核はブラウン運動の遷移密度として推定され、GPUMの共分散関数として使用されます。GPUMのアプリケーションは、スイスロール、WiFi信号の高次元の実際のデータセット、および画像データの例に関するシミュレーション研究で説明されています。そのパフォーマンスは、Graph Laplacian GP、Graph Mat\'{e}rn GP、およびEuclidean GPと比較されます。

Sparse Training with Lipschitz Continuous Loss Functions and a Weighted Group L0-norm Constraint
リプシッツ連続損失関数と重み付きグループ L0 ノルム制約によるスパーストレーニング

This paper is motivated by structured sparsity for deep neural network training. We study a weighted group $l_0$-norm constraint, and present the projection and normal cone of this set. Using randomized smoothing, we develop zeroth and first-order algorithms for minimizing a Lipschitz continuous function constrained by any closed set which can be projected onto. Non-asymptotic convergence guarantees are proven in expectation for the proposed algorithms for two related convergence criteria which can be considered as approximate stationary points. Two further methods are given using the proposed algorithms: one with non-asymptotic convergence guarantees in high probability, and the other with asymptotic guarantees to a stationary point almost surely. We believe in particular that these are the first such non-asymptotic convergence results for constrained Lipschitz continuous loss functions.

この論文では、ディープニューラルネットワークのトレーニングのための構造化されたスパース性に動機付けられています。重み付けされたグループ$l_0$-norm制約を検討し、このセットの射影と正規円錐を提示します。ランダム化平滑化を使用して、射影可能な任意の閉集合によって制約されたリプシッツ連続関数を最小化するためのゼロ次および1次アルゴリズムを開発します。非漸近収束の保証は、近似定点と見なすことができる2つの関連する収束基準について、提案されたアルゴリズムの期待で証明されています。提案されたアルゴリズムを使用して、さらに2つの方法が与えられます:1つは高確率で非漸近収束保証、もう1つはほぼ確実に静止点への漸近保証です。特に、これらは制約付きリプシッツ連続損失関数に対する最初の非漸近収束結果であると考えています。

Learning Optimal Group-structured Individualized Treatment Rules with Many Treatments
多数の治療による最適なグループ構造化個別化治療ルールの学習

Data driven individualized decision making problems have received a lot of attentions in recent years. In particular, decision makers aim to determine the optimal Individualized Treatment Rule (ITR) so that the expected specified outcome averaging over heterogeneous patient-specific characteristics is maximized. Many existing methods deal with binary or a moderate number of treatment arms and may not take potential treatment effect structure into account. However, the effectiveness of these methods may deteriorate when the number of treatment arms becomes large. In this article, we propose GRoup Outcome Weighted Learning (GROWL) to estimate the latent structure in the treatment space and the optimal group-structured ITRs through a single optimization. In particular, for estimating group-structured ITRs, we utilize the Reinforced Angle based Multicategory Support Vector Machines (RAMSVM) to learn group-based decision rules under the weighted angle based multi-class classification framework. Fisher consistency, the excess risk bound, and the convergence rate of the value function are established to provide a theoretical guarantee for GROWL. Extensive empirical results in simulation studies and real data analysis demonstrate that GROWL enjoys better performance than several other existing methods.

データ駆動型の個別意思決定問題は、近年多くの注目を集めています。特に、意思決定者は、異質な患者固有の特性を平均した期待される特定の結果が最大化されるように、最適な個別治療ルール(ITR)を決定することを目指しています。既存の方法の多くは、2種類または中程度の数の治療アームを扱っており、潜在的な治療効果構造を考慮していない可能性があります。ただし、治療アームの数が多くなると、これらの方法の有効性が低下する可能性があります。この記事では、治療空間の潜在構造と最適なグループ構造ITRを1回の最適化で推定するグループ結果加重学習(GROWL)を提案します。特に、グループ構造ITRを推定するために、強化角度ベースのマルチカテゴリサポートベクターマシン(RAMSVM)を使用して、加重角度ベースのマルチクラス分類フレームワークの下でグループベースの意思決定ルールを学習します。フィッシャー一貫性、過剰リスク境界、および価値関数の収束率は、GROWLの理論的保証を提供するために確立されています。シミュレーション研究と実際のデータ分析における広範な実験結果により、GROWLは他のいくつかの既存の方法よりも優れたパフォーマンスを発揮することが実証されています。

Inference for Gaussian Processes with Matern Covariogram on Compact Riemannian Manifolds
コンパクトリーマン多様体上のマターンコバリオグラムによるガウス過程の推論

Gaussian processes are widely employed as versatile modelling and predictive tools in spatial statistics, functional data analysis, computer modelling and diverse applications of machine learning. They have been widely studied over Euclidean spaces, where they are specified using covariance functions or covariograms for modelling complex dependencies. There is a growing literature on Gaussian processes over Riemannian manifolds in order to develop richer and more flexible inferential frameworks for non-Euclidean data. While numerical approximations through graph representations have been well studied for the Matern covariogram and heat kernel, the behaviour of asymptotic inference on the parameters of the covariogram has received relatively scant attention. We focus on asymptotic behaviour for Gaussian processes constructed over compact Riemannian manifolds. Building upon a recently introduced Matern covariogram on a compact Riemannian manifold, we employ formal notions and conditions for the equivalence of two Matern Gaussian random measures on compact manifolds to derive the parameter that is identifiable, also known as the microergodic parameter, and formally establish the consistency of the maximum likelihood estimate and the asymptotic optimality of the best linear unbiased predictor. The circle is studied as a specific example of compact Riemannian manifolds with numerical experiments to illustrate and corroborate the theory.

ガウス過程は、空間統計、機能データ解析、コンピュータモデリング、機械学習のさまざまなアプリケーションにおいて、多目的モデリングおよび予測ツールとして広く使用されています。ユークリッド空間上で広く研究されており、複雑な依存関係をモデリングするために共分散関数またはコバリオグラムを使用して指定されます。非ユークリッドデータ用のより豊富で柔軟な推論フレームワークを開発するために、リーマン多様体上のガウス過程に関する文献が増えています。グラフ表現による数値近似は、マターンコバリオグラムとヒートカーネルに対して十分に研究されてきましたが、コバリオグラムのパラメータに対する漸近推論の挙動は比較的注目されていません。私たちは、コンパクトなリーマン多様体上に構築されたガウス過程の漸近挙動に焦点を当てています。コンパクトリーマン多様体上の最近導入されたマターンコバリオグラムを基に、コンパクト多様体上の2つのマターンガウスランダム測度の同値性に関する形式的な概念と条件を使用して、識別可能なパラメーター(ミクロエルゴードパラメーターとも呼ばれる)を導出し、最大尤度推定値の一貫性と最良の線形不偏予測子の漸近最適性を形式的に確立します。円はコンパクトリーマン多様体の具体的な例として研究され、数値実験によって理論を説明および裏付けます。

FedLab: A Flexible Federated Learning Framework
FedLab:柔軟な連合学習フレームワーク

FedLab is a lightweight open-source framework for the simulation of federated learning. The design of FedLab focuses on federated learning algorithm effectiveness and communication efficiency. It allows customization on server optimization, client optimization, communication agreement, and communication compression. Also, FedLab is scalable in different deployment scenarios with different computation and communication resources. We hope FedLab could provide flexible APIs as well as reliable baseline implementations and relieve the burden of implementing novel approaches for researchers in the FL community. The source code, tutorial, and documentation can be found at https://github.com/SMILELab-FL/FedLab.

FedLabは、フェデレーテッドラーニングのシミュレーションのための軽量なオープンソースフレームワークです。FedLabの設計は、フェデレーテッドラーニングアルゴリズムの有効性と通信効率に焦点を当てています。これにより、サーバーの最適化、クライアントの最適化、通信契約、および通信圧縮のカスタマイズが可能になります。また、FedLabは、さまざまなコンピューティングリソースと通信リソースを使用するさまざまな展開シナリオでスケーラブルです。FedLabが柔軟なAPIと信頼性の高いベースライン実装を提供し、FLコミュニティの研究者が新しいアプローチを実装する負担を軽減できることを願っています。ソースコード、チュートリアル、およびドキュメントは、https://github.com/SMILELab-FL/FedLabにあります。

Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity
コネクティビティの問題:効果的なスパース性のレンズを通して刈り込むニューラルネットワーク

Neural network pruning is a fruitful area of research with surging interest in high sparsity regimes. Benchmarking in this domain heavily relies on faithful representation of the sparsity of subnetworks, which has been traditionally computed as the fraction of removed connections (direct sparsity). This definition, however, fails to recognize unpruned parameters that detached from input or output layers of the underlying subnetworks, potentially underestimating actual effective sparsity: the fraction of inactivated connections. While this effect might be negligible for moderately pruned networks (up to 10–100 compression rates), we find that it plays an increasing role for sparser subnetworks, greatly distorting comparison between different pruning algorithms. For example, we show that effective compression of a randomly pruned LeNet-300-100 can be orders of magnitude larger than its direct counterpart, while no discrepancy is ever observed when using SynFlow for pruning (Tanaka et al., 2020). In this work, we adopt the lens of effective sparsity to reevaluate several recent pruning algorithms on common benchmark architectures (e.g., LeNet-300-100, VGG-19, ResNet-18) and discover that their absolute and relative performance changes dramatically in this new, and as we argue, more appropriate framework. To aim for effective, rather than direct, sparsity, we develop a low-cost extension to most pruning algorithms. Further, equipped with effective sparsity as a reference frame, we partially reconfirm that random pruning with appropriate sparsity allocation across layers performs as well or better than more sophisticated algorithms for pruning at initialization (Su et al., 2020). In response to this observation, using an analogy of pressure distribution in coupled cylinders from thermodynamics, we design novel layerwise sparsity quotas that outperform all existing baselines in the context of random pruning.

ニューラルネットワークの剪定は、高スパース性領域への関心が高まっている、実りある研究分野です。この分野のベンチマークは、サブネットワークのスパース性の忠実な表現に大きく依存しており、これは従来、削除された接続の割合(直接スパース性)として計算されてきました。ただし、この定義では、基礎となるサブネットワークの入力層または出力層から切り離された剪定されていないパラメーターを認識できず、実際の有効スパース性、つまり非アクティブ化された接続の割合が過小評価される可能性があります。この影響は、中程度に剪定されたネットワーク(最大10～100の圧縮率)では無視できるかもしれませんが、よりスパースなサブネットワークではその影響が大きくなり、異なる剪定アルゴリズム間の比較が大きく歪むことがわかりました。たとえば、ランダムに剪定されたLeNet-300-100の有効圧縮は、直接の対応物よりも桁違いに大きくなる可能性がある一方で、剪定にSynFlowを使用した場合には矛盾がまったく観察されないことを示しています(Tanakaら, 2020)。この研究では、有効スパース性のレンズを採用して、一般的なベンチマークアーキテクチャ(LeNet-300-100、VGG-19、ResNet-18など)でいくつかの最近の剪定アルゴリズムを再評価し、この新しい、そして私たちが主張するようにより適切なフレームワークでは、それらの絶対的および相対的なパフォーマンスが劇的に変化することを発見しました。直接的なスパース性ではなく、有効なスパース性を目指すために、ほとんどの剪定アルゴリズムに対する低コストの拡張機能を開発しました。さらに、参照フレームとして有効スパース性を備え、レイヤー間で適切なスパース性割り当てを行うランダム剪定は、初期化時の剪定のより洗練されたアルゴリズムと同等かそれ以上のパフォーマンスを発揮することを部分的に再確認しました(Suら, 2020)。この観察に応えて、熱力学からの結合シリンダー内の圧力分布の類推を使用して、ランダムプルーニングのコンテキストで既存のすべてのベースラインを上回る新しいレイヤーワイズスパースクォータを設計します。

An Analysis of Robustness of Non-Lipschitz Networks
非リプシッツネットワークのロバスト性の解析

Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network’s final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.

大きな進歩にもかかわらず、ディープネットワークは依然として敵対的攻撃に対して非常に脆弱です。基本的な課題の1つは、小さな入力の変動がネットワークの最終層の特徴空間に大きな動きをもたらすことが多いことです。この論文では、この課題を抽象化した攻撃モデルを定義し、その本質的な特性を理解しやすくします。このモデルでは、敵対者は特徴空間内で任意の距離にデータを移動できますが、移動できるのはランダムな低次元サブスペースのみです。このような敵対者は非常に強力であり、与えられた入力を分類しなければならないアルゴリズムをすべて打ち負かすことができることを証明します。ただし、アルゴリズムが異常な入力を棄権できるようにすることで、特徴空間でクラスが十分に分離されている場合、このような敵対者を克服できることを示します。さらに、データ駆動型方法を使用して、精度と棄権のトレードオフを最適化するためにアルゴリズムパラメータを設定するための強力な理論的保証を提供します。この結果は、最近傍スタイルのアルゴリズムに新しい堅牢性保証を提供し、また対照学習にも応用できます。対照学習では、このようなアルゴリズムが低い棄権率で高い堅牢な精度を実現できることを実証しています。私たちのモデルは、分類対象のエンティティが観察可能な特徴を操作して好ましい分類を作成することを目的とする戦略的分類にも基づいており、その分野にも新たな洞察を提供します。

Fitting Autoregressive Graph Generative Models through Maximum Likelihood Estimation
最尤推定による自己回帰グラフ生成モデルのフィッティング

We consider the problem of fitting autoregressive graph generative models via maximum likelihood estimation (MLE). MLE is intractable for graph autoregressive models because the nodes in a graph can be arbitrarily reordered; thus the exact likelihood involves a sum over all possible node orders leading to the same graph. In this work, we fit the graph models by maximizing a variational bound, which is built by first deriving the joint probability over the graph and the node order of the autoregressive process. This approach avoids the need to specify ad-hoc node orders, since an inference network learns the most likely node sequences that have generated a given graph. We improve the approach by developing a graph generative model based on attention mechanisms and an inference network based on routing search. We demonstrate empirically that fitting autoregressive graph models via variational inference improves their qualitative and quantitative performance, and the improved model and inference network further boost the performance.

私たちは、最大尤度推定(MLE)による自己回帰グラフ生成モデルのフィッティングの問題を考察します。グラフ内のノードは任意に並べ替えられるため、MLEはグラフ自己回帰モデルには扱いにくい。したがって、正確な尤度には、同じグラフにつながるすべての可能なノード順序の合計が含まれます。この研究では、まずグラフ全体と自己回帰プロセスのノード順序の結合確率を導出することによって構築される変分境界を最大化することでグラフモデルをフィッティングします。このアプローチでは、推論ネットワークが特定のグラフを生成した最も可能性の高いノードシーケンスを学習するため、アドホックノード順序を指定する必要がなくなります。私たちは、注意メカニズムに基づくグラフ生成モデルとルーティング検索に基づく推論ネットワークを開発することで、このアプローチを改良します。変分推論による自己回帰グラフモデルのフィッティングにより、定性的および定量的なパフォーマンスが向上し、改良されたモデルと推論ネットワークによってパフォーマンスがさらに向上することを実証します。

Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization
ロバストな行列回復のためのサブグラジエント法のグローバル収束:小さな初期化、ノイズの多い測定、および過剰なパラメータ化

In this work, we study the performance of sub-gradient method (SubGM) on a natural nonconvex and nonsmooth formulation of low-rank matrix recovery with $\ell_1$-loss, where the goal is to recover a low-rank matrix from a limited number of measurements, a subset of which may be grossly corrupted with noise. We study a scenario where the rank of the true solution is unknown and over-estimated instead. The over-estimation of the rank gives rise to an over-parameterized model in which there are more degrees of freedom than needed. Such over-parameterization may lead to overfitting, or adversely affect the performance of the algorithm. We prove that a simple SubGM with small initialization is agnostic to both over-parameterization and noise in the measurements. In particular, we show that small initialization nullifies the effect of over-parameterization on the performance of SubGM, leading to an exponential improvement in its convergence rate. Moreover, we provide the first unifying framework for analyzing the behavior of SubGM under both outlier and Gaussian noise models, showing that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values, and, perhaps surprisingly, even if the globally optimal solutions do not correspond to the ground truth. At the core of our results is a robust variant of restricted isometry property, called Sign-RIP, which controls the deviation of the sub-differential of the $\ell_1$-loss from that of an ideal, expected loss. As a byproduct of our results, we consider a subclass of robust low-rank matrix recovery with Gaussian measurements, and show that the number of required samples to guarantee the global convergence of SubGM is independent of the over-parameterized rank.

この研究では、$\ell_1$損失を伴う低ランク行列回復の自然な非凸かつ非滑らかな定式化に対するサブ勾配法(SubGM)のパフォーマンスを調査します。ここでの目標は、限られた数の測定値から低ランク行列を回復することです。測定値のサブセットは、ノイズによって大幅に破損している可能性があります。真の解のランクが不明で、代わりに過大評価されているシナリオを調査します。ランクの過大評価により、必要以上に自由度が高い過剰パラメータ化モデルが生成されます。このような過剰パラメータ化は、過剰適合につながるか、アルゴリズムのパフォーマンスに悪影響を与える可能性があります。初期化が小さい単純なSubGMは、過剰パラメータ化と測定値のノイズの両方に依存しないことを証明します。特に、初期化が小さいと、過剰パラメータ化がSubGMのパフォーマンスに与える影響がなくなり、収束率が指数関数的に向上することを示します。さらに、外れ値モデルとガウスノイズモデルの両方におけるSubGMの挙動を解析するための最初の統一フレームワークを提供し、任意の大きさと密度のノイズ値であっても、またおそらく驚くべきことに、グローバル最適解が真の解と一致しない場合でも、SubGMが真の解に収束することを示しています。私たちの結果の核心は、Sign-RIPと呼ばれる制限付き等長特性の堅牢な変種であり、これは$\ell_1$損失のサブ微分と理想的な期待損失のサブ微分との偏差を制御します。私たちの結果の副産物として、ガウス測定による堅牢な低ランク行列回復のサブクラスを検討し、SubGMのグローバル収束を保証するために必要なサンプル数は、過剰パラメータ化されたランクとは無関係であることを示します。

Statistical Inference for Noisy Incomplete Binary Matrix
ノイズの多い不完全バイナリ行列の統計的推論

We consider the statistical inference for noisy incomplete binary (or 1-bit) matrix. Despite the importance of uncertainty quantification to matrix completion, most of the categorical matrix completion literature focuses on point estimation and prediction. This paper moves one step further toward statistical inference for binary matrix completion. Under a popular nonlinear factor analysis model, we obtain a point estimator and derive its asymptotic normality. Moreover, our analysis adopts a flexible missing-entry design that does not require a random sampling scheme as required by most of the existing asymptotic results for matrix completion. Under reasonable conditions, the proposed estimator is statistically efficient and optimal in the sense that the Cramer-Rao lower bound is achieved asymptotically for the model parameters. Two applications are considered, including (1) linking two forms of an educational test and (2) linking the roll call voting records from multiple years in the United States Senate. The first application enables the comparison between examinees who took different test forms, and the second application allows us to compare the liberal-conservativeness of senators who did not serve in the Senate at the same time.

私たちは、ノイズの多い不完全なバイナリ（または1ビット）行列の統計的推論について考察します。行列補完における不確実性の定量化の重要性にもかかわらず、カテゴリカル行列補完に関する文献のほとんどは、点推定と予測に焦点を当てています。この論文では、バイナリ行列補完の統計的推論に向けてさらに一歩前進します。一般的な非線形因子分析モデルの下で、点推定量を取得し、その漸近正規性を導出します。さらに、私たちの分析では、行列補完の既存の漸近結果のほとんどで必要とされるランダムサンプリング方式を必要としない柔軟な欠損エントリ設計を採用しています。妥当な条件下では、提案された推定量は、モデルパラメータに対してCramer-Rao下限が漸近的に達成されるという意味で、統計的に効率的かつ最適です。2つのアプリケーションが検討されており、(1) 2つの形式の教育テストのリンクと(2)米国上院での複数年の点呼投票記録のリンクです。最初のアプリケーションでは、異なるテスト形式を受けた受験者間の比較が可能になり、2番目のアプリケーションでは、同時に上院に所属していなかった上院議員のリベラル派と保守派を比較できるようになります。

Faith-Shap: The Faithful Shapley Interaction Index
フェイス・シャップ：フェイスフル・シャップリー相互作用指数

Shapley values, which were originally designed to assign attributions to individual players in coalition games, have become a commonly used approach in explainable machine learning to provide attributions to input features for black-box machine learning models. A key attraction of Shapley values is that they uniquely satisfy a very natural set of axiomatic properties. However, extending the Shapley value to assigning attributions to interactions rather than individual players, an interaction index, is non-trivial: as the natural set of axioms for the original Shapley values, extended to the context of interactions, no longer specify a unique interaction index. Many proposals thus introduce additional possibly stringent axioms, while sacrificing the key axiom of efficiency, in order to obtain unique interaction indices. In this work, rather than introduce additional conflicting axioms, we adopt the viewpoint of Shapley values as coefficients of the most faithful linear approximation to the pseudo-Boolean coalition game value function. By extending linear to higher order polynomial approximations, we can then define the general family of faithful interaction indices. We show that by additionally requiring the faithful interaction indices to satisfy interaction-extensions of the standard individual Shapley axioms (dummy, symmetry, linearity, and efficiency), we obtain a unique Faithful Shapley Interaction index, which we denote Faith-Shap, as a natural generalization of the Shapley value to interactions. We then provide some illustrative contrasts of Faith-Shap with previously proposed interaction indices, and further investigate some of its interesting algebraic properties. We further show the computational efficiency of computing Faith-Shap, together with some additional qualitative insights, via some illustrative experiments.

シャプレー値は、もともと連合ゲームで個々のプレイヤーに属性を割り当てるために設計されたものですが、ブラックボックス機械学習モデルの入力機能に属性を提供する説明可能な機械学習で一般的に使用されるアプローチになっています。シャプレー値の主な魅力は、非常に自然な公理的特性セットを一意に満たすことです。ただし、シャプレー値を拡張して、個々のプレイヤーではなく相互作用に属性を割り当てる相互作用インデックスは簡単ではありません。相互作用のコンテキストに拡張された元のシャプレー値の自然な公理セットは、もはや一意の相互作用インデックスを指定しないためです。そのため、多くの提案では、一意の相互作用インデックスを取得するために、効率という重要な公理を犠牲にして、追加のおそらく厳格な公理を導入しています。この研究では、追加の矛盾する公理を導入するのではなく、疑似ブール連合ゲーム値関数への最も忠実な線形近似の係数としてのシャプレー値という観点を採用します。線形近似を高次の多項式近似に拡張することで、忠実な相互作用指標の一般的なファミリーを定義できます。忠実な相互作用指標に、標準の個々のシャプレー公理(ダミー、対称性、線形性、効率性)の相互作用拡張を満たすことをさらに要求することで、シャプレー値の相互作用への自然な一般化として、一意の忠実なシャプレー相互作用指標(Faith-Shapと表記)が得られることを示します。次に、以前に提案された相互作用指標とFaith-Shapのいくつかの例示的な対比を示し、その興味深い代数的特性のいくつかをさらに調査します。さらに、いくつかの例示的な実験を通じて、Faith-Shapを計算する際の計算効率と、いくつかの追加の定性的な洞察を示します。

Decentralized Learning: Theoretical Optimality and Practical Improvements
分散型学習:理論的最適性と実践的改善

Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. While a simple version of DeTAG with plain SGD and constant step size suffice for achieving theoretical limits, we additionally provide convergence bound for DeTAG under general non-increasing step size and momentum. Empirically, we compare DeTAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and ImageNet. We substantiate our theory and show DeTAG converges faster on unshuffled data and in sparse networks. Furthermore, we study a DeTAG variant, DeTAG*, that practically speeds up data-center-scale model training. This manuscript provides extended contents to its ICML version.

分散化は、並列機械学習システムをスケールアップする有望な方法です。この論文では、確率的非凸設定におけるこのような方法の反復計算量の厳密な下限を示します。この下限は、D-PSGDなどの既存の多くの分散トレーニングアルゴリズムの既知の収束率の理論的なギャップを明らかにします。構築により、この下限が厳密で達成可能であることを証明します。この洞察に動機付けられて、対数ギャップのみで下限を達成する実用的なゴシップスタイルの分散アルゴリズムであるDeTAGをさらに提案します。プレーンなSGDと一定のステップサイズを使用したDeTAGの単純なバージョンは理論的な限界を達成するのに十分ですが、一般的な非増加ステップサイズとモメンタムの下でのDeTAGの収束境界をさらに提供します。経験的に、CIFAR10/100やImageNetなどの複数のビジョンベンチマークでDeTAGを他の分散アルゴリズムと比較します。私たちは理論を実証し、DeTAGがシャッフルされていないデータとスパースネットワークでより速く収束することを示します。さらに、データセンター規模のモデルトレーニングを実質的に高速化するDeTAGのバリアントであるDeTAG*を研究します。この原稿は、ICMLバージョンの拡張コンテンツを提供します。

Non-Asymptotic Guarantees for Robust Statistical Learning under Infinite Variance Assumption
無限分散仮定下でのロバスト統計学習のための非漸近保証

There has been a surge of interest in developing robust estimators for models with heavy-tailed and bounded variance data in statistics and machine learning, while few works impose unbounded variance. This paper proposes two types of robust estimators, the ridge log-truncated M-estimator and the elastic net log-truncated M-estimator. The first estimator is applied to convex regressions such as quantile regression and generalized linear models, while the other one is applied to high dimensional non-convex learning problems such as regressions via deep neural networks. Simulations and real data analysis demonstrate the robustness of log-truncated estimations over standard estimations.

統計学や機械学習では、ヘビーテール分散データと有界分散データを持つモデルのためのロバストな推定量の開発に関心が高まっている一方で、非有界分散を課す研究はほとんどありません。この論文では、リッジ対数切り捨てM推定量と弾性ネット対数切り捨てM推定量の2種類のロバスト推定量を提案します。最初の推定量は、分位点回帰や一般化線形モデルなどの凸回帰に適用され、もう1つは、ディープニューラルネットワークを介した回帰などの高次元の非凸学習問題に適用されます。シミュレーションと実際のデータ解析により、標準推定値に対する対数切り捨て推定値のロバスト性が実証されています。

Recursive Quantile Estimation: Non-Asymptotic Confidence Bounds
再帰的分位点推定: 非漸近信頼限界

This paper considers the recursive estimation of quantiles using the stochastic gradient descent (SGD) algorithm with Polyak-Ruppert averaging. The algorithm offers a computationally and memory efficient alternative to the usual empirical estimator. Our focus is on studying the non-asymptotic behavior by providing exponentially decreasing tail probability bounds under mild assumptions on the smoothness of the density functions. This novel non-asymptotic result is based on a bound of the moment generating function of the SGD estimate. We apply our result to the problem of best arm identification in a multi-armed stochastic bandit setting under quantile preferences.

この論文では、確率的勾配降下法(SGD)アルゴリズムとPolyak-Ruppert平均法を使用した分位数の再帰的推定について考察します。このアルゴリズムは、通常の経験的推定量に代わる、計算効率とメモリ効率の高い代替手段を提供します。私たちの焦点は、密度関数の滑らかさに関する穏やかな仮定の下で指数関数的に減少するテール確率境界を提供することにより、非漸近的な振る舞いを研究することです。この新しい非漸近的な結果は、SGD推定のモーメント生成関数の境界に基づいています。結果を、分位点の選好の下での多腕確率的バンディット設定における最良の腕の識別の問題に適用します。

Outlier-Robust Subsampling Techniques for Persistent Homology
持続ホモロジーのための外れ値ロバストなサブサンプリング手法

In recent years, persistent homology has been successfully applied to real-world data in many different settings. Despite significant computational advances, persistent homology algorithms do not yet scale to large datasets preventing interesting applications. One approach to address computational issues posed by persistent homology is to select a set of landmarks by subsampling from the data. Currently, these landmark points are chosen either at random or using the maxmin algorithm. Neither is ideal as random selection tends to favour dense areas of the data while the maxmin algorithm is very sensitive to noise. Here, we propose a novel approach to select landmarks specifically for persistent homology that preserves coarse topological information of the original dataset. Our method is motivated by the Mayer-Vietoris sequence and requires only local persistent homology calculations thus enabling efficient computation. We test our landmarks on artificial data sets which contain different levels of noise and compare them to standard landmark selection techniques. We demonstrate that our landmark selection outperforms standard methods as well as a subsampling technique based on an outlier-robust version of the k-means algorithm for low sampling densities in noisy data with respect to robustness to outliers.

近年、パーシステントホモロジーはさまざまな設定で実世界のデータにうまく適用されてきました。計算能力は大きく進歩しましたが、パーシステントホモロジーアルゴリズムはまだ大規模なデータセットに拡張できず、興味深いアプリケーションを妨げています。パーシステントホモロジーによって生じる計算上の問題に対処する1つの方法は、データからサブサンプリングしてランドマークのセットを選択することです。現在、これらのランドマークポイントはランダムに選択されるか、maxminアルゴリズムを使用して選択されます。ランダム選択はデータの密な領域を優先する傾向があり、maxminアルゴリズムはノイズに非常に敏感であるため、どちらも理想的ではありません。ここでは、元のデータセットの粗いトポロジ情報を保持しながら、パーシステントホモロジー専用のランドマークを選択する新しい方法を提案します。この方法はMayer-Vietorisシーケンスに触発されており、ローカルなパーシステントホモロジー計算のみを必要とするため、効率的な計算が可能です。さまざまなレベルのノイズを含む人工データセットでランドマークをテストし、標準的なランドマーク選択手法と比較します。外れ値に対する堅牢性に関して、ノイズの多いデータ内の低サンプリング密度に対するランドマーク選択は、標準的な方法や、外れ値に堅牢なバージョンのk-meansアルゴリズムに基づくサブサンプリング手法よりも優れていることを実証します。

Neural Operator: Learning Maps Between Function Spaces With Applications to PDEs
ニューラルオペレーター:偏微分方程式への応用による関数空間間の学習マップ

The classical development of neural networks has primarily focused on learning mappings between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural networks to learn operators, termed neural operators, that map between infinite dimensional function spaces. We formulate the neural operator as a composition of linear integral operators and nonlinear activation functions. We prove a universal approximation theorem for our proposed neural operator, showing that it can approximate any given nonlinear continuous operator. The proposed neural operators are also discretization-invariant, i.e., they share the same model parameters among different discretization of the underlying function spaces. Furthermore, we introduce four classes of efficient parameterization, viz., graph neural operators, multi-pole graph neural operators, low-rank neural operators, and Fourier neural operators. An important application for neural operators is learning surrogate maps for the solution operators of partial differential equations (PDEs). We consider standard PDEs such as the Burgers, Darcy subsurface flow, and the Navier-Stokes equations, and show that the proposed neural operators have superior performance compared to existing machine learning based methodologies, while being several orders of magnitude faster than conventional PDE solvers.

ニューラルネットワークの従来の開発は、主に有限次元ユークリッド空間または有限集合間のマッピングの学習に重点を置いてきました。私たちは、ニューラルネットワークを一般化して、無限次元関数空間間をマッピングするニューラルオペレータと呼ばれるオペレータを学習することを提案します。私たちは、ニューラルオペレータを線形積分オペレータと非線形活性化関数の組み合わせとして定式化します。私たちは、提案するニューラルオペレータの普遍近似定理を証明し、任意の非線形連続オペレータを近似できることを示します。提案するニューラルオペレータは離散化不変でもあります。つまり、基になる関数空間の異なる離散化間で同じモデルパラメータを共有します。さらに、グラフニューラルオペレータ、多極グラフニューラルオペレータ、低ランクニューラルオペレータ、およびフーリエニューラルオペレータという4つの効率的なパラメータ化クラスを導入します。ニューラルオペレータの重要な用途は、偏微分方程式(PDE)の解オペレータの代理マップの学習です。バーガース方程式、ダルシー地下流方程式、ナビエ・ストークス方程式などの標準的なPDEを考慮し、提案されたニューラル演算子が既存の機械学習ベースの方法論と比較して優れたパフォーマンスを発揮し、従来のPDEソルバーよりも数桁高速であることを示します。

Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data
多変量カテゴリカルデータのためのディメンショングループ化混合メンバーシップモデル

Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs (Gro-M$^3$s) for multivariate categorical data, which improve parsimony and interpretability. In Gro-M$^3$s, observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grouping structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro-M$^3$s to inferring the variable grouping structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset.

混合メンバーシップモデル(MMM)は、複雑な多変量データ用の潜在構造モデルの一般的なファミリーです。各被験者を単一のクラスターに所属させる代わりに、MMMは、クラスター間の部分的なメンバーシップを特徴付ける被験者固有の重みのベクトルを組み込みます。この柔軟性により、パラメーターを一意に識別、推定、解釈する上で課題が生じます。この記事では、多変量カテゴリデータ用の新しいクラスの次元グループ化MMM (Gro-M$^3$)を提案します。これにより、簡潔性と解釈可能性が向上します。Gro-M$^3$では、観測変数はグループに分割され、潜在メンバーシップはグループ内の変数に対して一定ですが、グループ間で異なる場合があります。従来の潜在クラスモデルは、すべての変数が1つのグループにある場合に取得され、従来のMMMは各変数が独自のグループにある場合に取得されます。この新しいモデルは、確率テンソルの新しい分解に対応しています。理論的には、一般的な設定で、未知のグループ化構造とモデルパラメーターの両方に対して、透明な識別可能性条件を導出します。方法論的には、変数のグループ化構造を推測し、モデルパラメータを推定するためのディリクレGro-M$^3$のベイズ的アプローチを提案します。シミュレーション結果は、優れた計算性能を示し、識別可能性の結果を実証的に確認します。機能障害調査データセットと性格検査データセットへの適用を通じて、新しい方法論を説明します。

Gaussian Processes with Errors in Variables: Theory and Computation
変数に誤差があるガウス過程:理論と計算

Covariate measurement error in nonparametric regression is a common problem in nutritional epidemiology and geostatistics, and other fields. Over the last two decades, this problem has received substantial attention in the frequentist literature. Bayesian approaches for handling measurement error have only been explored recently and are surprisingly successful, although there still is a lack of a proper theoretical justification regarding the asymptotic performance of the estimators. By specifying a Gaussian process prior on the regression function and a Dirichlet process Gaussian mixture prior on the unknown distribution of the unobserved covariates, we show that the posterior distribution of the regression function and the unknown covariate density attain optimal rates of contraction adaptively over a range of Holder classes, up to logarithmic terms. We also develop a novel surrogate prior for approximating the Gaussian process prior that leads to efficient computation and preserves the covariance structure, thereby facilitating easy prior elicitation. We demonstrate the empirical performance of our approach and compare it with competitors in a wide range of simulation experiments and a real data example.

ノンパラメトリック回帰における共変量測定誤差は、栄養疫学や地統計学、その他の分野でよく見られる問題です。過去20年間、この問題は頻度論の文献でかなり注目されてきました。測定誤差を処理するためのベイズ的アプローチは最近になってようやく研究され、驚くほど成功していますが、推定量の漸近的パフォーマンスに関する適切な理論的根拠はまだありません。回帰関数にガウス過程事前分布を指定し、観測されていない共変量の未知の分布にディリクレ過程ガウス混合事前分布を指定することにより、回帰関数の事後分布と未知の共変量密度が、対数項までのH個の古いクラスの範囲にわたって、適応的に最適な収縮率を達成することを示します。また、ガウス過程事前分布を近似するための新しい代理事前分布を開発し、効率的な計算を可能にし、共分散構造を保持することで、事前推定を容易にします。私たちは、私たちのアプローチの実証的なパフォーマンスを実証し、幅広いシミュレーション実験と実際のデータ例で競合他社のアプローチと比較します。

Learning Partial Differential Equations in Reproducing Kernel Hilbert Spaces
カーネルヒルベルト空間の再現における偏微分方程式の学習

We propose a new data-driven approach for learning the fundamental solutions (Green’s functions) of various linear partial differential equations (PDEs) given sample pairs of input-output functions. Building off the theory of functional linear regression (FLR), we estimate the best-fit Green’s function and bias term of the fundamental solution in a reproducing kernel Hilbert space (RKHS) which allows us to regularize their smoothness and impose various structural constraints. We derive a general representer theorem for operator RKHSs to approximate the original infinite-dimensional regression problem by a finite-dimensional one, reducing the search space to a parametric class of Green’s functions. In order to study the prediction error of our Green’s function estimator, we extend prior results on FLR with scalar outputs to the case with functional outputs. Finally, we demonstrate our method on several linear PDEs including the Poisson, Helmholtz, Schrödinger, Fokker-Planck, and heat equation. We highlight its robustness to noise as well as its ability to generalize to new data with varying degrees of smoothness and mesh discretization without any additional training.

私たちは、入力と出力関数のサンプルペアが与えられた場合に、さまざまな線形偏微分方程式(PDE)の基本解(グリーン関数)を学習するための新しいデータ駆動型アプローチを提案します。関数線形回帰(FLR)の理論を基に、再生核ヒルベルト空間(RKHS)における基本解の最適グリーン関数とバイアス項を推定します。これにより、それらの滑らかさを正規化し、さまざまな構造的制約を課すことができます。私たちは、演算子RKHSの一般的な表現子定理を導出し、元の無限次元回帰問題を有限次元回帰問題で近似し、探索空間をグリーン関数のパラメトリッククラスに縮小します。我々のグリーン関数推定子の予測誤差を調査するために、スカラー出力のFLRに関する以前の結果を関数出力の場合に拡張します。最後に、ポアソン、ヘルムホルツ、シュレーディンガー、フォッカー・プランク、熱方程式を含むいくつかの線形PDEで我々の手法を示す。ノイズに対する堅牢性と、追加のトレーニングなしでさまざまなレベルの滑らかさとメッシュの離散化を備えた新しいデータに一般化できる能力を強調します。

Doubly Robust Stein-Kernelized Monte Carlo Estimator: Simultaneous Bias-Variance Reduction and Supercanonical Convergence
二重にロバストなスタインカーネル化モンテカルロ推定量: 同時バイアス分散削減と超正準収束

Standard Monte Carlo computation is widely known to exhibit a canonical square-root convergence speed in terms of sample size. Two recent techniques, one based on control variate and one on importance sampling, both derived from an integration of reproducing kernels and Stein’s identity, have been proposed to reduce the error in Monte Carlo computation to supercanonical convergence. This paper presents a more general framework to encompass both techniques that is especially beneficial when the sample generator is biased and noise-corrupted. We show our general estimator, which we call the doubly robust Stein-kernelized estimator, outperforms both existing methods in terms of mean squared error rates across different scenarios. We also demonstrate the superior performance of our method via numerical examples.

標準的なモンテカルロ計算は、サンプルサイズの点で標準的な平方根収束速度を示すことで広く知られています。最近の2つの手法、1つは制御変量に基づく、もう1つは重要度サンプリングに基づくもので、どちらも再現カーネルとSteinの恒等式の統合から導き出され、モンテカルロ計算の誤差をスーパーカノニカル収束に減らすために提案されています。この論文では、サンプルジェネレータにバイアスがかかってノイズが破損している場合に特に有益な、両方の手法を網羅するより一般的なフレームワークを示します。私たちは、二重にロバストなStein-kernelized Estimatorと呼ぶ一般推定量が、異なるシナリオでの平均二乗誤差率の点で既存の両方の方法よりも優れていることを示しています。また、数値例を通じて、このメソッドの優れた性能を実証しています。

Online Optimization over Riemannian Manifolds
リーマン多様体上のオンライン最適化

Online optimization has witnessed a massive surge of research attention in recent years. In this paper, we propose online gradient descent and online bandit algorithms over Riemannian manifolds in full information and bandit feedback settings respectively, for both geodesically convex and strongly geodesically convex functions. We establish a series of upper bounds on the regrets for the proposed algorithms over Hadamard manifolds. We also find a universal lower bound for achievable regret on Hadamard manifolds. Our analysis shows how time horizon, dimension, and sectional curvature bounds have impact on the regret bounds. When the manifold permits positive sectional curvature, we prove similar regret bound can be established by handling non-constrictive project maps. In addition, numerical studies on problems defined on symmetric positive definite matrix manifold, hyperbolic spaces, and Grassmann manifolds are provided to validate our theoretical findings, using synthetic and real-world data.

オンライン最適化は近年、研究の注目が急増しています。この論文では、測地凸関数と強測地凸関数の両方について、それぞれ完全情報設定とバンディットフィードバック設定でのリーマン多様体上のオンライン勾配降下法とオンラインバンディットアルゴリズムを提案します。アダマール多様体上の提案アルゴリズムのリグレットの上限を設定します。また、アダマール多様体で達成可能なリグレットの普遍的な下限も見つけました。分析では、時間範囲、次元、断面曲率の境界がリグレット境界にどのように影響するかを示します。多様体が正の断面曲率を許容する場合、非狭窄射影マップを処理することで同様のリグレット境界を設定できることを証明します。さらに、対称正定値行列多様体、双曲空間、グラスマン多様体で定義された問題に関する数値研究が提供され、合成データと実世界のデータを使用して理論的発見が検証されます。

Bayes-Newton Methods for Approximate Bayesian Inference with PSD Guarantees
PSD 保証付き近似ベイズ推論のためのベイズ・ニュートン法

We formulate natural gradient variational inference (VI), expectation propagation (EP), and posterior linearisation (PL) as extensions of Newton’s method for optimising the parameters of a Bayesian posterior distribution. This viewpoint explicitly casts inference algorithms under the framework of numerical optimisation. We show that common approximations to Newton’s method from the optimisation literature, namely Gauss-Newton and quasi-Newton methods (e.g., the BFGS algorithm), are still valid under this ‘Bayes-Newton’ framework. This leads to a suite of novel algorithms which are guaranteed to result in positive semi-definite (PSD) covariance matrices, unlike standard VI and EP. Our unifying viewpoint provides new insights into the connections between various inference schemes. All the presented methods apply to any model with a Gaussian prior and non-conjugate likelihood, which we demonstrate with (sparse) Gaussian processes and state space models.

私たちは、ベイズ事後分布のパラメータを最適化するためのニュートンの方法の拡張として、自然勾配変分推論(VI)、期待伝播(EP)、および事後線形化(PL)を定式化します。この視点は、推論アルゴリズムを数値最適化のフレームワークの下に明示的にキャストします。最適化の文献からのニュートン法への一般的な近似、つまりガウス・ニュートン法と準ニュートン法(BFGSアルゴリズムなど)は、この「ベイズ・ニュートン」フレームワークの下でも有効であることを示します。これにより、標準のVIやEPとは異なり、正の半定値(PSD)共分散行列が得られることが保証されている一連の新しいアルゴリズムが生まれます。私たちの統一的な視点は、さまざまな推論スキーム間の接続に関する新しい洞察を提供します。提示されたすべての方法は、ガウスの事前確率と非共役尤度を持つ任意のモデルに適用され、これは(スパース)ガウス過程と状態空間モデルで実証されます。

Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality
高次元パラメータ学習のための反復ブロック粒子フィルタ:次元の呪縛を打ち負かす

Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al., 2020) is ineffective and the iterated filtering algorithm (Ionides et al., 2015) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.

高次元、部分的に観測された、非線形の確率過程のパラメータ学習は、方法論的な課題です。時空間的な疾患伝播システムは、そのようなプロセスが未解決の推論問題を引き起こす例を提供します。私たちは、一般的な状態空間、測定値、遷移密度、およびグラフ構造を持つグラフィカルな状態空間モデル上で高次元パラメータを学習するための反復ブロック粒子フィルター(IBPF)アルゴリズムを提案します。理論的な性能保証は、次元の呪い(COD)、アルゴリズムの収束、および尤度の最大化を打ち負かすことで得られます。麻疹の伝播に関する高度に非線形で非ガウス的な時空間モデルに関する実験では、反復アンサンブルカルマンフィルターアルゴリズム(Liら, 2020)は効果がなく、反復フィルタリングアルゴリズム(Ionidesら, 2015)はCODの影響を受けますが、IBPFアルゴリズムはさまざまな指標を持つさまざまな実験で一貫してCODを上回っています。

Fast Online Changepoint Detection via Functional Pruning CUSUM Statistics
機能的プルーニングCUSUM統計による高速オンライン変化点検出

Many modern applications of online changepoint detection require the ability to process high-frequency observations, sometimes with limited available computational resources. Online algorithms for detecting a change in mean often involve using a moving window, or specifying the expected size of change. Such choices affect which changes the algorithms have most power to detect. We introduce an algorithm, Functional Online CuSUM (FOCuS), which is equivalent to running these earlier methods simultaneously for all sizes of windows, or all possible values for the size of change. Our theoretical results give tight bounds on the expected computational cost per iteration of FOCuS, with this being logarithmic in the number of observations. We show how FOCuS can be applied to a number of different changes in mean scenarios, and demonstrate its practical utility through its state-of-the-art performance at detecting anomalous behaviour in computer server data.

オンライン変化点検出の多くの最新のアプリケーションでは、高頻度の観測を処理する能力が必要であり、利用可能な計算リソースが限られている場合もあります。平均の変化を検出するためのオンラインアルゴリズムには、多くの場合、移動ウィンドウを使用したり、予想される変化のサイズを指定したりすることが含まれます。このような選択は、アルゴリズムが最も検出力を持つ変更に影響します。アルゴリズム、Functional Online CuSUM (FOCuS)を導入しますが、これは、これらの以前のメソッドをすべてのサイズのウィンドウ、または変更のサイズのすべての可能な値に対して同時に実行するのと同等です。私たちの理論的な結果は、FOCuSの反復あたりの予想計算コストに厳しい範囲を与えており、これは観測数の対数です。FOCuSを平均シナリオのさまざまな変化にどのように適用できるかを示し、コンピュータサーバーデータの異常な振る舞いを検出する最先端のパフォーマンスを通じて、その実用的な有用性を実証します。

Temporal Abstraction in Reinforcement Learning with the Successor Representation
後続表現を用いた強化学習における時間的抽象化

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard.

時間的抽象化の複数のレベルで推論することは、知能の重要な属性の1つです。強化学習では、これはオプションと呼ばれる時間的に拡張された一連の行動を通じてモデル化されることがよくあります。オプションにより、エージェントは予測を行い、環境内でさまざまなレベルの抽象化で動作することができます。ただし、オプションフレームワークに基づくアプローチは、多くの場合、合理的な一連のオプションが事前にわかっているという前提から始まります。そうでない場合、どのオプションを検討すべきかについて明確な答えはありません。この論文では、状態をそれに続く状態訪問のパターンに基づいてエンコードする後継表現は、時間的抽象化の発見と使用の自然な基盤と見なすことができると主張します。私たちの主張を裏付けるために、最近の結果を大局的に見て、後継表現を使用して、時間的に拡張された探索または計画のいずれかを容易にするオプションを発見する方法を示します。これらの結果は、エージェントの表現を使用して有用なオプションを特定し、そのオプションを使用してさらに表現を改善するという、オプション発見の一般的なフレームワークのインスタンス化として表現します。その結果、表現とオプションの両方が互いに基づいて絶えず改良される、終わりのない好循環が生まれます。オプションの発見自体を超えて、後継表現によって、追加学習なしでオプションのセットを組み合わせ的に大きな対応物に拡張する方法についても説明します。これは、以前に学習したオプションを組み合わせることで実現されます。私たちの実証的評価は、時間的に拡張された探索のために発見されたオプションと、それらを結合するための後継表現の使用に焦点を当てています。私たちの結果は、オプションの定義に関係する重要な設計上の決定を明らかにし、固有オプションやオプションキーボードなどの後継表現に基づくさまざまな方法の相乗効果を示しています。

Approximate Post-Selective Inference for Regression with the Group LASSO
グループLASSOによる回帰のための近似ポストセレクティブ推論

After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables. Such a representation, however, fails to hold for selection with the Group LASSO and substantially obstructs the scope of subsequent post-selective inference. Key questions of inferential interest, e.g., inference for the effects of selected variables on the outcome, remain unanswered. In the present paper, we develop a consistent, post-selective, Bayesian method to address the existing gaps by deriving a likelihood adjustment factor and an approximation thereof that eliminates bias from the selection of groups. Experiments on simulated data and data from the Human Connectome Project demonstrate that our method recovers the effects of parameters within the selected groups while paying only a small price for bias adjustment.

グループLASSO (または重複、スパース、標準化されたグループLASSOなどの一般化されたバリアント)による選択後、選択バイアスの調整を行わないと、選択されたパラメータの推論は信頼できません。ペナルティ付きガウス回帰設定では、既存のアプローチにより、データ変数の線形不等式として表現できる選択イベントの調整が行われます。ただし、このような表現はグループLASSOによる選択には当てはまらず、その後の選択後推論の範囲を実質的に妨げます。推論上の重要な質問、たとえば、選択された変数が結果に与える影響の推論は、未解決のままです。この論文では、グループの選択からバイアスを排除する尤度調整係数とその近似値を導出することで、既存のギャップに対処するための一貫した選択後ベイズ法を開発します。シミュレートされたデータとヒューマンコネクトームプロジェクトのデータでの実験により、この方法は、バイアス調整のコストをわずかに抑えながら、選択されたグループ内のパラメータの影響を回復できることが実証されています。

Towards Learning to Imitate from a Single Video Demonstration
1つのビデオデモンストレーションから模倣を学ぶために

Agents that can learn to imitate behaviours observed in video — without having direct access to internal state or action information of the observed agent — are more suitable for learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function by comparing an agent’s behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improve policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and quadruped and humanoid agents in 3D. We show that our method outperforms current state-of-the-art techniques and can learn to imitate behaviours from a single video demonstration.

観察されたエージェントの内部状態や行動情報に直接アクセスすることなく、ビデオで観察された行動を模倣することを学習できるエージェントは、自然界での学習に適しています。ただし、この目標を促進する強化学習(RL)エージェントを策定することは、依然として大きな課題です。私たちは、エージェントの行動を単一のデモンストレーションと比較することで報酬関数を学習する対照トレーニングを使用してこの課題に取り組みます。私たちは、この距離を最小限に抑えるようにRLポリシーをトレーニングしながら、モーションクリップ間の空間と時間における報酬を学習するために、シャム再帰型ニューラルネットワークアーキテクチャを使用します。実験を通じて、マルチタスクデータと追加の画像エンコーディング損失を含めると、学習した報酬の時間的一貫性が向上し、その結果、ポリシー学習が大幅に改善されることもわかりました。私たちは、2Dでシミュレートされたヒューマノイド、イヌ、およびラプターエージェントと、3Dで四足歩行およびヒューマノイドエージェントでこのアプローチを実証します。私たちの方法は、現在の最先端の技術よりも優れており、単一のビデオデモンストレーションから行動を模倣することを学習できることを示しています。

A Likelihood Approach to Nonparametric Estimation of a Singular Distribution Using Deep Generative Models
深層生成モデルを用いた特異分布のノンパラメトリック推定への尤度アプローチ

We investigate statistical properties of a likelihood approach to nonparametric estimation of a singular distribution using deep generative models. More specifically, a deep generative model is used to model high-dimensional data that are assumed to concentrate around some low-dimensional structure. Estimating the distribution supported on this low-dimensional structure, such as a low-dimensional manifold, is challenging due to its singularity with respect to the Lebesgue measure in the ambient space. In the considered model, a usual likelihood approach can fail to estimate the target distribution consistently due to the singularity. We prove that a novel and effective solution exists by perturbing the data with an instance noise, which leads to consistent estimation of the underlying distribution with desirable convergence rates. We also characterize the class of distributions that can be efficiently estimated via deep generative models. This class is sufficiently general to contain various structured distributions such as product distributions, classically smooth distributions and distributions supported on a low-dimensional manifold. Our analysis provides some insights on how deep generative models can avoid the curse of dimensionality for nonparametric distribution estimation. We conduct a thorough simulation study and real data analysis to empirically demonstrate that the proposed data perturbation technique improves the estimation performance significantly.

私たちは、深層生成モデルを用いた特異分布のノンパラメトリック推定に対する尤度アプローチの統計的特性を調査します。より具体的には、深層生成モデルは、何らかの低次元構造の周囲に集中すると想定される高次元データをモデル化するために使用されます。低次元多様体などのこの低次元構造でサポートされる分布を推定することは、周囲空間におけるルベーグ測度に関する特異性のため困難です。検討中のモデルでは、特異性のため、通常の尤度アプローチではターゲット分布を一貫して推定できない可能性があります。私たちは、インスタンスノイズでデータを摂動させることで、望ましい収束率で基礎分布を一貫して推定できる、新しく効果的なソリューションが存在することを証明します。また、深層生成モデルを介して効率的に推定できる分布のクラスを特徴付けます。このクラスは、積分布、古典的な滑らかな分布、低次元多様体でサポートされる分布など、さまざまな構造化分布を含むのに十分なほど一般的です。我々の分析は、深層生成モデルがノンパラメトリック分布推定の次元の呪いを回避する方法についていくつかの洞察を提供します。徹底的なシミュレーション研究と実際のデータ分析を実施し、提案されたデータ摂動技術により推定性能が大幅に向上することを実証します。

A Randomized Subspace-based Approach for Dimensionality Reduction and Important Variable Selection
次元削減と重要な変数選択のためのランダム化部分空間ベースアプローチ

An analysis of high-dimensional data can offer a detailed description of a system but is often challenged by the curse of dimensionality. General dimensionality reduction techniques can alleviate such difficulty by extracting a few important features, but they are limited due to the lack of interpretability and connectivity to actual decision making associated with each physical variable. Variable selection techniques, as an alternative, can maintain the interpretability, but they often involve a greedy search that is susceptible to failure in capturing important interactions or a metaheuristic search that requires extensive computations. This research proposes a novel method that identifies critical subspaces, reduced-dimensional physical spaces, to achieve dimensionality reduction and variable selection. We apply a randomized search for subspace exploration and leverage ensemble techniques to enhance model performance. When applied to high-dimensional data collected from the failure prediction of a composite/metal hybrid structure exhibiting complex progressive damage failure under loading, the proposed method outperforms the existing and potential alternatives in prediction and important variable selection.

高次元データの分析はシステムの詳細な説明を提供できますが、次元の呪いに悩まされることがよくあります。一般的な次元削減技術は、いくつかの重要な特徴を抽出することでそのような困難を軽減できますが、各物理変数に関連する実際の意思決定への解釈可能性と接続性がないため、限界があります。代替手段としての変数選択技術は解釈可能性を維持できますが、重要な相互作用を捕捉できない可能性のある貪欲な検索や、膨大な計算を必要とするメタヒューリスティック検索を伴うことがよくあります。この研究では、重要なサブスペース、つまり次元削減された物理空間を特定して次元削減と変数選択を実現する新しい方法を提案します。サブスペース探索にはランダム検索を適用し、アンサンブル技術を活用してモデルのパフォーマンスを強化します。負荷下で複雑な進行性損傷破壊を示す複合材/金属ハイブリッド構造の破壊予測から収集された高次元データに適用すると、提案された方法は、予測と重要な変数選択において既存および潜在的な代替方法よりも優れています。

Intrinsic Persistent Homology via Density-based Metric Learning
密度ベース計量学習による固有持続ホモロジー

We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.

私たちは、多様体仮定の下で高次元ユークリッド空間のデータからトポロジカル特徴を推定する問題に取り組みます。私たちのアプローチは、フェルマー距離として知られるサンプルメトリックに恵まれたデータポイントの空間の永続的な相同性の計算に基づいています。このような計量空間は、多様体の形状とサンプルを生成する密度の両方を説明する固有の計量に恵まれた多様体自体にほぼ確実に収束することを証明します。この事実は、関連する永続化図の収束を意味します。永続的なホモロジーを計算するときにこの固有距離を使用すると、入力データ内の外れ値の存在に対する堅牢性や、周囲空間への基になる多様体の特定の埋め込みに対する感度が低いなどの有利な特性が提示されます。これらの考え方を用いて、時系列でのパターン認識や異常検出の手法を提案・実装し、実データへの応用で評価します。

Privacy-Aware Rejection Sampling
プライバシーに配慮した拒否サンプリング

While differential privacy (DP) offers strong theoretical privacy guarantees, implementations of DP mechanisms may be vulnerable to side-channel attacks, such as timing attacks. When sampling methods such as MCMC or rejection sampling are used to implement a privacy mechanism, the runtime can leak private information. We characterize the additional privacy cost due to the runtime of a rejection sampler in terms of both $(\epsilon,\delta)$-DP as well as $f$-DP. We also show that unless the acceptance probability is constant across databases, the runtime of a rejection sampler does not satisfy $\epsilon$-DP for any $\epsilon$. We show that there is a similar breakdown in privacy with adaptive rejection samplers. We propose three modifications to the rejection sampling algorithm, with varying assumptions, to protect against timing attacks by making the runtime independent of the data. The modification with the weakest assumptions is an approximate sampler, introducing a small increase in the privacy cost, whereas the other modifications give perfect samplers. We also use our techniques to develop an adaptive rejection sampler for log-Hölder densities, which also has data-independent runtime. We give several examples of DP mechanisms that fit the assumptions of our methods and can thus be implemented using our samplers.

差分プライバシー(DP)は理論上は強力なプライバシー保証を提供しますが、DPメカニズムの実装はタイミング攻撃などのサイドチャネル攻撃に対して脆弱である可能性があります。MCMCや拒否サンプリングなどのサンプリング方法を使用してプライバシーメカニズムを実装すると、ランタイムで個人情報が漏洩する可能性があります。拒否サンプラーのランタイムによる追加のプライバシーコストを、$(\epsilon,\delta)$-DPと$f$-DPの両方の観点から特徴付けます。また、受け入れ確率がデータベース間で一定でない限り、拒否サンプラーのランタイムは、任意の$\epsilon$に対して$\epsilon$-DPを満たさないことも示します。適応拒否サンプラーでもプライバシーに同様の問題が発生することを示します。さまざまな仮定を使用して拒否サンプリングアルゴリズムに3つの変更を提案し、ランタイムをデータから独立させることでタイミング攻撃から保護します。最も弱い仮定の変更は近似サンプラーであり、プライバシーコストがわずかに増加しますが、他の変更では完全なサンプラーが得られます。また、この技術を使用して、データに依存しない実行時間を持つlog-Hölder密度の適応型拒否サンプラーも開発しました。私たちの方法の仮定に適合し、私たちのサンプラーを使用して実装できるDPメカニズムの例をいくつか示します。

Inference for a Large Directed Acyclic Graph with Unspecified Interventions
不特定の介入を持つ大規模な有向非巡回グラフの推論

Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires identifying the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks.

いくつかの不特定の介入（介入対象が不明）を前提とした有向関係の統計的推論は困難です。この記事では、不特定の介入を伴う仮説上の有向関係をテストします。まず、識別可能なモデルを生成するための条件を導きます。従来の推論とは異なり、有向関係をテストするには、仮説固有の主要変数の祖先と関連する介入を識別する必要があります。この目的のために、主要変数の位相的順序を確立するためのノードワイズ回帰に基づく剥離アルゴリズムを提案します。さらに、剥離アルゴリズムが低次多項式時間で一貫した推定値を生成することを証明します。次に、祖先と介入の識別の不確実性を考慮するために、データ摂動スキームと統合された尤度比検定を提案します。また、データ摂動検定統計量の分布がターゲット分布に収束することを示します。数値例は、遺伝子調節ネットワークの推論への応用を含む、提案された方法の有用性と有効性を示しています。

How Do You Want Your Greedy: Simultaneous or Repeated?
あなたはどのようにあなたの欲張りを望みますか:同時または繰り返し?

We present SimulatneousGreedys, a deterministic algorithm for constrained submodular maximization. At a high level, the algorithm maintains $\ell$ solutions and greedily updates them in a simultaneous fashion. SimultaneousGreedys achieves the tightest known approximation guarantees for both $k$-extendible systems and the more general $k$-systems, which are $(k+1)^2/k = k + \mathcal{O}(1)$ and $(1 + \sqrt{k+2})^2 = k + \mathcal{O}(\sqrt{k})$, respectively. We also improve the analysis of RepeatedGreedy, showing that it achieves an approximation ratio of $k + \mathcal{O}(\sqrt{k})$ for $k$-systems when allowed to run for $\mathcal{O}(\sqrt{k})$ iterations, an improvement in both the runtime and approximation over previous analyses. We demonstrate that both algorithms may be modified to run in nearly linear time with an arbitrarily small loss in the approximation. Both SimultaneousGreedys and RepeatedGreedy are flexible enough to incorporate the intersection of $m$ additional knapsack constraints, while retaining similar approximation guarantees: both algorithms yield an approximation guarantee of roughly $k + 2m + \mathcal{O}(\sqrt{k+m})$ for $k$-systems and SimultaneousGreedys enjoys an improved approximation guarantee of $k+2m + \mathcal{O}(\sqrt{m})$ for $k$-extendible systems. To complement our algorithmic contributions, we prove that no algorithm making polynomially many oracle queries can achieve an approximation better than $k + 1/2 – \epsilon$. We also present SubmodularGreedy.jl, a Julia package which implements these algorithms. Finally, we test these algorithms on real datasets.

私たちは、制約付きサブモジュラ最大化のための決定論的アルゴリズムであるSimulatneousGreedysを紹介します。高レベルでは、アルゴリズムは$\ell$ソリューションを維持し、同時に貪欲に更新します。SimultaneousGreedysは、$k$拡張可能システムとより一般的な$k$システムの両方に対して、最も厳密な既知の近似保証を実現します。これは、それぞれ$(k+1)^2/k = k + \mathcal{O}(1)$と$(1 + \sqrt{k+2})^2 = k + \mathcal{O}(\sqrt{k})$です。また、RepeatedGreedyの解析も改善し、$\mathcal{O}(\sqrt{k})$回の反復実行を許可した場合に$k$システムに対して$k + \mathcal{O}(\sqrt{k})$の近似比を達成することを示しています。これは、以前の解析と比較して実行時間と近似の両方が改善されています。両方のアルゴリズムを変更して、近似値の損失を任意に小さくして、ほぼ線形時間で実行できることを実証しています。SimultaneousGreedysとRepeatedGreedyはどちらも、同様の近似保証を維持しながら、$m$個の追加ナップザック制約の交差を組み込むのに十分な柔軟性を備えています。つまり、両方のアルゴリズムは、$k$システムに対しておよそ$k + 2m + \mathcal{O}(\sqrt{k+m})$の近似保証をもたらし、SimultaneousGreedysは、$k$拡張可能システムに対して$k+2m + \mathcal{O}(\sqrt{m})$の近似保証が向上しています。アルゴリズムの貢献を補完するために、多項式的に多くのオラクルクエリを作成するアルゴリズムでは、$k + 1/2 – \epsilon$よりも優れた近似を達成できないことを証明します。また、これらのアルゴリズムを実装するJuliaパッケージのSubmodularGreedy.jlも紹介します。最後に、実際のデータセットでこれらのアルゴリズムをテストします。

Kernel-Matrix Determinant Estimates from stopped Cholesky Decomposition
停止したコレスキー分解からのカーネル行列行列式推定

Algorithms involving Gaussian processes or determinantal point processes typically require computing the determinant of a kernel matrix. Frequently, the latter is computed from the Cholesky decomposition, an algorithm of cubic complexity in the size of the matrix. We show that, under mild assumptions, it is possible to estimate the determinant from only a sub-matrix, with probabilistic guarantee on the relative error. We present an augmentation of the Cholesky decomposition that stops under certain conditions before processing the whole matrix. Experiments demonstrate that this can save a considerable amount of time while rarely exceeding an overhead of more than 5% when not stopping early. More generally, we present a probabilistic stopping strategy for the approximation of a sum of known length where addends are revealed sequentially. We do not assume independence between addends, only that they are bounded from below and decrease in conditional expectation.

ガウス過程または行列式点過程を含むアルゴリズムでは、通常、カーネル行列の行列式を計算する必要があります。多くの場合、後者は、行列のサイズの3次計算量のアルゴリズムであるコレスキー分解から計算されます。私たちは、穏やかな仮定の下で、相対誤差の確率的保証とともに、サブ行列のみから行列式を推定することが可能であることを示します。行列全体を処理する前に特定の条件下で停止するコレスキー分解の増強を示します。実験では、これにより時間を大幅に節約でき、早期に停止しない場合に5%を超えるオーバーヘッドを超えることはめったにないことが実証されています。より一般的には、加数が順番に明らかになる既知の長さの合計の近似のための確率的停止戦略を提示します。加数間の独立性を仮定するのではなく、加数が下から制限され、条件付き期待値が減少するというだけを仮定します。

Optimizing ROC Curves with a Sort-Based Surrogate Loss for Binary Classification and Changepoint Detection
二項分類と変化点検出のためのソートベースの代理損失によるROC曲線の最適化

Receiver Operating Characteristic (ROC) curves are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is a piecewise constant function of predicted values. ROC curves can also be used in other problems with false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose an L1 relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and speed relative to previous baselines.

受信者操作特性(ROC)曲線は、バイナリ分類モデルの評価には便利ですが、曲線下面積(AUC)が予測値の区分定数関数であるため、学習に使用するのは困難です。ROC曲線は、変化点検出など、偽陽性率と真陽性率を伴う他の問題にも使用できます。このより一般的なコンテキストでは、ROC曲線にループ、非常に最適ではないエラー率のポイント、および1を超えるAUCが含まれる可能性があることを示します。この観察により、新しい最適化目標が生まれました。AUCを最大化するのではなく、Min(FP,FN)の値が大きいポイントを回避する、AUC=1の単調なROC曲線が求められます。この目標のL1緩和を提案します。その結果、AUM (Area Under Min(FP, FN)の略)と呼ばれる新しい代理損失関数が生成されます。以前の損失関数は、すべてのラベル付きサンプルまたはペアの合計に基づいていますが、AUMでは、ROC曲線上のポイントのシーケンスの並べ替えと合計が必要です。AUM方向導関数は、勾配降下学習アルゴリズムで効率的に計算および使用できることを示しています。教師ありバイナリ分類および変化点検出問題に関する実証的研究では、新しいAUM最小化学習アルゴリズムによって、以前のベースラインと比較してAUCと速度が向上することが示されています。

When Locally Linear Embedding Hits Boundary
局所的な線形埋め込みが境界に当たった場合

Based on the Riemannian manifold model, we study the asymptotic behavior of a widely applied unsupervised learning algorithm, locally linear embedding (LLE), when the point cloud is sampled from a compact, smooth manifold with boundary. We show several peculiar behaviors of LLE near the boundary that are different from those diffusion-based algorithms. In particular, we show that LLE pointwisely converges to a mixed-type differential operator with degeneracy and we calculate the convergence rate. The impact of the hyperbolic part of the operator is discussed and we propose a clipped LLE algorithm which is a potential approach to recover the Dirichlet Laplace-Beltrami operator.

リーマン多様体モデルに基づいて、広く適用されている教師なし学習アルゴリズムである局所線形埋め込み(LLE)の漸近的な振る舞いを研究します。これは、境界を持つコンパクトで滑らかな多様体から点群がサンプリングされた場合のものです。境界付近でのLLEのいくつかの特異な振る舞いは、それらの拡散ベースのアルゴリズムとは異なることを示しています。特に、LLEがデジェネラシーを持つ混合型微分演算子に点ごとに収束することを示し、収束率を計算します。演算子の双曲線部分の影響について議論し、ディリクレ・ラプラス・ベルトラミ演算子を回復するための潜在的なアプローチであるクリップされたLLEアルゴリズムを提案します。

Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data
大規模データを用いた欠損応答問題に対する分布ノンパラメトリック回帰代入

Nonparametric regression imputation is commonly used in missing data analysis. However, it suffers from the curse of dimension. The problem can be alleviated by the explosive sample size in the era of big data, while the large-scale data size presents some challenges in the storage of data and the calculation of estimators. These challenges make the classical nonparametric regression imputation methods no longer applicable. This motivates us to develop two distributed nonparametric regression imputation methods. One is based on kernel smoothing and the other on the sieve method. The kernel-based distributed imputation method has extremely low communication cost, and the sieve-based distributed imputation method can accommodate more local machines. The response mean estimation is considered to illustrate the proposed imputation methods. Two distributed nonparametric regression imputation estimators are proposed for the response mean, which are proved to be asymptotically normal with asymptotic variances achieving the semiparametric efficiency bound. The proposed methods are evaluated through simulation studies and illustrated in real data analysis.

ノンパラメトリック回帰補完は、欠損データ解析でよく使用されます。しかし、次元の呪いに悩まされています。ビッグデータ時代の爆発的なサンプルサイズによってこの問題は緩和できますが、大規模なデータサイズは、データの保存と推定量の計算にいくつかの課題をもたらします。これらの課題により、従来のノンパラメトリック回帰補完方法は適用できなくなりました。これが、2つの分散ノンパラメトリック回帰補完方法を開発する動機となりました。1つはカーネル平滑化に基づき、もう1つはふるい法に基づきます。カーネルベースの分散補完方法は通信コストが非常に低く、ふるいベースの分散補完方法はより多くのローカルマシンに対応できます。提案された補完方法を説明するために、応答平均推定が検討されています。応答平均に対して2つの分散ノンパラメトリック回帰補完推定量が提案されており、これらは漸近的に正規であり、漸近分散がセミパラメトリック効率境界を達成することが証明されています。提案された方法はシミュレーション研究を通じて評価され、実際のデータ分析で実証されています。

Prior Specification for Bayesian Matrix Factorization via Prior Predictive Matching
事前予測マッチングによるベイズ行列因数分解の事前仕様

The behavior of many Bayesian models used in machine learning critically depends on the choice of prior distributions, controlled by some hyperparameters typically selected through Bayesian optimization or cross-validation. This requires repeated, costly, posterior inference. We provide an alternative for selecting good priors without carrying out posterior inference, building on the prior predictive distribution that marginalizes the model parameters. We estimate virtual statistics for data generated by the prior predictive distribution and then optimize over the hyperparameters to learn those for which the virtual statistics match the target values provided by the user or estimated from (a subset of) the observed data. We apply the principle for probabilistic matrix factorization, for which good solutions for prior selection have been missing. We show that for Poisson factorization models we can analytically determine the hyperparameters, including the number of factors, that best replicate the target statistics, and we empirically study the sensitivity of the approach for the model mismatch. We also present a model-independent procedure that determines the hyperparameters for general models by stochastic optimization and demonstrate this extension in the context of hierarchical matrix factorization models.

機械学習で使用される多くのベイジアンモデルの挙動は、事前分布の選択に大きく依存します。事前分布は、通常、ベイジアン最適化または交差検証によって選択されるいくつかのハイパーパラメータによって制御されます。これには、繰り返しのコストのかかる事後推論が必要です。私たちは、モデルパラメータを周辺化する事前予測分布に基づいて、事後推論を実行せずに適切な事前分布を選択するための代替手段を提供します。事前予測分布によって生成されたデータの仮想統計を推定し、ハイパーパラメータを最適化して、仮想統計がユーザーによって提供されたターゲット値または観測データ（のサブセット）から推定されたターゲット値と一致するハイパーパラメータを学習します。私たちは、事前選択の適切なソリューションが欠けていた確率的行列因数分解の原理を適用します。ポアソン因数分解モデルの場合、ターゲット統計を最もよく再現するハイパーパラメータ（因子数を含む）を解析的に決定できることを示し、モデル不一致に対するアプローチの感度を経験的に研究します。また、確率的最適化によって一般的なモデルのハイパーパラメータを決定するモデルに依存しない手順も提示し、階層的行列分解モデルのコンテキストでこの拡張を示します。

Posterior Contraction for Deep Gaussian Process Priors
ディープガウス過程事前分布の事後収縮

We study posterior contraction rates for a class of deep Gaussian process priors in the nonparametric regression setting under a general composition assumption on the regression function. It is shown that the contraction rates can achieve the minimax convergence rate (up to log n factors), while being adaptive to the underlying structure and smoothness of the target function. The proposed framework extends the Bayesian nonparametric theory for Gaussian process priors.

私たちは、回帰関数の一般的な構成仮定の下で、ノンパラメトリック回帰設定の深いガウス過程事前分布のクラスの事後収縮率を研究します。収縮率は、ターゲット関数の基礎となる構造と滑らかさに適応しながら、ミニマックス収束率(log n因子まで)を達成できることが示されています。提案されたフレームワークは、ガウス過程事前確率のベイズノンパラメトリック理論を拡張します。

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule
広い最小密度仮説と探索・活用学習率スケジュール

Several papers argue that wide minima generalize better than narrow minima. In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-exploit learning rate schedule. On a variety of image and natural language datasets, compared to their original hand-tuned learning rate baselines, we show that our explore-exploit schedule can result in either up to 0.84% higher absolute accuracy using the original training budget or up to 57% reduced training time while achieving the original reported accuracy.

いくつかの論文では、幅の広い最小値は狭い最小値よりも一般化が優れていると主張しています。この論文では、ワイドミニマの一般化特性を裏付ける詳細な実験を通じて、ワイドミニマの密度がナローミニマの密度よりも低い可能性が高いという新しい仮説の経験的証拠も提供します。さらに、この仮説に動機付けられて、新しい探索-活用学習率スケジュールを設計します。さまざまな画像データセットと自然言語データセットで、元の手動で調整された学習率ベースラインと比較して、探索-活用スケジュールにより、元のトレーニング予算を使用して絶対精度が最大0.84%向上するか、最初に報告された精度を達成しながらトレーニング時間が最大57%短縮されることを示しています。

Fundamental limits and algorithms for sparse linear regression with sublinear sparsity
サブリニアスパース性を持つスパース線形回帰の基本制限とアルゴリズム

We establish exact asymptotic expressions for the normalized mutual information and minimum mean-square-error (MMSE) of sparse linear regression in the sub-linear sparsity regime. Our result is achieved by a generalization of the adaptive interpolation method in Bayesian inference for linear regimes to sub-linear ones. A modification of the well-known approximate message passing algorithm to approach the MMSE fundamental limit is also proposed, and its state evolution is rigorously analysed. Our results show that the traditional linear assumption between the signal dimension and number of observations in the replica and adaptive interpolation methods is not necessary for sparse signals. They also show how to modify the existing well-known AMP algorithms for linear regimes to sub-linear ones.

私たちは、正規化された相互情報量と、サブリニアスパース性領域におけるスパース線形回帰の最小平均二乗誤差(MMSE)の正確な漸近式を確立します。私たちの結果は、線形領域からサブ線形領域へのベイズ推論における適応補間法の一般化によって達成されます。よく知られている近似メッセージパッシングアルゴリズムをMMSEの基本限界に近づけるための修正も提案されており、その状態進化が厳密に分析されています。私たちの結果は、レプリカ内挿法と適応内挿法の信号次元と観測値の数の間の従来の線形仮定が、スパース信号には必要ないことを示しています。また、線形領域の既存のよく知られたAMPアルゴリズムをサブ線形領域に変更する方法も示しています。

On the Complexity of SHAP-Score-Based Explanations: Tractability via Knowledge Compilation and Non-Approximability Results
SHAPスコアに基づく説明の複雑さについて:知識のコンパイルと非近似性の結果による扱いやすさ

Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential~ Shap-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, we prove a strong positive result stating that the Shap-score can be computed in polynomial time over deterministic and decomposable Boolean circuits under the so-called product distributions on entities. Such circuits are studied in the field of Knowledge Compilation and generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). Our positive result extends even beyond binary classifiers, as it continues to hold if each feature is associated with a finite domain of possible values. We also establish the computational limits of the notion of Shap-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the Shap-score intractable (namely, $\#P$-hard). It also implies that computing Shap-scores is $\#P$-hard even over the class of propositional formulas in DNF. Based on this negative result, we look for the existence of fully-polynomial randomized approximation schemes (FPRAS) for computing Shap-scores over such class. In stark contrast to the model counting problem for DNF formulas, which admits an FPRAS, we prove that no such FPRAS exists (under widely believed complexity assumptions) for the computation of Shap-scores. Surprisingly, this negative result holds even for the class of monotone formulas in DNF. These techniques can be further extended to prove another strong negative result: Under widely believed complexity assumptions, there is no polynomial-time algorithm that checks, given a monotone DNF formula $\varphi$ and features $x,y$, whether the Shap-score of $x$ in $\varphi$ is smaller than the Shap-score of $y$ in $\varphi$.

Shapley値に基づくスコアは、機械学習モデルによる分類結果の説明に広く使用されています。その代表的な例として、影響力のあるShapスコアが挙げられます。これはShapley値のバージョンで、すべての特徴にスコアを割り当てることで、特定のエンティティに対する学習済みモデルの結果を説明するのに役立ちます。一般にShapley値の計算は計算上扱いにくい問題ですが、私たちは、エンティティ上のいわゆる積分布の下で、決定論的かつ分解可能なブール回路に対してShapスコアを多項式時間で計算できるという強い肯定的な結果を証明しました。このような回路は知識コンパイルの分野で研究されており、二分決定木、順序付き二分決定図(OBDD)、自由二分決定図(FBDD)など、幅広いブール回路と二分決定図クラスを一般化しています。肯定的な結果は、各特徴が可能な値の有限ドメインに関連付けられている場合にも維持されるため、二分分類器を超えて拡張されます。また、緩やかな条件下では、ブールモデルのクラスでShapスコアを計算することは、そのクラスのモデルカウント問題と常に多項式的に同じくらい困難であることを観察することにより、Shapスコアの概念の計算限界を確立します。これは、決定性と分解可能性の両方が、検討する回路にとって不可欠な特性であることを意味します。どちらか一方を削除すると、Shapスコアを計算する問題が手に負えなくなる(つまり、$\#P$困難)ためです。また、DNFの命題式のクラスでもShapスコアを計算することは$\#P$困難であることを意味します。この否定的な結果に基づいて、そのようなクラスでShapスコアを計算するための完全多項式ランダム近似スキーム(FPRAS)の存在を探します。FPRASが許容されるDNF式のモデルカウント問題とはまったく対照的に、Shapスコアの計算ではそのようなFPRASは存在しない(広く信じられている複雑さの仮定の下では)ことを証明します。驚くべきことに、この否定的な結果は、DNFの単調な式のクラスにも当てはまります。これらの手法をさらに拡張して、別の強い否定的な結果を証明することができます。広く信じられている複雑さの仮定の下では、単調なDNF式$\varphi$と特徴$x,y$が与えられたときに、$\varphi$の$x$のShapスコアが$\varphi$の$y$のShapスコアよりも小さいかどうかを確認する多項式時間アルゴリズムは存在しません。

Monotonic Alpha-divergence Minimisation for Variational Inference
変分推論のための単調α‐発散最小化

In this paper, we introduce a novel family of iterative algorithms which carry out $\alpha$-divergence minimisation in a Variational Inference context. They do so by ensuring a systematic decrease at each step in the $\alpha$-divergence between the variational and the posterior distributions. In its most general form, the variational distribution is a mixture model and our framework allows us to simultaneously optimise the weights and components parameters of this mixture model. Our approach permits us to build on various methods previously proposed for $\alpha$-divergence minimisation such as Gradient or Power Descent schemes and we also shed a new light on an integrated Expectation Maximization algorithm. Lastly, we provide empirical evidence that our methodology yields improved results on several multimodal target distributions and on a real data example.

この論文では、変分推論のコンテキストで$alpha$-divergence minimisationを実行する反復アルゴリズムの新しいファミリを紹介します。これは、変分分布と事後分布の間の$alpha$-発散の各ステップで系統的な減少を確保することによって行われます。最も一般的な形式では、変分分布は混合モデルであり、私たちのフレームワークにより、この混合モデルの重みと成分パラメータを同時に最適化できます。私たちのアプローチにより、以前に提案された$alpha$-divergence minimisation (勾配スキームやパワーディセントスキームなど)のさまざまな方法に基づいて構築することができ、統合されたExpectation Maximizationアルゴリズムにも新たな光を当てることができます。最後に、私たちの方法論がいくつかのマルチモーダルターゲット分布と実際のデータの例で改善された結果をもたらすという経験的証拠を提供します。

Density estimation on low-dimensional manifolds: an inflation-deflation approach
低次元多様体上の密度推定:インフレーション・デフレアプローチ

Normalizing flows (NFs) are universal density estimators based on neural networks. However, this universality is limited: the density’s support needs to be diffeomorphic to a Euclidean space. In this paper, we propose a novel method to overcome this limitation without sacrificing universality. The proposed method inflates the data manifold by adding noise in the normal space, trains an NF on this inflated manifold, and, finally, deflates the learned density. Our main result provides sufficient conditions on the manifold and the specific choice of noise under which the corresponding estimator is exact. Our method has the same computational complexity as NFs and does not require computing an inverse flow. We also demonstrate theoretically (under certain conditions) and empirically (on a wide range of toy examples) that noise in the normal space can be well approximated by Gaussian noise. This allows using our method for approximating arbitrary densities on unknown manifolds provided that the manifold dimension is known.

正規化フロー(NF)は、ニューラルネットワークに基づく普遍的な密度推定量です。ただし、この普遍性には限界があります。密度のサポートは、ユークリッド空間に微分同相である必要があります。この論文では、普遍性を犠牲にすることなくこの限界を克服する新しい方法を提案します。提案された方法は、通常空間にノイズを追加することでデータ多様体を膨張させ、この膨張した多様体でNFをトレーニングし、最後に学習した密度を収縮させます。主な結果は、多様体に関する十分な条件と、対応する推定量が正確になるノイズの特定の選択を提供します。この方法は、NFと同じ計算量で、逆フローを計算する必要はありません。また、通常空間のノイズはガウスノイズで十分に近似できることを理論的(特定の条件下)および経験的(さまざまな簡単な例で)に実証します。これにより、多様体の次元が既知であれば、未知の多様体上の任意の密度を近似するためにこの方法を使用できます。

Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints
ピーク制約を持つMDPのための証明可能なサンプル効率のモデルフリーアルゴリズム

In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(\epsilon,p)$-PAC policy when the episode $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.

動的システムの最適化では、変数には通常、制約があります。このような問題は、制約付きマルコフ決定プロセス(CMDP)としてモデル化できます。この論文では、ピーク制約付きマルコフ決定プロセス(PCMDP)を検討します。このプロセスでは、エージェントは、有限期間で総報酬を最大化するポリシーを選択し、各エポックで確率1で制約を満たします。PCMDP問題を制約のない問題に変換するモデルフリーアルゴリズムを提案し、Q学習ベースのアプローチを適用します。提案されたPCMDP問題に対して、おそらく近似的に正しい(PAC)の概念を定義します。提案されたアルゴリズムは、エピソード$K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$のときに$(\epsilon,p)$-PACポリシーを実現することが証明されています。ここで、$S$と$A$は、それぞれ状態とアクションの数です。$H$は、エピソードあたりのエポック数です。$I$は制約関数の数で、$\ell=\log(\frac{SAT}{p})$です。これは、遷移ダイナミクスが事前にわかっていないピーク制約のあるPCMDPのPACタイプの解析に関する最初の結果であることに注意してください。提案されたアルゴリズムをエネルギーハーベスティング問題と単一マシンスケジューリング問題で実証したところ、研究対象の最適化問題の理論上の上限に近いパフォーマンスが得られました。

Topological Convolutional Layers for Deep Learning
深層学習のためのトポロジカル畳み込み層

This work introduces the Topological CNN (TCNN), which encompasses several topologically defined convolutional methods. Manifolds with important relationships to the natural image space are used to parameterize image filters which are used as convolutional weights in a TCNN. These manifolds also parameterize slices in layers of a TCNN across which the weights are localized. We show evidence that TCNNs learn faster, on less data, with fewer learned parameters, and with greater generalizability and interpretability than conventional CNNs. We introduce and explore TCNN layers for both image and video data. We propose extensions to 3D images and 3D video.

この研究では、トポロジカルCNN (TCNN)を紹介し、トポロジカルに定義されたいくつかの畳み込み手法を包含します。自然画像空間と重要な関係を持つ多様体は、TCNNで畳み込み重みとして使用される画像フィルターをパラメータ化するために使用されます。また、これらの多様体は、重みが局在するTCNNの層内のスライスをパラメータ化します。TCNNは、従来のCNNよりも、より少ないデータで、より少ない学習パラメータで、より優れた一般化可能性と解釈可能性で学習するという証拠を示しています。画像データとビデオデータの両方のTCNNレイヤーを紹介し、探索します。3D画像や3D動画の拡張をご提案します。

Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval
任意の初期化によるオンライン確率的勾配降下法による非平滑、非凸位相検索の解法

In recent literature, a general two step procedure has been formulated for solving the problem of phase retrieval. First, a spectral technique is used to obtain a constant-error initial estimate, following which, the estimate is refined to arbitrary precision by first-order optimization of a non-convex loss function. Numerical experiments, however, seem to suggest that simply running the iterative schemes from a random initialization may also lead to convergence, albeit at the cost of slightly higher sample complexity. In this paper, we prove that, in fact, constant step size online stochastic gradient descent (SGD) converges from arbitrary initializations for the non-smooth, non-convex amplitude squared loss objective. In this setting, online SGD is also equivalent to the randomized Kaczmarz algorithm from numerical analysis. Our analysis can easily be generalized to other single index models. It also makes use of new ideas from stochastic process theory, including the notion of a summary state space, which we believe will be of use for the broader field of non-convex optimization.

最近の文献では、位相回復の問題を解決するための一般的な2段階の手順が定式化されています。まず、スペクトル手法を使用して一定誤差の初期推定値を取得し、次に、非凸損失関数の1次最適化によって推定値を任意の精度に調整します。ただし、数値実験では、ランダムな初期化から反復スキームを実行するだけでも、サンプルの複雑さがわずかに高くなるという犠牲はあるものの、収束につながる可能性があることが示唆されています。この論文では、実際に、一定ステップサイズのオンライン確率的勾配降下法(SGD)が、非滑らかで非凸な振幅二乗損失目的関数の任意の初期化から収束することを証明します。この設定では、オンラインSGDは、数値解析からのランダム化Kaczmarzアルゴリズムと同等です。私たちの分析は、他の単一インデックスモデルに簡単に一般化できます。また、要約状態空間の概念を含む確率過程理論の新しいアイデアも活用しており、これは非凸最適化のより広範な分野に役立つと考えています。

Tree-AMP: Compositional Inference with Tree Approximate Message Passing
Tree-AMP: ツリー近似メッセージパッシングによる構成推論

We introduce Tree-AMP, standing for Tree Approximate Message Passing, a python package for compositional inference in high-dimensional tree-structured models. The package provides a unifying framework to study several approximate message passing algorithms previously derived for a variety of machine learning tasks such as generalized linear models, inference in multi-layer networks, matrix factorization, and reconstruction using non-separable penalties. For some models, the asymptotic performance of the algorithm can be theoretically predicted by the state evolution, and the measurements entropy estimated by the free entropy formalism. The implementation is modular by design: each module, which implements a factor, can be composed at will with other modules to solve complex inference tasks. The user only needs to declare the factor graph of the model: the inference algorithm, state evolution and entropy estimation are fully automated.

私たちは、Tree Approximate Message Passingの略であるTree-AMPは、高次元ツリー構造モデルにおける合成推論のためのPythonパッケージです。このパッケージは、一般化線形モデル、多層ネットワークでの推論、行列分解、非分離ペナルティを使用した再構成など、さまざまな機械学習タスクに対して以前に導き出されたいくつかの近似メッセージパッシングアルゴリズムを研究するための統一フレームワークを提供します。一部のモデルでは、アルゴリズムの漸近性能は状態進化によって理論的に予測でき、測定値のエントロピーは自由エントロピー形式によって推定されます。実装は設計上モジュール化されており、ファクターを実装する各モジュールは、複雑な推論タスクを解決するために他のモジュールと自由に構成できます。ユーザーはモデルの因子グラフを宣言するだけで済みます:推論アルゴリズム、状態進化、エントロピー推定は完全に自動化されています。

On the geometry of Stein variational gradient descent
スタイン変分勾配降下の幾何学について

Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performance gains of these in various numerical experiments.

ベイズ推論問題では、高次元の確率分布をサンプリングまたは近似する必要があります。この論文の焦点は、最近導入されたスタイン変分勾配降下法、つまり、再現カーネルヒルベルト空間ノルムに関して反復された最も急な降下ステップに依存するアルゴリズムのクラスにあります。この構造は相互作用する粒子系を導き、その平均場の限界は、特定の幾何学的構造を備えた確率分布の空間上の勾配流れです。この視点を活用して、アルゴリズムの収束特性に光を当て、特に適切な正定値カーネル関数を選択する問題に対処します。私たちの分析は、調整された裾を持つ特定の微分不可能なカーネルを検討することにつながります。これらのパフォーマンスの大幅な向上をさまざまな数値実験で実証しています。

Kernel-based estimation for partially functional linear model: Minimax rates and randomized sketches
部分的に機能的な線形モデルのためのカーネルベースの推定:ミニマックス率とランダム化スケッチ

This paper considers the partially functional linear model (PFLM) where all predictive features consist of a functional covariate and a high dimensional scalar vector. Over an infinite dimensional reproducing kernel Hilbert space, the proposed estimation for PFLM is a least square approach with two mixed regularizations of a function-norm and an $\ell_1$-norm. Our main task in this paper is to establish the minimax rates for PFLM under high dimensional setting, and the optimal minimax rates of estimation are established by using various techniques in empirical process theory for analyzing kernel classes. In addition, we propose an efficient numerical algorithm based on randomized sketches of the kernel matrix. Several numerical experiments are implemented to support our method and optimization strategy.

この論文では、すべての予測特徴が関数共変量と高次元スカラーベクトルで構成される部分関数型線形モデル(PFLM)について考察します。無限次元の再現カーネルヒルベルト空間上で、PFLMの提案された推定は、関数ノルムと$ell_1$ノルムの2つの混合正則化を持つ最小二乗アプローチです。この論文では、高次元設定下でのPFLMのミニマックスレートを確立することを主な課題とし、カーネルクラスを解析するための経験的プロセス理論における様々な手法を用いて、最適なミニマックス推定レートを確立することです。さらに、カーネル行列のランダム化されたスケッチに基づく効率的な数値アルゴリズムを提案します。私たちの方法と最適化戦略をサポートするために、いくつかの数値実験が実装されています。

Contextual Stochastic Block Model: Sharp Thresholds and Contiguity
コンテキスト確率的ブロックモデル: シャープなしきい値と隣接性

We study community detection in the “contextual stochastic block model” (Yan and Sarkar (2020), Deshpande et al. (2018)). Deshpande et al. (2018) studied this problem in the setting of sparse graphs with high-dimensional node-covariates. Using the non-rigorous “cavity method” from statistical physics (Mezard and Montanari (2009)), they calculated the sharp limit for community detection in this setting, and verified that the limit matches the information theoretic threshold when the average degree of the observed graph is large. They conjectured that the limit should hold as soon as the average degree exceeds one. We establish this conjecture, and characterize the sharp threshold for detection and weak recovery.

私たちは、「文脈的確率的ブロックモデル」におけるコミュニティ検出について研究しています(Yan and Sarkar (2020)、Deshpandeら(2018))。Deshpandeら(2018)は、高次元のノード共変量を持つスパースグラフの設定でこの問題を研究しました。統計物理学の非厳密な「共振器法」(Mezard and Montanari (2009))を用いて、この設定における群集検出のシャープ限界を計算し、観測されたグラフの平均度が大きい場合に、その限界が情報理論の閾値と一致することを確認しました。彼らは、平均度が1を超えるとすぐに制限が維持されると推測しました。この推測を確立し、検出と弱い回復の鋭い閾値を特徴付けます。

VCG Mechanism Design with Unknown Agent Values under Stochastic Bandit Feedback
確率的バンディットフィードバックの下での未知のエージェント値を持つVCGメカニズム設計

We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values. On each round, a mechanism first assigns an allocation to a set of agents and charges them a price; at the end of the round, the agents provide (stochastic) feedback to the mechanism for the allocation they received. This setting is motivated by applications in cloud markets and online advertising where an agent may know her value for an allocation only after experiencing it. Therefore, the mechanism needs to explore different allocations for each agent so that it can learn their values, while simultaneously attempting to find the socially optimal set of allocations. Our focus is on truthful and individually rational mechanisms which imitate the classical VCG mechanism in the long run. To that end, we first define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism. We show that these three terms are interdependent via an $\Omega(T^{\frac{2}{3}})$ lower bound for the maximum of these three terms after $T$ rounds of allocations, and describe an algorithm which essentially achieves this rate. Our framework also provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets. Next, we define asymptotic variants for the truthfulness and individual rationality requirements and provide asymptotic rates to quantify the degree to which both properties are satisfied by the proposed algorithm.

私たちは、エージェントが自分の価値を知らない場合の、複数ラウンドの福祉最大化メカニズム設計問題を研究します。各ラウンドで、メカニズムは最初にエージェントの集合に割り当てを割り当て、価格を請求します。ラウンドの終わりに、エージェントは受け取った割り当てについてメカニズムに（確率的）フィードバックを提供します。この設定は、エージェントが割り当てを体験した後にのみ自分の価値を知る可能性があるクラウド市場やオンライン広告のアプリケーションに触発されています。したがって、メカニズムは各エージェントの異なる割り当てを探索してその価値を学習すると同時に、社会的に最適な割り当てセットを見つけようとします。私たちは、長期的には古典的なVCGメカニズムを模倣する、誠実で個々に合理的なメカニズムに焦点を当てる。そのために、我々はまず、福祉に対する後悔、各エージェントの個々の効用、メカニズムの効用の3つの概念を定義します。これら3つの項は、$T$回の割り当て後のこれら3つの項の最大値の下限$\Omega(T^{\frac{2}{3}})$を介して相互に依存していることを示し、このレートを本質的に達成するアルゴリズムについて説明します。私たちのフレームワークは、エージェントと売り手の後悔の間でトレードオフを行うように価格設定スキームを制御する柔軟性も提供します。次に、真実性と個々の合理性の要件の漸近バリアントを定義し、提案されたアルゴリズムによって両方の特性が満たされる度合いを定量化するための漸近レートを提供します。

Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems
ベイズ停止時間問題の逆強化学習のための必要十分条件

This paper presents an inverse reinforcement learning (IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed strategies. Our IRL algorithm identifies optimality and then constructs set-valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. As a real-world example, we illustrate using a YouTube dataset comprising metadata from 190000 videos how the proposed IRL method predicts user engagement in online multimedia platforms with high accuracy. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.

この論文では、ベイジアン停止時間問題のための逆強化学習(IRL)フレームワークを紹介します。ベイジアン意思決定者の行動を観察することにより、これらの行動がコスト関数の最適化と一致しているかどうかを識別するための必要十分条件を提供します。ベイジアン(部分的に観察された)設定では、逆学習者は、観察された戦略に関して最適性をせいぜい識別できます。IRLアルゴリズムは最適性を識別し、コスト関数の集合値推定値を構築します。このIRLの目的を達成するために、ミクロ経済学に由来するベイジアン顕示選好の新しいアイデアを使用します。提案されたIRLスキームを、停止時間問題の2つの重要な例、つまり逐次仮説検定とベイジアン検索を使用して説明します。実際の例として、190,000本のビデオのメタデータを含むYouTubeデータセットを使用して、提案されたIRLメソッドがオンラインマルチメディアプラットフォームでのユーザーエンゲージメントを高精度で予測する方法を示します。最後に、有限のデータセットに対して、IRL検出アルゴリズムを提案し、そのエラー確率の有限サンプル境界を示します。

Online Change-Point Detection in High-Dimensional Covariance Structure with Application to Dynamic Networks
高次元共分散構造におけるオンライン変化点検出と動的ネットワークへの応用

In this paper, we develop an online change-point detection procedure in the covariance structure of high-dimensional data. A new stopping rule is proposed to terminate the process as early as possible when a change in covariance structure occurs. The stopping rule allows spatial and temporal dependence and can be applied to non-Gaussian data. An explicit expression for the average run length is derived, so that the level of threshold in the stopping rule can be easily obtained with no need to run time-consuming Monte Carlo simulations. We also establish an upper bound for the expected detection delay, the expression of which demonstrates the impact of data dependence and magnitude of change in the covariance structure. Simulation studies are provided to confirm accuracy of the theoretical results. The practical usefulness of the proposed procedure is illustrated by detecting the change of brain’s covariance network in a resting-state fMRI data set. The implementation of the methodology is provided in the R package OnlineCOV.

この論文では、高次元データの共分散構造におけるオンライン変化点検出手順を開発します。共分散構造の変化が発生したときにできるだけ早くプロセスを終了するための新しい停止規則を提案します。停止規則は空間的および時間的依存性を許容し、非ガウスデータに適用できます。平均ランレングスの明示的な式が導出されるため、時間のかかるモンテカルロシミュレーションを実行することなく、停止規則のしきい値レベルを簡単に取得できます。また、予想される検出遅延の上限を確立し、その式はデータ依存性の影響と共分散構造の変化の大きさを示します。理論的結果の精度を確認するためにシミュレーション研究が提供されます。提案された手順の実用的な有用性は、安静時のfMRIデータセットで脳の共分散ネットワークの変化を検出することによって示されます。この方法論の実装は、RパッケージOnlineCOVで提供されます。

Convergence Rates of a Class of Multivariate Density Estimation Methods Based on Adaptive Partitioning
適応分割に基づく多変量密度推定法のクラスの収束率

Density estimation is a building block for many other statistical methods, such as classification, nonparametric testing, and data compression. In this paper, we focus on a non-parametric approach to multivariate density estimation, and study its asymptotic properties under both frequentist and Bayesian settings. The estimated density function is obtained by considering a sequence of approximating spaces to the space of densities. These spaces consist of piecewise constant density functions supported by binary partitions with increasing complexity. To obtain an estimate, the partition is learned by maximizing either the likelihood of the corresponding histogram on that partition, or the marginal posterior probability of the partition under a suitable prior. We analyze the convergence rate of the maximum likelihood estimator and the posterior concentration rate of the Bayesian estimator, and conclude that for a relatively rich class of density functions the rate does not directly depend on the dimension. We also show that the Bayesian method can adapt to the unknown smoothness of the density function. The method is applied to several specific function classes and explicit rates are obtained. These include spatially sparse functions, functions of bounded variation, and H{\”o}lder continuous functions. We also introduce an ensemble approach, obtained by aggregating multiple density estimates fit under carefully designed perturbations, and show that for density functions lying in a H{\”o}lder space ($\mathcal{H}^{1, \beta}, 0 < \beta \leq 1$), the ensemble method can achieve minimax convergence rate up to a logarithmic term, while the corresponding rate of the density estimator based on a single partition is suboptimal for this function class.

密度推定は、分類、ノンパラメトリック検定、データ圧縮など、他の多くの統計手法の基礎となるものです。この論文では、多変量密度推定に対するノンパラメトリック手法に焦点を当て、頻度主義とベイズ主義の両方の設定における漸近特性について検討します。推定された密度関数は、密度空間への近似空間のシーケンスを考慮することによって得られます。これらの空間は、複雑さが増すバイナリパーティションによってサポートされる区分的に一定の密度関数で構成されます。推定値を取得するには、そのパーティション上の対応するヒストグラムの尤度、または適切な事前確率の下でのパーティションの周辺事後確率のいずれかを最大化することによってパーティションを学習します。最大尤度推定量の収束率とベイズ推定量の事後集中率を分析し、比較的豊富な密度関数のクラスでは、収束率は次元に直接依存しないと結論付けました。また、ベイズ法は密度関数の未知の滑らかさに適応できることも示しています。この方法はいくつかの特定の関数クラスに適用され、明示的な率が得られます。これらには、空間的にスパースな関数、有界変動の関数、ヘルダー連続関数が含まれます。また、慎重に設計された摂動の下で適合された複数の密度推定値を集約することによって得られるアンサンブルアプローチも紹介し、ヘルダー空間（$\mathcal{H}^{1, \beta}, 0 < \beta \leq 1$）にある密度関数の場合、アンサンブル法は対数項までミニマックス収束率を達成できるが、単一のパーティションに基づく密度推定値の対応する率は、この関数クラスに対して最適ではないことを示します。

Reinforcement Learning for Joint Optimization of Multiple Rewards
複数の報酬の同時最適化のための強化学習

Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.

マルコフ決定過程の長期報酬を最大化する最適なポリシーを見つけるには、動的計画法と後方帰納法を使用してベルマン最適性方程式を解く必要があります。しかし、多くの現実の問題では、累積報酬が非線形である目的の最適化が必要であり、動的計画法を直接適用することはできません。たとえば、リソース割り当て問題では、目的の1つはユーザー間の長期的な公平性を最大化することです。エージェントが報酬の合計の何らかの関数を最適化することを目指す場合、問題はマルコフの性質を失うことがわかります。この論文では、報酬の長期平均の非線形関数を最適化する問題を取り上げ、形式化します。ポリシーを学習するためのモデルベースおよびモデルフリーのアルゴリズムを提案します。モデルベースのポリシーは、凹状のL-リプシッツ関数と組み合わせた$K$個の目的に対して$\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$の後悔を達成することが示されています。さらに、携帯電話基地局のスケジューリングとキューイングシステムのスケジューリングにおける公平性を例に挙げると、提案されたアルゴリズムは従来のRLアプローチよりも大幅に優れていることが示されています。

On the Convergence of Stochastic Gradient Descent with Bandwidth-based Step Size
帯域幅ベースのステップサイズを持つ確率的勾配降下法の収束について

We first propose a general step-size framework for the stochastic gradient descent(SGD) method: bandwidth-based step sizes that are allowed to vary within a banded region. The framework provides efficient and flexible step size selection in optimization, including cyclical and non-monotonic step sizes (e.g., triangular policy and cosine with restart), for which theoretical guarantees are rare. We provide state-of-the-art convergence guarantees for SGD under mild conditions and allow a large constant step size at the beginning of training. Moreover, we investigate the error bounds of SGD under the bandwidth step size where the boundary functions are in the same order and different orders, respectively. Finally, we propose a $1/t$ up-down policy and design novel non-monotonic step sizes. Numerical experiments demonstrate these bandwidth-based step sizes’ efficiency and significant potential in training regularized logistic regression and several large-scale neural network tasks.

私たちは、まず、確率的勾配降下法(SGD)の一般的なステップサイズフレームワーク、つまりバンド領域内で変動することが許容される帯域幅ベースのステップサイズを提案します。このフレームワークは、理論的保証がまれな周期的および非単調なステップサイズ(三角方策やリスタートを伴うコサインなど)を含む、最適化における効率的で柔軟なステップサイズの選択を提供します。私たちは、温和な条件下でのSGDに最先端の収束保証を提供し、トレーニングの開始時に大きな一定のステップサイズを許可します。さらに、境界関数がそれぞれ同じ順序と異なる順序にある帯域幅ステップサイズの下でのSGDの誤差範囲を調査します。最後に、$1/t$のアップダウン方策を提案し、新しい非単調ステップサイズを設計します。数値実験では、これらの帯域幅ベースのステップサイズの効率と、正則化されたロジスティック回帰といくつかの大規模なニューラルネットワークタスクのトレーニングにおける大きな可能性が実証されています。

A Group-Theoretic Approach to Computational Abstraction: Symmetry-Driven Hierarchical Clustering
計算抽象化への群論的アプローチ:対称性駆動階層クラスタリング

Humans’ abstraction ability plays a key role in concept learning and knowledge discovery. This theory paper presents the mathematical formulation for computationally emulating human-like abstractions—computational abstraction—and abstraction processes developed hierarchically from innate priors like symmetries. We study the nature of abstraction via a group-theoretic approach, formalizing and practically computing abstractions as symmetry-driven hierarchical clustering. Compared to data-driven clustering like k-means or agglomerative clustering (a chain), our abstraction model is data-free, feature-free, similarity-free, and globally hierarchical (a lattice). This paper also serves as a theoretical generalization of several existing works. These include generalizing Shannon’s information lattice, specialized algorithms for certain symmetry-induced clusterings, as well as formalizing knowledge discovery applications such as learning music theory from scores and chemistry laws from molecules. We consider computational abstraction as a first step towards a principled and cognitive way of achieving human-level concept learning and knowledge discovery.

人間の抽象化能力は、概念学習と知識発見において重要な役割を果たします。この理論論文では、人間のような抽象化、つまり計算抽象化と、対称性などの生来の事前条件から階層的に開発された抽象化プロセスを計算的にエミュレートするための数学的定式化を示します。私たちは、抽象化の性質を群論的アプローチで研究し、対称性駆動型階層的クラスタリングとして抽象化を形式化し、実際に計算します。k-meansや凝集型クラスタリング(チェーン)などのデータ駆動型クラスタリングと比較すると、私たちの抽象化モデルは、データフリー、特徴フリー、類似性フリー、グローバルに階層化されています(格子)。この論文では、いくつかの既存の研究の理論的一般化としても機能します。これには、シャノンの情報格子の一般化、特定の対称性誘導クラスタリング用の特殊なアルゴリズム、楽譜から音楽理論を学習したり分子から化学法則を学習したりする知識発見アプリケーションの形式化が含まれます。私たちは、計算抽象化を、人間レベルの概念学習と知識発見を実現するための原理的かつ認知的な方法への第一歩であると考えています。

The d-Separation Criterion in Categorical Probability
カテゴリカル確率におけるd-分離基準

The d-separation criterion detects the compatibility of a joint probability distribution with a directed acyclic graph through certain conditional independences. In this work, we study this problem in the context of categorical probability theory by introducing a categorical definition of causal models, a categorical notion of d-separation, and proving an abstract version of the d-separation criterion. This approach has two main benefits. First, categorical d-separation is a very intuitive criterion based on topological connectedness. Second, our results apply both to measure-theoretic probability (with standard Borel spaces) and beyond probability theory, including to deterministic and possibilistic networks. It therefore provides a clean proof of the equivalence of local and global Markov properties with causal compatibility for continuous and mixed random variables as well as deterministic and possibilistic variables.

d-分離基準は、特定の条件付き独立性を通じて、同時確率分布と有向非巡回グラフとの互換性を検出します。この研究では、因果モデルのカテゴリー定義、d-分離のカテゴリー概念、およびd-分離基準の抽象的なバージョンを証明することにより、この問題をカテゴリー的確率論の文脈で研究します。このアプローチには、主に2つの利点があります。まず、カテゴリカルd-分離は、トポロジカルな接続性に基づく非常に直感的な基準です。第二に、私たちの結果は、測定理論的確率(標準ボレル空間を使用)と、決定論的および可能性的ネットワークを含む超確率論の両方に適用されます。したがって、ローカルおよびグローバルマルコフ特性の等価性を、連続確率変数と混合確率変数、および決定論的変数と可能性的変数の因果互換性とともに明確に証明します。

The multimarginal optimal transport formulation of adversarial multiclass classification
敵対的多クラス分類の多限界最適輸送定式化

We study a family of adversarial multiclass classification problems and provide equivalent reformulations in terms of: 1) a family of generalized barycenter problems introduced in the paper and 2) a family of multimarginal optimal transport problems where the number of marginals is equal to the number of classes in the original classification problem. These new theoretical results reveal a rich geometric structure of adversarial learning problems in multiclass classification and extend recent results restricted to the binary classification setting. A direct computational implication of our results is that by solving either the barycenter problem and its dual, or the MOT problem and its dual, we can recover the optimal robust classification rule and the optimal adversarial strategy for the original adversarial problem. Examples with synthetic and real data illustrate our results.

私たちは、敵対的多クラス分類問題のファミリーを研究し、1)論文で紹介された一般化重心問題のファミリー、および2)周辺値の数が元の分類問題のクラス数と等しい多周辺最適輸送問題のファミリーの観点から、同等の再定式化を提供します。これらの新しい理論的結果は、多クラス分類における敵対的学習問題の豊かな幾何学的構造を明らかにし、最近の結果を二項分類設定に限定して拡張します。私たちの結果の直接的な計算上の意味は、重心問題とその双対、またはMOT問題とその双対のいずれかを解くことによって、元の敵対的問題に対する最適なロバストな分類ルールと最適な敵対的戦略を回復できるということです。合成データと実際のデータを使用した例は、私たちの結果を示しています。

Robust Load Balancing with Machine Learned Advice
機械学習のアドバイスによる堅牢なロードバランシング

Motivated by the exploding growth of web-based services and the importance of efficiently managing the computational resources of such systems, we introduce and study a theoretical model for load balancing of very large databases such as commercial search engines. Our model is a more realistic version of the well-received \bab model with an additional constraint that limits the number of servers that carry each piece of the data. This additional constraint is necessary when, on one hand, the data is so large that we can not copy the whole data on each server. On the other hand, the query response time is so limited that we can not ignore the fact that the number of queries for each piece of the data changes over time, and hence we can not simply split the data over different machines. In this paper, we develop an almost optimal load balancing algorithm that works given an estimate of the load of each piece of the data. Our algorithm is almost perfectly robust to wrong estimates, to the extent that even when all of the loads are adversarially chosen the performance of our algorithm is $1-1/e$, which is provably optimal. Along the way, we develop various techniques for analyzing the balls-into-bins process under certain correlations and build a novel connection with the multiplicative weights update scheme.

ウェブベースのサービスの爆発的な成長と、そのようなシステムの計算リソースを効率的に管理することの重要性に触発されて、商用検索エンジンなどの非常に大規模なデータベースの負荷分散の理論モデルを紹介し、研究します。私たちのモデルは、各データ片を運ぶサーバーの数を制限する追加の制約を備えた、広く受け入れられている\babモデルのより現実的なバージョンです。この追加の制約は、一方では、データが非常に大きいため、各サーバーにデータ全体をコピーできない場合に必要です。他方では、クエリの応答時間が非常に限られているため、各データ片に対するクエリの数が時間とともに変化するという事実を無視できず、したがって、データを異なるマシンに単純に分割することはできません。この論文では、各データ片の負荷の推定値があれば機能する、ほぼ最適な負荷分散アルゴリズムを開発します。私たちのアルゴリズムは、すべての負荷が敵対的に選択された場合でも、アルゴリズムのパフォーマンスが$1-1/e$であるほど、誤った推定に対してほぼ完全に堅牢であり、これは証明可能な最適値です。その過程で、特定の相関関係の下でボールがビンに入るプロセスを分析するためのさまざまな手法を開発し、乗法重み更新スキームとの新しい接続を構築します。

Benchmarking Graph Neural Networks
グラフニューラルネットワークのベンチマーク

In the last few years, graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. This emerging field has witnessed an extensive growth of promising techniques that have been applied with success to computer science, mathematics, biology, physics and chemistry. But for any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress. This led us in March 2020 to release a benchmark framework that i) comprises of a diverse collection of mathematical and real-world graphs, ii) enables fair model comparison with the same parameter budget to identify key architectures, iii) has an open-source, easy-to use and reproducible code infrastructure, and iv) is flexible for researchers to experiment with new theoretical ideas. As of December 2022, the GitHub repository has reached 2,000 stars and 380 forks, which demonstrates the utility of the proposed open-source framework through the wide usage by the GNN community. In this paper, we present an updated version of our benchmark with a concise presentation of the aforementioned framework characteristics, an additional medium-sized molecular dataset AQSOL, similar to the popular ZINC, but with a real-world measured chemical target, and discuss how this framework can be leveraged to explore new GNN designs and insights. As a proof of value of our benchmark, we study the case of graph positional encoding (PE) in GNNs, which was introduced with this benchmark and has since spurred interest of exploring more powerful PE for Transformers and GNNs in a robust experimental setting.

ここ数年、グラフニューラルネットワーク(GNN)は、グラフ上のデータを分析し、そこから学習するための標準的なツールキットとなっています。この新興分野では、コンピューターサイエンス、数学、生物学、物理学、化学に成功裏に適用されてきた有望な技術が急速に成長しています。しかし、成功した分野が主流になり、信頼できるものになるためには、進捗状況を定量化するためのベンチマークを開発する必要があります。このため、2020年3月に、i)数学的グラフと現実世界のグラフの多様なコレクションで構成され、ii)同じパラメーターバジェットで公平なモデル比較を可能にして主要なアーキテクチャを特定し、iii)オープンソースで使いやすく再現可能なコードインフラストラクチャを備え、iv)研究者が新しい理論的アイデアを実験できる柔軟性を備えたベンチマークフレームワークをリリースしました。2022年12月現在、GitHubリポジトリは2,000個のスターと380個のフォークに達しており、GNNコミュニティによる幅広い使用を通じて、提案されたオープンソースフレームワークの有用性が実証されています。この論文では、前述のフレームワーク特性を簡潔に説明したベンチマークの更新バージョン、人気のZINCに似ているが実際の測定された化学ターゲットを含む追加の中規模分子データセットAQSOLを紹介し、このフレームワークを活用して新しいGNN設計と洞察を探索する方法を説明します。ベンチマークの価値を証明するために、このベンチマークで導入され、それ以来、堅牢な実験設定でトランスフォーマーとGNNのより強力なPEを探索することへの関心を刺激してきた、GNNのグラフ位置エンコーディング(PE)のケースを調査します。

A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness
距離認識による単一モデルの深い不確実性を改善するためのシンプルなアプローチ

Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve the uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model’s ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks and on modern architectures (Wide-ResNet and BERT), SNGP consistently outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning.

不確実性の正確な定量化はディープラーニングにおける大きな課題です。ニューラルネットワークは自信過剰のエラーを起こし、分布外(OOD)入力に高い信頼度の予測を割り当てる可能性があるためです。ディープラーニングで予測の不確実性を推定する最も一般的なアプローチは、ベイジアンニューラルネットワーク(BNN)やディープアンサンブルなど、複数のニューラルネットワークからの予測を組み合わせる方法です。ただし、メモリと計算コストが高いため、リアルタイムの産業規模のアプリケーションでの実用性は限られています。さらに、アンサンブルとBNNは、必ずしも基礎となるメンバーネットワークのすべての問題を修正するわけではありません。この研究では、単一の決定論的表現に基づいて、単一のネットワークの不確実性特性を改善するための原理的なアプローチを研究します。不確実性の定量化をミニマックス学習問題として形式化することにより、まず、DNNが高品質(つまり、ミニマックス最適)の不確実性推定を実現するための必要条件として、距離認識、つまり、モデルがテスト例とトレーニングデータの距離を定量化する能力を特定します。次に、スペクトル正規化ニューラルガウス過程(SNGP)を提案します。これは、2つの簡単な変更((1)隠された重みにスペクトル正規化を適用して表現の双リプシッツ平滑性を強化する、(2)最後の出力層をガウス過程層に置き換える)により、最新のDNNの距離認識能力を向上させる簡単な方法です。一連の視覚および言語理解ベンチマークと最新のアーキテクチャ(Wide-ResNetおよびBERT)において、SNGPは予測、キャリブレーション、およびドメイン外検出において他の単一モデルアプローチよりも一貫して優れています。さらに、SNGPはディープアンサンブルやデータ拡張などの一般的な手法に補完的な利点を提供するため、確率的ディープラーニングのシンプルでスケーラブルな構成要素となります。

Neural Implicit Flow: a mesh-agnostic dimensionality reduction paradigm of spatio-temporal data
ニューラル・インプリシット・フロー: メッシュに依存しない時空間データの次元削減パラダイム

High-dimensional spatio-temporal dynamics can often be encoded in a low-dimensional subspace. Engineering applications for modeling, characterization, design, and control of such large-scale systems often rely on dimensionality reduction to make solutions computationally tractable in real time. Common existing paradigms for dimensionality reduction include linear methods, such as the singular value decomposition (SVD), and nonlinear methods, such as variants of convolutional autoencoders (CAE). However, these encoding techniques lack the ability to efficiently represent the complexity associated with spatio-temporal data, which often requires variable geometry, non-uniform grid resolution, adaptive meshing, and/or parametric dependencies. To resolve these practical engineering challenges, we propose a general framework called Neural Implicit Flow (NIF) that enables a mesh-agnostic, low-rank representation of large-scale, parametric, spatial-temporal data. NIF consists of two modified multilayer perceptrons (MLPs): (i) ShapeNet, which isolates and represents the spatial complexity, and (ii) ParameterNet, which accounts for any other input complexity, including parametric dependencies, time, and sensor measurements. We demonstrate the utility of NIF for parametric surrogate modeling, enabling the interpretable representation and compression of complex spatio-temporal dynamics, efficient many-spatial-query tasks, and improved generalization performance for sparse reconstruction.

高次元の時空間ダイナミクスは、多くの場合、低次元のサブスペースにエンコードできます。このような大規模システムのモデリング、特性評価、設計、制御のためのエンジニアリングアプリケーションでは、ソリューションをリアルタイムで計算可能にするために、次元削減が頻繁に使用されます。次元削減の一般的な既存のパラダイムには、特異値分解(SVD)などの線形手法や、畳み込みオートエンコーダ(CAE)のバリアントなどの非線形手法があります。ただし、これらのエンコード手法では、可変ジオメトリ、不均一なグリッド解像度、適応メッシュ、および/またはパラメトリック依存関係を必要とすることが多い時空間データに関連する複雑さを効率的に表現することができません。これらの実用的なエンジニアリングの課題を解決するために、大規模なパラメトリックな時空間データのメッシュに依存しない低ランク表現を可能にする、Neural Implicit Flow (NIF)と呼ばれる一般的なフレームワークを提案します。NIFは、2つの修正された多層パーセプトロン(MLP)で構成されています。(i)空間の複雑さを分離して表現するShapeNet、および(ii)パラメトリック依存性、時間、センサー測定など、その他の入力の複雑さを考慮したParameterNetです。パラメトリックサロゲートモデリングにおけるNIFの有用性を実証し、複雑な時空間ダイナミクスの解釈可能な表現と圧縮、効率的な多空間クエリタスク、およびスパース再構成の一般化パフォーマンスの向上を実現します。

On Batch Teaching Without Collusion
共謀を伴わないバッチティーチングについて

Formal models of learning from teachers need to respect certain criteria to avoid collusion. The most commonly accepted notion of collusion-avoidance was proposed by Goldman and Mathias (1996), and various teaching models obeying their criterion have been studied. For each model $M$ and each concept class $\mathcal{C}$, a parameter $M$-TD$(\mathcal{C})$ refers to the teaching dimension of concept class $\mathcal{C}$ in model $M$—defined to be the number of examples required for teaching a concept, in the worst case over all concepts in $\mathcal{C}$. This paper introduces a new model of teaching, called no-clash teaching, together with the corresponding parameter NCTD$(\mathcal{C})$. No-clash teaching is provably optimal in the strong sense that, given any concept class $\mathcal{C}$ and any model $M$ obeying Goldman and Mathias’s collusion-avoidance criterion, one obtains NCTD$(\mathcal{C})\le M$-TD$(\mathcal{C})$. We also study a corresponding notion NCTD$^+$ for the case of learning from positive data only, establish useful bounds on NCTD and NCTD$^+$, and discuss relations of these parameters to other complexity parameters of interest in computational learning theory. We further argue that Goldman and Mathias’s collusion-avoidance criterion may in some settings be too weak in that it admits certain forms of interaction between teacher and learner that could be considered collusion in practice. Therefore, we introduce a strictly stronger notion of collusion-avoidance and demonstrate that the well-studied notion of Preference-based Teaching is optimal among all teaching schemes that are strongly collusion-avoiding on all finite subsets of a given concept class.

教師から学ぶ正式なモデルは、共謀を避けるために特定の基準を尊重する必要があります。共謀回避の最も一般的に受け入れられている概念は、GoldmanとMathias (1996)によって提案され、その基準に従うさまざまな教授モデルが研究されてきました。各モデル$M$と各概念クラス$\mathcal{C}$について、パラメーター$M$-TD$(\mathcal{C})$は、モデル$M$の概念クラス$\mathcal{C}$の教授次元を参照します。これは、$\mathcal{C}$内のすべての概念における最悪のケースで、概念を教えるのに必要な例の数として定義されます。この論文では、衝突なし教授と呼ばれる新しい教授モデルと、対応するパラメーターNCTD$(\mathcal{C})$を紹介します。衝突のない教授法は、任意の概念クラス$\mathcal{C}$とGoldmanとMathiasの共謀回避基準に従う任意のモデル$M$が与えられた場合、NCTD$(\mathcal{C})\le M$-TD$(\mathcal{C})$が得られるという強い意味で最適であることが証明できます。また、正のデータのみから学習する場合の対応する概念NCTD$^+$を研究し、NCTDとNCTD$^+$の有用な境界を確立し、これらのパラメータと計算学習理論で重要な他の複雑性パラメータとの関係について議論します。さらに、GoldmanとMathiasの共謀回避基準は、実際には共謀と見なされる可能性のある教師と学習者の間の特定の形式の相互作用を許容するという点で、設定によっては弱すぎる可能性があると主張します。したがって、私たちは、より厳密な共謀回避の概念を導入し、特定の概念クラスのすべての有限サブセット上で強力に共謀を回避するすべての教授スキームの中で、よく研究されている嗜好に基づく教授の概念が最適であることを示します。

Sensing Theorems for Unsupervised Learning in Linear Inverse Problems
線形逆問題における教師なし学習のためのセンシング定理

Solving an ill-posed linear inverse problem requires knowledge about the underlying signal model. In many applications, this model is a priori unknown and has to be learned from data. However, it is impossible to learn the model using observations obtained via a single incomplete measurement operator, as there is no information about the signal model in the nullspace of the operator, resulting in a chicken-and-egg problem: to learn the model we need reconstructed signals, but to reconstruct the signals we need to know the model. Two ways to overcome this limitation are using multiple measurement operators or assuming that the signal model is invariant to a certain group action. In this paper, we present necessary and sufficient sensing conditions for learning the signal model from measurement data alone which only depend on the dimension of the model and the number of operators or properties of the group action that the model is invariant to. As our results are agnostic of the learning algorithm, they shed light into the fundamental limitations of learning from incomplete data and have implications in a wide range set of practical algorithms, such as dictionary learning, matrix completion and deep neural networks.

不適切線形逆問題を解くには、基礎となる信号モデルに関する知識が必要です。多くのアプリケーションでは、このモデルは事前に未知であり、データから学習する必要があります。ただし、単一の不完全な測定演算子を介して取得された観測値を使用してモデルを学習することは不可能です。これは、演算子のヌル空間に信号モデルに関する情報がないためであり、鶏が先か卵が先かという問題が生じます。つまり、モデルを学習するには再構成された信号が必要ですが、信号を再構成するにはモデルを知る必要があります。この制限を克服する2つの方法は、複数の測定演算子を使用するか、信号モデルが特定のグループアクションに対して不変であると仮定することです。この論文では、測定データのみから信号モデルを学習するための必要かつ十分なセンシング条件を提示します。この条件は、モデルの次元と、モデルが不変であるグループアクションの演算子またはプロパティの数にのみ依存します。私たちの結果は学習アルゴリズムに依存しないため、不完全なデータからの学習の基本的な制限に光を当て、辞書学習、行列補完、ディープニューラルネットワークなどの幅広い実用的なアルゴリズムに影響を与えます。

First-Order Algorithms for Nonlinear Generalized Nash Equilibrium Problems
非線形一般化ナッシュ均衡問題に対する 1 次アルゴリズム

We consider the problem of computing an equilibrium in a class of nonlinear generalized Nash equilibrium problems (NGNEPs) in which the strategy sets for each player are defined by the equality and inequality constraints that may depend on the choices of rival players. While the asymptotic global convergence and local convergence rate of certain algorithms have been extensively investigated, the iteration complexity analysis is still in its infancy. This paper provides two first-order algorithms based on quadratic penalty method (QPM) and augmented Lagrangian method (ALM), respectively, with an accelerated mirror-prox algorithm as the solver in each inner loop. We show the nonasymptotic convergence rate for these algorithms. In particular, we establish the global convergence guarantee for solving monotone and strongly monotone NGNEPs and provide the complexity bounds expressed in terms of the number of gradient evaluations. Experimental results demonstrate the efficiency of our algorithms in practice.

私たちは、各プレイヤーの戦略セットがライバルプレイヤーの選択に依存する可能性のある等式制約と不等式制約によって定義される非線形一般化ナッシュ均衡問題(NGNEP)のクラスにおける均衡を計算する問題を考察します。特定のアルゴリズムの漸近的なグローバル収束とローカル収束率は広範に調査されてきましたが、反復計算の複雑さの分析はまだ初期段階にあります。この論文では、加速ミラー近似アルゴリズムを各内部ループのソルバーとして使用する、それぞれ2次ペナルティ法(QPM)と拡張ラグランジュ法(ALM)に基づく2つの1次アルゴリズムを示します。これらのアルゴリズムの非漸近収束率を示します。特に、単調および強単調なNGNEPを解くためのグローバル収束保証を確立し、勾配評価の数で表された複雑さの境界を示します。実験結果は、実際のアルゴリズムの効率を実証しています。

Ridges, Neural Networks, and the Radon Transform
リッジ、ニューラルネットワーク、ラドン変換

A ridge is a function that is characterized by a one-dimensional profile (activation) and a multidimensional direction vector. Ridges appear in the theory of neural networks as functional descriptors of the effect of a neuron, with the direction vector being encoded in the linear weights. In this paper, we investigate properties of the Radon transform in relation to ridges and to the characterization of neural networks. We introduce a broad category of hyper-spherical Banach subspaces (including the relevant subspace of measures) over which the back-projection operator is invertible. We also give conditions under which the back-projection operator is extendable to the full parent space with its null space being identifiable as a Banach complement. Starting from first principles, we then characterize the sampling functionals that are in the range of the filtered Radon transform. Next, we extend the definition of ridges for any distributional profile and determine their (filtered) Radon transform in full generality. Finally, we apply our formalism to clarify and simplify some of the results and proofs on the optimality of ReLU networks that have appeared in the literature.

リッジは、1次元プロファイル(活性化)と多次元方向ベクトルによって特徴付けられる関数です。リッジは、ニューラルネットワークの理論では、ニューロンの効果の機能記述子として登場し、方向ベクトルは線形重みにエンコードされます。この論文では、リッジとニューラルネットワークの特性との関係で、ラドン変換の特性について調査します。バックプロジェクション演算子が可逆である、超球面バナッハ部分空間(関連する測度の部分空間を含む)の広範なカテゴリを紹介します。また、バックプロジェクション演算子が完全な親空間に拡張可能であり、そのヌル空間がバナッハ補集合として識別可能である条件も示します。第一原理から始めて、フィルター処理されたラドン変換の範囲内にあるサンプリング関数を特徴付けます。次に、任意の分布プロファイルに対してリッジの定義を拡張し、それらの(フィルター処理された)ラドン変換を完全な一般性で決定します。最後に、私たちの形式論を適用して、文献に記載されているReLUネットワークの最適性に関する結果と証明のいくつかを明確化し、簡素化します。

Label Distribution Changing Learning with Sample Space Expanding
ラベル分布、サンプルスペース拡張による学習の変更

With the evolution of data collection ways, label ambiguity has arisen from various applications. How to reduce its uncertainty and leverage its effectiveness is still a challenging task. As two types of representative label ambiguities, Label Distribution Learning (LDL), which annotates each instance with a label distribution, and Emerging New Class (ENC), which focuses on model reusing with new classes, have attached extensive attentions. Nevertheless, in many applications, such as emotion distribution recognition and facial age estimation, we may face a more complicated label ambiguity scenario, i.e., label distribution changing with sample space expanding owing to the new class. To solve this crucial but rarely studied problem, we propose a new framework named as Label Distribution Changing Learning (LDCL) in this paper, together with its theoretical guarantee with generalization error bound. Our approach expands the sample space by re-scaling previous distribution and then estimates the emerging label value via scaling constraint factor. For demonstration, we present two special cases within the framework, together with their optimizations and convergence analyses. Besides evaluating LDCL on most of the existing 13 data sets, we also apply it in the application of emotion distribution recognition. Experimental results demonstrate the effectiveness of our approach in both tackling label ambiguity problem and estimating facial emotion

データ収集方法の進化に伴い、さまざまなアプリケーションからラベルの曖昧さが生じています。その不確実性を減らし、その有効性をどのように活用するかは、依然として困難な課題です。代表的なラベルの曖昧さとして、各インスタンスにラベル分布を注釈するラベル分布学習(LDL)と、新しいクラスでのモデルの再利用に重点を置く新興新クラス(ENC)の2種類が大きな注目を集めています。ただし、感情分布認識や顔の年齢推定などの多くのアプリケーションでは、より複雑なラベルの曖昧さのシナリオ、つまり、新しいクラスによるサンプル空間の拡大とともにラベル分布が変化するというシナリオに直面する可能性があります。この重要でありながらほとんど研究されていない問題を解決するために、この論文では、一般化誤差の境界による理論的保証とともに、ラベル分布変更学習(LDCL)という新しいフレームワークを提案します。このアプローチでは、以前の分布を再スケーリングしてサンプル空間を拡大し、スケーリング制約係数を介して出現ラベル値を推定します。デモンストレーションのために、フレームワーク内の2つの特殊なケースと、それらの最適化および収束分析を示します。既存の13データセットのほとんどでLDCLを評価するだけでなく、感情分布認識のアプリケーションにも適用します。実験結果は、ラベルの曖昧さの問題への取り組みと顔の感情の推定の両方において、私たちのアプローチの有効性を実証しています。

Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopically Rational Followers?
強化学習は、近視眼的に合理的なフォロワーを持つ一般和マルコフゲームでスタックルベルグ・ナッシュ均衡を見つけることができるか?

We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers. In particular, we focus on the class of games where the followers are myopically rational; i.e., they aim to maximize their instantaneous rewards. For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair $(\pi^*, \nu^*)$ such that: (i) $\pi^*$ is the optimal policy for the leader when the followers always play their best response, and (ii) $\nu^*$ is the best response policy of the followers, which is a Nash equilibrium of the followers’ game induced by $\pi^*$. We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings. Our algorithms are optimistic and pessimistic variants of least-squares value iteration, and they are readily able to incorporate function approximation tools in the setting of large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively. To the best of our knowledge, we establish the first provably efficient RL algorithms for solving for SNEs in general-sum Markov games with myopically rational followers.

私たちは、プレイヤーの1人をリーダー、他のプレイヤーをフォロワーとするマルチプレイヤー一般和マルコフゲームを研究します。特に、フォロワーが近視眼的に合理的なゲーム、つまり瞬間報酬の最大化を目指すゲームのクラスに焦点を当てます。このようなゲームでは、スタックルベルクナッシュ均衡(SNE)を見つけることが目標です。スタックルベルクナッシュ均衡とは、(i)フォロワーが常に最善の応答をする場合のリーダーの最適ポリシーが$\pi^*$であり、(ii)フォロワーの最善の応答ポリシーが$\nu^*$であり、これは$\pi^*$によって誘導されるフォロワーのゲームのナッシュ均衡です。私たちは、オンラインとオフラインの両方の設定でSNEを解決するためのサンプル効率の高い強化学習(RL)アルゴリズムを開発します。私たちのアルゴリズムは、最小二乗値反復の楽観的および悲観的な変種であり、大規模な状態空間の設定で関数近似ツールを簡単に組み込むことができます。さらに、線形関数近似の場合、私たちのアルゴリズムは、それぞれオンラインおよびオフラインの設定で線形以下の後悔と準最適性を達成することを証明しています。私たちの知る限りでは、近視眼的に合理的なフォロワーを持つ一般和マルコフゲームでSNEを解決するための、初めて証明可能な効率的なRLアルゴリズムを確立しました。

Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations and Beyond
Quantus: ニューラルネットワークの説明などを責任を持って評価するための説明可能なAIツールキット

The evaluation of explanation methods is a research topic that has not yet been explored deeply, however, since explainability is supposed to strengthen trust in artificial intelligence, it is necessary to systematically review and compare explanation methods in order to confirm their correctness. Until now, no tool with focus on XAI evaluation exists that exhaustively and speedily allows researchers to evaluate the performance of explanations of neural network predictions. To increase transparency and reproducibility in the field, we therefore built Quantus—a comprehensive, evaluation toolkit in Python that includes a growing, well-organised collection of evaluation metrics and tutorials for evaluating explainable methods. The toolkit has been thoroughly tested and is available under an open-source license on PyPi (or on https://github.com/understandable-machine-intelligence-lab/Quantus/).

説明手法の評価は、まだ深く掘り下げられていない研究テーマですが、説明可能性が人工知能への信頼を強めると考えられているため、その正しさを確認するためには、説明手法を体系的に見直し、比較する必要があります。これまで、XAI評価に着目したツールで、ニューラルネットワーク予測の説明の性能を網羅的かつスピーディーに評価できるツールは存在しませんでした。そこで、現場での透明性と再現性を高めるために、Pythonによる包括的な評価ツールキットであるQuantusを構築しました。これには、評価メトリクスの増え続けるよく整理されたコレクションと、説明可能な方法を評価するためのチュートリアルが含まれています。ツールキットは徹底的にテストされており、PyPi(またはhttps://github.com/understandable-machine-intelligence-lab/Quantus/)のオープンソースライセンスの下で利用できます。

Gap Minimization for Knowledge Sharing and Transfer
知識の共有と伝達のためのギャップの最小化

Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of performance gap, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g., $\mathcal{H}$-divergence or discrepancy distance), we theoretically show that the performance gap can be viewed as a data- and algorithm-dependent regularizer, which controls the model complexity and leads to finer guarantees. More importantly, it also provides new insights and motivates a novel principle for designing strategies for knowledge sharing and transfer: gap minimization. We instantiate this principle with two algorithms: 1. gapBoost, a novel and principled boosting algorithm that explicitly minimizes the performance gap between source and target domains for transfer learning; and 2. gapMTNN, a representation learning algorithm that reformulates gap minimization as semantic conditional matching for multitask learning. Our extensive evaluation on both transfer learning and multitask learning benchmark data sets shows that our methods outperform existing baselines.

過去20年間で、知識の共有と移転による複数の関連タスクからの学習の重要性が高まっています。あるタスクから別のタスクに情報をうまく移転するには、ドメイン間の類似点と相違点を理解することが重要です。この論文では、学習タスク間の距離を測定する直感的で新しい尺度であるパフォーマンスギャップの概念を紹介します。タスク間の予想されるリスクの差を制限するツールとして使用される既存の尺度($\mathcal{H}$ダイバージェンスまたは不一致距離など)とは異なり、パフォーマンスギャップは、モデルの複雑さを制御し、より細かい保証につながるデータおよびアルゴリズムに依存する正則化子と見なすことができることを理論的に示します。さらに重要なことに、これは新しい洞察も提供し、知識の共有と移転の戦略を設計するための新しい原則であるギャップ最小化を促進します。この原則を2つのアルゴリズムでインスタンス化します。1. gapBoost。これは、転移学習のソースドメインとターゲットドメイン間のパフォーマンスギャップを明示的に最小化する、新しい原理的なブースティングアルゴリズムです。2. gapMTNN、ギャップ最小化をマルチタスク学習の意味的条件マッチングとして再定式化する表現学習アルゴリズム。転移学習とマルチタスク学習のベンチマークデータセットの両方で広範囲に評価した結果、当社の手法が既存のベースラインを上回ることが示されました。

Sparse PCA: a Geometric Approach
スパース PCA: 幾何学的アプローチ

We consider the problem of maximizing the variance explained from a data matrix using orthogonal sparse principal components that have a support of fixed cardinality. While most existing methods focus on building principal components (PCs) iteratively through deflation, we propose GeoSPCA, a novel algorithm to build all PCs at once while satisfying the orthogonality constraints which brings substantial benefits over deflation. This novel approach is based on the left eigenvalues of the covariance matrix which helps circumvent the non-convexity of the problem by approximating the optimal solution using a binary linear optimization problem that can find the optimal solution. The resulting approximation can be used to tackle different versions of the sparse PCA problem including the case in which the principal components share the same support or have disjoint supports and the Structured Sparse PCA problem. We also propose optimality bounds and illustrate the benefits of GeoSPCA in selected real world problems both in terms of explained variance, sparsity and tractability. Improvements vs. the greedy algorithm, which is often at par with state-of-the-art techniques, reaches up to 24% in terms of variance while solving real world problems with 10,000s of variables and support cardinality of 100s in minutes. We also apply GeoSPCA in a face recognition problem yielding more than 10% improvement vs. other PCA based technique such as structured sparse PCA.

私たちは、固定カーディナリティのサポートを持つ直交スパース主成分を使用して、データマトリックスから説明される分散を最大化する問題を考察します。既存の方法のほとんどは、デフレーションを通じて主成分(PC)を反復的に構築することに重点を置いていますが、私たちは、デフレーションに比べて大きな利点をもたらす直交性制約を満たしながらすべてのPCを一度に構築する新しいアルゴリズムであるGeoSPCAを提案します。この新しいアプローチは、共分散マトリックスの左固有値に基づいています。これは、最適解を見つけることができるバイナリ線形最適化問題を使用して最適解を近似することにより、問題の非凸性を回避するのに役立ちます。結果として得られる近似は、主成分が同じサポートを共有する場合や、サポートが互いに素である場合、および構造化スパースPCA問題など、スパースPCA問題のさまざまなバージョンに取り組むために使用できます。また、最適性境界を提案し、説明される分散、スパース性、扱いやすさの両方の観点から、選択された現実の問題におけるGeoSPCAの利点を示します。最先端の技術と同等であることが多い貪欲アルゴリズムと比べると、分散の点で最大24%の改善が見られ、10,000個の変数を持つ現実世界の問題を数分で解決し、100個の基数をサポートします。また、顔認識の問題にGeoSPCAを適用したところ、構造化スパースPCAなどの他のPCAベースの技術と比べて10%以上の改善が見られました。

Labels, Information, and Computation: Efficient Learning Using Sufficient Labels
ラベル、情報、計算:十分なラベルを用いた効率的な学習

In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic “sufficiently-labeled data” and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.

教師あり学習では、完全にラベル付けされたトレーニングデータの大規模なセットを取得するのはコストがかかります。有能な分類器をトレーニングするために、すべてのトレーニング例の完全なラベル情報が必ずしも必要ではないことを示します。具体的には、統計の十分性の原則に着想を得て、分類に関連するほぼすべての情報をキャプチャしながらも、直接取得しやすい完全にラベル付けされたトレーニングセットの統計(要約)を示します。この統計を「十分にラベル付けされたデータ」と呼び、クラスごとにランダムに選択された1つの完全にラベル付けされた例だけを使用して有能な分類器ヘッドをトレーニングできる最適な隠れた表現を見つけるためのその十分性と効率性を証明します。十分にラベル付けされたデータは、最初に完全にラベル付けされたデータを収集しなくても、アノテーターから直接取得できます。また、十分にラベル付けされたデータを取得するよりも、十分にラベル付けされたデータを直接取得する方が簡単であることを証明します。さらに、十分にラベル付けされたデータは、絶対情報ではなく相対情報を格納するため、当然ながらより安全です。私たちの理論を裏付ける広範な実験結果が提供されています。

Attacks against Federated Learning Defense Systems and their Mitigation
フェデレーテッド・ラーニング防御システムに対する攻撃とその緩和策

The susceptibility of federated learning (FL) to attacks from untrustworthy endpoints has led to the design of several defense systems. FL defense systems enhance the federated optimization algorithm using anomaly detection, scaling the updates from endpoints depending on their anomalous behavior. However, the defense systems themselves may be exploited by the endpoints with more sophisticated attacks. First, this paper proposes three categories of attacks and shows that they can effectively deceive some well-known FL defense systems. In the first two categories, referred to as on-off attacks, the adversary toggles between being honest and engaging in attacks. We analyse two such on-off attacks, label flipping and free riding, and show their impact against existing FL defense systems. As a third category, we propose attacks based on “good mouthing” and “bad mouthing”, to boost or diminish influence of the victim endpoints on the global model. Secondly, we propose a new federated optimization algorithm, Viceroy, that can successfully mitigate all the proposed attacks. The proposed attacks and the mitigation strategy have been tested on a number of different experiments establishing their effectiveness in comparison with other contemporary methods. The proposed algorithm has also been made available as open source. Finally, in the appendices, we provide an induction proof for the on-off model poisoning attack, and the proof of convergence and adversarial tolerance for the new federated optimization algorithm.

信頼できないエンドポイントからの攻撃に対する連合学習(FL)の脆弱性により、いくつかの防御システムが設計されました。FL防御システムは、異常検出を使用して連合最適化アルゴリズムを強化し、エンドポイントの異常な動作に応じてエンドポイントからの更新をスケーリングします。ただし、防御システム自体は、エンドポイントによってより洗練された攻撃で悪用される可能性があります。まず、この論文では3つの攻撃カテゴリを提案し、それらがいくつかのよく知られているFL防御システムを効果的に欺くことができることを示します。最初の2つのカテゴリはオンオフ攻撃と呼ばれ、敵対者は正直であることと攻撃を行うことを切り替えます。このような2つのオンオフ攻撃、ラベルフリッピングとフリーライディングを分析し、既存のFL防御システムに対する影響を示します。3番目のカテゴリとして、「良い口調」と「悪い口調」に基づく攻撃を提案し、被害エンドポイントのグローバルモデルへの影響を増減します。次に、提案されたすべての攻撃を正常に軽減できる新しい連合最適化アルゴリズム、Viceroyを提案します。提案された攻撃と緩和戦略は、さまざまな実験でテストされ、他の最新の方法と比較してその有効性が立証されています。提案されたアルゴリズムは、オープンソースとしても公開されています。最後に、付録では、オンオフモデルポイズニング攻撃の帰納的証明と、新しい連合最適化アルゴリズムの収束と敵対的耐性の証明を提供します。

HiClass: a Python Library for Local Hierarchical Classification Compatible with Scikit-learn
HiClass: scikit-learn と互換性のあるローカル階層分類のための Python ライブラリ

HiClass is an open-source Python library for local hierarchical classification entirely compatible with scikit-learn. It contains implementations of the most common design patterns for hierarchical machine learning models found in the literature, that is, the local classifiers per node, per parent node and per level. Additionally, the package contains implementations of hierarchical metrics, which are more appropriate for evaluating classification performance on hierarchical data. The documentation includes installation and usage instructions, examples within tutorials and interactive notebooks, and a complete description of the API. HiClass is released under the simplified BSD license, encouraging its use in both academic and commercial environments. Source code and documentation are available at https://github.com/scikit-learn-contrib/hiclass.

HiClassは、scikit-learnと完全に互換性のあるローカル階層分類用のオープンソースのPythonライブラリです。これには、文献に記載されている階層型機械学習モデルの最も一般的な設計パターン(ノードごと、親ノードごと、レベルごとのローカル分類子)の実装が含まれています。さらに、このパッケージには、階層データの分類パフォーマンスを評価するのにより適した階層メトリックの実装が含まれています。ドキュメントには、インストールと使用手順、チュートリアルやインタラクティブノートブック内の例、APIの詳細な説明が含まれています。HiClassは、簡略化されたBSDライセンスの下でリリースされており、学術環境と商用環境の両方での使用を奨励しています。ソースコードとドキュメントは、https://github.com/scikit-learn-contrib/hiclassで入手できます。

Impact of classification difficulty on the weight matrices spectra in Deep Learning and application to early-stopping
深層学習における重み行列スペクトルに対する分類難易度の影響と早期停止への応用

Much recent research effort has been devoted to explain the success of deep learning. Random Matrix Theory (RMT) provides an emerging way to this end by analyzing the spectra of large random matrices involved in a trained deep neural network (DNN) such as weight matrices or Hessian matrices in the stochastic gradient descent algorithm. To better understand spectra of weight matrices, we conduct extensive experiments on weight matrices under different settings for layers, networks and data sets. Based on the previous work of {martin2018implicit}, spectra of weight matrices at the terminal stage of training are classified into three main types: Light Tail (LT), Bulk Transition period (BT) and Heavy Tail (HT). These different types, especially HT, implicitly indicate some regularization in the DNNs. In this paper, inspired from {martin2018implicit}, we identify the difficulty of the classification problem as an important factor for the appearance of HT in weight matrices spectra. Higher the classification difficulty, higher the chance for HT to appear. Moreover, the classification difficulty can be affected either by the signal-to-noise ratio of the dataset, or by the complexity of the classification problem (complex features, large number of classes) as well. Leveraging on this finding, we further propose a spectral criterion to detect the appearance of HT and use it to early stop the training process without testing data. Such early stopped DNNs have the merit of avoiding overfitting and unnecessary extra training while preserving a much comparable generalization ability. These findings from the paper are validated in several NNs (LeNet, MiniAlexNet and VGG), using Gaussian synthetic data and real data sets (MNIST and CIFAR10).

ディープラーニングの成功を説明するために、最近多くの研究努力が払われています。ランダム行列理論(RMT)は、重み行列や確率的勾配降下法アルゴリズムのヘッセ行列など、トレーニングされたディープニューラルネットワーク(DNN)に含まれる大きなランダム行列のスペクトルを分析することで、この目的を達成するための新しい方法を提供します。重み行列のスペクトルをよりよく理解するために、レイヤー、ネットワーク、およびデータセットのさまざまな設定で重み行列に関する広範な実験を行います。{martin2018implicit}の以前の研究に基づいて、トレーニングの最終段階での重み行列のスペクトルは、ライトテール(LT)、バルク遷移期間(BT)、ヘビーテール(HT)の3つの主要なタイプに分類されます。これらの異なるタイプ、特にHTは、DNNの何らかの正則化を暗黙的に示しています。この論文では、{martin2018implicit}に触発され、分類問題の難しさが重み行列スペクトルにHTが出現する重要な要因であることを確認します。分類の難易度が高いほど、HTが出現する可能性が高くなります。さらに、分類の難易度は、データセットの信号対雑音比、または分類問題の複雑さ(複雑な特徴、多数のクラス)によっても影響を受ける可能性があります。この発見を活用して、HTの出現を検出し、データをテストせずにトレーニングプロセスを早期に停止するためのスペクトル基準をさらに提案します。このように早期に停止するDNNには、過剰適合や不要な追加トレーニングを回避しながら、同等の一般化能力を維持できるという利点があります。この論文のこれらの発見は、ガウス合成データと実際のデータセット(MNISTおよびCIFAR10)を使用して、いくつかのNN (LeNet、MiniAlexNet、VGG)で検証されています。

The SKIM-FA Kernel: High-Dimensional Variable Selection and Nonlinear Interaction Discovery in Linear Time
SKIM-FAカーネル:高次元変数選択と線形時間における非線形相互作用の発見

Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can lead to poor estimation and variable selection. Unfortunately, methods that simultaneously express sparsity, nonlinearity, and interactions are computationally intractable — with runtime at least quadratic in the number of covariates, and often worse. In the present work, we solve this computational bottleneck. We show that suitable interaction models have a kernel representation, namely there exists a “kernel trick” to perform variable selection and estimation in $O$(# covariates) time. Our resulting fit corresponds to a sparse orthogonal decomposition of the regression function in a Hilbert space (i.e., a functional ANOVA decomposition), where interaction effects represent all variation that cannot be explained by lower-order effects. On a variety of synthetic and real data sets, our approach outperforms existing methods used for large, high-dimensional data sets while remaining competitive (or being orders of magnitude faster) in runtime.

多くの科学的問題では、ターゲット応答に関連する共変量の小さなセットを特定し、その効果を推定する必要があります。多くの場合、これらの効果は非線形で相互作用を含むため、線形および加法的な方法では、推定と変数選択が不十分になる可能性があります。残念ながら、スパース性、非線形性、および相互作用を同時に表現する方法は、計算上扱いにくく、実行時間は共変量の数の少なくとも2乗、多くの場合はそれよりも悪くなります。この研究では、この計算上のボトルネックを解決します。適切な相互作用モデルにはカーネル表現があること、つまり、変数選択と推定を$O$(#共変量)時間で実行する「カーネルトリック」が存在することを示します。結果として得られる適合は、ヒルベルト空間での回帰関数のスパース直交分解(つまり、機能的ANOVA分解)に対応し、相互作用効果は、低次の効果では説明できないすべての変動を表します。さまざまな合成データセットと実際のデータセットにおいて、当社のアプローチは、実行時に競争力を維持しながら(または桁違いに高速に)、大規模で高次元のデータセットに使用される既存の方法よりも優れたパフォーマンスを発揮します。

Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels
加法性ノイズチャネルの特性を使用したノイズの多い反復アルゴリズムの一般化限界

Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.

異なるデータ分布で異なる最適化アルゴリズムによってトレーニングされた機械学習モデルは、異なる汎化動作を示すことができます。この論文では、ノイズの多い反復アルゴリズムによって訓練されたモデルの一般化を分析します。私たちは、ノイズの多い反復アルゴリズムを通信理論や情報理論に見られる加法的なノイズチャネルに接続することにより、分布依存の一般化境界を導き出します。私たちの一般化限界は、微分プライベート確率的勾配降下法(DP-SGD)、連合学習、確率的勾配ランジュバンダイナミクス(SGLD)など、いくつかのアプリケーションに光を当てます。私たちは、数値実験を通じてその限界を実証し、ニューラルネットワークの一般化現象に関する最近の経験的観測を理解するのに役立つことを示しています。

Discrete Variational Calculus for Accelerated Optimization
加速最適化のための離散変分計算

Many of the new developments in machine learning are connected with gradient-based optimization methods. Recently, these methods have been studied using a variational perspective (Betancourt et al., 2018). This has opened up the possibility of introducing variational and symplectic methods using geometric integration. In particular, in this paper, we introduce variational integrators (Marsden and West, 2001) which allow us to derive different methods for optimization. Using both Hamilton’s and Lagrange-d’Alembert’s principle, we derive two families of optimization methods in one-to-one correspondence that generalize Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient (Nesterov, 1983), the second of which mimics the behavior of the latter reducing the oscillations of classical momentum methods. However, since the systems considered are explicitly time-dependent, the preservation of symplecticity of autonomous systems occurs here solely on the fibers. Several experiments exemplify the result.

機械学習における新しい開発の多くは、勾配ベースの最適化手法に関連しています。最近、これらの手法は変分の観点から研究されています(Betancourtら, 2018)。これにより、幾何学的積分を使用して変分法とシンプレクティック法を導入する可能性が開かれました。特に、この論文では、最適化のためのさまざまな手法を導出できる変分積分器(Marsden and West, 2001)を紹介します。ハミルトン原理とラグランジュ・ダランベール原理の両方を使用して、ポリアックの重いボール(Polyak, 1964)とネステロフの加速勾配(Nesterov, 1983)を一般化する1対1対応の2つの最適化手法ファミリーを導出します。後者は後者の動作を模倣し、古典的な運動量法の振動を減らします。ただし、検討対象のシステムは明示的に時間依存であるため、自律システムのシンプレクティック性の保存は、ここではファイバー上でのみ発生します。いくつかの実験がその結果を例証しています。

Calibrated Multiple-Output Quantile Regression with Representation Learning
表現学習による較正された多出力分位点回帰

We develop a method to generate predictive regions that cover a multivariate response variable with a user-specified probability. Our work is composed of two components. First, we use a deep generative model to learn a representation of the response that has a unimodal distribution. Existing multiple-output quantile regression approaches are effective in such cases, so we apply them on the learned representation, and then transform the solution to the original space of the response. This process results in a flexible and informative region that can have an arbitrary shape, a property that existing methods lack. Second, we propose an extension of conformal prediction to the multivariate response setting that modifies any method to return sets with a pre-specified coverage level. The desired coverage is theoretically guaranteed in the finite-sample case for any distribution. Experiments conducted on both real and synthetic data show that our method constructs regions that are significantly smaller compared to existing techniques.

私たちは、ユーザーが指定した確率で多変量応答変数をカバーする予測領域を生成する方法を開発しました。この作業は2つの要素で構成されています。まず、深層生成モデルを使用して、単峰性分布を持つ応答の表現を学習します。既存の多重出力分位回帰アプローチはこのような場合に有効であるため、学習した表現にそれを適用し、ソリューションを応答の元の空間に変換します。このプロセスにより、既存の方法にはない特性である、任意の形状を持つことができる柔軟で有益な領域が得られます。次に、多変量応答設定への適合予測の拡張を提案します。これにより、任意の方法が変更され、事前に指定されたカバレッジレベルのセットが返されます。任意の分布の有限サンプルの場合、目的のカバレッジが理論的に保証されます。実際のデータと合成データの両方で実施した実験では、既存の手法と比較して大幅に小さい領域が構築されることが示されています。

Bayesian Data Selection
ベイジアンデータの選択

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the “data selection” problem: finding a lower-dimensional statistic – such as a subset of variables – that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining “background” components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the “Stein volume criterion (SVC)”, that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.

複雑で高次元のデータに関する洞察は、関心のあるモデルに一致する、または一致しないデータの特徴を発見することで得られます。このタスクを形式化するために、「データ選択」問題を導入します。これは、関心のある特定のパラメトリックモデルによく適合する低次元の統計(変数のサブセットなど)を見つけることです。データ選択に対する完全なベイジアンアプローチは、統計の値をパラメトリックにモデル化し、データの残りの「背景」コンポーネントをノンパラメトリックにモデル化し、統計の選択に対して標準的なベイジアンモデル選択を実行することです。ただし、ノンパラメトリックモデルを高次元データに適合させることは、統計的にも計算的にも非常に非効率的になる傾向があります。私たちは、ノンパラメトリックモデルの適合を必要としない、データ選択を実行するための新しいスコア、「Steinボリューム基準(SVC)」を提案します。SVCは、Kullback-Leiblerダイバージェンスの代わりにカーネル化されたStein不一致を伴う一般化周辺尤度の形式をとります。SVCがデータ選択に対して一貫していることを証明し、パラメータの対応する一般化事後分布の一貫性と漸近正規性を確立します。確率的主成分分析と遺伝子制御のスピングラスモデルを使用して、SVCを単一細胞RNAシーケンスデータセットの分析に適用します。

Lower Bounds and Accelerated Algorithms for Bilevel Optimization
バイレベル最適化のための下限と高速化アルゴリズム

Bilevel optimization has recently attracted growing interests due to its wide applications in modern machine learning problems. Although recent studies have characterized the convergence rate for several such popular algorithms, it is still unclear how much further these convergence rates can be improved. In this paper, we address this fundamental question from two perspectives. First, we provide the first-known lower complexity bounds of $\widetilde \Omega\bigg(\sqrt{\frac{L_y\widetilde L_{xy}^2}{\mu_x\mu_y^2}}\bigg)$ and $\widetilde \Omega\big(\frac{1}{\sqrt{\epsilon}}\min\{\kappa_y,\frac{1}{\sqrt{\epsilon^{3}}}\}\big)$ respectively for strongly-convex-strongly-convex and convex-strongly-convex bilevel optimizations. Second, we propose an accelerated bilevel optimizer named AccBiO, for which we provide the first-known complexity bounds without the gradient boundedness assumption (which was made in existing analyses) under the two aforementioned geometries. We also provide significantly tighter upper bounds than the existing complexity when the bounded gradient assumption does hold. We show that AccBiO achieves the optimal results (i.e., the upper and lower bounds match up to logarithmic factors) when the inner-level problem takes a quadratic form with a constant-level condition number. Interestingly, our lower bounds under both geometries are larger than the corresponding optimal complexities of minimax optimization, establishing that bilevel optimization is provably more challenging than minimax optimization. Our theoretical results are validated by numerical experiments.

2レベル最適化は、最近の機械学習の問題に幅広く応用されているため、最近ますます注目を集めています。最近の研究では、いくつかのこのような一般的なアルゴリズムの収束率を特徴付けていますが、これらの収束率をどれだけ改善できるかはまだ明らかではありません。この論文では、この基本的な問題に2つの観点から取り組みます。まず、強凸-強凸と凸-強凸2レベル最適化について、それぞれ$\widetilde \Omega\bigg(\sqrt{\frac{L_y\widetilde L_{xy}^2}{\mu_x\mu_y^2}}\bigg)$と$\widetilde \Omega\big(\frac{1}{\sqrt{\epsilon}}\min\{\kappa_y,\frac{1}{\sqrt{\epsilon^{3}}}\}\big)$という初めて知られる複雑さの下限を示します。次に、AccBiOという名の高速化された2レベル最適化装置を提案します。この装置では、前述の2つのジオメトリの下で、既存の分析で行われた勾配有界性仮定なしに、初めて知られている複雑さの境界を提供します。また、有界勾配仮定が成立する場合、既存の複雑さよりも大幅に厳しい上限も提供します。内部レベルの問題が一定レベルの条件数を持つ2次形式を取る場合、AccBiOが最適な結果(つまり、上限と下限が対数係数まで一致する)を達成することを示します。興味深いことに、両方のジオメトリの下での下限は、ミニマックス最適化の対応する最適な複雑さよりも大きく、2レベル最適化はミニマックス最適化よりも困難であることが証明されています。理論的結果は数値実験によって検証されます。

Graph-Aided Online Multi-Kernel Learning
グラフ支援オンラインマルチカーネル学習

Multi-kernel learning (MKL) has been widely used in learning problems involving function learning tasks. Compared with single kernel learning approach which relies on a pre-selected kernel, the advantage of MKL is its flexibility results from combining a dictionary of kernels. However, inclusion of irrelevant kernels in the dictionary may deteriorate the accuracy of MKL, and increase the computational complexity. Faced with this challenge, a novel graph-aided framework is developed to select a subset of kernels from the dictionary with the assistance of a graph. Different graph construction and refinement schemes are developed based on incurred losses or kernel similarities to assist the adaptive selection process. Moreover, to cope with the scenario where data may be collected in a sequential fashion, or cannot be stored in batch due to the massive scale, random feature approximation are adopted to enable online function learning. It is proved that our proposed algorithms enjoy sub-linear regret bounds. Experiments on a number of real datasets showcase the advantages of our novel graph-aided algorithms compared to state-of-the-art alternatives.

マルチカーネル学習(MKL)は、関数学習タスクを含む学習問題で広く使用されています。事前に選択されたカーネルに依存する単一カーネル学習アプローチと比較して、MKLの利点は、カーネルの辞書を組み合わせることで得られる柔軟性です。ただし、辞書に無関係なカーネルを含めると、MKLの精度が低下し、計算の複雑さが増す可能性があります。この課題に直面して、グラフの支援により辞書からカーネルのサブセットを選択するための新しいグラフ支援フレームワークが開発されました。発生した損失またはカーネルの類似性に基づいて、さまざまなグラフ構築および改良スキームが開発され、適応選択プロセスを支援します。さらに、データが順次収集されるか、大規模であるためバッチで保存できないシナリオに対処するために、ランダムな特徴近似が採用され、オンライン関数学習が可能になります。提案されたアルゴリズムは、線形以下の後悔境界を備えていることが証明されています。いくつかの実際のデータセットでの実験により、最先端の代替手段と比較した新しいグラフ支援アルゴリズムの利点が明らかになりました。

Interpolating Classifiers Make Few Mistakes
分類器の補間は間違いをほとんど犯しません

This paper provides elementary analyses of the regret and generalization of minimum-norm interpolating classifiers (MNIC). The MNIC is the function of smallest Reproducing Kernel Hilbert Space norm that perfectly interpolates a label pattern on a finite data set. We derive a mistake bound for MNIC and a regularized variant that holds for all data sets. This bound follows from elementary properties of matrix inverses. Under the assumption that the data is independently and identically distributed, the mistake bound implies that MNIC generalizes at a rate proportional to the norm of the interpolating solution and inversely proportional to the number of data points. This rate matches similar rates derived for margin classifiers and perceptrons. We derive several plausible generative models where the norm of the interpolating classifier is bounded or grows at a rate sublinear in $n$. We also show that as long as the population class conditional distributions are sufficiently separable in total variation, then MNIC generalizes with a fast rate.

この論文では、最小ノルム補間分類器(MNIC)のリグレットと一般化に関する基本的な分析を提供します。MNICは、有限データセット上のラベルパターンを完全に補間する最小の再生カーネルヒルベルト空間ノルムの関数です。MNICのエラー境界と、すべてのデータセットに適用される正規化されたバリアントを導出します。この境界は、逆行列の基本特性に従います。データが独立かつ同一に分布していると仮定すると、エラー境界は、MNICが補間ソリューションのノルムに比例し、データポイントの数に反比例する速度で一般化することを意味します。この速度は、マージン分類器とパーセプトロンに導出された同様の速度と一致します。補間分類器のノルムが制限されるか、nに対して線形以下の速度で増加する、いくつかの妥当な生成モデルを導出します。また、母集団クラス条件付き分布が全体の変動で十分に分離可能である限り、MNICは高速で一般化することも示します。

Regularized Joint Mixture Models
正則化されたジョイント混合モデル

Regularized regression models are well studied and, under appropriate conditions, offer fast and statistically interpretable results. However, large data in many applications are heterogeneous in the sense of harboring distributional differences between latent groups. Then, the assumption that the conditional distribution of response $Y$ given features $X$ is the same for all samples may not hold. Furthermore, in scientific applications, the covariance structure of the features may contain important signals and its learning is also affected by latent group structure. We propose a class of mixture models for paired data $(X,Y)$ that couples together the distribution of $X$ (using sparse graphical models) and the conditional $Y \! \mid \! X$ (using sparse regression models). The regression and graphical models are specific to the latent groups and model parameters are estimated jointly. This allows signals in either or both of the feature distribution and regression model to inform learning of latent structure and provides automatic control of confounding by such structure. Estimation is handled via an expectation-maximization algorithm, whose convergence is established theoretically. We illustrate the key ideas via empirical examples. An R package is available at https://github.com/k-perrakis/regjmix.

正規化回帰モデルは十分に研究されており、適切な条件下では、高速で統計的に解釈可能な結果を提供します。ただし、多くのアプリケーションにおける大規模データは、潜在グループ間の分布の違いを抱えているという意味で異質です。そのため、特徴量$X$が与えられた場合の応答$Y$の条件付き分布がすべてのサンプルで同じであるという仮定は成立しない可能性があります。さらに、科学的アプリケーションでは、特徴量の共分散構造に重要なシグナルが含まれる可能性があり、その学習も潜在グループ構造の影響を受けます。私たちは、分布$X$ (スパースグラフィカルモデルを使用)と条件付き$Y \! \mid \! X$ (スパース回帰モデルを使用)を結合した、ペアデータ$(X,Y)$の混合モデルのクラスを提案します。回帰モデルとグラフィカルモデルは潜在グループに固有であり、モデルパラメーターは共同で推定されます。これにより、特徴量分布と回帰モデルのいずれかまたは両方のシグナルが潜在構造の学習に情報を提供し、そのような構造による交絡を自動的に制御できます。推定は期待最大化アルゴリズムによって処理され、その収束は理論的に確立されています。主要なアイデアを経験的な例で説明します。Rパッケージはhttps://github.com/k-perrakis/regjmixで入手できます。

An Inertial Block Majorization Minimization Framework for Nonsmooth Nonconvex Optimization
非平滑非凸最適化のための慣性ブロック多数決最小化フレームワーク

In this paper, we introduce TITAN, a novel inerTIal block majorizaTion minimizAtioN ramework for nonsmooth nonconvex optimization problems. To the best of our knowledge, TITAN is the first framework of block-coordinate update method that relies on the majorization-minimization framework while embedding inertial force to each step of the block updates. The inertial force is obtained via an extrapolation operator that subsumes heavy-ball and Nesterov-type accelerations for block proximal gradient methods as special cases. By choosing various surrogate functions, such as proximal, Lipschitz gradient, Bregman, quadratic, and composite surrogate functions, and by varying the extrapolation operator, TITAN produces a rich set of inertial block-coordinate update methods. We study sub-sequential convergence as well as global convergence for the generated sequence of TITAN. We illustrate the effectiveness of TITAN on two important machine learning problems, namely sparse non-negative matrix factorization and matrix completion.

この論文では、非平滑非凸最適化問題のための新しい慣性ブロック主要化最小化フレームワークであるTITANを紹介します。我々の知る限り、TITANは主要化最小化フレームワークに依存しながら、ブロック更新の各ステップに慣性力を埋め込んだブロック座標更新法の最初のフレームワークです。慣性力は、ブロック近似勾配法のヘビーボール型およびネステロフ型加速度を特別なケースとして包含する外挿演算子によって取得されます。近似、リプシッツ勾配、ブレグマン、二次、および複合代理関数などのさまざまな代理関数を選択し、外挿演算子を変更することで、TITANは豊富な慣性ブロック座標更新法のセットを生成します。私たちは、TITANの生成シーケンスのサブシーケンシャル収束とグローバル収束を調査します。ここでは、スパース非負行列分解と行列補完という2つの重要な機械学習問題におけるTITANの有効性を示します。

Learning Mean-Field Games with Discounted and Average Costs
平均フィールドゲームを割引コストと平均コストで学習

We consider learning approximate Nash equilibria for discrete-time mean-field games with stochastic nonlinear state dynamics subject to both average and discounted costs. To this end, we introduce a mean-field equilibrium (MFE) operator, whose fixed point is a mean-field equilibrium, i.e., equilibrium in the infinite population limit. We first prove that this operator is a contraction, and propose a learning algorithm to compute an approximate mean-field equilibrium by approximating the MFE operator with a random one. Moreover, using the contraction property of the MFE operator, we establish the error analysis of the proposed learning algorithm. We then show that the learned mean-field equilibrium constitutes an approximate Nash equilibrium for finite-agent games.

私たちは、確率的非線形状態ダイナミクスを持つ離散時間平均場ゲームについて、近似ナッシュ均衡を学習することを考えます。これは、平均コストと割引コストの両方の影響を受けます。この目的のために、不動点が平均場均衡、すなわち無限人口限界の均衡である平均場均衡(MFE)演算子を導入します。まず、この演算子が縮約であることを証明し、MFE演算子をランダムな演算子で近似することにより、近似平均場平衡を計算する学習アルゴリズムを提案します。さらに、MFE演算子の縮約特性を使用して、提案された学習アルゴリズムのエラー分析を確立します。次に、学習された平均場均衡が有限エージェントゲームの近似ナッシュ均衡を構成することを示します。

Globally-Consistent Rule-Based Summary-Explanations for Machine Learning Models: Application to Credit-Risk Evaluation
機械学習モデルのためのグローバル整合性ルールベースの要約-説明:信用リスク評価への応用

We develop a method for understanding specific predictions made by (global) predictive models by constructing (local) models tailored to each specific observation (these are also called “explanations” in the literature). Unlike existing work that “explains” specific observations by approximating global models in the vicinity of these observations, we fit models that are globally-consistent with predictions made by the global model on past data. We focus on rule-based models (also known as association rules or conjunctions of predicates), which are interpretable and widely used in practice. We design multiple algorithms to extract such rules from discrete and continuous datasets, and study their theoretical properties. Finally, we apply these algorithms to multiple credit-risk models trained on the Explainable Machine Learning Challenge data from FICO and demonstrate that our approach effectively produces sparse summary-explanations of these models in seconds. Our approach is model-agnostic (that is, can be used to explain any predictive model), and solves a minimum set cover problem to construct its summaries.

私たちは、それぞれの特定の観測に合わせた（ローカル）モデルを構築することにより、（グローバル）予測モデルによる特定の予測を理解するための方法を開発しました（これらは文献では「説明」とも呼ばれます）。特定の観測をこれらの観測の近傍でグローバルモデルを近似することによって「説明」する既存の研究とは異なり、我々は過去のデータに対するグローバルモデルによる予測とグローバルに一致するモデルを適合させます。私たちは、解釈可能で実際に広く使用されているルールベースのモデル（関連ルールまたは述語の結合とも呼ばれる）に焦点を当てています。私たちは、離散データセットと連続データセットからそのようなルールを抽出するための複数のアルゴリズムを設計し、それらの理論的特性を研究します。最後に、これらのアルゴリズムを、FICOの説明可能な機械学習チャレンジデータでトレーニングされた複数の信用リスクモデルに適用し、我々のアプローチが数秒でこれらのモデルのスパース要約説明を効果的に生成することを実証します。我々のアプローチはモデルに依存しません（つまり、あらゆる予測モデルを説明するために使用できます）。そして、最小セットカバー問題を解決してその要約を構築します。

Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions
敵対的攻撃を拡張して敵対的クラス確率分布を生成する

Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.

幅広い人工知能タスクにおけるディープラーニングモデルの優れたパフォーマンスと一般化レベルにもかかわらず、これらのモデルは、自然な入力に感知できないが悪意のある摂動を加えることで簡単に騙されることが実証されています。これらの変更された入力は、文献では敵対的サンプルとして知られています。この論文では、多数の入力に攻撃方法を適用したときにクラスに望ましい確率分布を生成するために、敵対的攻撃を一般化および拡張するための新しい確率フレームワークを提案します。この新しい攻撃パラダイムにより、敵対者はターゲットモデルをより細かく制御できるため、従来のパラダイムでは実行できないディープラーニングモデルに対する脅威が、さまざまなシナリオで明らかになります。このような攻撃を効率的に生成するための4つの異なる戦略を紹介し、複数の敵対的攻撃アルゴリズムを拡張することでアプローチを説明します。また、音声コマンド分類タスクとツイート感情分類タスクという、それぞれオーディオとテキストの領域における2つの典型的な機械学習問題に対して、アプローチを実験的に検証します。私たちの結果は、高い欺瞞率を維持しながらクラスのあらゆる確率分布を近似でき、ラベルシフト検出方法によって攻撃が検出されるのを防ぐこともできることを示しています。

Python package for causal discovery based on LiNGAM
LiNGAMに基づく因果関係の発見のためのPythonパッケージ

Causal discovery is a methodology for learning causal graphs from data, and LiNGAM is a well-known model for causal discovery. This paper describes an open-source Python package for causal discovery based on LiNGAM. The package implements various LiNGAM methods under different settings like time series cases, multiple-group cases, mixed data cases, and hidden common cause cases, in addition to evaluation of statistical reliability and model assumptions. The source code is freely available under the MIT license at https://github.com/cdt15/lingam.

因果探索は、データから因果グラフを学習する方法論であり、LiNGAMは因果探索のよく知られたモデルです。この論文では、LiNGAMに基づく因果関係の発見のためのオープンソースのPythonパッケージについて説明します。このパッケージは、統計的信頼性とモデルの仮定の評価に加えて、時系列ケース、複数グループケース、混合データケース、隠れた共通原因ケースなど、さまざまな設定でさまざまなLiNGAMメソッドを実装します。ソースコードは、https://github.com/cdt15/lingamのMITライセンスの下で無料で入手できます。

Adaptation to the Range in K-Armed Bandits
Kアームドバンディットの範囲への適応

We consider stochastic bandit problems with $K$ arms, each associated with a distribution supported on a given finite range $[m,M]$. We do not assume that the range $[m,M]$ is known and show that there is a cost for learning this range. Indeed, a new trade-off between distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical $\ln T$ and $\sqrt{T}$ bounds. For instance, a $\sqrt{T}$ distribution-free regret bound may only be achieved if the distribution-dependent regret bounds are at least of order $\sqrt{T}$. We exhibit a strategy achieving the rates for regret imposed by the new trade-off.

私たちは、$K$アームを持つ確率的バンディット問題を考え、それぞれが与えられた有限範囲$[m,M]$でサポートされる分布に関連付けられています。範囲$[m,M]$が既知であるとは想定せず、この範囲の学習にはコストがかかることを示します。実際、分布依存と分布のない後悔境界の間に新たなトレードオフが発生し、典型的な$ln T$と$sqrt{T}$の範囲を同時に達成できなくなります。たとえば、$sqrt{T}$分布フリー後悔範囲は、分布依存後悔範囲が少なくとも$sqrt{T}$次数である場合にのみ達成できます。私たちは、新たなトレードオフによって課せられた後悔の割合を達成する戦略を示しています。

Learning-augmented count-min sketches via Bayesian nonparametrics
ベイズ非パラメトリックによる学習強化カウントミニマムスケッチ

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens’ frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (NeurIPS 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being that are obtained as mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a “constructive” proof that builds upon arguments that are tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a “Bayesian” proof of the CMS-DP that has the main advantage of building upon arguments that are usable under the popular Pitman-Yor process (PYP) prior, which generalizes the DP prior by allowing for a more flexible tail behaviour, ranging from geometric tails to heavy power-law tails. This result leads to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a PYP prior. Under this more general framework, we apply the arguments of the “Bayesian” proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed to deal with the estimation of low-frequency tokens. An extension of our BNP approach to more general queries, such as range queries, is also discussed.

カウント・ミニマム・スケッチ(CMS)は、時間とメモリ効率に優れたランダム化データ構造で、ランダムハッシュデータに基づいて、トークンのデータストリーム(つまりポイントクエリ)内のトークンの頻度の推定値を提供します。CMSの学習強化バージョンであるCMS-DPは、Cai、Mitzenmacher、Adams (NeurIPS 2018)によって提案されており、ディリクレ過程(DP)事前分布を介したトークンのデータストリームのベイジアンノンパラメトリック(BNP)モデリングに依存しており、ポイントクエリの推定値は、ハッシュデータが与えられた場合のポイントクエリの事後分布の平均関数として取得されます。CMS-DPはCMSのいくつかの側面を改善することが証明されていますが、DP事前分布に合わせた議論、つまり他のノンパラメトリック事前分布には使用できない議論に基づく「構成的」証明から生じるという大きな欠点があります。この論文では、一般的なPitman-Yorプロセス(PYP)事前分布の下で使用できる議論に基づいて構築されるという主な利点を持つCMS-DPの「ベイジアン」証明を示します。PYP事前分布は、幾何学的テールから重いべき乗則テールまで、より柔軟なテール動作を可能にすることでDP事前分布を一般化します。この結果から、PYP事前分布を介したトークンのデータストリームのBNPモデリングに依存する、CMS-PYPと呼ばれるべき乗則データストリームの下での新しい学習強化CMSの開発につながります。このより一般的なフレームワークでは、PYP事前分布に適切に適応されたCMS-DPの「ベイジアン」証明の議論を適用し、ハッシュデータが与えられた場合のポイントクエリの事後分布を計算します。合成データと実際のテキストデータへの適用により、CMS-PYPは、テキストデータで非常に重要であることが知られている低頻度トークンの推定においてCMSおよびCMS-DPよりも優れており、低頻度トークンの推定を処理するように設計されたCMSのバリエーションに対しても競争力があることがわかります。範囲クエリなどのより一般的なクエリへのBNPアプローチの拡張についても説明します。

Optimal Strategies for Reject Option Classifiers
リジェクトオプション分類子の最適な戦略

In classification with a reject option, the classifier is allowed in uncertain cases to abstain from prediction. The classical cost-based model of a reject option classifier requires the rejection cost to be defined explicitly. The alternative bounded-improvement model and the bounded-abstention model avoid the notion of the reject cost. The bounded-improvement model seeks a classifier with a guaranteed selective risk and maximal cover. The bounded-abstention model seeks a classifier with guaranteed cover and minimal selective risk. We prove that despite their different formulations the three rejection models lead to the same prediction strategy: the Bayes classifier endowed with a randomized Bayes selection function. We define the notion of a proper uncertainty score as a scalar summary of the prediction uncertainty sufficient to construct the randomized Bayes selection function. We propose two algorithms to learn the proper uncertainty score from examples for an arbitrary black-box classifier. We prove that both algorithms provide Fisher consistent estimates of the proper uncertainty score and demonstrate their efficiency in different prediction problems, including classification, ordinal regression, and structured output classification.

拒否オプションによる分類では、不確実なケースでは分類器が予測を控えることが許されます。拒否オプション分類器の従来のコストベースモデルでは、拒否コストを明示的に定義する必要があります。代替の制限付き改善モデルと制限付き棄却モデルでは、拒否コストの概念は使用されません。制限付き改善モデルでは、保証された選択リスクと最大のカバーを持つ分類器を求めます。制限付き棄却モデルでは、保証されたカバーと最小の選択リスクを持つ分類器を求めます。異なる定式化にもかかわらず、3つの拒否モデルが同じ予測戦略、つまりランダム化されたベイズ選択関数を備えたベイズ分類器につながることを証明します。適切な不確実性スコアの概念を、ランダム化されたベイズ選択関数を構築するのに十分な予測不確実性のスカラー要約として定義します。任意のブラックボックス分類器の例から適切な不確実性スコアを学習する2つのアルゴリズムを提案します。両方のアルゴリズムが適切な不確実性スコアのフィッシャー整合推定値を提供することを証明し、分類、順序回帰、構造化出力分類などのさまざまな予測問題における効率性を実証します。

A Line-Search Descent Algorithm for Strict Saddle Functions with Complexity Guarantees
複雑さが保証された厳密な鞍関数のためのラインサーチ降下アルゴリズム

We describe a line-search algorithm which achieves the best-known worst-case complexity results for problems with a certain “strict saddle” property that has been observed to hold in low-rank matrix optimization problems. Our algorithm is adaptive, in the sense that it makes use of backtracking line searches and does not require prior knowledge of the parameters that define the strict saddle property.

私たちは、低ランク行列最適化問題で保持することが観察されている特定の”厳密なサドル”プロパティを持つ問題に対して、最もよく知られている最悪の場合の複雑さの結果を達成するライン探索アルゴリズムについて説明します。私たちのアルゴリズムは、バックトラッキングライン検索を利用し、厳密なサドルプロパティを定義するパラメータの事前知識を必要としないという意味で適応性があります。

Sampling random graph homomorphisms and applications to network data analysis
ランダムグラフ準同型のサンプリングとネットワークデータ解析への応用

A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels.

グラフ準同型とは、隣接関係を維持する2つのグラフ間のマップです。グラフからランダムグラフ準同型を大規模ネットワークにサンプリングする問題について考えます。ランダムグラフ準同型をサンプリングするための2つの補完的なMCMCアルゴリズムを提案し、それらの混合時間と時間平均の集中度に上限を設定します。このサンプリングアルゴリズムに基づいて、独立サンプリングと近傍サンプリングに基づく方法の欠点を回避する、ネットワークデータ解析の新しいフレームワークを提案します。MCMC軌跡のさまざまな時間平均から、準同型密度や平均クラスタリング係数などのよく知られたものやそれらの一般化など、さまざまな計算可能な観測値が得られます。さらに、これらのネットワーク観測値は、ネットワーク間の適切に正規化されたカット距離に関して安定していることを示します。合成ネットワークを通じてフレームワークを示すさまざまな例とシミュレーションを示します。また、Facebook100データセットと古典小説セットの単語隣接ネットワークにおけるネットワーククラスタリングとサブグラフ分類のタスクでフレームワークのパフォーマンスを\commHL{実証}します。

A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs
GANへの適用による単調介在物を解くための緩和慣性前方‐後退‐前方アルゴリズム

We introduce a relaxed inertial forward-backward-forward (RIFBF) splitting algorithm for approaching the set of zeros of the sum of a maximally monotone operator and a single-valued monotone and Lipschitz continuous operator. This work aims to extend Tseng’s forward-backward-forward method by both using inertial effects as well as relaxation parameters. We formulate first a second order dynamical system that approaches the solution set of the monotone inclusion problem to be solved and provide an asymptotic analysis for its trajectories. We provide for RIFBF, which follows by explicit time discretization, a convergence analysis in the general monotone case as well as when applied to the solving of pseudo-monotone variational inequalities. We illustrate the proposed method by applications to a bilinear saddle point problem, in the context of which we also emphasize the interplay between the inertial and the relaxation parameters, and to the training of Generative Adversarial Networks (GANs).

私たちは、最大単調演算子と単一値単調およびリプシッツ連続演算子の和のゼロ集合に近づくための緩和された慣性前方後方前方(RIFBF)分割アルゴリズムを導入します。本研究は、慣性効果と緩和パラメータの両方を使用して、Tsengの前方後方前方法を拡張することを目的とします。まず、解決すべき単調包含問題の解集合に近づく2次動的システムを定式化し、その軌跡の漸近解析を提供します。明示的な時間離散化に従うRIFBFに対して、一般的な単調ケースおよび擬似単調変分不等式の解決に適用された場合の収束解析を提供します。提案手法を、慣性パラメータと緩和パラメータの相互作用を強調する文脈での双線形鞍点問題への適用、および生成的敵対ネットワーク(GAN)のトレーニングへの適用によって説明します。

On Distance and Kernel Measures of Conditional Dependence
条件付き依存性の距離とカーネル測度について

Measuring conditional dependence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional dependence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel pairs, we show the distance-based conditional dependence measures to be equivalent to that of kernel-based measures. On the other hand, we also show that some popular kernel conditional dependence measures based on the Hilbert-Schmidt norm of a certain cross-conditional covariance operator, do not have a simple distance representation, except in some limiting cases.

条件付き依存関係の測定は、統計的推論における重要なタスクの1つであり、因果関係の発見、特徴選択、次元削減、ベイジアンネットワーク学習などの基本です。この研究では、メートル空間上の距離によって誘発される条件付き依存性測定と、再生カーネルヒルベルト空間(RKHS)に関連付けられた再生カーネルとの間の関係を探ります。特定の距離とカーネルのペアについて、距離ベースの条件付き依存性測定値がカーネルベースの測定値と同等であることを示します。一方、特定の交差条件付き共分散演算子のヒルベルト・シュミットノルムに基づく一般的なカーネル条件付き依存度測度の一部は、一部の制限的な場合を除いて、単純な距離表現を持たないことも示しています。

AutoKeras: An AutoML Library for Deep Learning
AutoKeras: ディープラーニング用の AutoML ライブラリ

To use deep learning, one needs to be familiar with various software tools like TensorFlow or Keras, as well as various model architecture and optimization best practices. Despite recent progress in software usability, deep learning remains a highly specialized occupation. To enable people with limited machine learning and programming experience to adopt deep learning, we developed AutoKeras, an Automated Machine Learning (AutoML) library that automates the process of model selection and hyperparameter tuning. AutoKeras encapsulates the complex process of building and training deep neural networks into a very simple and accessible interface, which enables novice users to solve standard machine learning problems with a few lines of code. Designed with practical applications in mind, AutoKeras is built on top of Keras and TensorFlow, and all AutoKeras-created models can be easily exported and deployed with the help of the TensorFlow ecosystem tooling.

ディープラーニングを使用するには、TensorFlowやKerasなどのさまざまなソフトウェアツール、およびさまざまなモデルアーキテクチャと最適化のベストプラクティスに精通している必要があります。近年、ソフトウェアのユーザビリティが進歩しているにもかかわらず、ディープラーニングは依然として高度に専門化された職業です。機械学習やプログラミングの経験が限られている人でもディープラーニングを採用できるように、モデル選択とハイパーパラメータ調整のプロセスを自動化する自動機械学習(AutoML)ライブラリであるAutoKerasを開発しました。AutoKerasは、ディープニューラルネットワークの構築とトレーニングの複雑なプロセスを非常にシンプルでアクセスしやすいインターフェースにカプセル化し、初心者のユーザーが数行のコードで標準的な機械学習の問題を解決できるようにします。実用的なアプリケーションを念頭に置いて設計されたAutoKerasは、KerasとTensorFlowの上に構築されており、AutoKerasで作成されたすべてのモデルは、TensorFlowエコシステムツールの助けを借りて簡単にエクスポートおよびデプロイできます。

Cluster-Specific Predictions with Multi-Task Gaussian Processes
マルチタスクガウスプロセスによるクラスタ固有の予測

A model involving Gaussian processes (GPs) is introduced to simultaneously handle multitask learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors’ estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty in both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performance when dealing with group-structured data. The model handles irregular grids of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real data sets. The overall algorithm, called MagmaClust, is publicly available as an R package.

ガウス過程(GP)を含むモデルを導入し、複数の機能データのマルチタスク学習、クラスタリング、予測を同時に処理します。この手順は、機能データのモデルベースのクラスタリング手法として、また新しいタスクの後続の予測の学習ステップとして機能します。このモデルは、共通の平均プロセスを持つマルチタスクGPの混合としてインスタンス化されます。変分EMアルゴリズムは、潜在変数とプロセスの超事後推定とともにハイパーパラメータの最適化を処理するために導出されます。平均プロセスと潜在クラスタリング変数を予測分布内に統合するための明示的な式を確立し、両方の側面の不確実性を考慮します。この分布は、クラスター固有のGP予測の混合として定義され、グループ構造化データを処理する際のパフォーマンスが向上します。このモデルは、不規則な観測グリッドを処理し、タスク間で追加情報を共有するための共分散構造に関するさまざまな仮説を提供します。クラスタリングタスクと予測タスクの両方のパフォーマンスは、さまざまなシミュレートされたシナリオと実際のデータセットを通じて評価されます。MagmaClustと呼ばれる全体的なアルゴリズムは、Rパッケージとして公開されています。

Efficient Structure-preserving Support Tensor Train Machine
効率的な構造保持サポートテンソルトレインマシン

An increasing amount of the collected data are high-dimensional multi-way arrays (tensors), and it is crucial for efficient learning algorithms to exploit this tensorial structure as much as possible. The ever present curse of dimensionality for high dimensional data and the loss of structure when vectorizing the data motivates the use of tailored low-rank tensor classification methods. In the presence of small amounts of training data, kernel methods offer an attractive choice as they provide the possibility for a nonlinear decision boundary. We develop the Tensor Train Multi-way Multi-level Kernel (TT-MMK), which combines the simplicity of the Canonical Polyadic decomposition, the classification power of the Dual Structure-preserving Support Vector Machine, and the reliability of the Tensor Train (TT) approximation. We show by experiments that the TT-MMK method is usually more reliable computationally, less sensitive to tuning parameters, and gives higher prediction accuracy in the SVM classification when benchmarked against other state-of-the-art techniques.

収集されるデータのうち、高次元の多元配列(テンソル)の量は増加しており、効率的な学習アルゴリズムでは、このテンソル構造を可能な限り活用することが重要です。高次元データには常に存在する次元の呪いがあり、データをベクトル化すると構造が失われるため、カスタマイズされた低ランクのテンソル分類法が使用されています。トレーニングデータが少ない場合、カーネル法は非線形決定境界の可能性を提供するため、魅力的な選択肢となります。私たちは、標準多元分解のシンプルさ、デュアル構造保存サポートベクターマシンの分類力、およびテンソルトレイン(TT)近似の信頼性を組み合わせた、テンソルトレインマルチウェイマルチレベルカーネル(TT-MMK)を開発しました。私たちは、TT-MMK法が通常、計算上の信頼性が高く、チューニングパラメーターの影響を受けにくく、他の最先端技術と比較した場合、SVM分類でより高い予測精度が得られることを実験で示しています。

Bayesian Spiked Laplacian Graphs
ベイジアンスパイクラプラシアングラフ

In network analysis, it is common to work with a collection of graphs that exhibit heterogeneity. For example, neuroimaging data from patient cohorts are increasingly available. A critical analytical task is to identify communities, and graph Laplacian-based methods are routinely used. However, these methods are currently limited to a single network and also do not provide measures of uncertainty on the community assignment. In this work, we first propose a probabilistic network model called the ”Spiked Laplacian Graph” that considers an observed network as a transform of the Laplacian and degree matrices of the network generating process, with the Laplacian eigenvalues modeled by a modified spiked structure. This effectively reduces the number of parameters in the eigenvectors, and their sign patterns allow efficient estimation of the underlying community structure. Further, the posterior distribution of the eigenvectors provides uncertainty quantification for the community estimates. Second, we introduce a Bayesian non-parametric approach to address the issue of heterogeneity in a collection of graphs. Theoretical results are established on the posterior consistency of the procedure and provide insights on the trade-off between model resolution and accuracy. We illustrate the performance of the methodology on synthetic data sets, as well as a neuroscience study related to brain activity in working memory.

ネットワーク分析では、異質性を示すグラフのコレクションを扱うのが一般的です。たとえば、患者コホートの神経画像データがますます利用可能になっています。コミュニティを識別することは重要な分析タスクであり、グラフラプラシアンベースの方法が日常的に使用されています。ただし、これらの方法は現在単一のネットワークに限定されており、コミュニティの割り当てに関する不確実性の尺度も提供していません。この研究では、まず、「スパイクラプラシアングラフ」と呼ばれる確率ネットワークモデルを提案します。これは、観測されたネットワークを、ネットワーク生成プロセスのラプラシアンと次数行列の変換と見なし、ラプラシアンの固有値を修正されたスパイク構造でモデル化します。これにより、固有ベクトルのパラメータ数が効果的に削減され、その符号パターンにより、基礎となるコミュニティ構造を効率的に推定できます。さらに、固有ベクトルの事後分布により、コミュニティ推定の不確実性の定量化が提供されます。次に、グラフのコレクションにおける異質性の問題に対処するために、ベイジアンノンパラメトリックアプローチを導入します。手順の事後一貫性に関する理論的結果が確立され、モデルの解像度と精度のトレードオフに関する洞察が提供されます。合成データセットでのこの方法論のパフォーマンスと、作業記憶における脳の活動に関連する神経科学の研究について説明します。

The Brier Score under Administrative Censoring: Problems and a Solution
行政検閲下のブライアスコア:問題と解決策

The Brier score is commonly used for evaluating probability predictions. In survival analysis, with right-censored observations of the event times, this score can be weighted by the inverse probability of censoring (IPCW) to retain its original interpretation. It is common practice to estimate the censoring distribution with the Kaplan-Meier estimator, even though it assumes that the censoring distribution is independent of the covariates. This paper investigates problems that may arise for the IPCW weighting scheme when the covariates used in the prediction model contain information about the censoring times. In particular, this may occur for administratively censored data if the distribution of the covariates varies with calendar time. For administratively censored data, we propose an alternative version of the Brier score. This administrative Brier score does not require estimation of the censoring distribution and is valid also when the censoring times can be predicted from the covariates.

Brierスコアは、確率予測の評価によく使用されます。生存分析では、イベント時間の右側打ち切り観測値を使用して、このスコアを逆打ち切り確率(IPCW)で重み付けして、元の解釈を保持できます。打ち切り分布は共変量から独立していると想定されていますが、Kaplan-Meier推定量を使用して打ち切り分布を推定するのが一般的な方法です。この論文では、予測モデルで使用される共変量に打ち切り時間に関する情報が含まれている場合に、IPCW重み付けスキームで発生する可能性のある問題を調査します。特に、共変量の分布が暦時間によって変化する場合に、管理上打ち切りデータでこれが発生する可能性があります。管理上打ち切りデータの場合、Brierスコアの代替バージョンを提案します。この管理上Brierスコアでは、打ち切り分布の推定は必要なく、共変量から打ち切り時間を予測できる場合にも有効です。

Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-means, and Local Search
階層クラスタリングの近似限界: 平均リンケージ、二等分K平均法、およびローカル探索

Hierarchical clustering is a data analysis method that has been used for decades. Despite its widespread use, the method has an underdeveloped analytical foundation. Having a well-understood foundation would both support the currently used methods and help guide future improvements. The goal of this paper is to give an analytic framework to better understand observations seen in practice. This paper considers the dual of a problem framework for hierarchical clustering introduced by Dasgupta. The main result is that one of the most popular algorithms used in practice, average linkage agglomerative clustering, has a small constant approximation ratio for this objective. To contrast, this paper establishes that using several other popular algorithms, including bisecting $k$-means divisive clustering, have a very poor lower bound on its approximation ratio for the same objective. However, we show that there are divisive algorithms that perform well with respect to this objective by giving two constant approximation algorithms. This paper is some of the first work to establish guarantees on widely used hierarchical algorithms for a natural objective function. This objective and analysis give insight into what these popular algorithms are optimizing and when they will perform well.

階層的クラスタリングは、数十年にわたって使用されてきたデータ分析手法です。広く使用されているにもかかわらず、この手法の分析基盤は未発達です。十分に理解された基盤があれば、現在使用されている手法をサポートし、将来の改善を導くのに役立ちます。この論文の目的は、実際に見られる観察をよりよく理解するための分析フレームワークを提供することです。この論文では、Dasguptaによって導入された階層的クラスタリングの問題フレームワークの双対を検討します。主な結果は、実際に使用されている最も一般的なアルゴリズムの1つである平均リンク凝集型クラスタリングでは、この目的に対して小さな定数近似比を持つということです。対照的に、この論文では、二分$k$平均分割クラスタリングを含む他のいくつかの一般的なアルゴリズムでは、同じ目的に対する近似比の下限が非常に貧弱であることを示しています。ただし、2つの定数近似アルゴリズムを示すことで、この目的に関して優れたパフォーマンスを発揮する分割アルゴリズムがあることを示します。この論文では、自然な目的関数に対して広く使用されている階層型アルゴリズムの保証を確立した最初の研究の1つです。この目的と分析により、これらの一般的なアルゴリズムが何を最適化しているのか、また、どのような場合に優れたパフォーマンスを発揮するのかについての洞察が得られます。

関連記事