Journal of Machine Learning Research Papers: Volume 23の論文一覧

Journal of Machine Learning Research Papers Volume 23に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Metrics of Calibration for Probabilistic Predictions
確率的予測のためのキャリブレーションのメトリクス

Many predictions are probabilistic in nature; for example, a prediction could be for precipitation tomorrow, but with only a 30 percent chance. Given such probabilistic predictions together with the actual outcomes, “reliability diagrams” (also known as “calibration plots”) help detect and diagnose statistically significant discrepancies—so-called “miscalibration”—between the predictions and the outcomes. The canonical reliability diagrams are based on histogramming the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation using smooth convolutional kernels is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram into a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise. The cumulative methods do not impose such an explicit trade-off. Considering these results, practitioners probably should adopt the cumulative approach as a standard for best practices.

多くの予測は、本質的に確率的です。たとえば、明日は降水があると予測されていても、その確率は30パーセントに過ぎない、などです。このような確率予測と実際の結果を併せて考えると、「信頼性図」(「キャリブレーションプロット」とも呼ばれる)は、予測と結果の間の統計的に有意な食い違い(いわゆる「ミスキャリブレーション」)を検出して診断するのに役立ちます。標準的な信頼性図は、予測の観測値と期待値のヒストグラムに基づいています。ハードヒストグラムビニングを、滑らかな畳み込みカーネルを使用したソフトカーネル密度推定に置き換えることも、一般的な方法です。しかし、どの幅のビンまたはカーネルが最適でしょうか。観測値と期待値の累積差のプロットでは、ミスキャリブレーションをグラフのセカントラインの傾きとして直接表示することで、この問題をほぼ回避できます。セカントラインの一定のオフセットが無関係な場合でも、傾きを定量的な精度で簡単に認識できます。ビン分けやカーネル密度推定を行う必要はありません。既存の標準のミスキャリブレーションメトリックは、信頼性図を1つのスカラー統計にまとめたものです。累積プロットは、累積差のグラフがゼロからどれだけずれているかを示すスカラーメトリックに自然につながります。良好なキャリブレーションは、ゼロからほとんどずれていない水平で平坦なグラフに対応します。累積アプローチは、現在のところ慣例に反していますが、厳密な証明と数値例に裏付けられた数学理論によって保証された多くの好ましい統計特性を提供します。特に、ビン分けまたはカーネル密度推定に基づくメトリックは、予測確率の関数として変動を分析する能力と統計的信頼性をトレードオフする必要があります。またはその逆です。ビンまたはカーネルを広げると、ランダムノイズが平均化されますが、分解能がいくらか低下します。ビンまたはカーネルを狭めると、分解能が向上しますが、平均化されるノイズはそれほど多くありません。累積法では、このような明確なトレードオフはありません。これらの結果を考慮すると、実践者はおそらく累積アプローチをベストプラクティスの標準として採用する必要があります。

Approximate Bayesian Computation via Classification
分類による近似ベイズ計算

Approximate Bayesian Computation (ABC) enables statistical inference in simulator-based models whose likelihoods are difficult to calculate but easy to simulate from. ABC constructs a kernel-type approximation to the posterior distribution through an accept/reject mechanism which compares summary statistics of real and simulated data. To obviate the need for summary statistics, we directly compare empirical distributions with a Kullback-Leibler (KL) divergence estimator obtained via contrastive learning. In particular, we blend flexible machine learning classifiers within ABC to automate fake/real data comparisons. We consider the traditional accept/reject kernel as well as an exponential weighting scheme which does not require the ABC acceptance threshold. Our theoretical results show that the rate at which our ABC posterior distributions concentrate around the true parameter depends on the estimation error of the classifier. We derive limiting posterior shape results and find that, with a properly scaled exponential kernel, asymptotic normality holds. We demonstrate the usefulness of our approach on simulated examples as well as real data in the context of stock volatility estimation.

近似ベイズ計算(ABC)は、尤度の計算は難しいがシミュレーションは容易なシミュレータベースのモデルで統計的推論を可能にします。ABCは、実データとシミュレーションデータの要約統計量を比較する受け入れ/拒否メカニズムを通じて、事後分布のカーネル型近似を構築します。要約統計量が不要になるように、対照学習によって得られたKullback-Leibler (KL)ダイバージェンス推定量と経験分布を直接比較します。特に、ABC内で柔軟な機械学習分類器をブレンドして、偽データと実データの比較を自動化します。従来の受け入れ/拒否カーネルと、ABC受け入れしきい値を必要としない指数重み付け方式を検討します。理論的な結果から、ABC事後分布が真のパラメータの周囲に集中する割合は、分類器の推定誤差に依存することがわかります。事後形状の制限結果を導出し、適切にスケーリングされた指数カーネルでは漸近正規性が維持されることを発見しました。株価変動率の推定という観点から、シミュレーション例と実際のデータを用いて当社のアプローチの有用性を実証します。

OMLT: Optimization & Machine Learning Toolkit
OMLT: 最適化 & 機械学習ツールキット

The optimization and machine learning toolkit (OMLT) is an open-source software package incorporating neural network and gradient-boosted tree surrogate models, which have been trained using machine learning, into larger optimization problems. We discuss the advances in optimization technology that made OMLT possible and show how OMLT seamlessly integrates with the algebraic modeling language Pyomo. We demonstrate how to use OMLT for solving decision-making problems in both computer science and engineering.

最適化および機械学習ツールキット(OMLT)は、機械学習を使用してトレーニングされたニューラルネットワークと勾配ブーストツリーサロゲートモデルを大規模な最適化問題に組み込むオープンソースソフトウェアパッケージです。OMLTを可能にした最適化技術の進歩について説明し、OMLTが代数モデリング言語Pyomoとシームレスに統合する方法を示します。コンピュータサイエンスとエンジニアリングの両方で意思決定の問題を解決するためにOMLTを使用する方法を示します。

Scalable Gaussian-process regression and variable selection using Vecchia approximations
ベッキア近似を用いたスケーラブルなガウス過程回帰と変数選択

Gaussian process (GP) regression is a flexible, nonparametric approach to regression that naturally quantifies uncertainty. In many applications, the number of responses and covariates are both large, and a goal is to select covariates that are related to the response. For this setting, we propose a novel, scalable algorithm, coined VGPR, which optimizes a penalized GP log-likelihood based on the Vecchia GP approximation, an ordered conditional approximation from spatial statistics that implies a sparse Cholesky factor of the precision matrix. We traverse the regularization path from strong to weak penalization, sequentially adding candidate covariates based on the gradient of the log-likelihood and deselecting irrelevant covariates via a new quadratic constrained coordinate descent algorithm. We propose Vecchia-based mini-batch subsampling, which provides unbiased gradient estimators. The resulting procedure is scalable to millions of responses and thousands of covariates. Theoretical analysis and numerical studies demonstrate the improved scalability and accuracy relative to existing methods.

ガウス過程(GP)回帰は、不確実性を自然に定量化する、回帰に対する柔軟でノンパラメトリックなアプローチです。多くのアプリケーションでは、応答と共変量の数はどちらも大きく、応答に関連する共変量を選択することが目標となります。この設定に対して、VGPRと名付けられた新しいスケーラブルなアルゴリズムを提案します。このアルゴリズムは、精度行列のスパースコレスキー因子を暗示する空間統計からの順序付き条件付き近似であるVecchia GP近似に基づいて、ペナルティ付きGP対数尤度を最適化します。強いペナルティから弱いペナルティまで正則化パスをたどり、対数尤度の勾配に基づいて候補共変量を順次追加し、新しい2次制約付き座標降下アルゴリズムを使用して無関係な共変量を選択解除します。不偏勾配推定量を提供するVecchiaベースのミニバッチサブサンプリングを提案します。結果として得られる手順は、数百万の応答と数千の共変量にスケーラブルです。理論分析と数値研究により、既存の方法に比べてスケーラビリティと精度が向上していることが実証されています。

Existence, Stability and Scalability of Orthogonal Convolutional Neural Networks
直交畳み込みニューラルネットワークの存在、安定性、スケーラビリティ

Imposing orthogonality on the layers of neural networks is known to facilitate the learning by limiting the exploding/vanishing of the gradient; decorrelate the features; improve the robustness. This paper studies the theoretical properties of orthogonal convolutional layers. We establish necessary and sufficient conditions on the layer architecture guaranteeing the existence of an orthogonal convolutional transform. The conditions prove that orthogonal convolutional transforms exist for almost all architectures used in practice for ‘circular’ padding. We also exhibit limitations with ‘valid’ boundary conditions and ‘same’ boundary conditions with zero-padding. Recently, a regularization term imposing the orthogonality of convolutional layers has been proposed, and impressive empirical results have been obtained in different applications (Wang et al., 2020). The second motivation of the present paper is to specify the theory behind this. We make the link between this regularization term and orthogonality measures. In doing so, we show that this regularization strategy is stable with respect to numerical and optimization errors and that, in the presence of small errors and when the size of the signal/image is large, the convolutional layers remain close to isometric. The theoretical results are confirmed with experiments and the landscape of the regularization term is studied. Experiments on real data sets show that when orthogonality is used to enforce robustness, the parameter multiplying the regularization term can be used to tune a tradeoff between accuracy and orthogonality, for the benefit of both accuracy and robustness. Altogether, the study guarantees that the regularization proposed in Wang et al. (2020) is an efficient, flexible and stable numerical strategy to learn orthogonal convolutional layers.

ニューラルネットワークの層に直交性を課すと、勾配の爆発/消失を制限して学習を容易にし、特徴を非相関化し、堅牢性を向上させることが知られています。この論文では、直交畳み込み層の理論的特性について検討します。直交畳み込み変換の存在を保証する層アーキテクチャに関する必要かつ十分な条件を確立します。条件は、直交畳み込み変換が「円形」パディングに実際に使用されるほぼすべてのアーキテクチャに存在することを証明します。また、「有効な」境界条件とゼロパディングの「同じ」境界条件の制限も示します。最近、畳み込み層の直交性を課す正則化項が提案され、さまざまなアプリケーションで印象的な実験結果が得られています(Wangら、2020)。本論文の2番目の目的は、この背後にある理論を特定することです。この正則化項と直交性尺度を結び付けます。そうすることで、この正則化戦略は数値誤差と最適化誤差に関して安定しており、誤差が小さく信号/画像のサイズが大きい場合、畳み込み層は等尺性に近いままであることを示します。理論的結果は実験で確認され、正則化項のランドスケープが研究されています。実際のデータセットでの実験では、直交性を使用して堅牢性を強化する場合、正則化項を乗算するパラメーターを使用して、精度と直交性のトレードオフを調整し、精度と堅牢性の両方にメリットをもたらすことができます。全体として、この研究では、Wangら(2020)で提案された正則化が、直交畳み込み層を学習するための効率的で柔軟かつ安定した数値戦略であることを保証しています。

Minimax optimal approaches to the label shift problem in non-parametric settings
ノンパラメトリック設定におけるラベルシフト問題へのミニマックス最適アプローチ

We study the minimax rates of the label shift problem in non-parametric classification. In addition to the unsupervised setting in which the learner only has access to unlabeled examples from the target domain, we also consider the setting in which a small number of labeled examples from the target domain is available to the learner. Our study reveals a difference in the difficulty of the label shift problem in the two settings, and we attribute this difference to the availability of data from the target domain to estimate the class conditional distributions in the latter setting. We also show that a class proportion estimation approach is minimax rate-optimal in the unsupervised setting.

私たちは、ノンパラメトリック分類におけるラベルシフト問題のミニマックス率を研究します。学習者がターゲットドメインのラベル付けされていない例にのみアクセスできる教師なし設定に加えて、ターゲットドメインのラベル付けされた少数の例を学習者が利用できる設定も考慮します。私たちの研究では、2つの設定でラベルシフト問題の難易度に違いがあることが明らかになり、この違いは、後者の設定でクラスの条件付き分布を推定するためのターゲットドメインからのデータの可用性に起因すると考えています。また、クラス比率推定アプローチが教師なし設定でミニマックスレート最適であることも示します。

Constraint Reasoning Embedded Structured Prediction
制約推論、埋め込み構造化予測

Many real-world structured prediction problems need machine learning to capture data distribution and constraint reasoning to ensure structure validity. Nevertheless, constrained structured prediction is still limited in real-world applications because of the lack of tools to bridge constraint satisfaction and machine learning. In this paper, we propose COnstraint REasoning embedded Structured Prediction (Core-Sp), a scalable constraint reasoning and machine learning integrated approach for learning over structured domains. We propose to embed decision diagrams, a popular constraint reasoning tool, as a fully-differentiable module into deep neural networks for structured prediction. We also propose an iterative search algorithm to automate the searching process of the best Core-Sp structure. We evaluate Core-Sp on three applications: vehicle dispatching service planning, if-then program synthesis, and text2SQL generation. The proposed Core-Sp module demonstrates superior performance over state-of-the-art approaches in all three applications. The structures generated with Core-Sp satisfy 100% of the constraints when using exact decision diagrams. In addition, Core-Sp boosts learning performance by reducing the modeling space via constraint satisfaction.

現実世界の構造化予測問題の多くは、データ分布を捉えるための機械学習と、構造の妥当性を確保するための制約推論を必要とします。しかし、制約充足と機械学習を橋渡しするツールがないため、制約付き構造化予測は現実世界のアプリケーションでは依然として制限されています。この論文では、構造化ドメインでの学習のためのスケーラブルな制約推論と機械学習の統合アプローチであるCOnstraint REasoning embedded Structured Prediction (Core-Sp)を提案します。構造化予測のために、一般的な制約推論ツールである決定図を完全に微分可能なモジュールとしてディープニューラルネットワークに埋め込むことを提案します。また、最適なCore-Sp構造の検索プロセスを自動化する反復検索アルゴリズムも提案します。Core-Spを、車両ディスパッチサービス計画、if-thenプログラム合成、text2SQL生成の3つのアプリケーションで評価します。提案されたCore-Spモジュールは、3つのアプリケーションすべてで最先端のアプローチよりも優れたパフォーマンスを発揮します。Core-Spで生成された構造は、正確な決定図を使用する場合、制約を100%満たします。さらに、Core-Spは制約の充足を通じてモデリング空間を縮小することで学習パフォーマンスを向上させます。

Vector-Valued Least-Squares Regression under Output Regularity Assumptions
出力規則性の仮定の下でのベクトル値最小二乗回帰

We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity assumptions. We illustrate our theoretical insights on synthetic least-squares problems. Then, we propose a surrogate structured prediction method derived from this reduced-rank method. We assess its benefits on three different problems: image reconstruction, multi-label classification, and metabolite identification.

私たちは、無限次元出力を持つ最小二乗回帰問題を解くための低ランク法を提案し、分析します。本手法の学習限界を導き出し、フルランク法と比較して統計的パフォーマンスの設定が向上する点について検討します。この分析では、低ランク回帰の関心を標準の低ランク設定を超えて、より一般的な出力規則性の仮定にまで拡張します。合成最小二乗問題に関する理論的な洞察を示します。そこで、この低ランク法から導出される代理構造化予測法を提案します。私たちは、画像再構成、マルチラベル分類、代謝物同定という3つの異なる問題でその利点を評価します。

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch
状態分布不一致下におけるソフトマックスオフポリシーアクタークリティックの大域的最適性と有限サンプル分析

In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step is not a gradient update because we do not use a density ratio to correct the state distribution, which aligns well with what practitioners do. Our update is approximate because we use a learned critic instead of the true value function. Our update is stochastic because at each step the update is done for only the current state action pair. Moreover, we remove several restrictive assumptions from existing works in our analysis. Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains, based on its uniform contraction properties.

この論文では、密度比を使用して動作ポリシーの状態分布とターゲットポリシーの状態分布の不一致を修正することなく、表形式設定でオフポリシーアクタークリティックアルゴリズムのグローバル最適性と収束率を確立します。本研究は、ポリシーパラメータの更新に正確なポリシー勾配を使用する既存の研究を上回っていますが、本研究では近似的かつ確率的な更新ステップを使用しています。本更新ステップは勾配更新ではありません。状態分布を修正するために密度比を使用しないためです。これは、実践者が行っていることとよく一致しています。本更新は近似です。真の値関数の代わりに学習済みクリティックを使用するためです。本更新は確率的です。各ステップで、現在の状態アクションペアに対してのみ更新が行われます。さらに、本分析では、既存の研究からいくつかの制限的な仮定を取り除きます。本研究の中心となるのは、時間不均一なマルコフ連鎖上の時間不均一な更新演算子を使用した一般的な確率的近似アルゴリズムの有限サンプル分析です。これは、その均一な収縮特性に基づいています。

SGD with Coordinate Sampling: Theory and Practice
座標サンプリングによるSGD:理論と実践

While classical forms of stochastic gradient descent algorithm treat the different coordinates in the same way, a framework allowing for adaptive (non uniform) coordinate sampling is developed to leverage structure in data. In a non-convex setting and including zeroth-order gradient estimate, almost sure convergence as well as non-asymptotic bounds are established. Within the proposed framework, we develop an algorithm, MUSKETEER, based on a reinforcement strategy: after collecting information on the noisy gradients, it samples the most promising coordinate (all for one); then it moves along the one direction yielding an important decrease of the objective (one for all). Numerical experiments on both synthetic and real data examples confirm the effectiveness of MUSKETEER in large scale problems.

古典的な形式の確率的勾配降下アルゴリズムは、異なる座標を同じように扱いますが、データの構造を活用するために、適応型(不均一)座標サンプリングを可能にするフレームワークが開発されています。非凸設定で0次勾配推定を含む場合、ほぼ確実な収束範囲と非漸近境界が確立されます。提案されたフレームワーク内で、我々は強化戦略に基づいてアルゴリズムMUSKETEERを開発します:ノイズの多い勾配に関する情報を収集した後、最も有望な座標をサンプリングします（すべて1つ)。次に、一方向に沿って移動し、目標を大幅に減少させます(全員に1つ)。合成データの例と実データの両方の例での数値実験により、大規模な問題におけるMUSKETEERの有効性が確認されています。

Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification
情報理論的分類精度:多クラス分類におけるあいまいな結果ラベルのデータ駆動型の組み合わせを導く基準

Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications, including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at https://github.com/JSB-UCLA/ITCA) implements ITCA and the search strategies.

結果ラベルの曖昧さと主観性は、現実世界のデータセットに遍在しています。実務者は通常、すべてのデータポイント(インスタンス)の曖昧な結果ラベルをアドホックに組み合わせてマルチクラス分類の精度を向上させますが、すべてのデータポイントのラベルの組み合わせを最適性基準で導くための原則的なアプローチはありません。この問題に対処するために、情報理論的分類精度(ITCA)を提案します。これは、予測精度(予測されたラベルが実際のラベルとどの程度一致するか)と分類解像度(予測可能なラベルの数)のトレードオフのバランスをとる基準であり、実務者が曖昧な結果ラベルを組み合わせる方法をガイドします。ITCAによって示される最適なラベルの組み合わせを見つけるために、貪欲検索と幅優先検索という2つの検索戦略を提案します。ITCAと2つの検索戦略は、すべての機械学習分類アルゴリズムに適応できます。分類アルゴリズムと検索戦略と組み合わせることで、ITCAには予測精度の向上と曖昧なラベルの識別という2つの用途があります。まず、ITCAが合成データと実データの両方の検索戦略で正しいラベルの組み合わせを見つける際に高い精度を達成することを確認します。次に、医療予後、がんの生存率予測、ユーザーの人口統計予測、細胞タイプの分類など、さまざまなアプリケーションでのITCAの有効性を実証します。また、オラクルと線形判別分析分類アルゴリズムを研究することで、ITCAに関する理論的な洞察も提供します。Pythonパッケージitca (https://github.com/JSB-UCLA/ITCAで入手可能)は、ITCAと検索戦略を実装します。

Fundamental Limits and Tradeoffs in Invariant Representation Learning
不変表現学習における基本的な制限とトレードオフ

A wide range of machine learning applications such as privacy-preserving learning, algorithmic fairness, and domain adaptation/generalization among others, involve learning invariant representations of the data that aim to achieve two competing goals: (a) maximize information or accuracy with respect to a target response, and (b) maximize invariance or independence with respect to a set of protected features (e.g.\ for fairness, privacy, etc). Despite their wide applicability, theoretical understanding of the optimal tradeoffs — with respect to accuracy, and invariance — achievable by invariant representations is still severely lacking. In this paper, we provide an information theoretic analysis of such tradeoffs under both classification and regression settings. More precisely, we provide a geometric characterization of the accuracy and invariance achievable by any representation of the data; we term this feasible region the information plane. We provide an inner bound for this feasible region for the classification case, and an exact characterization for the regression case, which allows us to either bound or exactly characterize the Pareto optimal frontier between accuracy and invariance. Although our contributions are mainly theoretical, a key practical application of our results is in certifying the potential sub-optimality of any given representation learning algorithm for either classification or regression tasks. Our results shed new light on the fundamental interplay between accuracy and invariance, and may be useful in guiding the design of future representation learning algorithms.

プライバシー保護学習、アルゴリズムの公平性、ドメイン適応/一般化など、さまざまな機械学習アプリケーションでは、データの不変表現を学習して、(a)ターゲット応答に関する情報または精度を最大化すること、および(b)保護された機能セット(公平性、プライバシーなど)に関する不変性または独立性を最大化することという、2つの競合する目標を達成することを目指します。不変表現は幅広く適用できますが、不変表現によって達成できる最適なトレードオフ(精度と不変性に関して)の理論的理解は、まだ大きく欠如しています。この論文では、分類と回帰の両方の設定におけるこのようなトレードオフの情報理論的分析を提供します。より正確には、データの任意の表現によって達成できる精度と不変性の幾何学的特徴付けを提供します。この実現可能な領域を情報平面と呼びます。分類の場合、この実行可能領域の内部境界と回帰の場合の正確な特性を提供します。これにより、精度と不変性の間のパレート最適境界を境界で囲むか、正確に特性付けることができます。私たちの貢献は主に理論的なものです。しかし、私たちの結果の重要な実用的応用は、分類または回帰タスクのいずれかに対する任意の表現学習アルゴリズムの潜在的な準最適性を証明することです。私たちの結果は、精度と不変性の基本的な相互作用に新たな光を当て、将来の表現学習アルゴリズムの設計を導くのに役立つ可能性があります。

Early Stopping for Iterative Regularization with General Loss Functions
一般損失関数による反復正則化の早期停止

In this paper, we investigate the early stopping strategy for the iterative regularization technique, which is based on gradient descent of convex loss functions in reproducing kernel Hilbert spaces without an explicit regularization term. This work shows that projecting the last iterate of the stopping time produces an estimator that can improve the generalization ability. Using the upper bound of the generalization errors, we establish a close link between the iterative regularization and Tikhonov regularization scheme and explain theoretically why the two schemes have similar regularization paths in the existing numerical simulations. We introduce a data-dependent way based on cross-validation to select the stopping time. We prove that the a-posteriori selection way can retain the comparable generalization errors to those obtained by our stopping rules with a-prior parameters.

この論文では、明示的な正則化項を持たずにカーネルヒルベルト空間を再現する際の凸損失関数の勾配降下法に基づく反復正則化手法の早期停止戦略を調査します。この研究では、停止時間の最後の反復を投影すると、汎化能力を向上させることができる推定量が生成されることを示しています。一般化誤差の上限を使用して、反復正則化とチホノフ正則化スキームとの間に密接なリンクを確立し、既存の数値シミュレーションで2つのスキームが類似した正則化パスを持つ理由を理論的に説明します。クロスバリデーションに基づくデータ依存の方法を導入し、停止時間を選択します。私たちは、a-事後選択の方法が、a-事前パラメータを持つ停止ルールによって得られるものと同等の一般化誤差を保持できることを証明します。

Interval-censored Hawkes processes
インターバル打ち切りホークス過程

Interval-censored data solely records the aggregated counts of events during specific time intervals — such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors — and not the exact occurrence time of the events. It is currently not understood how to fit the Hawkes point processes to this kind of data. Its typical loss function (the point process log-likelihood) cannot be computed without exact event times. Furthermore, it does not have the independent increments property to use the Poisson likelihood. This work builds a novel point process, a set of tools, and approximations for fitting Hawkes processes within interval-censored data scenarios. First, we define the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. We fit MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function — for when the exogenous events are observed as event time — and the latent homogeneous Poisson process exogenous function — for when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and finally, we connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the popularity estimation algorithm Hawkes Intensity Process (HIP) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data. We find that our MBPP outperforms HIP on real-world datasets for the task of popularity prediction. This work makes it possible to efficiently fit the Hawkes process to interval-censored data.

区間打ち切りデータは、特定の時間間隔におけるイベントの総計数（入院患者数や交通ループ検出器を通過する車両数など）のみを記録し、イベントの正確な発生時刻は記録しません。現在、この種のデータにホークス点過程を適合させる方法はわかっていません。その典型的な損失関数（点過程対数尤度）は、正確なイベント時刻がなければ計算できません。さらに、ポアソン尤度を使用するための独立増分特性がありません。この研究では、区間打ち切りデータシナリオ内でホークス過程を適合させるための新しい点過程、ツールセット、および近似値を構築します。まず、一般的な自己励起ホークス過程と直接パラメータが一致する新しいポアソン過程である平均行動ポアソン過程（MBPP）を定義します。区間打ち切りポアソン対数尤度（IC-LL）を使用して、区間打ち切り設定でMBPPを適合させます。パラメータ等価性を使用して、関連するホークス過程のパラメータを明らかにします。次に、外生イベントと内生イベントを区別するための2つの新しい外生関数を導入します。外生イベントがイベント時間として観測される場合のマルチインパルス外生関数と、外生イベントが区間打ち切りボリュームとして提示される場合の潜在的同次ポアソン過程外生関数を提案します。3番目に、解析解が存在しない場合にMBPPの強度と補償関数を推定するためのいくつかの近似法を提供します。4番目で最後の方法として、MBPPの区間打ち切り損失を、より広範なBregmanダイバージェンスベースの関数に関連付けます。この関連付けを使用して、人気度推定アルゴリズムであるホークス強度過程(HIP)がMBPPの特殊なケースであることを示します。合成データと実世界のデータでの経験的テストを通じてモデルを検証します。人気度予測のタスクでは、実世界のデータセットでMBPPがHIPよりも優れていることがわかりました。この研究により、区間打ち切りデータにホークス過程を効率的に適合することが可能になりました。

Statistical Optimality and Computational Efficiency of Nystrom Kernel PCA
NystromカーネルPCAの統計的最適性と計算効率

Kernel methods provide an elegant framework for developing nonlinear learning algorithms from simple linear methods. Though these methods have superior empirical performance in several real data applications, their usefulness is inhibited by the significant computational burden incurred in large sample situations. Various approximation schemes have been proposed in the literature to alleviate these computational issues, and the approximate kernel machines are shown to retain the empirical performance. However, the theoretical properties of these approximate kernel machines are less well understood. In this work, we theoretically study the trade-off between computational complexity and statistical accuracy in Nystrom approximate kernel principal component analysis (KPCA), wherein we show that the Nystrom approximate KPCA matches the statistical performance of (non-approximate) KPCA while remaining computationally beneficial. Additionally, we show that Nystrom approximate KPCA outperforms the statistical behavior of another popular approximation scheme, the random feature approximation, when applied to KPCA.

カーネル法は、単純な線形法から非線形学習アルゴリズムを開発するための優れたフレームワークを提供します。これらの方法は、いくつかの実際のデータアプリケーションで優れた経験的パフォーマンスを発揮しますが、大規模なサンプル状況で発生する大きな計算負荷によってその有用性が制限されます。これらの計算上の問題を軽減するために、さまざまな近似スキームが文献で提案されており、近似カーネルマシンは経験的パフォーマンスを維持することが示されています。ただし、これらの近似カーネルマシンの理論的特性はあまりよく理解されていません。この研究では、Nystrom近似カーネル主成分分析(KPCA)における計算の複雑さと統計的精度のトレードオフを理論的に研究し、Nystrom近似KPCAが(非近似) KPCAの統計的パフォーマンスに匹敵しながら、計算上の利点を維持することを示します。さらに、KPCAに適用した場合、Nystrom近似KPCAは、別の一般的な近似スキームであるランダム特徴近似の統計的動作よりも優れていることを示します。

Faster Randomized Interior Point Methods for Tall/Wide Linear Programs
背の高い/幅の広い線形プログラムのためのより高速なランダム化内部点法

Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as $\ell_1$-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.

線形計画法(LP)は、オペレーションズリサーチ、エンジニアリング、経済学、さらには組み合わせ論などのより抽象的な数学の分野を含む、幅広い分野のさまざまな問題を解決するために適用されてきた非常に便利なツールです。また、$\ell_1$正則化SVM、基底追求、非負行列因数分解などの多くの機械学習アプリケーションでも使用されています。内点法(IPM)は、理論と実践の両方でLPを解決するための最も一般的な方法の1つです。その根本的な複雑さは、各反復で線形方程式のシステムを解くコストによって左右されます。この論文では、変数の数が制約の数よりもはるかに大きい特殊なケースについて、実行可能なIPMと実行不可能なIPMの両方を検討します。ランダム化線形代数のツールを使用することで、共役勾配法やチェビシェフ反復法などの反復ソルバーと組み合わせると、反復の複雑さを増大させることなく、IPMアルゴリズム(近似ソルバーによって発生するエラーを考慮して適切に修正)が実行可能な近似最適解に収束することが保証される前処理手法を提示します。私たちの経験的評価は、現実世界のデータと合成データの両方で理論的結果を検証します。

Causal Aggregation: Estimation and Inference of Causal Effects by Constraint-Based Data Fusion
因果集約:制約に基づくデータ融合による因果効果の推定と推論

In causal inference, it is common to estimate the causal effect of a single treatment variable on an outcome. However, practitioners may also be interested in the effect of simultaneous interventions on multiple covariates of a fixed target variable. We propose a novel method that allows to estimate the effect of joint interventions using data from different experiments in which only very few variables are manipulated. If there is only little randomized data or no randomized data at all, one can use observational data sets if certain parental sets are known or instrumental variables are available. If the joint causal effect is linear, the proposed method can be used for estimation and inference of joint causal effects, and we characterize conditions for identifiability. In the overidentified case, we indicate how to leverage all the available causal information across multiple data sets to efficiently estimate the causal effects. If the dimension of the covariate vector is large, we may only have a few samples in each data set. Under a sparsity assumption, we derive an estimator of the causal effects in this high-dimensional scenario. In addition, we show how to deal with the case where a lack of experimental constraints prevents direct estimation of the causal effects. When the joint causal effects are non-linear, we characterize conditions under which identifiability holds, and propose a non-linear causal aggregation methodology for experimental data sets similar to the gradient boosting algorithm where in each iteration we combine weak learners trained on different datasets using only unconfounded samples. We demonstrate the effectiveness of the proposed method on simulated and semi-synthetic data.

因果推論では、単一の治療変数が結果に及ぼす因果効果を推定するのが一般的です。しかし、専門家は、固定されたターゲット変数の複数の共変量に対する同時介入の効果にも関心があるかもしれません。私たちは、非常に少数の変数のみが操作される異なる実験のデータを使用して共同介入の効果を推定できる新しい方法を提案します。ランダム化データがほとんどないかまったくない場合は、特定の親セットがわかっているか、操作変数が利用できる場合は、観察データセットを使用できます。共同因果効果が線形である場合、提案された方法は共同因果効果の推定と推論に使用でき、識別可能性の条件を特徴付けます。過剰識別の場合、複数のデータセットにわたる利用可能なすべての因果情報を活用して、因果効果を効率的に推定する方法を示します。共変量ベクトルの次元が大きい場合、各データセットのサンプル数はわずかになる可能性があります。スパース性の仮定の下で、この高次元シナリオにおける因果効果の推定量を導出します。さらに、実験的制約の欠如により因果効果を直接推定できない場合の対処方法を示します。結合因果効果が非線形である場合、識別可能性が保持される条件を特徴付け、各反復で交絡のないサンプルのみを使用して異なるデータセットでトレーニングされた弱学習者を組み合わせる勾配ブースティングアルゴリズムに似た実験データセットの非線形因果集約方法論を提案します。シミュレートされたデータと半合成データで提案された方法の有効性を実証します。

Fully General Online Imitation Learning
完全に一般的なオンライン模倣学習

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes, and we allow our imitator to learn online from the demonstrator. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event’s likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency. If any such event qualifies as “dangerous”, our imitator would have the notable distinction of being relatively “safe”.

模倣学習では、模倣者と実証者は、環境との過去の相互作用に基づいて行動を選択するためのポリシーです。模倣者を実行する場合、おそらく、実証者がずっと行動していた場合と同様にイベントが展開されることが望まれます。一般に、学習中に1つの間違いがあると、まったく異なるイベントが発生する可能性があります。環境が再起動する特殊な設定では、既存の研究により、イベントが同じように展開されるように模倣する方法に関する正式なガイダンスが提供されていますが、その設定以外では、正式なガイダンスは存在しません。私たちは、(確率的)環境と実証者がトレーニング目的であってもリセットされない、完全に一般的な設定を扱い、模倣者が実証者からオンラインで学習できるようにします。私たちの新しい保守的なベイジアン模倣学習者は、利用可能な各アクションの確率を過小評価し、残りの確率でより多くのデータを照会します。私たちの主な結果は、実証者がずっと行動していた場合、イベントの可能性は低かった場合、代わりに(最初は完全に無知な)模倣者を実行すると、そのイベントの可能性を上方に制限できるということです。一方、デモ参加者への質問の頻度は急速に減少しています。このようなイベントが「危険」とみなされる場合、私たちの模倣者は比較的「安全」であるという注目すべき特徴を持つことになります。

Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression
分布ランダムフォレスト:不均一性調整と多変量分布回帰

Random Forest is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf.

ランダムフォレストは、成功し、広く使用されている回帰および分類アルゴリズムです。その魅力と汎用性の理由の一部は、学習データに対するカーネル型の重み付け関数の(暗黙的な)構築であり、これは元の平均推定以外のターゲットにも使用できます。私たちは、推定対象やデータモデルに依存しない、それらの同時条件付き分布に基づく多変量応答の新しい森林構築を提案します。MMD分布メトリクスに基づく新しい分割基準を使用しており、多変量分布の不均一性の検出に適しています。誘導された重みは、完全な条件付き分布の推定値を定義し、これは任意のターゲットや複雑になる可能性のある対象に使用できます。この方法は非常に用途が広く、さまざまな例で説明しているように、使用するのに便利です。このコードは、PythonおよびRパッケージdrfとして利用できます。

Maximum sampled conditional likelihood for informative subsampling
情報量の多いサブサンプリングの最大サンプリング条件付き尤度

Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited. After a subsample is taken from the full data, most available methods use an inverse probability weighted (IPW) objective function to estimate the model parameters. The IPW estimator does not fully utilize the information in the selected subsample. In this paper, we propose to use the maximum sampled conditional likelihood estimator (MSCLE) based on the sampled data. We established the asymptotic normality of the MSCLE and prove that its asymptotic variance covariance matrix is the smallest among a class of asymptotically unbiased estimators, including the IPW estimator. We further discuss the asymptotic results with the L-optimal subsampling probabilities and illustrate the estimation procedure with generalized linear models. Numerical experiments are provided to evaluate the practical performance of the proposed method.

サブサンプリングは、コンピューティングリソースが限られている場合に大量のデータセットから情報を抽出するための計算上効果的なアプローチです。フルデータからサブサンプルを取得した後、利用可能なほとんどの方法では、逆確率加重(IPW)目的関数を使用してモデルパラメータを推定します。IPW推定器は、選択したサブサンプルの情報を十分に活用していません。この論文では、サンプリングされたデータに基づいて、最大サンプリング条件付き尤度推定量(MSCLE)を使用することを提案します。MSCLEの漸近正規性を確立し、その漸近分散共分散行列が、IPW推定量を含む漸近的に不偏な推定量のクラスの中で最小であることを証明します。さらに、L-最適サブサンプリング確率を使用した漸近結果について説明し、一般化線形モデルを使用した推定手順を示します。提案手法の実用性能を評価するために、数値実験を行います。

The Geometry of Uniqueness, Sparsity and Clustering in Penalized Estimation
ペナルティ評価における一意性,スパース性,クラスタリングの幾何学

We provide a necessary and sufficient condition for the uniqueness of penalized least-squares estimators whose penalty term is given by a norm with a polytope unit ball, covering a wide range of methods including SLOPE, PACS, fused, clustered and classical LASSO as well as the related method of basis pursuit. We consider a strong type of uniqueness that is relevant for statistical problems. The uniqueness condition is geometric and involves how the row span of the design matrix intersects the faces of the dual norm unit ball, which for SLOPE is given by the signed permutahedron. Further considerations based this condition also allow to derive results on sparsity and clustering features. In particular, we define the notion of a SLOPE pattern to describe both sparsity and clustering properties of this method and also provide a geometric characterization of accessible SLOPE patterns.

私たちは、ペナルティ項がポリトープユニットボールのノルムによって与えられるペナルティ付き最小二乗推定量の一意性に必要十分な条件を提供し、SLOPE、PACS、融合、クラスター化、古典的LASSOなどの幅広い方法、および関連する基底追求の方法をカバーしています。統計問題に関連する強いタイプの一意性を考慮します。一意性条件は幾何学的であり、計画行列の行スパンが双対ノルム単位ボールの面とどのように交差するかを含み、SLOPEの場合、これは符号付き順列面体によって与えられます。この条件に基づくさらなる考慮事項により、スパース性とクラスタリングの特徴に関する結果を導き出すこともできます。特に、この方法のスパース性とクラスタリングの両方のプロパティを記述するために、SLOPEパターンの概念を定義し、アクセス可能なSLOPEパターンの幾何学的特性も提供します。

ALMA: Alternating Minimization Algorithm for Clustering Mixture Multilayer Network
アルマ望遠鏡:混合多層ネットワークのクラスタリングのための交互最小化アルゴリズム

The paper considers a Mixture Multilayer Stochastic Block Model (MMLSBM), where layers can be partitioned into groups of similar networks, and networks in each group are equipped with a distinct Stochastic Block Model. The goal is to partition the multilayer network into clusters of similar layers, and to identify communities in those layers. Jing et al. (2020) introduced the MMLSBM and developed a clustering methodology, TWIST, based on regularized tensor decomposition. The present paper proposes a different technique, an alternating minimization algorithm (ALMA), that aims at simultaneous recovery of the layer partition, together with estimation of the matrices of connection probabilities of the distinct layers. Compared to TWIST, ALMA achieves higher accuracy, both theoretically and numerically.

この論文では、層を類似したネットワークのグループに分割でき、各グループのネットワークに個別の確率的ブロックモデルを装備するMixture Multilayer Stochastic Block Model(MMLSBM)について考察します。目標は、マルチレイヤーネットワークを類似したレイヤーのクラスターに分割し、それらのレイヤー内のコミュニティを特定することです。Jingら(2020)は、MMLSBMを導入し、正則化されたテンソル分解に基づくクラスタリング手法TWISTを開発しました。この論文では、異なる層の接続確率の行列の推定とともに、層分割の同時回復を目指す代替最小化アルゴリズム(ALMA)という別の手法を提案しています。TWISTと比較して、アルマ望遠鏡は理論的にも数値的にも高い精度を達成しています。

Joint Continuous and Discrete Model Selection via Submodularity
サブモジュラリティによるジョイント連続モデルと離散モデルの選択

In model selection problems for machine learning, the desire for a well-performing model with meaningful structure is typically expressed through a regularized optimization problem. In many scenarios, however, the meaningful structure is specified in some discrete space, leading to difficult nonconvex optimization problems. In this paper, we connect the model selection problem with structure-promoting regularizers to submodular function minimization with continuous and discrete arguments. In particular, we leverage the theory of submodular functions to identify a class of these problems that can be solved exactly and efficiently with an agnostic combination of discrete and continuous optimization routines. We show how simple continuous or discrete constraints can also be handled for certain problem classes and extend these ideas to a robust optimization framework. We also show how some problems outside of this class can be embedded into the class, further extending the class of problems our framework can accommodate. Finally, we numerically validate our theoretical results with several proof-of-concept examples with synthetic and real-world data, comparing against state-of-the-art algorithms.

機械学習のモデル選択問題では、意味のある構造を持つ高性能モデルを求める要望は、通常、正規化された最適化問題を通じて表現されます。ただし、多くのシナリオでは、意味のある構造は離散空間で指定されるため、困難な非凸最適化問題につながります。この論文では、構造促進正規化子を持つモデル選択問題を、連続引数と離散引数を持つサブモジュラ関数の最小化に結び付けます。特に、サブモジュラ関数の理論を活用して、離散および連続最適化ルーチンの非依存的な組み合わせで正確かつ効率的に解決できるこれらの問題のクラスを特定します。特定の問題クラスに対して単純な連続または離散制約も処理できることを示し、これらのアイデアを堅牢な最適化フレームワークに拡張します。また、このクラス以外のいくつかの問題をクラスに埋め込む方法を示し、フレームワークが対応できる問題のクラスをさらに拡張します。最後に、合成データと実際のデータを使用したいくつかの概念実証例を使用して、最先端のアルゴリズムと比較し、理論上の結果を数値的に検証します。

Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima
分布確率的勾配降下法: 非凸性、非平滑性、および局所極小値への収束

Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)–a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation.

勾配降下法(GD)に基づくアルゴリズムは、現代の機械学習モデルを最適化するために不可欠なツールです。この論文では、GDのネットワークベースのバリエーションである分散確率的GD (D-SGD)について検討します。分散アルゴリズムは、大規模な機械学習の問題だけでなく、モノのインターネット(IoT)や関連アプリケーションでも重要な役割を果たします。この論文では、2つの主な問題について検討します。まず、損失関数が非凸で非滑らかな場合のD-SGDの臨界点への収束を調べます。現代のディープラーニングで実用的な関心のあるものも含め、広範囲の非滑らかな損失関数を検討します。固定された各初期化に対して、D-SGDは確率1で損失の臨界点に収束することが示されています。次に、鞍点を回避する問題について検討します。古典的なGDが鞍点を回避することはよく知られていますが、GDの分散バリアントでは類似の結果がありませんでした。この問題では、損失関数は非凸で非平滑である可能性があるが、鞍点の近傍では平滑であると仮定します。任意の固定初期化に対して、D-SGDは確率1でこのような鞍点を回避することが示されています。結果は、確率近似の常微分方程式(ODE)法を使用して、基礎となる(分散)勾配フローを調べることによって証明されます。

Kernel Autocovariance Operators of Stationary Processes: Estimation and Convergence
定常過程のカーネル自己共分散演算子: 推定と収束

We consider autocovariance operators of a stationary stochastic process on a Polish space that is embedded into a reproducing kernel Hilbert space. We investigate how empirical estimates of these operators converge along realizations of the process under various conditions. In particular, we examine ergodic and strongly mixing processes and obtain several asymptotic results as well as finite sample error bounds. We provide applications of our theory in terms of consistency results for kernel PCA with dependent data and the conditional mean embedding of transition probabilities. Finally, we use our approach to examine the nonparametric estimation of Markov transition operators and highlight how our theory can give a consistency analysis for a large family of spectral analysis methods including kernel-based dynamic mode decomposition.

私たちは、再現カーネルヒルベルト空間に埋め込まれたポーランド空間上の定常確率過程の自己共分散演算子を考えます。これらの演算子の経験的推定値が、さまざまな条件下でのプロセスの実現にどのように収束するかを調査します。特に、エルゴーディックおよび強混合プロセスを調べ、いくつかの漸近結果と有限のサンプル誤差範囲を取得します。私たちは、従属データを持つカーネルPCAの一貫性結果と、遷移確率の条件付き平均埋め込みの観点から、私たちの理論の応用を提供します。最後に、このアプローチを使用して、マルコフ遷移演算子のノンパラメトリック推定を検討し、カーネルベースの動的モード分解を含むスペクトル解析手法の大規模なファミリーに対して、この理論が一貫性解析をどのように提供できるかを強調します。

Project and Forget: Solving Large-Scale Metric Constrained Problems
Project and Forget: メトリクスに制約のある大規模な問題を解決する

Many important machine learning problems can be formulated as highly constrained convex optimization problems. One important example is metric constrained problems. In this paper, we show that standard optimization techniques can not be used to solve metric constrained problem. To solve such problems, we provide a general active set framework, called Project and Forget, and several variants thereof that use Bregman projections. Project and Forget is a general purpose method that can be used to solve highly constrained convex problems with many (possibly exponentially) constraints. We provide a theoretical analysis of Project and Forget and prove that our algorithms converge to the global optimal solution and have a linear rate of convergence. We demonstrate that using our method, we can solve large problem instances of general weighted correlation clustering, metric nearness, information theoretic metric learning and quadratically regularized optimal transport; in each case, out-performing the state of the art methods with respect to CPU times and problem sizes.

多くの重要な機械学習の問題は、高度に制約された凸最適化問題として定式化できます。重要な例の1つは、メトリック制約問題です。この論文では、標準的な最適化手法ではメトリック制約問題を解決できないことを示します。このような問題を解決するために、Project and Forgetと呼ばれる一般的なアクティブセットフレームワークと、Bregman射影を使用するそのいくつかのバリエーションを提供します。Project and Forgetは、多くの(場合によっては指数関数的な)制約がある高度に制約された凸問題を解決するために使用できる汎用メソッドです。Project and Forgetの理論的分析を提供し、アルゴリズムがグローバル最適解に収束し、収束率が線形であることを証明します。この方法を使用すると、一般的な加重相関クラスタリング、メトリック近接性、情報理論的メトリック学習、および2次正規化最適トランスポートの大規模な問題インスタンスを解決できることを実証します。いずれの場合も、CPU時間と問題のサイズに関して最先端の方法よりも優れています。

On Mixup Regularization
Mixup正則化について

Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions.

Mixupは、トレーニングポイントとラベルの凸結合として新しい例を作成するデータ拡張手法です。この単純な手法は、さまざまな設定やアプリケーションで多くの最先端モデルの精度を向上させることが実証されていますが、この実証的な成功の理由は十分に理解されていません。この論文では、正則化効果を明らかにすることで、Mixupの理論的基礎を説明する上で大きな一歩を踏み出します。Mixupは、データ変換と変換されたデータのランダムな摂動の組み合わせの影響を受ける標準的な経験的リスク最小化推定量として解釈できることを示します。この新しい解釈から、2つの重要な洞察が得られます。まず、データ変換は、テスト時に、Mixupでトレーニングされたモデルを変換されたデータにも適用する必要があることを示唆しています。これは、予測の精度とキャリブレーションの両方を向上させるために実証されている1行のコード変更です。次に、Mixupの新しい解釈のランダムな摂動が、ラベルのスムージングや推定量のLipschitz定数の削減など、複数の既知の正則化スキームを誘導する方法を示します。これらのスキームは互いに相乗的に作用し、過剰適合や過信した予測を防ぐ自己調整された効果的な正規化効果をもたらします。私たちは、結論を裏付ける実験によって理論分析を裏付けています。

Improving Bayesian Network Structure Learning in the Presence of Measurement Error
測定誤差の存在下でのベイジアンネットワーク構造学習の改善

Structure learning algorithms that learn the graph of a Bayesian network from observational data often do so by assuming the data correctly reflect the true distribution of the variables. However, this assumption does not hold in the presence of measurement error, which can lead to spurious edges. This is one of the reasons why the synthetic performance of these algorithms often overestimates real-world performance. This paper describes a heuristic algorithm that can be added as an additional learning phase at the end of any structure learning algorithm, and serves as a correction learning phase that removes potential false positive edges. The results show that the proposed correction algorithm successfully improves the graphical score of five well-established structure learning algorithms spanning different classes of learning in the presence of measurement error.

観測データからベイジアンネットワークのグラフを学習する構造学習アルゴリズムは、多くの場合、データが変数の真の分布を正しく反映していると仮定して学習します。ただし、この仮定は、スプリアスエッジにつながる可能性のある測定誤差が存在する場合には成り立ちません。これが、これらのアルゴリズムの合成性能が実世界の性能を過大評価することが多い理由の1つです。この論文では、任意の構造学習アルゴリズムの最後に追加の学習フェーズとして追加でき、潜在的な誤検出エッジを排除する修正学習フェーズとして機能するヒューリスティックアルゴリズムについて説明します。結果は、提案された修正アルゴリズムが、測定誤差が存在する場合に、異なる学習クラスにまたがる5つの確立された構造学習アルゴリズムのグラフィカルスコアを成功裏に改善することを示しています。

Convergence Rates for Gaussian Mixtures of Experts
専門家のガウス混合の収束率

We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of algebraic independence of the expert functions. Drawing on optimal transport, we establish a connection between the algebraic independence of the expert functions and a certain class of partial differential equations (PDEs) with respect to the parameters. Exploiting this connection allows us to derive convergence rates for parameter estimation.

私たちは、共変量フリーゲーティングネットワークを持つ専門家の過剰指定ガウス混合の理論的処理を提供します。これらのモデルの最尤推定(MLE)の収束率を確立します。私たちの証明技術は、エキスパート関数の代数的独立性という新しい概念に基づいています。最適な輸送を利用して、エキスパート関数の代数的独立性と、パラメータに関する特定のクラスの偏微分方程式(PDE)との間に接続を確立します。この接続を利用することで、パラメータ推定の収束率を導き出すことができます。

Community detection in sparse latent space models
スパース潜在空間モデルにおけるコミュニティ検出

We show that a simple community detection algorithm originated from stochastic blockmodel literature achieves consistency, and even optimality, for a broad and flexible class of sparse latent space models. The class of models includes latent eigenmodels (Hoff, 2008). The community detection algorithm is based on spectral clustering followed by local refinement via normalized edge counting. It is easy to implement and attains high accuracy with a low computational budget. The proof of its optimality depends on a neat equivalence between likelihood ratio test and edge counting in a simple vs. simple hypothesis testing problem that underpins the refinement step, which could be of independent interest.

私たちは、確率的ブロックモデルの文献に由来する単純なコミュニティ検出アルゴリズムが、スパース潜在空間モデルの広範で柔軟なクラスに対して一貫性、さらには最適性を達成することを示します。モデルのクラスには、潜在固有モデルが含まれます(Hoff、2008)。コミュニティ検出アルゴリズムは、スペクトルクラスタリングと、それに続く正規化エッジカウントによる局所的な絞り込みに基づいています。実装が簡単で、低い計算バジェットで高精度を実現します。その最適性の証明は、リファインメントステップを支える単純な仮説検定問題と単純な仮説検定問題における尤度比検定とエッジカウントの間のきちんとした等価性に依存しており、これは独立した関心事になる可能性があります。

On Low-rank Trace Regression under General Sampling Distribution
一般標本分布下における低位トレース回帰について

In this paper, we study the trace regression when a matrix of parameters $\mathbf{B}^\star$ is estimated via the convex relaxation of a rank-regularized regression or via regularized non-convex optimization. It is known that these estimators satisfy near-optimal error bounds under assumptions on the rank, coherence, and spikiness of $\mathbf{B}^\star$. We start by introducing a general notion of spikiness for $\mathbf{B}^\star$ that provides a generic recipe to prove the restricted strong convexity of the sampling operator of the trace regression and obtain near-optimal and non-asymptotic error bounds for the estimation error. Similar to the existing literature, these results require the regularization parameter to be above a certain theory-inspired threshold that depends on observation noise that may be unknown in practice. Next, we extend the error bounds to cases where the regularization parameter is chosen via cross-validation. This result is significant in that existing theoretical results on cross-validated estimators (Kale et al., 2011; Kumar et al., 2013; Abou-Moustafa and Szepesvari, 2017) do not apply to our setting since the estimators we study are not known to satisfy their required notion of stability. Finally, using simulations on synthetic and real data, we show that the cross-validated estimator selects a near-optimal penalty parameter and outperforms the theory-inspired approach of selecting the parameter.

この論文では、ランク正則化回帰の凸緩和または正則化非凸最適化によってパラメーターの行列$\mathbf{B}^\star$が推定される場合のトレース回帰について検討します。これらの推定量は、$\mathbf{B}^\star$のランク、コヒーレンス、スパイクネスに関する仮定の下で、ほぼ最適な誤差範囲を満たすことが知られています。まず、トレース回帰のサンプリング演算子の制限された強い凸性を証明し、推定誤差のほぼ最適で非漸近的な誤差範囲を得るための一般的なレシピを提供する、$\mathbf{B}^\star$のスパイクネスの概念を導入します。既存の文献と同様に、これらの結果では、正則化パラメーターが、実際には未知である可能性のある観測ノイズに依存する特定の理論に触発されたしきい値を超える必要があります。次に、クロスバリデーションによって正則化パラメーターが選択される場合に誤差範囲を拡張します。この結果は、交差検証推定量に関する既存の理論的結果(Kaleら, 2011; Kumarら, 2013; Abou-Moustafa and Szepesvari, 2017)が、私たちが研究する推定量が安定性の要件を満たしていないため、私たちの設定には当てはまらないという点で重要です。最後に、合成データと実データでのシミュレーションを使用して、交差検証推定量がほぼ最適なペナルティパラメータを選択し、パラメータを選択する理論にヒントを得たアプローチよりも優れていることを示します。

Network Regression with Graph Laplacians
グラフラプラシアンによるネットワーク回帰

Network data are increasingly available in various research fields, motivating statistical analysis for populations of networks, where a network as a whole is viewed as a data point. The study of how a network changes as a function of covariates is often of paramount interest. However, due to the non-Euclidean nature of networks, basic statistical tools available for scalar and vector data are no longer applicable. This motivates an extension of the notion of regression to the case where responses are network data. Here we propose to adopt conditional Fréchet means implemented as M-estimators that depend on weights derived from both global and local least squares regression, extending the Fréchet regression framework to networks that are quantified by their graph Laplacians. The challenge is to characterize the space of graph Laplacians to justify the application of Fréchet regression. This characterization then leads to asymptotic rates of convergence for the corresponding M-estimators by applying empirical process methods. We demonstrate the usefulness and good practical performance of the proposed framework with simulations and with network data arising from resting-state fMRI in neuroimaging, as well as New York taxi records.

ネットワークデータはさまざまな研究分野でますます利用できるようになり、ネットワーク全体をデータポイントと見なして、ネットワークの母集団の統計分析を行うようになりました。共変量の関数としてネットワークがどのように変化するかの研究は、多くの場合、最も重要な関心事です。ただし、ネットワークの非ユークリッド特性のため、スカラーデータとベクトルデータに使用できる基本的な統計ツールは適用できなくなりました。このため、応答がネットワークデータである場合に回帰の概念を拡張することになります。ここでは、グローバル最小二乗回帰とローカル最小二乗回帰の両方から得られる重みに依存するM推定量として実装された条件付きフレシェ平均を採用し、フレシェ回帰フレームワークをグラフラプラシアンによって定量化されるネットワークに拡張することを提案します。課題は、グラフラプラシアンの空間を特徴付けて、フレシェ回帰の適用を正当化することです。この特徴付けにより、経験的プロセスメソッドを適用することで、対応するM推定量の漸近収束率が得られます。私たちは、シミュレーションと、神経画像診断における安静時fMRIから得られるネットワークデータ、およびニューヨークのタクシー記録を使用して、提案されたフレームワークの有用性と優れた実用的パフォーマンスを実証します。

Self-Healing Robust Neural Networks via Closed-Loop Control
閉ループ制御による自己修復ロバストニューラルネットワーク

Despite the wide applications of neural networks, there have been increasing concerns about their vulnerability issue. While numerous attack and defense techniques have been developed, this work investigates the robustness issue from a new angle: can we design a self-healing neural network that can automatically detect and fix the vulnerability issue by itself? A typical self-healing mechanism is the immune system of a human body. This biology-inspired idea has been used in many engineering designs but has rarely been investigated in deep learning. This paper considers the post-training self-healing of a neural network, and proposes a closed-loop control formulation to automatically detect and fix the errors caused by various attacks or perturbations. We provide a margin-based analysis to explain how this formulation can improve the robustness of a classifier. To speed up the inference, we convert the optimal control problem to Pontryagon’s Maximum Principle and solve it via the method of successive approximation. Lastly, we present an error estimation of the proposed framework for neural networks with nonlinear activation functions. We validate the performance of several network architectures against various perturbations. Since the self-healing method does not need a-priori information about data perturbations or attacks, it can handle a broad class of unforeseen perturbations.

ニューラルネットワークは幅広く応用されていますが、その脆弱性の問題に対する懸念が高まっています。数多くの攻撃および防御手法が開発されている一方で、この研究では、脆弱性の問題を自動的に検出して修正できる自己修復ニューラルネットワークを設計できるかどうかという新しい角度から堅牢性の問題を調査しています。典型的な自己修復メカニズムは、人体の免疫システムです。この生物学にヒントを得たアイデアは、多くのエンジニアリング設計で使用されていますが、ディープラーニングではほとんど調査されていません。この論文では、ニューラルネットワークのトレーニング後の自己修復について検討し、さまざまな攻撃や摂動によって引き起こされるエラーを自動的に検出して修正するための閉ループ制御定式化を提案します。この定式化によって分類器の堅牢性がどのように向上するかを説明するために、マージンベースの分析を提供します。推論を高速化するために、最適制御問題をポントリャゴンの最大原理に変換し、逐次近似法で解決します。最後に、非線形活性化関数を持つニューラルネットワークの提案フレームワークの誤差推定を示します。さまざまな摂動に対する複数のネットワークアーキテクチャのパフォーマンスを検証します。自己修復方式では、データの摂動や攻撃に関する事前情報が必要ないため、予期しない摂動の幅広いクラスに対処できます。

Hamilton-Jacobi equations on graphs with applications to semi-supervised learning and data depth
グラフ上のハミルトン・ヤコビ方程式と半教師あり学習およびデータ深度への応用

Shortest path graph distances are widely used in data science and machine learning, since they can approximate the underlying geodesic distance on the data manifold. However, the shortest path distance is highly sensitive to the addition of corrupted edges in the graph, either through noise or an adversarial perturbation. In this paper we study a family of Hamilton-Jacobi equations on graphs that we call the $p$-eikonal equation. We show that the $p$-eikonal equation with $p=1$ is a provably robust distance-type function on a graph, and the $p\to \infty$ limit recovers shortest path distances. While the $p$-eikonal equation does not correspond to a shortest-path graph distance, we nonetheless show that the continuum limit of the $p$-eikonal equation on a random geometric graph recovers a geodesic density weighted distance in the continuum. We consider applications of the $p$-eikonal equation to data depth and semi-supervised learning, and use the continuum limit to prove asymptotic consistency results for both applications. Finally, we show the results of experiments with data depth and semi-supervised learning on real image datasets, including MNIST, FashionMNIST and CIFAR-10, which show that the $p$-eikonal equation offers significantly better results compared to shortest path distances.

最短経路グラフ距離は、データ多様体上の基礎となる測地線距離を近似できるため、データサイエンスや機械学習で広く使用されています。ただし、最短経路距離は、ノイズまたは敵対的摂動によるグラフ内の破損したエッジの追加に非常に敏感です。この論文では、グラフ上のハミルトン-ヤコビ方程式のファミリーを研究し、これを$p$-アイコナール方程式と呼びます。$p=1$の$p$-アイコナール方程式はグラフ上の証明可能な堅牢な距離型関数であり、$p\to \infty$極限で最短経路距離が回復されることを示します。$p$-アイコナール方程式は最短経路グラフ距離に対応していませんが、ランダムな幾何学的グラフ上の$p$-アイコナール方程式の連続体極限で、連続体で測地線密度加重距離が回復されることを示します。$p$-アイコナール方程式をデータ深度と半教師あり学習に適用することを検討し、連続体極限を使用して両方の適用における漸近的一貫性の結果を証明します。最後に、MNIST、FashionMNIST、CIFAR-10などの実際の画像データセットでデータ深度と半教師あり学習を使用した実験の結果を示します。この実験では、$p$-アイコナール方程式が最短経路距離と比較して大幅に優れた結果をもたらすことが示されています。

Nonparametric Neighborhood Selection in Graphical Models
グラフィカルモデルでのノンパラメトリック近傍選択

The neighborhood selection method directly explores the conditional dependence structure and has been widely used to construct undirected graphical models. However, except for some special cases with discrete data, there is little research on nonparametric methods for neighborhood selection with mixed data. This paper develops a fully nonparametric neighborhood selection method under a consolidated smoothing spline ANOVA (SS ANOVA) decomposition framework. The proposed model is flexible and contains many existing models as special cases. The proposed method provides a unified framework for mixed data without any restrictions on the type of each random variable. We detect edges by applying an L1 regularization to interactions in the SS ANOVA decomposition. We propose an iterative procedure to compute the estimates and establish the convergence rates for conditional density and interactions. Simulations indicate that the proposed methods perform well under Gaussian and non-Gaussian settings. We illustrate the proposed methods using two real data examples.

近傍選択法は、条件付き従属構造を直接調査し、無向グラフィカルモデルの構築に広く使用されています。ただし、離散データを使用した一部の特殊なケースを除いて、混合データを使用した近傍選択のノンパラメトリック手法に関する研究はほとんどありません。この論文では、統合された平滑化スプラインANOVA (SS ANOVA)分解フレームワークの下で、完全にノンパラメトリックな近傍選択法を開発します。提案されたモデルは柔軟性があり、多くの既存のモデルを特殊なケースとして含んでいます。提案された方法は、各ランダム変数のタイプに制限のない、混合データ用の統一されたフレームワークを提供します。SS ANOVA分解の相互作用にL1正則化を適用することでエッジを検出します。推定値を計算し、条件付き密度と相互作用の収束率を確立するための反復手順を提案します。シミュレーションでは、提案された方法がガウス設定と非ガウス設定の両方で良好に機能することが示されています。2つの実際のデータ例を使用して、提案された方法を説明します。

WarpDrive: Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU
WarpDrive:GPU上での高速エンドツーエンドの深層マルチエージェント強化学習

WarpDrive is a flexible, lightweight, and easy-to-use open-source framework for end-to-end deep multi-agent reinforcement learning (MARL) on a Graphics Processing Unit (GPU), available at https://github.com/salesforce/warp-drive. It addresses key system bottlenecks when applying MARL to complex environments with high-dimensional state, observation, or action spaces. For example, WarpDrive eliminates data copying between the CPU and GPU and runs thousands of simulations and agents in parallel. It also enables distributed training on multiple GPUs and scales to millions of agents. In all, WarpDrive enables orders-of-magnitude faster MARL compared to common CPU-GPU implementations. For example, WarpDrive yields 2.9 million environment steps/second with 2000 environments and 1000 agents (at least 100× faster than a CPU version) in a 2d-Tag simulation. It is user-friendly: e.g., it provides a lightweight, extendable Python interface and flexible environment wrappers. It is also compatible with PyTorch. In all, WarpDrive offers a platform to significantly accelerate reinforcement learning research and development.

WarpDriveは、グラフィックスプロセッシングユニット(GPU)上のエンドツーエンドのディープマルチエージェント強化学習(MARL)用の柔軟で軽量、使いやすいオープンソースフレームワークで、https://github.com/salesforce/warp-driveから入手できます。高次元の状態、観察、またはアクション空間を持つ複雑な環境にMARLを適用する場合の主要なシステムボトルネックを解決します。たとえば、WarpDriveはCPUとGPU間のデータコピーを排除し、数千のシミュレーションとエージェントを並行して実行します。また、複数のGPUでの分散トレーニングも可能で、数百万のエージェントに拡張できます。全体として、WarpDriveは一般的なCPU-GPU実装と比較して桁違いに高速なMARLを実現します。たとえば、WarpDriveは2D-Tagシミュレーションで2,000の環境と1,000のエージェントで290万の環境ステップ/秒(CPUバージョンよりも少なくとも100倍高速)を実現します。WarpDriveはユーザーフレンドリーです。たとえば、軽量で拡張可能なPythonインターフェースと柔軟な環境ラッパーを提供します。また、PyTorchとも互換性があります。全体として、WarpDriveは強化学習の研究開発を大幅に加速するプラットフォームを提供します。

d3rlpy: An Offline Deep Reinforcement Learning Library
d3rlpy: オフラインの深層強化学習ライブラリ

In this paper, we introduce d3rlpy, an open-sourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a set of offline deep RL algorithms as well as off-policy online algorithms via a fully documented plug-and-play API. To address a reproducibility issue, we conduct a large-scale benchmark with D4RL and Atari 2600 dataset to ensure implementation quality and provide experimental scripts and full tables of results. The d3rlpy source code can be found on GitHub: https://github.com/takuseno/d3rlpy.

この論文では、Python用のオープンソースのオフライン深層強化学習(RL)ライブラリであるd3rlpyについて紹介します。d3rlpyは、完全に文書化されたプラグアンドプレイAPIを介して、一連のオフラインディープRLアルゴリズムとオフポリシーオンラインアルゴリズムをサポートします。再現性の問題に対処するために、D4RLとAtari 2600データセットを使用して大規模なベンチマークを実施し、実装の品質を確保し、実験スクリプトと結果の完全なテーブルを提供します。d3rlpyのソースコードはGitHub: https://github.com/takuseno/d3rlpyにあります。

Oracle Complexity in Nonsmooth Nonconvex Optimization
非平滑非凸最適化における Oracle の複雑性

It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (with gradient norm less than $\epsilon$) in $\mathcal{O}(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near $\epsilon$-stationary points. This is perhaps the most natural relaxation of finding $\epsilon$-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and $\epsilon$ smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large.

滑らかで下界があり、おそらく非凸な関数が与えられた場合、標準的な勾配ベースの方法は、$\epsilon$-定常点（勾配ノルムが$\epsilon$未満）を$\mathcal{O}(1/\epsilon^2)$回の反復で見つけることができることはよく知られています。しかし、現代のニューラルネットワークのトレーニングに関連する問題など、多くの重要な非凸最適化問題は本質的に滑らかではないため、これらの結果は適用できません。この論文では、アルゴリズムがさまざまなポイントでの関数のローカル情報にのみアクセスできると仮定したオラクル複雑性の観点から、非滑らかな非凸最適化を検討します。主な結果が2つあります。まず、$\epsilon$-定常点に近い点を取得する問題について検討します。これは、滑らかでない非凸の場合は不可能である、$\epsilon$-定常点を見つける最も自然な緩和策である可能性があります。この緩和された目標は、任意の距離と、ある定数より小さい$\epsilon$に対しては、効率的に達成できないことを証明します。2番目の結果は、滑らかな最適化への還元によって、非滑らかな非凸最適化に取り組む可能性を扱っています。つまり、滑らかな最適化手法を目的関数の滑らかな近似に適用します。このアプローチでは、緩やかな仮定の下で、オラクル複雑度と滑らかさの間に固有のトレードオフがあることを証明します。一方では、非滑らかな非凸関数の平滑化は非常に効率的に実行できます(たとえば、ランダム化された平滑化によって)が、滑らかさパラメーターに次元に依存する要因があり、標準的な滑らかな最適化手法にプラグインするときに反復複雑度に強く影響する可能性があります。他方では、これらの次元要因は適切な平滑化方法で排除できますが、平滑化プロセスのオラクル複雑度を指数関数的に大きくすることによってのみ実現できます。

Intrinsic Dimension Estimation Using Wasserstein Distance
ワッサースタイン距離を用いた内在次元の推定

It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data.

多くの実用的な機械学習タスクで遭遇する高次元データは、低次元の構造を持っている、つまり多様体仮説が成り立つと長い間考えられてきました。したがって、自然な問題は、有限のサンプルから特定の母集団分布の本質的な次元を推定することです。固有次元の新しい推定量を導入し、有限サンプル、非漸近保証を提供します。次に、この手法を適用して、データの本質的な次元のみに依存する敵対的生成ネットワーク(GAN)の新しいサンプル複雑性境界を取得します。

Nystrom Regularization for Time Series Forecasting
時系列予測のための Nystrom 正則化

This paper focuses on learning rate analysis of Nystrom regularization with sequential sub-sampling for $\tau$-mixing time series. Using a recently developed Banach-valued Bernstein inequality for $\tau$-mixing sequences and an integral operator approach based on second-order decomposition, we succeed in deriving almost optimal learning rates of Nystrom regularization with sequential sub-sampling for $\tau$-mixing time series. A series of numerical experiments are carried out to verify our theoretical results, showing the excellent learning performance of Nystrom regularization with sequential sub-sampling in learning massive time series data. All these results extend the applicable range of Nystr\”{o}m regularization from i.i.d. samples to non-i.i.d. sequences.

この論文では、$tau$-ミキシング時系列の逐次サブサンプリングによるNystrom正則化の学習率分析に焦点を当てています。最近開発された$tau$-ミキシングシーケンスのBanach値Bernstein不等式と2次分解に基づく積分演算子アプローチを使用して、$tau$-ミキシング時系列の逐次サブサンプリングによるナイストロム正則化のほぼ最適な学習率を導出することに成功しました。一連の数値実験を行って理論結果を検証し、大量の時系列データの学習における逐次サブサンプリングによるNystrom正則化の優れた学習性能を示しています。これらの結果はすべて、Nystr”{o}m正則化の適用範囲をi.i.d.サンプルから非i.i.d.に拡張します。シーケンス。

Toward Understanding Convolutional Neural Networks from Volterra Convolution Perspective
ボルテラ畳み込みの観点から畳み込みニューラルネットワークを理解するために

We make an attempt to understand convolutional neural network by exploring the relationship between (deep) convolutional neural networks and Volterra convolutions. We propose a novel approach to explain and study the overall characteristics of neural networks without being disturbed by the horribly complex architectures. Specifically, we attempt to convert the basic structures of a convolutional neural network (CNN) and their combinations to the form of Volterra convolutions. The results show that most of convolutional neural networks can be approximated in the form of Volterra convolution, where the approximated proxy kernels preserve the characteristics of the original network. Analyzing these proxy kernels may give valuable insight about the original network. Based on this setup, we present methods to approximate the order-zero and order-one proxy kernels, and verify the correctness and effectiveness of our results.

私たちは、(深層)畳み込みニューラルネットワークとVolterra畳み込みとの関係を探ることで、畳み込みニューラルネットワークの理解を試みます。私たちは、ニューラルネットワークの全体的な特性を、恐ろしく複雑なアーキテクチャに邪魔されることなく説明および研究するための新しいアプローチを提案します。具体的には、畳み込みニューラルネットワーク(CNN)の基本構造とその組み合わせをVolterra畳み込みの形式に変換することを試みています。結果は、ほとんどの畳み込みニューラルネットワークがVolterra畳み込みの形で近似できることを示しています。このとき、近似されたプロキシカーネルは元のネットワークの特性を保持します。これらのプロキシカーネルを分析すると、元のネットワークに関する貴重な洞察が得られる可能性があります。この設定に基づいて、order-zeroとorder-oneプロキシカーネルを近似する方法を提示し、結果の正確性と有効性を検証します。

Detecting Latent Communities in Network Formation Models
ネットワーク形成モデルにおける潜在コミュニティの検出

This paper proposes a logistic undirected network formation model which allows for assortative matching on observed individual characteristics and the presence of edge-wise fixed effects. We model the coefficients of observed characteristics to have a latent community structure and the edge-wise fixed effects to be of low rank. We propose a multi-step estimation procedure involving nuclear norm regularization, sample splitting, iterative logistic regression and spectral clustering to detect the latent communities. We show that the latent communities can be exactly recovered when the expected degree of the network is of order logn or higher, where n is the number of nodes in the network. The finite sample performance of the new estimation and inference methods is illustrated through both simulated and real datasets.

この論文では、観測された個々の特性とエッジごとの固定効果の存在に関する品揃えマッチングを可能にするロジスティック無向ネットワーク形成モデルを提案します。観測された特性の係数は潜在的なコミュニティ構造を持ち、エッジごとの固定効果は低ランクであるとモデル化します。私たちは、潜在コミュニティを検出するための核ノルム正則化、サンプル分割、反復ロジスティック回帰、およびスペクトルクラスタリングを含む多段階の推定手順を提案します。ネットワークの期待次数がlogn以上である場合、潜在的なコミュニティを正確に回復できることを示します(nはネットワーク内のノード数)。新しい推定および推論方法の有限サンプル性能は、シミュレートされたデータセットと実際のデータセットの両方を通じて示されています。

The Separation Capacity of Random Neural Networks
ランダムニューラルネットワークの分離能力

Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. In the present article, we enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what conditions can a random neural network make two classes $\mathcal{X}^-, \mathcal{X}^+ \subset \mathbb{R}^d$ (with positive distance) linearly separable? We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. Crucially, the number of required neurons is explicitly linked to geometric properties of the underlying sets $\mathcal{X}^-, \mathcal{X}^+$ and their mutual arrangement. This instance-specific viewpoint allows us to overcome the usual curse of dimensionality (exponential width of the layers) in non-pathological situations where the data carries low-complexity structure. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity (based on a localized version of Gaussian mean width), which leads to sound and informative separation guarantees. We connect our result with related lines of work on approximation, memorization, and generalization.

ランダムな重みを持つニューラルネットワークは、さまざまな機械学習アプリケーションで使用されていますが、最も顕著なのは、多くのディープラーニングアルゴリズムの初期化や、完全に学習されたニューラルネットワークの計算コストの低い代替手段としてです。この記事では、ランダムニューラルネットワークの理論的理解を深めるために、次のデータ分離問題に取り組みます。ランダムニューラルネットワークは、どのような条件下で2つのクラス$\mathcal{X}^-, \mathcal{X}^+ \subset \mathbb{R}^d$ (正の距離)を線形分離可能にできますか。標準的なガウス重みと均一に分散されたバイアスを持つ十分に大きな2層ReLUネットワークが、この問題を高い確率で解決できることを示します。重要なのは、必要なニューロンの数が、基礎となるセット$\mathcal{X}^-, \mathcal{X}^+$の幾何学的特性とそれらの相互配置に明示的にリンクされていることです。このインスタンス固有の観点により、データが低複雑性の構造を持つ非病的な状況で、次元の通常の呪い(層の指数的幅)を克服できます。データの関連構造を、相互複雑性という新しい概念(ガウス平均幅の局所バージョンに基づく)で定量化し、健全で有益な分離保証を実現します。この結果を、近似、記憶、一般化に関する関連研究と結び付けます。

On Regularized Square-root Regression Problems: Distributionally Robust Interpretation and Fast Computations
正則化平方根回帰問題について:分布的にロバストな解釈と高速計算

Square-root (loss) regularized models have recently become popular in linear regression due to their nice statistical properties. Moreover, some of these models can be interpreted as the distributionally robust optimization counterparts of the traditional least-squares regularized models. In this paper, we give a unified proof to show that any square-root regularized model whose penalty function being the sum of a simple norm and a seminorm can be interpreted as the distributionally robust optimization (DRO) formulation of the corresponding least-squares problem. In particular, the optimal transport cost in the DRO formulation is given by a certain dual form of the penalty. To solve the resulting square-root regularized model whose loss function and penalty function are both nonsmooth, we design a proximal point dual semismooth Newton algorithm and demonstrate its efficiency when the penalty is the sparse group Lasso penalty or the fused Lasso penalty. Extensive experiments demonstrate that our algorithm is highly efficient for solving the square-root sparse group Lasso problems and the square-root fused Lasso problems.

平方根（損失）正則化モデルは、その優れた統計特性のため、最近線形回帰で人気が高まっています。さらに、これらのモデルのいくつかは、従来の最小二乗正則化モデルの分布ロバスト最適化対応物として解釈できます。この論文では、ペナルティ関数が単純ノルムとセミノルムの合計である任意の平方根正則化モデルは、対応する最小二乗問題の分布ロバスト最適化（DRO）定式化として解釈できることを示す統一的な証明を示します。特に、DRO定式化における最適な輸送コストは、ペナルティの特定の双対形式によって与えられます。損失関数とペナルティ関数の両方が非平滑である結果として得られる平方根正則化モデルを解くために、近似点双対半平滑ニュートンアルゴリズムを設計し、ペナルティがスパースグループLassoペナルティまたは融合Lassoペナルティである場合の効率性を実証します。広範囲にわたる実験により、私たちのアルゴリズムは平方根スパースグループLasso問題と平方根融合Lasso問題を解くのに非常に効率的であることが実証されています。

Learning from Noisy Pairwise Similarity and Unlabeled Data
ノイズの多いペアワイズ類似性とラベルなしデータからの学習

SU classification employs similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points to build a classifier, which can serve as an alternative to the standard supervised trained classifiers requiring data points with class labels. SU classification is advantageous because in the era of big data, more attention has been paid to data privacy. Datasets with specific class labels are often difficult to obtain in real-world classification applications regarding privacy-sensitive matters, such as politics and religion, which can be a bottleneck in supervised classification. Fortunately, similarity labels do not reveal the explicit information and inherently protect the privacy, e.g., collecting answers to “With whom do you share the same opinion on issue $\mathcal{I}$?” instead of “What is your opinion on issue $\mathcal{I}$?”. Nevertheless, SU classification still has an obvious limitation: respondents might answer these questions in a manner that is viewed favorably by others instead of answering truthfully. Therefore, there exist some dissimilar data pairs labeled as similar, which significantly degenerates the performance of SU classification. In this paper, we study how to learn from noisy similar (nS) data pairs and unlabeled (U) data, which is called nSU classification. Specifically, we carefully model the similarity noise and estimate the noise rate by using the mixture proportion estimation technique. Then, a clean classifier can be learned by minimizing a denoised and unbiased classification risk estimator, which only involves the noisy data. Moreover, we further derive a theoretical generalization error bound for the proposed method. Experimental results demonstrate the effectiveness of the proposed algorithm on several benchmark datasets.

SU分類では、類似(S)データペア(2つの例が同じクラスに属する)とラベルなし(U)データポイントを使用して分類器を構築します。これは、クラスラベル付きのデータポイントを必要とする標準的な教師ありトレーニング済み分類器の代替として使用できます。SU分類が有利なのは、ビッグデータの時代にはデータプライバシーへの注目が高まっているためです。政治や宗教などプライバシーに敏感な問題に関する実際の分類アプリケーションでは、特定のクラスラベルを持つデータセットを入手するのが難しいことが多く、教師あり分類のボトルネックになる可能性があります。幸いなことに、類似ラベルは明示的な情報を明らかにせず、本質的にプライバシーを保護します。たとえば、「問題$\mathcal{I}$についての意見は?」ではなく、「問題$\mathcal{I}$について誰と同じ意見を持っていますか?」という質問に対する回答を収集します。ただし、SU分類には依然として明らかな制限があります。回答者は、これらの質問に正直に答えるのではなく、他の人に好意的に見られる方法で答える可能性があります。そのため、類似としてラベル付けされた非類似データペアがいくつか存在し、SU分類のパフォーマンスが大幅に低下します。この論文では、ノイズの多い類似(nS)データペアとラベルなし(U)データ(nSU分類と呼ばれる)から学習する方法を検討します。具体的には、類似性ノイズを慎重にモデル化し、混合比率推定技術を使用してノイズ率を推定します。次に、ノイズの多いデータのみを含むノイズ除去された偏りのない分類リスク推定値を最小化することで、クリーンな分類器を学習できます。さらに、提案方法の理論的な一般化誤差境界をさらに導出します。実験結果は、いくつかのベンチマークデータセットで提案アルゴリズムの有効性を実証しています。

Pathfinder: Parallel quasi-Newton variational inference
パスファインダー:平行準ニュートン変分推論

We propose Pathfinder, a variational method for approximately sampling from differentiable probability densities. Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathfinder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the target distribution. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder’s speed advantage with multiple cores.

私たちは、微分可能な確率密度から近似的にサンプリングする変分法であるPathfinderを提案します。ランダムな初期化から始めて、Pathfinderは準ニュートン最適化パスに沿ってターゲット密度への正規近似値を見つけ、局所共分散は最適化プログラムによって生成された逆ヘッセ推定値を使用して推定されます。Pathfinderは、ターゲット分布への推定Kullback-Leibler (KL)ダイバージェンスが最も低い近似値から抽出したものを返す。私たちは、広範囲の事後分布でPathfinderを評価し、その近似抽出値が自動微分変分推論(ADVI)のものより優れており、1-ワッサーシュタイン距離で測定された動的ハミルトンモンテカルロ(HMC)の短いチェーンによって生成されたものと同等であることを実証した。ADVIおよび短い動的HMC実行と比較すると、Pathfinderでは対数密度と勾配の評価が1～2桁少なく、より困難な事後分布では評価が大幅に削減されます。Pathfinderの複数回の実行における重要度の再サンプリングにより、近似値の多様性が向上し、1-ワッサーシュタイン距離がさらに短縮され、プラトー、鞍点、またはマイナーモードでの最適化の失敗に対する堅牢性の尺度が提供されます。モンテカルロKLダイバージェンスの推定は、コアPathfinderアルゴリズムで驚くほど並列化可能であり、再サンプリングバージョンでの複数回の実行も同様であるため、複数のコアを使用したPathfinderの速度の利点がさらに高まります。

Tree-Values: Selective Inference for Regression Trees
tree-values: 回帰ツリーの選択的推論

We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.

私たちは、分類と回帰木(CART) (Breimanら, 1984)アルゴリズムの出力に対して推論を行うことを検討します。ツリーがデータから推定されたという事実を考慮しない推論への素朴なアプローチは、タイプ1エラー率制御や名目カバレッジなどの標準的な保証を達成しません。したがって、適合したCARTツリーで推論を行うための選択的推論フレームワークを提案します。一言で言えば、データから木が推定されたという事実を条件とします。選択的タイプ1エラー率を制御するターミナルノードのペア間の平均応答の差と、ノミナル選択的カバレッジを達成する1つのターミナルノード内の平均応答の信頼区間の検定を提案します。必要な条件付けセットを計算するための効率的なアルゴリズムが提供されます。これらの方法をシミュレーションで適用し、ポーションコントロール介入とカロリー摂取量との関連を含むデータセットに適用します。

Variational Inference in high-dimensional linear regression
高次元線形回帰における変分推論

We study high-dimensional bayesian linear regression with product priors. Using the nascent theory of “non-linear large deviations” (Chatterjee and Dembo, 2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant for the posterior. Furthermore, we establish that under an additional “separation” condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this “separation” condition. Finally, we illustrate our results on concrete examples with specific design matrices.

私たちは、積の事前分布を持つ高次元のベイズ線形回帰を研究します。「非線形大偏差」という新生理論(Chatterjee and Dembo, 2016)を用いて、事後分布の対数正規化定数に対するナイーブ平均場近似の先行順序の正確性のための十分な条件を導出します。続いて、観測データの真の線形モデルを仮定して、事後分布の対数正規化定数の限定無限次元変分式を導出します。さらに、追加の「分離」条件下では、変分問題には固有のオプティマイザーがあり、このオプティマイザーが事後分布の確率的特性を支配することを確立します。この「分離」条件の有効性について、直感的に十分な条件を提供します。最後に、特定の設計行列を使用した具体的な例で結果を示します。

Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning and Graph Neural Networks
強化学習とグラフニューラルネットワークを使用したグラフ分割とスパース行列の順序付け

We present a novel method for graph partitioning, based on reinforcement learning and graph convolutional neural networks. Our approach is to recursively partition coarser representations of a given graph. The neural network is implemented using SAGE graph convolution layers, and trained using an advantage actor critic (A2C) agent. We present two variants, one for finding ean edge separator that minimizes the normalized cut or quotient cut, and one that finds a small vertex separator. The vertex separators are then used to construct a nested dissection ordering to permute a sparse matrix so that its triangular factorization will incur less fill-in. The partitioning quality is compared with partitions obtained using METIS and SCOTCH, and the nested dissection ordering is evaluated in the sparse solver SuperLU. Our results show that the proposed method achieves similar partitioning quality as METIS, SCOTCH and spectral partitioning. Furthermore, the method generalizes across different classes of graphs, and works well on a variety of graphs from the SuiteSparse sparse matrix collection.

私たちは、強化学習とグラフ畳み込みニューラルネットワークに基づくグラフ分割の新しい方法を紹介します。このアプローチは、特定のグラフのより粗い表現を再帰的に分割することです。ニューラルネットワークはSAGEグラフ畳み込み層を使用して実装され、アドバンテージアクタークリティック(A2C)エージェントを使用してトレーニングされます。正規化カットまたは商カットを最小化するeanエッジセパレーターを見つけるバリアントと、小さな頂点セパレーターを見つけるバリアントの2つを紹介します。頂点セパレーターは、ネストされた分解順序を構築するために使用され、スパースマトリックスを並べ替えて、三角因数分解によるフィルインが少なくなるようにするものです。分割の品質は、METISおよびSCOTCHを使用して取得された分割と比較され、ネストされた分解順序はスパースソルバーSuperLUで評価されます。結果は、提案された方法がMETIS、SCOTCH、およびスペクトル分割と同様の分割品質を実現することを示しています。さらに、この方法はさまざまなグラフのクラスに一般化されており、SuiteSparseスパースマトリックスコレクションのさまざまなグラフでうまく機能します。

On Instrumental Variable Regression for Deep Offline Policy Evaluation
ディープオフライン政策評価のための操作変数回帰について

We show that the popular reinforcement learning (RL) strategy of estimating the state-action value (Q-function) by minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and output noise being correlated. Hence, direct minimization of the Bellman error can result in significantly biased Q-function estimates. We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Q-function estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as model-based techniques but also to obtain competitive new techniques. We find empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE. We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwithACME.

私たちは、ベルマン誤差の二乗平均を最小化することで状態行動値(Q関数)を推定する一般的な強化学習(RL)戦略が、入力と出力ノイズが相関しているため交絡を伴う回帰問題につながることを示します。したがって、ベルマン誤差を直接最小化すると、Q関数推定値にかなり偏りが生じる可能性があります。Deep QネットワークとFitted Q EvaluationでターゲットQネットワークを固定するとこの交絡を克服できる理由を説明し、これにより、Deep RL文献で一般的だが十分に理解されていないこのトリックに新たな光を当てます。交絡に対処するための別のアプローチは、因果関係の文献で開発された手法、特に操作変数(IV)を活用することです。ここでは、IVアプローチがQ関数推定値の改善につながるかどうかを調査することにより、IVとRLに関する文献をまとめます。この論文では、ログデータのみを使用してポリシーの価値を推定することを目標とするオフラインポリシー評価(OPE)のコンテキストで、最近のさまざまなIV手法を分析および比較します。OPEにさまざまなIV技術を適用することで、モデルベースの技術など、以前に提案されたOPE手法を回復できるだけでなく、競争力のある新しい技術を取得することもできます。最先端のOPE手法は、OPE用に開発されていないAGMMなどの一部のIV手法とパフォーマンスがほぼ同等であることが経験的にわかっています。すべてのコードとデータセットは、https://github.com/liyuan9988/IVOPEwithACMEでオープンソース化されています。

Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data
高次元クラスタデータの可視化のためのt-SNEの理論的基礎

This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth the interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications.

この論文では、人気の高い非線形次元削減およびデータ可視化手法であるt分布確率的近傍埋め込み(t-SNE)アルゴリズムの理論的基礎を調査します。勾配降下法に基づくt-SNEの分析のための新しい理論的枠組みを提示します。t-SNEの初期の誇張段階については、基礎となるグラフラプラシアンに基づくべき乗反復に対する漸近的等価性を示し、その限界動作を特徴付け、ラプラシアンスペクトルクラスタリングとの深い関係、および暗黙的な正則化としての早期停止などの基本原理を明らかにします。結果は、このような計算戦略の固有のメカニズムと経験的利点を説明しています。t-SNEの埋め込み段階では、反復全体にわたる低次元マップの運動学を特徴付け、クラスター間反発と低次元マップの拡張動作を特徴とする増幅フェーズと安定化フェーズを特定します。一般理論では、クラスター化されたデータを視覚化するためのt-SNEの高速収束率と優れた経験的パフォーマンスを説明し、t-SNE視覚化の解釈を明らかにし、さまざまなアプリケーションでt-SNEを適用し、そのチューニングパラメータを選択するための理論的なガイダンスを提供します。

More Powerful Conditional Selective Inference for Generalized Lasso by Parametric Programming
パラメトリック計画法による一般化ラッソのためのより強力な条件付き選択的推論

Conditional selective inference (SI) has been studied intensively as a new statistical inference framework for data-driven hypotheses. The basic concept of conditional SI is to make the inference conditional on the selection event, which enables an exact and valid statistical inference to be conducted even when the hypothesis is selected based on the data. Conditional SI has mainly been studied in the context of model selection, such as vanilla lasso or generalized lasso. The main limitation of existing approaches is the low statistical power owing to over-conditioning, which is required for computational tractability. In this study, we propose a more powerful and general conditional SI method for a class of problems that can be converted into quadratic parametric programming, which includes generalized lasso. The key concept is to compute the continuum path of the optimal solution in the direction of the selected test statistic and to identify the subset of the data space that corresponds to the model selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned major drawback of over-conditioning, but also improves the performance and practicality of SI in various respects. We conducted several experiments to demonstrate the effectiveness and efficiency of our proposed method.

条件付き選択推論(SI)は、データ駆動型仮説の新しい統計的推論フレームワークとして集中的に研究されてきました。条件付きSIの基本概念は、推論を選択イベントに条件付きにすることです。これにより、データに基づいて仮説が選択された場合でも、正確で有効な統計的推論を実行できます。条件付きSIは、主にバニラLassoや一般化Lassoなどのモデル選択のコンテキストで研究されてきました。既存のアプローチの主な制限は、計算の扱いやすさに必要な過剰条件付けによる統計的検出力が低いことです。この研究では、一般化Lassoを含む二次パラメトリック計画法に変換できる問題のクラスに対して、より強力で一般的な条件付きSI法を提案します。重要な概念は、選択された検定統計量の方向への最適解の連続パスを計算し、解のパスをたどってモデル選択イベントに対応するデータ空間のサブセットを特定することです。提案されたパラメトリックプログラミングベースの方法は、前述の過剰条件付けの大きな欠点を回避するだけでなく、さまざまな点でSIのパフォーマンスと実用性を向上させます。提案された方法の有効性と効率性を実証するために、いくつかの実験を実施しました。

Interpretable Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings
スペクトルエンベロープと最適スケーリングを使用したカテゴリカル時系列の解釈可能な分類

This article introduces a novel approach to the classification of categorical time series under the supervised learning paradigm. To construct meaningful features for categorical time series classification, we consider two relevant quantities: the spectral envelope and its corresponding set of optimal scalings. These quantities characterize oscillatory patterns in a categorical time series as the largest possible power at each frequency, or spectral envelope, obtained by assigning numerical values, or scalings, to categories that optimally emphasize oscillations at each frequency. Our procedure combines these two quantities to produce an interpretable and parsimonious feature-based classifier that can be used to accurately determine group membership for categorical time series. Classification consistency of the proposed method is investigated, and simulation studies are used to demonstrate accuracy in classifying categorical time series with various underlying group structures. Finally, we use the proposed method to explore key differences in oscillatory patterns of sleep stage time series for patients with different sleep disorders and accurately classify patients accordingly. The code for implementing the proposed method is available at https://github.com/zedali16/envsca.

この記事では、教師あり学習パラダイムの下でカテゴリ時系列を分類する新しいアプローチを紹介します。カテゴリ時系列分類に意味のある特徴を構築するために、スペクトルエンベロープとそれに対応する最適なスケーリングのセットという2つの関連量を検討します。これらの量は、カテゴリ時系列の振動パターンを、各周波数での振動を最適に強調するカテゴリに数値またはスケーリングを割り当てることによって得られる、各周波数での最大可能パワー、またはスペクトルエンベロープとして特徴付けます。私たちの手順では、これら2つの量を組み合わせて、カテゴリ時系列のグループメンバーシップを正確に決定するために使用できる、解釈可能で簡潔な特徴ベースの分類器を作成します。提案された方法の分類の一貫性を調査し、さまざまな基礎グループ構造を持つカテゴリ時系列の分類の精度をシミュレーション研究で実証します。最後に、提案された方法を使用して、さまざまな睡眠障害を持つ患者の睡眠段階時系列の振動パターンの主な違いを調査し、それに応じて患者を正確に分類します。提案された方法を実装するためのコードは、https://github.com/zedali16/envscaで入手できます。

JsonGrinder.jl: automated differentiable neural architecture for embedding arbitrary JSON data
JsonGrinder.jl: 任意のJSONデータを埋め込むための自動微分ニューラルアーキテクチャ

Standard machine learning (ML) problems are formulated on data converted into a suitable tensor representation. However, there are data sources, for example in cybersecurity, that are naturally represented in a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to a tensor representation is usually done by manual feature engineering, which is laborious, lossy, and prone to bias originating from the human inability to correctly judge the importance of particular features. JsonGrinder.jl is a library automating various ML tasks on these difficult sources. Starting with an arbitrary set of JSON samples, it automatically creates a differentiable ML model (called hmilnet), which embeds raw JSON samples into a fixed-size tensor representation. This embedding network can be naturally extended by an arbitrary ML model expecting tensor inputs in order to perform classification, regression, or clustering.

標準的な機械学習(ML)の問題は、適切なテンソル表現に変換されたデータに基づいて定式化されます。ただし、サイバーセキュリティなど、XML、JSON、プロトコルバッファなどの統一された階層構造で自然に表されるデータソースがあります。このデータをテンソル表現に変換するには、通常、手動の特徴エンジニアリングによって行われますが、これは手間がかかり、損失が多く、特定の特徴の重要性を人間が正確に判断できないことに起因するバイアスが発生しやすくなります。JsonGrinder.jlは、これらの難しいソースに対するさまざまなMLタスクを自動化するライブラリです。任意のJSONサンプルのセットから始めて、微分可能なMLモデル(hmilnetと呼ばれる)を自動的に作成し、生のJSONサンプルを固定サイズのテンソル表現に埋め込みます。この埋め込みネットワークは、分類、回帰、またはクラスタリングを実行するためにテンソル入力を期待する任意のMLモデルによって自然に拡張できます。

Handling Hard Affine SDP Shape Constraints in RKHSs
RKHSでのハードアフィンSDP形状制約の処理

Shape constraints, such as non-negativity, monotonicity, convexity or supermodularity, play a key role in various applications of machine learning and statistics. However, incorporating this side information into predictive models in a hard way (for example at all points of an interval) for rich function classes is a notoriously challenging problem. We propose a unified and modular convex optimization framework, relying on second-order cone (SOC) tightening, to encode hard affine SDP constraints on function derivatives, for models belonging to vector-valued reproducing kernel Hilbert spaces (vRKHSs). The modular nature of the proposed approach allows to simultaneously handle multiple shape constraints, and to tighten an infinite number of constraints into finitely many. We prove the convergence of the proposed scheme and that of its adaptive variant, leveraging geometric properties of vRKHSs. Due to the covering-based construction of the tightening, the method is particularly well-suited to tasks with small to moderate input dimensions. The efficiency of the approach is illustrated in the context of shape optimization, safety-critical control, robotics and econometrics.

非負性、単調性、凸性、超モジュラ性などの形状制約は、機械学習や統計のさまざまなアプリケーションで重要な役割を果たします。ただし、この副次情報を、豊富な関数クラスの予測モデルに難しい方法で(たとえば、区間のすべてのポイントで)組み込むことは、非常に困難な問題です。ベクトル値再生核ヒルベルト空間(vRKHS)に属するモデルに対して、関数導関数に対するハードアフィンSDP制約をエンコードするために、2次円錐(SOC)の引き締めに依存する、統一されたモジュール式の凸最適化フレームワークを提案します。提案されたアプローチのモジュール性により、複数の形状制約を同時に処理し、無限の数の制約を有限の数に引き締めることができます。vRKHSの幾何学的特性を活用して、提案されたスキームとその適応型の収束を証明します。引き締めの被覆ベースの構築により、この方法は、小さい入力次元から中程度の入力次元を持つタスクに特に適しています。このアプローチの効率性は、形状最適化、安全性重視の制御、ロボット工学、計量経済学の文脈で実証されています。

Stable Classification
安定した分類

We address the problem of instability of classification models: small changes in the training data leading to large changes in the resulting model and predictions. This phenomenon is especially well established for single tree based methods such as CART, however it is present in all classification methods. We apply robust optimization to improve the stability of four of the most commonly used classification methods: Random Forests, Logistic Regression, Support Vector Machines, and Optimal Classification Trees. Through experiments on 30 data sets with sizes ranging between 10^2 and 10^4 observations and features, we show that our approach (a) leads to improvements in stability, and in some cases accuracy, compared to the original methods, with the gains in stability being particularly significant (even, surprisingly, for those methods that were previously thought to be stable, such as Random Forests) and (b) has computational times comparable with (and indeed in some cases even faster than) the original methods allowing the method to be very scalable.

私たちは、分類モデルの不安定性の問題に取り組んでいます。つまり、トレーニングデータの小さな変更が、結果として得られるモデルと予測に大きな変更をもたらすのです。この現象は、CARTなどの単一ツリーベースの方法では特によく知られていますが、すべての分類方法で発生します。私たちは、ロバスト最適化を適用して、ランダムフォレスト、ロジスティック回帰、サポートベクターマシン、最適分類ツリーという、最も一般的に使用される4つの分類方法の安定性を改善しました。観測値と特徴が10^2～10^4のサイズの30のデータセットでの実験により、我々のアプローチは(a)元の方法と比較して安定性、場合によっては精度が向上し、安定性の向上は特に顕著であり(驚いたことに、ランダムフォレストなど、以前は安定していると考えられていた方法であっても)、(b)計算時間が元の方法と同等(場合によってはそれよりも速い)であるため、非常にスケーラブルであることを示しています。

Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables
隠れ変数を持つグラフィカルモデルにおける因果効果のセミパラメトリック推論

Identification theory for causal effects in causal models associated with hidden variable directed acyclic graphs (DAGs) is well studied. However, the corresponding algorithms are underused due to the complexity of estimating the identifying functionals they output. In this work, we bridge the gap between identification and estimation of population-level causal effects involving a single treatment and a single outcome. We derive influence function based estimators that exhibit double robustness for the identified effects in a large class of hidden variable DAGs where the treatment satisfies a simple graphical criterion; this class includes models yielding the adjustment and front-door functionals as special cases. We also provide necessary and sufficient conditions under which the statistical model of a hidden variable DAG is nonparametrically saturated and implies no equality constraints on the observed data distribution. Further, we derive an important class of hidden variable DAGs that imply observed data distributions observationally equivalent (up to equality constraints) to fully observed DAGs. In these classes of DAGs, we derive estimators that achieve the semiparametric efficiency bounds for the target of interest where the treatment satisfies our graphical criterion. Finally, we provide a sound and complete identification algorithm that directly yields a weight based estimation strategy for any identifiable effect in hidden variable causal models.

隠れ変数有向非巡回グラフ(DAG)に関連する因果モデルにおける因果効果の識別理論は、十分に研究されています。しかし、対応するアルゴリズムは、それらが出力する識別関数の推定が複雑であるため、十分に活用されていません。この研究では、単一の処理と単一の結果を含む集団レベルの因果効果の識別と推定の間のギャップを埋めます。処理が単純なグラフィカル基準を満たす大規模な隠れ変数DAGクラスで識別された効果に対して2倍の堅牢性を示す影響関数ベースの推定量を導出します。このクラスには、特別なケースとして調整関数とフロントドア関数を生成するモデルが含まれます。また、隠れ変数DAGの統計モデルがノンパラメトリックに飽和し、観測されたデータ分布に等式制約がないことを意味する必要十分条件も提供します。さらに、観測されたデータ分布が(等式制約まで)完全に観測されたDAGと観測的に同等であることを意味する重要な隠れ変数DAGクラスを導出します。これらのDAGクラスでは、処理がグラフィカル基準を満たす対象に対してセミパラメトリック効率境界を達成する推定量を導出します。最後に、隠れた変数因果モデルで識別可能な効果に対して重みベースの推定戦略を直接生成する、健全で完全な識別アルゴリズムを提供します。

Linearization and Identification of Multiple-Attractor Dynamical Systems through Laplacian Eigenmaps
ラプラシアン固有マップによる多重アトラクタ力学系の線形化と同定

Dynamical Systems (DS) are fundamental to the modeling and understanding time evolving phenomena, and have application in physics, biology and control. As determining an analytical description of the dynamics is often difficult, data-driven approaches are preferred for identifying and controlling nonlinear DS with multiple equilibrium points. Identification of such DS has been treated largely as a supervised learning problem. Instead, we focus on an unsupervised learning scenario where we know neither the number nor the type of dynamics. We propose a Graph-based spectral clustering method that takes advantage of a velocity-augmented kernel to connect data points belonging to the same dynamics, while preserving the natural temporal evolution. We study the eigenvectors and eigenvalues of the Graph Laplacian and show that they form a set of orthogonal embedding spaces, one for each sub-dynamics. We prove that there always exist a set of 2-dimensional embedding spaces in which the sub-dynamics are linear and n-dimensional embedding spaces where they are quasi-linear. We compare the clustering performance of our algorithm to Kernel K-Means, Spectral Clustering and Gaussian Mixtures and show that, even when these algorithms are provided with the correct number of sub-dynamics, they fail to cluster them correctly. We learn a diffeomorphism from the Laplacian embedding space to the original space and show that the Laplacian embedding leads to good reconstruction accuracy and a faster training time through an exponential decaying loss compared to the state-of-the-art diffeomorphism-based approaches.

動的システム(DS)は、時間とともに変化する現象をモデル化して理解するための基礎であり、物理学、生物学、制御に応用されています。ダイナミクスの解析的記述を決定することは難しいことが多いため、複数の平衡点を持つ非線形DSを識別して制御するには、データ駆動型のアプローチが好まれます。このようなDSの識別は、主に教師あり学習の問題として扱われてきました。代わりに、ダイナミクスの数も種類もわからない教師なし学習のシナリオに焦点を当てます。速度拡張カーネルを利用して、自然な時間的変化を維持しながら、同じダイナミクスに属するデータポイントを接続するグラフベースのスペクトルクラスタリング手法を提案します。グラフラプラシアンの固有ベクトルと固有値を調べ、それらがサブダイナミクスごとに1つずつ、直交する埋め込み空間のセットを形成することを示します。私たちは、サブダイナミクスが線形である2次元埋め込み空間のセットと、サブダイナミクスが準線形であるn次元埋め込み空間のセットが常に存在することを証明します。我々のアルゴリズムのクラスタリングパフォーマンスをカーネルK平均法、スペクトルクラスタリング、ガウス混合と比較し、これらのアルゴリズムに正しい数のサブダイナミクスが提供された場合でも、正しくクラスタリングできないことを示します。ラプラシアン埋め込み空間から元の空間への微分同相写像を学習し、最先端の微分同相写像ベースのアプローチと比較して、ラプラシアン埋め込みによって、優れた再構築精度と、指数関数的減衰損失によるトレーニング時間の短縮が実現されることを示します。

Expected Regret and Pseudo-Regret are Equivalent When the Optimal Arm is Unique
最適なアームが唯一のものである場合、期待される後悔と疑似後悔は同等である

In online linear optimisation with stochastic losses it is common to bound the pseudo-regret of an algorithm rather than the expected regret. This is attributed to the expected fluctuations for i.i.d sums making expected regret bounds better than $\Omega(\sqrt T)$ impossible. In this paper we show that when there is a unique optimal action and the action set is a polytope the difference between pseudo-regret and expected regret is $o(1)$. This means that the existing upper bounds on pseudo-regret in the literature can immediately be extended to also upper bound the expected regret. Our results are independent of the algorithm used to select the actions and apply equally to the bandit and full-information settings.

確率的損失を伴うオンライン線形最適化では、予想される後悔ではなく、アルゴリズムの疑似後悔を限定するのが一般的です。これは、i.i.d和の予想変動により、予想される後悔の範囲が$Omega(sqrt T)$よりも良くなることは不可能であることに起因します。この論文では、一意の最適なアクションがあり、アクションセットがポリトープである場合、疑似後悔と予想される後悔の差は$o(1)$であることを示します。これは、文献における疑似後悔の既存の上限を、予想される後悔の上限まですぐに拡張できることを意味します。私たちの結果は、アクションを選択するために使用されるアルゴリズムとは無関係であり、バンディットとフルインフォメーションの設定に等しく適用されます。

Two-mode Networks: Inference with as Many Parameters as Actors and Differential Privacy
2 モードネットワーク: アクターと同じ数のパラメーターと差分プライバシーによる推論

Many network data encountered are two-mode networks. These networks are characterized by having two sets of nodes and links are only made between nodes belonging to different sets. While their two-mode feature triggers interesting interactions, it also increases the risk of privacy exposure, and it is essential to protect sensitive information from being disclosed when releasing these data. In this paper, we introduce a weak notion of edge differential privacy and propose to release the degree sequence of a two-mode network by adding non-negative Laplacian noises that satisfies this privacy definition. Under mild conditions for an exponential-family model for bipartite graphs in which each node is individually parameterized, we establish the consistency and Asymptotic normality of two differential privacy estimators, the first based on moment equations and the second after denoising the noisy sequence. For the latter, we develop an efficient algorithm which produces a readily useful synthetic bipartite graph. Numerical simulations and a real data application are carried out to verify our theoretical results and demonstrate the usefulness of our proposal.

遭遇するネットワークデータの多くは2モードネットワークです。これらのネットワークは、2セットのノードを持ち、リンクは異なるセットに属するノード間でのみ作成されるという特徴があります。2モードの特徴は興味深い相互作用をトリガーしますが、プライバシー露出のリスクも増加し、これらのデータを公開する際には機密情報が漏洩しないように保護することが不可欠です。この論文では、エッジ差分プライバシーの弱い概念を導入し、このプライバシー定義を満たす非負のラプラシアンノイズを追加することで、2モードネットワークの次数シーケンスを公開することを提案します。各ノードが個別にパラメーター化される2部グラフの指数族モデルの穏やかな条件下で、2つの差分プライバシー推定値の一貫性と漸近正規性を確立します。1つ目はモーメント方程式に基づき、2つ目はノイズシーケンスのノイズ除去後です。後者については、すぐに使用できる合成2部グラフを生成する効率的なアルゴリズムを開発します。数値シミュレーションと実際のデータアプリケーションを実行して、理論的な結果を検証し、提案の有用性を実証します。

Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays
一般化行列因数分解: 一般化線形潜在変数モデルを大規模なデータ配列に当てはめるための効率的なアルゴリズム

Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large data sets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional data sets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a data set of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithms.

測定されていない変数や潜在変数は、心理学、生態学、医学などさまざまな分野で研究されている多変量測定間の相関関係の原因となることがよくあります。ガウス測定には、確立された理論と高速アルゴリズムを備えた因子分析や主成分分析などの古典的なツールがあります。一般化線形潜在変数モデル(GLLVM)は、このような因子モデルを非ガウス応答に一般化します。ただし、GLLVMでモデルパラメーターを推定する現在のアルゴリズムは集中的な計算を必要とし、数千の観測単位または応答を含む大規模なデータセットには拡張できません。この記事では、ペナルティ付き準尤度を使用してモデルを近似し、ニュートン法とフィッシャースコアリングを使用してモデルパラメーターを学習することに基づいて、高次元データセットにGLLVMを適合させる新しいアプローチを提案します。計算的には、私たちの方法は著しく高速で安定しており、GLLVMはこれまで可能だったよりもはるかに大きな行列に適合できます。私たちは、各ユニットで2,000種を超える観測種を含む48,000の観測ユニットのデータセットにこの手法を適用し、変動の大部分がいくつかの要因で説明できることを発見しました。私たちは、提案したフィッティングアルゴリズムの使いやすい実装を公開しています。

Regularized and Smooth Double Core Tensor Factorization for Heterogeneous Data
異種データのための正則化および滑らかなダブルコアテンソル因数分解

We introduce a general tensor model suitable for data analytic tasks for heterogeneous datasets, wherein there are joint low-rank structures within groups of observations, but also discriminative structures across different groups. To capture such complex structures, a double core tensor (DCOT) factorization model is introduced together with a family of smoothing loss functions. By leveraging the proposed smoothing function, the model accurately estimates the model factors, even in the presence of missing entries. A linearized ADMM method is employed to solve regularized versions of DCOT factorizations, that avoid large tensor operations and large memory storage requirements. Further, we establish theoretically its global convergence, together with consistency of the estimates of the model parameters. The effectiveness of the DCOT model is illustrated on several real-world examples including image completion, recommender systems, subspace clustering, and detecting modules in heterogeneous Omics multi-modal data, since it provides more insightful decompositions than conventional tensor methods.

私たちは、異種データセットのデータ分析タスクに適した一般的なテンソルモデルを導入します。異種データセットでは、観測のグループ内に結合した低ランク構造がありますが、異なるグループ間では識別構造もあります。このような複雑な構造を捉えるために、ダブルコアテンソル(DCOT)因数分解モデルが、平滑化損失関数のファミリーとともに導入されています。提案された平滑化関数を活用することで、モデルは、欠落したエントリが存在する場合でも、モデル因子を正確に推定します。線形化されたADMM法は、大規模なテンソル演算と大規模なメモリストレージ要件を回避する、DCOT因数分解の正規化バージョンを解決するために採用されています。さらに、モデルパラメータの推定値の一貫性とともに、そのグローバル収束を理論的に確立します。DCOTモデルの有効性は、従来のテンソル法よりも洞察に富んだ分解を提供するため、画像補完、レコメンデーションシステム、サブスペースクラスタリング、異種Omicsマルチモーダルデータにおけるモジュールの検出など、いくつかの実際の例で実証されています。

Estimating Causal Effects under Network Interference with Bayesian Generalized Propensity Scores
Bayes一般化傾向スコアによるネットワーク干渉下での因果効果の推定

Real-world systems are often comprised of interconnected units, and can be represented as networks, with nodes and edges. In a social system, for instance, individuals may have social ties and financial relationships. In these settings, when a node (the unit analysis) is exposed to a treatment, its effects may spill over to connected units; then estimating both the direct effect of the treatment and its spillover effects presents several challenges. First, assumptions about the mechanism through which spillover effects occur along the observed network are required. Second, in observational studies, where the treatment assignment has not been randomized, confounding and homophily are further potential threats to the identification and to the estimation of causal effects, on networks. Here, we make two structural assumptions: (i) neighborhood interference, which assumes interference operates only through a function of the immediate neighbors’ treatments, and (ii) unconfoundedness of the individual and neighborhood treatment, which rules out the presence of unmeasured confounding variables, including those driving homophily. Under these assumptions we develop a new covariate-adjustment estimator for direct treatment and spillover effects in observational studies on networks. We proposed an estimation strategy based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors’ treatment. Adjustment for propensity score is performed using a penalized spline regression. Our inference strategy capitalizes on a three-step Bayesian procedure, which allows to take account for the uncertainty in the propensity score estimation, and avoids model feedback. The correlation among connected units is taken into account using a community detection algorithm, and incorporating random effects in the outcome model. All these sources of variability, including variability of treatment assignment, are accounted for in the posterior distribution of the finite-sample causal estimands we target. We design a simulation study to assess the performance of the proposed estimator on different network topologies, both on synthetic networks and on real friendship network from the Add-Health study.

現実世界のシステムは、多くの場合、相互接続されたユニットで構成されており、ノードとエッジを持つネットワークとして表すことができます。たとえば、社会システムでは、個人は社会的つながりや金銭的な関係を持っている可能性があります。これらの設定では、ノード(ユニット分析)が処理にさらされると、その影響が接続されたユニットに波及する可能性があります。その場合、処理の直接的な影響とその波及効果の両方を推定することは、いくつかの課題を伴います。まず、観測されたネットワークに沿って波及効果が発生するメカニズムに関する仮定が必要です。次に、処理の割り当てがランダム化されていない観察研究では、交絡と同質性が、ネットワークに対する因果効果の特定と推定に対するさらなる潜在的な脅威となります。ここでは、2つの構造的仮定を立てます。(i)近隣干渉。干渉は、すぐ近くの処理の機能を通じてのみ作用すると仮定します。(ii)個人と近隣の処理の非交絡性。同質性を促進するものを含む、測定されていない交絡変数の存在を排除します。これらの仮定の下で、ネットワークの観察研究における直接的な治療効果とスピルオーバー効果の新しい共変量調整推定量を開発しました。私たちは、さまざまなレベルの個人治療と近隣の人々の治療法への曝露の下で、ユニット間で個人と近隣の共変量のバランスをとる一般化傾向スコアに基づく推定戦略を提案しました。傾向スコアの調整は、ペナルティ付きスプライン回帰を使用して実行されます。私たちの推論戦略は、傾向スコア推定の不確実性を考慮し、モデルフィードバックを回避することができる3段階のベイズ手順を活用しています。接続されたユニット間の相関は、コミュニティ検出アルゴリズムを使用して考慮され、結果モデルにランダム効果が組み込まれています。治療割り当ての変動性を含むこれらすべての変動性の原因は、私たちが対象とする有限サンプルの因果推定量の事後分布で考慮されます。私たちは、合成ネットワークとAdd-Health研究からの実際の友人ネットワークの両方で、さまざまなネットワークトポロジで提案された推定量のパフォーマンスを評価するためのシミュレーション研究を設計しました。

ReservoirComputing.jl: An Efficient and Modular Library for Reservoir Computing Models
ReservoirComputing.jl: 貯水池コンピューティングモデルのための効率的なモジュール式ライブラリ

We introduce ReservoirComputing.jl, an open source Julia library for reservoir computing models. It is designed for temporal or sequential tasks such as time series prediction and modeling complex dynamical systems. As such it is suited to process a range of complex spatio-temporal data sets, from mathematical models to climate data. The key ideas of reservoir computing are the model architecture, i.e. the reservoir, which embeds the input into a higher dimensional space, and the learning paradigm, where only the readout layer is trained. As a result the computational resources can be kept low, and only linear optimization is required for the training. Although reservoir computing has proven itself as a successful machine learning algorithm, the software implementations have lagged behind, hindering wide recognition, reproducibility, and uptake by general scientists. ReservoirComputing.jl enhances this field by being intuitive, highly modular, and faster compared to alternative tools. A variety of modular components from the literature are implemented, e.g. two reservoir types – echo state networks and cellular automata models, and multiple training methods including Gaussian and support vector regression. A comprehensive documentation, which includes reproduced experiments from the literature is provided. The code and documentation are hosted on Github under an MIT license https://github.com/SciML/ReservoirComputing.jl.

私たちは、リザーバーコンピューティングモデル用のオープンソースJuliaライブラリであるReservoirComputing.jlを紹介します。これは、時系列予測や複雑な動的システムのモデリングなどの時間的またはシーケンシャルなタスク用に設計されています。そのため、数学モデルから気候データまで、さまざまな複雑な時空間データセットの処理に適しています。リザーバーコンピューティングの重要なアイデアは、モデルアーキテクチャ、つまり入力を高次元空間に埋め込むリザーバーと、読み取り層のみがトレーニングされる学習パラダイムです。その結果、計算リソースを低く抑えることができ、トレーニングには線形最適化のみが必要になります。リザーバーコンピューティングは優れた機械学習アルゴリズムであることが証明されていますが、ソフトウェア実装が遅れており、一般の科学者による幅広い認識、再現性、採用を妨げています。ReservoirComputing.jlは、直感的で高度にモジュール化されており、代替ツールと比較して高速であるため、この分野を強化します。文献からのさまざまなモジュールコンポーネントが実装されています。2つのリザーバータイプ(エコー状態ネットワークとセルラーオートマトンモデル)、およびガウス回帰やサポートベクター回帰などの複数のトレーニング方法。文献から再現された実験を含む包括的なドキュメントが提供されます。コードとドキュメントは、MITライセンスの下でGithubでホストされています(https://github.com/SciML/ReservoirComputing.jl)。

Information-Theoretic Characterization of the Generalization Error for Iterative Semi-Supervised Learning
反復半教師あり学習のための一般化誤差の情報理論的特徴付け

Using information-theoretic principles, we consider the generalization error (gen-error) of iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount of unlabelled data to progressively refine the model parameters. In contrast to most previous works that bound the gen-error, we provide an exact expression for the gen-error and particularize it to the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the gen-error decreases with the number of iterations, but quickly saturates. On the flip side, if the class conditional variances (and so amount of overlap between the classes) are large, the gen-error increases with the number of iterations. To mitigate this undesirable effect, we show that regularization can reduce the gen-error. The theoretical results are corroborated by extensive experiments on the MNIST and CIFAR datasets in which we notice that for easy-to-distinguish classes, the gen-error improves after several pseudo-labelling iterations, but saturates afterwards, and for more difficult-to-distinguish classes, regularization improves the generalization performance.

情報理論の原理を用いて、大量のラベルなしデータに対して疑似ラベルを反復的に生成し、モデルパラメータを徐々に改良する反復半教師あり学習（SSL）アルゴリズムの一般化誤差（gen-error）について考察します。gen-errorを制限したこれまでのほとんどの研究とは対照的に、我々はgen-errorの正確な表現を提供し、それをバイナリガウス混合モデルに特化させる。我々の理論的結果は、クラス条件付き分散がそれほど大きくない場合、gen-errorは反復回数とともに減少するが、すぐに飽和することを示唆しています。逆に、クラス条件付き分散（およびクラス間の重複量）が大きい場合、gen-errorは反復回数とともに増加します。この望ましくない影響を軽減するために、我々は正規化によってgen-errorを減らすことができることを示す。理論的結果は、MNISTおよびCIFARデータセットでの広範な実験によって裏付けられており、区別しやすいクラスの場合、疑似ラベル付けの反復を数回行うとgen-errorは改善しますが、その後は飽和し、区別が難しいクラスの場合、正規化によって一般化のパフォーマンスが向上することがわかります。

Integral Autoencoder Network for Discretization-Invariant Learning
離散化不変量学習のための積分自己符号化器

Discretization invariant learning aims at learning in the infinite-dimensional function spaces with the capacity to process heterogeneous discrete representations of functions as inputs and/or outputs of a learning model. This paper proposes a novel deep learning framework based on integral autoencoders (IAE-Net) for discretization invariant learning. The basic building block of IAE-Net consists of an encoder and a decoder as integral transforms with data-driven kernels, and a fully connected neural network between the encoder and decoder. This basic building block is applied in parallel in a wide multi-channel structure, which is repeatedly composed to form a deep and densely connected neural network with skip connections as IAE-Net. IAE-Net is trained with randomized data augmentation that generates training data with heterogeneous structures to facilitate the performance of discretization invariant learning. The proposed IAE-Net is tested with various applications in predictive data science, solving forward and inverse problems in scientific computing, and signal/image processing. Compared with alternatives in the literature, IAE-Net achieves state-of-the-art performance in existing applications and creates a wide range of new applications where existing methods fail.

離散化不変学習は、学習モデルの入力および/または出力として関数の異種離散表現を処理する能力を備えた無限次元関数空間での学習を目的としています。この論文では、離散化不変学習のための積分オートエンコーダ(IAE-Net)に基づく新しいディープラーニングフレームワークを提案します。IAE-Netの基本的な構成要素は、データ駆動型カーネルを使用した積分変換としてのエンコーダとデコーダ、およびエンコーダとデコーダ間の完全に接続されたニューラルネットワークで構成されます。この基本的な構成要素は、幅広いマルチチャネル構造に並列に適用され、繰り返し構成されて、IAE-Netのようなスキップ接続を持つ深く密に接続されたニューラルネットワークを形成します。IAE-Netは、離散化不変学習のパフォーマンスを容易にするために、異種構造のトレーニングデータを生成するランダムデータ拡張を使用してトレーニングされます。提案されたIAE-Netは、予測データサイエンス、科学計算における順方向および逆方向問題の解決、および信号/画像処理のさまざまなアプリケーションでテストされています。文献に記載されている代替手段と比較すると、IAE-Netは既存のアプリケーションで最先端のパフォーマンスを実現し、既存の方法では実現できない幅広い新しいアプリケーションを生み出します。

Deepchecks: A Library for Testing and Validating Machine Learning Models and Data
Deepchecks: 機械学習モデルとデータをテストおよび検証するためのライブラリ

This paper presents Deepchecks, a Python library for comprehensively validating machine learning models and data. Our goal is to provide an easy-to-use library comprising many checks related to various issues, such as model predictive performance, data integrity, data distribution mismatches, and more. The package is distributed under the GNU Affero General Public License and relies on core libraries from the scientific Python ecosystem: scikit-learn, PyTorch, NumPy, pandas, and SciPy.

この論文では、機械学習モデルとデータを包括的に検証するためのPythonライブラリであるDeepchecksを紹介します。私たちの目標は、モデルの予測パフォーマンス、データの整合性、データ分布の不一致など、さまざまな問題に関連する多くのチェックで構成される使いやすいライブラリを提供することです。このパッケージはGNU Affero General Public Licenseの下で配布され、科学的なPythonエコシステムのコアライブラリ(scikit-learn、PyTorch、NumPy、pandas、SciPy)に依存しています。

Exact Partitioning of High-order Models with a Novel Convex Tensor Cone Relaxation
新しい凸テンソルコーン緩和による高次モデルの厳密分割

In this paper we propose an algorithm for exact partitioning of high-order models. We define a general class of $m$-degree Homogeneous Polynomial Models, which subsumes several examples motivated from prior literature. Exact partitioning can be formulated as a tensor optimization problem. We relax this high-order combinatorial problem to a convex conic form problem. To this end, we carefully define the Carathéodory symmetric tensor cone, and show its convexity, and the convexity of its dual cone. This allows us to construct a primal-dual certificate to show that the solution of the convex relaxation is correct (equal to the unobserved true group assignment) and to analyze the statistical upper bound of exact partitioning.

この論文では、高次モデルの正確な分割のためのアルゴリズムを提案します。私たちは、先行文献から動機付けられたいくつかの例を包含する$m$次数均次多項式モデルの一般クラスを定義します。厳密分割は、テンソル最適化問題として定式化できます。この高次組み合わせ問題を凸円錐形問題に緩和します。この目的のために、Carathéodory対称テンソル円錐を慎重に定義し、その凸性と双対円錐の凸性を示します。これにより、凸緩和の解が正しい(観測されていない真の群割り当てに等しい)ことを示し、厳密分割の統計的上限を分析するための主双対証明書を構築できます。

De-Sequentialized Monte Carlo: a parallel-in-time particle smoother
非逐次化モンテカルロ: 時間的に並列な粒子スムーザー

Particle smoothers are SMC (Sequential Monte Carlo) algorithms designed to approximate the joint distribution of the states given observations from a state-space model. We propose dSMC (de-Sequentialized Monte Carlo), a new particle smoother that is able to process $T$ observations in $\mathcal{O}(\log_2 T)$ time on parallel architectures. This compares favorably with standard particle smoothers, the complexity of which is linear in $T$. We derive $\mathcal{L}_p$ convergence results for dSMC, with an explicit upper bound, polynomial in $T$. We then discuss how to reduce the variance of the smoothing estimates computed by dSMC by (i) designing good proposal distributions for sampling the particles at the initialization of the algorithm, as well as by (ii) using lazy resampling to increase the number of particles used in dSMC. Finally, we design a particle Gibbs sampler based on dSMC, which is able to perform parameter inference in a state-space model at a $\mathcal{O}(\log_2 T)$ cost on parallel hardware.

粒子スムーザーは、状態空間モデルからの観測値に基づいて状態の結合分布を近似するように設計されたSMC (Sequential Monte Carlo)アルゴリズムです。並列アーキテクチャで$T$個の観測値を$\mathcal{O}(\log_2 T)$時間で処理できる新しい粒子スムーザーであるdSMC (de-Sequentialized Monte Carlo)を提案します。これは、複雑さが$T$に線形である標準的な粒子スムーザーと比べても遜色ありません。明示的な上限、$T$の多項式を使用して、dSMCの$\mathcal{L}_p$収束結果を導出します。次に、(i)アルゴリズムの初期化時に粒子をサンプリングするための適切な提案分布を設計し、(ii)遅延再サンプリングを使用してdSMCで使用される粒子の数を増やすことで、dSMCによって計算されるスムージング推定値の分散を減らす方法について説明します。最後に、並列ハードウェア上で$\mathcal{O}(\log_2 T)$コストで状態空間モデルのパラメータ推論を実行できるdSMCに基づく粒子ギブスサンプラーを設計します。

On the Convergence Rates of Policy Gradient Methods
政策勾配法の収束率について

We consider infinite-horizon discounted Markov decision problems with finite state and action spaces and study the convergence rates of the projected policy gradient method and a general class of policy mirror descent methods, all with direct parametrization in the policy space. First, we develop a theory of weak gradient-mapping dominance and use it to prove sharp sublinear convergence rate of the projected policy gradient method. Then we show that with geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.

私たちは、有限状態空間と有限行動空間を持つ無限地平線割引マルコフ決定問題を検討し、投影された政策勾配法と一般的なクラスの政策鏡像降下法の収束率を研究します。まず、弱い勾配マッピング優勢の理論を開発し、それを使用して、予測された政策勾配法の急激なサブ線形収束率を証明します。次に、幾何学的にステップサイズが増加すると、自然方策勾配法や射影Q降下法などの一般的なクラスの方策ミラー降下法が、エントロピーやその他の強凸正則化に頼らずに線形収束率を享受することを示します。最後に、不正確な方策ミラー降下法の収束率も解析し、単純な生成モデルの下でそのサンプルの複雑さを推定します。

Nonparametric adaptive control and prediction: theory and randomized algorithms
ノンパラメトリック適応制御と予測:理論とランダム化アルゴリズム

A key assumption in the theory of nonlinear adaptive control is that the uncertainty of the system can be expressed in the linear span of a set of known basis functions. While this assumption leads to efficient algorithms, it limits applications to very specific classes of systems. We introduce a novel nonparametric adaptive algorithm that estimates an infinite-dimensional density over parameters online to learn an unknown dynamics in a reproducing kernel Hilbert space. Surprisingly, the resulting control input admits an analytical expression that enables its implementation despite its underlying infinite-dimensional structure. While this adaptive input is rich and expressive — subsuming, for example, traditional linear parameterizations — its computational complexity grows linearly with time, making it comparatively more expensive than its parametric counterparts. Leveraging the theory of random Fourier features, we provide an efficient randomized implementation that recovers the complexity of classical parametric methods while provably retaining the expressivity of the nonparametric input. In particular, our explicit bounds only depend polynomially on the underlying parameters of the system, allowing our proposed algorithms to efficiently scale to high-dimensional systems. As an illustration of the method, we demonstrate the ability of the randomized approximation algorithm to learn a predictive model of a 60-dimensional system consisting of ten point masses interacting through Newtonian gravitation. By reinterpretation as a gradient flow on a specific loss, we conclude with a natural extension of our kernel-based adaptive algorithms to deep neural networks. We show empirically that the extra expressivity afforded by deep representations can lead to improved performance at the expense of the closed-loop stability that is rigorously guaranteed and consistently observed for kernel machines.

非線形適応制御の理論における重要な仮定は、システムの不確実性が既知の基底関数の集合の線形範囲で表現できるというものです。この仮定は効率的なアルゴリズムにつながりますが、非常に特定のクラスのシステムへの適用を制限します。私たちは、オンラインでパラメータの無限次元密度を推定し、再生カーネルヒルベルト空間における未知のダイナミクスを学習する、新しいノンパラメトリック適応アルゴリズムを紹介します。驚くべきことに、結果として得られる制御入力は、その根底にある無限次元構造にもかかわらず、実装を可能にする解析表現を受け入れます。この適応入力は豊かで表現力豊かですが(たとえば、従来の線形パラメータ化を包含)、その計算の複雑さは時間とともに線形に増大し、パラメトリックな対応物よりも比較的高価になります。ランダムフーリエ特徴の理論を活用して、私たちは、ノンパラメトリック入力の表現力を証明可能に保持しながら、古典的なパラメトリック法の複雑さを回復する、効率的なランダム化実装を提供します。特に、明示的な境界はシステムの基礎パラメータに多項式的にのみ依存するため、提案アルゴリズムは高次元システムに効率的に拡張できます。この方法の例として、ランダム化近似アルゴリズムが、ニュートン重力によって相互作用する10個の質点からなる60次元システムの予測モデルを学習できることを示します。特定の損失に対する勾配フローとして再解釈することで、カーネルベースの適応アルゴリズムをディープニューラルネットワークに自然に拡張できます。ディープ表現によって得られる追加の表現力により、カーネルマシンで厳密に保証され、一貫して観察される閉ループ安定性を犠牲にして、パフォーマンスが向上する可能性があることを経験的に示します。

Generalized Resubstitution for Classification Error Estimation
分類誤差推定のための一般化再代入

We propose the family of generalized resubstitution classifier error estimators based on arbitrary empirical probability measures. These error estimators are computationally efficient and do not require retraining of classifiers. The plain resubstitution error estimator corresponds to choosing the standard empirical probability measure. Other choices of empirical probability measure lead to bolstered, posterior-probability, Gaussian-process, and Bayesian error estimators; in addition, we propose here bolstered posterior-probability error estimators, as a new family of generalized resubstitution estimators. In the two-class case, we show that a generalized resubstitution estimator is consistent and asymptotically unbiased, regardless of the distribution of the features and label, if the corresponding empirical probability measure converges uniformly to the standard empirical probability measure and the classification rule has finite VC dimension. A generalized resubstitution estimator typically has hyperparameters that can be tuned to control its bias and variance, which adds flexibility. We conducted extensive numerical experiments with various classification rules trained on synthetic data, which indicate that the new family of error estimators proposed here produces the best results overall, except in the case of very complex, overfitting classifiers, in which semi-bolstered resubstitution should be used instead. In addition, results of an image classification experiment using the LeNet-5 convolutional neural network and the MNIST data set show that naive-Bayes bolstered resubstitution with a simple data-driven calibration procedure produces excellent results, demonstrating the potential of this class of error estimators in deep learning for computer vision.

私たちは、任意の経験的確率尺度に基づく一般化再置換分類器誤差推定器のファミリーを提案します。これらの誤差推定器は計算効率が良く、分類器の再トレーニングを必要としない。単純な再置換誤差推定器は、標準的な経験的確率尺度を選択することに相当します。経験的確率尺度の他の選択は、強化、事後確率、ガウス過程、およびベイズ誤差推定器につながります。さらに、ここでは、一般化再置換推定器の新しいファミリーとして、強化事後確率誤差推定器を提案します。2クラスのケースでは、対応する経験的確率尺度が標準的な経験的確率尺度に均一に収束し、分類ルールが有限のVC次元を持つ場合、特徴とラベルの分布に関係なく、一般化再置換推定器は一貫性があり、漸近的に偏りがないことを示します。一般化再置換推定器には通常、バイアスと分散を制御するために調整可能なハイパーパラメータがあり、柔軟性が追加されます。合成データでトレーニングされたさまざまな分類ルールを使用して広範な数値実験を実施したところ、ここで提案された新しいエラー推定器ファミリは、非常に複雑で過剰適合する分類器の場合を除き、全体的に最良の結果を生み出すことがわかりました。そのような分類器の場合は、代わりに半強化再置換を使用する必要があります。さらに、LeNet-5畳み込みニューラルネットワークとMNISTデータセットを使用した画像分類実験の結果は、単純なデータ駆動型キャリブレーション手順によるナイーブベイズ強化再置換が優れた結果を生み出すことを示しており、コンピュータービジョンのディープラーニングにおけるこのクラスのエラー推定器の可能性を実証しています。

Convergence Guarantees for the Good-Turing Estimator
グッドチューリング推定量の収束保証

Consider a finite sample from an unknown distribution over a countable alphabet. The occupancy probability (OP) refers to the total probability of symbols that appear exactly k times in the sample. Estimating the OP is a basic problem in large alphabet modeling, with a variety of applications in machine learning, statistics and information theory. The Good-Turing (GT) framework is perhaps the most popular OP estimation scheme. Classical results show that the GT estimator converges to the OP, for every k independently. In this work we introduce new exact convergence guarantees for the GT estimator, based on worst-case mean squared error analysis. Our scheme improves upon currently known results. Further, we introduce a novel simultaneous convergence rate, for any desired set of occupancy probabilities. This allows us to quantify the unified performance of OP estimators, and introduce a novel estimation framework with favorable convergence guarantees.

可算アルファベット上の未知の分布からの有限のサンプルを考えてみましょう。占有確率(OP)は、サンプルに正確にk回出現するシンボルの合計確率を指します。OPの推定は、大規模アルファベットモデリングの基本的な問題であり、機械学習、統計学、情報理論など、さまざまな応用が可能です。Good-Turing(GT)フレームワークは、おそらく最も人気のあるOP推定スキームです。古典的な結果は、GT推定量がkごとに独立してOPに収束することを示しています。この作業では、最悪の場合の平均二乗誤差分析に基づく、GT推定器の新しい正確な収束保証を紹介します。私たちのスキームは、現在知られている結果を改善します。さらに、任意の所望の占有確率のセットに対して、新しい同時収束率を導入します。これにより、OP推定器の統一された性能を定量化し、有利な収束保証を備えた新しい推定フレームワークを導入することができます。

Jump Gaussian Process Model for Estimating Piecewise Continuous Regression Functions
区分的連続回帰関数を推定するためのジャンプガウス過程モデル

This paper presents a Gaussian process (GP) model for estimating piecewise continuous regression functions. In many scientific and engineering applications of regression analysis, the underlying regression functions are often piecewise continuous in that data follow different continuous regression models for different input regions with discontinuities across regions. However, many conventional GP regression approaches are not designed for piecewise regression analysis. There are piecewise GP models to use explicit domain partitioning and pose independent GP models over partitioned regions. They are not flexible enough to model real datasets where data domains are divided by complex and curvy jump boundaries. We propose a new GP modeling approach to estimate an unknown piecewise continuous regression function. The new GP model seeks a local GP estimate of an unknown regression function at each test location, using local data neighboring the test location. Considering the possibilities of the local data being from different regions, the proposed approach partitions the local data into pieces by a local data partitioning function. It uses only the local data likely from the same region as the test location for the regression estimate. Since we do not know which local data points come from the relevant region, we propose a data-driven approach to split and subset local data by a local partitioning function. We discuss several modeling choices of the local data partitioning function, including a locally linear function and a locally polynomial function. We also investigate an optimization problem to jointly optimize the partitioning function and other covariance parameters using a likelihood maximization criterion. Several advantages of using the proposed approach over the conventional GP and piecewise GP modeling approaches are shown by various simulated experiments and real data studies.

この論文では、区分連続回帰関数を推定するためのガウス過程(GP)モデルを紹介します。回帰分析の多くの科学的および工学的応用では、基礎となる回帰関数は区分連続であることが多く、データは領域間で不連続な異なる入力領域に対して異なる連続回帰モデルに従います。ただし、従来のGP回帰アプローチの多くは、区分回帰分析用に設計されていません。明示的なドメイン分割を使用し、分割された領域に対して独立したGPモデルをポーズする区分GPモデルがあります。これらは、データ領域が複雑で曲線的なジャンプ境界によって分割されている実際のデータセットをモデル化できるほど柔軟ではありません。未知の区分連続回帰関数を推定するための新しいGPモデリングアプローチを提案します。新しいGPモデルは、テスト場所の近隣のローカルデータを使用して、各テスト場所で未知の回帰関数のローカルGP推定値を求めます。ローカルデータが異なる領域からのものである可能性を考慮して、提案されたアプローチは、ローカルデータ分割関数によってローカルデータを分割します。回帰推定には、テスト場所と同じ地域からのローカルデータのみを使用します。どのローカルデータポイントが関連地域からのものであるかわからないため、ローカルパーティション関数によってローカルデータを分割してサブセット化するデータ駆動型アプローチを提案します。ローカルデータパーティション関数のモデリングの選択肢として、ローカル線形関数やローカル多項式関数などをいくつか取り上げます。また、尤度最大化基準を使用してパーティション関数とその他の共分散パラメーターを共同で最適化する最適化問題も調査します。従来のGPおよび区分GPモデリングアプローチよりも提案アプローチを使用する利点がいくつかあることが、さまざまなシミュレーション実験と実際のデータ研究によって示されています。

Nonstochastic Bandits with Composite Anonymous Feedback
非確率的バンディットと複合匿名フィードバック

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses as a special case the easier task of bandits with delayed feedback, a well-studied framework where the player observes the delayed losses individually. Our first contribution is a general reduction transforming a standard bandit algorithm into one that can operate in the harder setting: We bound the regret of the transformed algorithm in terms of the stability and regret of the original algorithm. Then, we show that the transformation of a suitably tuned FTRL with Tsallis entropy has a regret of order $\sqrt{(d+1)KT}$, where $d$ is the maximum delay, $K$ is the number of arms, and $T$ is the time horizon. Finally, we show that our results cannot be improved in general by exhibiting a matching (up to a log factor) lower bound on the regret of any algorithm operating in this setting.

私たちは、アクションの損失がプレイヤーに直接請求されるのではなく、敵対的な方法で後続のラウンドに分散される非確率的バンディット設定を調査します。各ラウンドの終わりにプレイヤーが観察する瞬間的な損失は、以前にプレイしたアクションの多くの損失コンポーネントの合計です。この設定は、遅延フィードバックを備えたバンディットのより簡単なタスクを特殊なケースとして含みます。これは、プレイヤーが遅延損失を個別に観察する、よく研究されたフレームワークです。私たちの最初の貢献は、標準的なバンディットアルゴリズムをより困難な設定で動作できるアルゴリズムに変換する一般的な削減です。変換されたアルゴリズムの後悔を、元のアルゴリズムの安定性と後悔の観点から制限します。次に、Tsallisエントロピーを使用して適切に調整されたFTRLの変換には、$\sqrt{(d+1)KT}$のオーダーの後悔があることを示します。ここで、$d$は最大遅延、$K$はアームの数、$T$は時間範囲です。最後に、この設定で動作するあらゆるアルゴリズムの後悔の一致する（対数係数まで）下限を示すことによって、一般に結果を改善することはできないことを示します。

Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons
深層ネットワーク近似:固定数のニューロンで任意の精度を達成

This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple, computable, and continuous activation function $\sigma$ leveraging a triangular-wave function and the softsign function. We first prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, we show that classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$ when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the rectified linear unit (ReLU) activation function by ours would improve the experiment results.

この論文では、有限数のニューロンですべての連続関数の普遍近似特性を実現するシンプルなフィードフォワードニューラルネットワークを開発します。これらのニューラルネットワークがシンプルなのは、三角波関数とソフトサイン関数を活用したシンプルで計算可能な連続活性化関数$\sigma$で設計されているためです。まず、幅$36d(2d+1)$、深さ$11$の$\sigma$活性化ネットワークが、任意の小さな誤差内でd次元ハイパーキューブ上の任意の連続関数を近似できることを証明した。したがって、教師あり学習とそれに関連する回帰問題では、サイズが$36d(2d+1)\times 11$以上であるこれらのネットワークによって生成される仮説空間は、連続関数空間$C([a,b]^d)$で稠密であり、したがって$p\in [1,\infty)$のルベーグ空間$L^p([a,b]^d)$で稠密です。さらに、同じクラスのサンプルが同じサブセットに位置するような、$\mathbb{R}^d$の互いに素な有界閉サブセットが存在する場合、画像と信号の分類から生じる分類関数は、幅$36d(2d+1)$、深さ$12$の$\sigma$活性化ネットワークによって生成される仮説空間内にあることを示します。最後に、数値実験を使用して、ReLU (Rerectified Linear Unit)活性化関数を私たちのものに置き換えると、実験結果が改善されることを示します。

Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms
連続時間と空間における方策勾配とアクター・クリティック学習:理論とアルゴリズム

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

私たちは、Wangら(2020)によって開発された正規化された探索的定式化の下で、連続時間と空間における強化学習のための方策勾配(PG)を研究します。与えられたパラメーター化された確率的方策に関する価値関数の勾配を、サンプルと現在の価値関数を使用して評価できる補助実行報酬関数の期待積分として表す。この表現は、PGを実質的に方策評価(PE)問題に変換し、JiaとZhou (2022a)によって最近開発されたマルチンゲールアプローチをPEに適用してPG問題を解決できるようにします。この分析に基づいて、RL用の2種類のアクタークリティックアルゴリズムを提案します。これらのアルゴリズムでは、価値関数と方策を同時に交互に学習および更新します。最初のタイプは、前述の表現に直接基づいており、将来の軌跡を含み、オフラインです。オンライン学習用に設計された2つ目のタイプは、方策勾配の1次条件を採用し、それをマルチンゲール直交条件に変換します。これらの条件は、方策の更新時に確率的近似を使用して組み込まれます。最後に、2つの具体的な例でシミュレーションによってアルゴリズムを実証します。

CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms
CleanRL:深層強化学習アルゴリズムの高品質な単一ファイル実装

CleanRL is an open-source library that provides high-quality single-file implementations of Deep Reinforcement Learning (DRL) algorithms. These single-file implementations are self-contained algorithm variant files such as dqn.py, ppo.py, and ppo_atari.py that individually include all algorithm variant’s implementation details. Such a paradigm significantly reduces the complexity and the lines of code (LOC) in each implemented variant, which makes them quicker and easier to understand. This paradigm gives the researchers the most fine-grained control over all aspects of the algorithm in a single file, allowing them to prototype novel features quickly. Despite having succinct implementations, CleanRL’s codebase is thoroughly documented and benchmarked to ensure performance is on par with reputable sources. As a result, CleanRL produces a repository tailor-fit for two purposes: 1) understanding all implementation details of DRL algorithms and 2) quickly prototyping novel features. CleanRL’s source code can be found at https://github.com/vwxyzjn/cleanrl.

CleanRLは、ディープ強化学習(DRL)アルゴリズムの高品質な単一ファイル実装を提供するオープンソースライブラリです。これらの単一ファイル実装は、dqn.py、ppo.py、ppo_atari.pyなどの自己完結型のアルゴリズムバリアントファイルで、アルゴリズムバリアントの実装の詳細がすべて個別に含まれています。このようなパラダイムにより、実装された各バリアントの複雑さとコード行数(LOC)が大幅に削減され、理解が迅速化されます。このパラダイムにより、研究者は単一ファイルでアルゴリズムのすべての側面を最もきめ細かく制御できるため、新しい機能を迅速にプロトタイプ化できます。簡潔な実装にもかかわらず、CleanRLのコードベースは徹底的に文書化され、ベンチマークされているため、パフォーマンスが信頼できるソースと同等であることが保証されています。その結果、CleanRLは、1) DRLアルゴリズムのすべての実装の詳細を理解することと、2)新しい機能を迅速にプロトタイプ化することという2つの目的にぴったりのリポジトリを生成します。CleanRLのソースコードはhttps://github.com/vwxyzjn/cleanrlにあります。

The Weighted Generalised Covariance Measure
重み付き一般化共分散測度

We introduce a new test for conditional independence which is based on what we call the weighted generalised covariance measure (WGCM). It is an extension of the recently introduced generalised covariance measure (GCM). To test the null hypothesis of $X$ and $Y$ being conditionally independent given $Z$, our test statistic is a weighted form of the sample covariance between the residuals of nonlinearly regressing $X$ and $Y$ on $Z$. We propose different variants of the test for both univariate and multivariate $X$ and $Y$. We give conditions under which the tests yield the correct type I error rate. Finally, we compare our novel tests to the original GCM using simulation and on real data sets. Typically, our tests have power against a wider class of alternatives compared to the GCM. This comes at the cost of having less power against alternatives for which the GCM already works well. In the special case of binary or categorical $X$ and $Y$, one of our tests has power against all alternatives.

私たちは、加重一般化共分散測度(WGCM)と呼ばれるものに基づく条件付き独立性の新しい検定法を導入します。これは、最近導入された一般化共分散測度(GCM)の拡張です。$Z$が与えられた場合、$X$と$Y$が条件付きで独立であるという帰無仮説を検定するために、検定統計量は、$Z$に対する非線形回帰の$X$と$Y$の残差間の標本共分散の加重形式です。私たちは、単変量と多変量の両方の$X$と$Y$に対して、この検定のさまざまなバリエーションを提案します。検定が正しいタイプIの誤り率をもたらす条件を示します。最後に、シミュレーションと実際のデータセットを使用して、この新しい検定法を元のGCMと比較します。通常、我々の検定法は、GCMと比較して、より広範な選択肢に対して検出力があります。これは、GCMが既にうまく機能している選択肢に対して検出力が低くなるという代償を伴います。バイナリまたはカテゴリの$X$と$Y$の特殊なケースでは、我々の検定法の1つはすべての選択肢に対して検出力があります。

Communication-Constrained Distributed Quantile Regression with Optimal Statistical Guarantees
最適な統計的保証による通信制約付き分布分位点回帰

We address the problem of how to achieve optimal inference in distributed quantile regression without stringent scaling conditions. This is challenging due to the non-smooth nature of the quantile regression (QR) loss function, which invalidates the use of existing methodology. The difficulties are resolved through a double-smoothing approach that is applied to the local (at each data source) and global objective functions. Despite the reliance on a delicate combination of local and global smoothing parameters, the quantile regression model is fully parametric, thereby facilitating interpretation. In the low-dimensional regime, we establish a finite-sample theoretical framework for the sequentially defined distributed QR estimators. This reveals a trade-off between the communication cost and statistical error. We further discuss and compare several alternative confidence set constructions, based on inversion of Wald and score-type tests and resampling techniques, detailing an improvement that is effective for more extreme quantile coefficients. In high dimensions, a sparse framework is adopted, where the proposed doubly-smoothed objective function is complemented with an $\ell_1$-penalty. We show that the corresponding distributed penalized QR estimator achieves the global convergence rate after a near-constant number of communication rounds. A thorough simulation study further elucidates our findings.

私たちは、厳格なスケーリング条件なしで分散分位回帰で最適な推論を達成する方法という問題に取り組んでいます。これは、分位回帰(QR)損失関数の非平滑性のために困難であり、既存の方法論の使用を無効にします。困難は、ローカル(各データソース)およびグローバル目的関数に適用される二重平滑化アプローチによって解決されます。ローカルおよびグローバル平滑化パラメータの微妙な組み合わせに依存しているにもかかわらず、分位回帰モデルは完全にパラメトリックであるため、解釈が容易です。低次元領域では、順次定義される分散QR推定量に対する有限サンプルの理論的フレームワークを確立します。これにより、通信コストと統計的誤差のトレードオフが明らかになります。さらに、ワルドの逆検定とスコア型検定およびリサンプリング手法に基づくいくつかの代替信頼セット構成について説明および比較し、より極端な分位係数に効果的な改善を詳しく説明します。高次元では、提案された二重平滑化目的関数が$\ell_1$ペナルティで補完されるスパースフレームワークが採用されています。対応する分散ペナルティ付きQR推定器は、ほぼ一定数の通信ラウンドの後にグローバル収束率を達成することを示しています。徹底的なシミュレーション研究により、私たちの発見がさらに明らかになりました。

Fast Stagewise Sparse Factor Regression
高速ステージワイズスパース因子回帰

Sparse factorization of a large matrix is fundamental in modern statistical learning. In particular, the sparse singular value decomposition has been utilized in many multivariate regression methods. The appeal of this factorization is owing to its power in discovering a highly-interpretable latent association network. However, many existing methods are either ad hoc without a general performance guarantee, or are computationally intensive. We formulate the statistical problem as a sparse factor regression and tackle it with a two-stage “deflation + stagewise learning” approach. In the first stage, we consider both sequential and parallel approaches for simplifying the task into a set of co-sparse unit-rank estimation (CURE) problems, and establish the statistical underpinnings of these commonly-adopted and yet poorly understood deflation methods. In the second stage, we innovate a contended stagewise learning technique, consisting of a sequence of simple incremental updates, to efficiently trace out the whole solution paths of CURE. Our algorithm achieves a much lower computational complexity than alternating convex search, and it enables a flexible and principled tradeoff between statistical accuracy and computational efficiency. Our work is among the first to enable stagewise learning for non-convex problems, and the idea can be applicable in many multi-convex problems. Extensive simulation studies and an application in genetics demonstrate the effectiveness and scalability of our approach.

大規模行列のスパース分解は、現代の統計学習の基本です。特に、スパース特異値分解は、多くの多変量回帰法で利用されてきました。この分解の魅力は、高度に解釈可能な潜在的関連ネットワークを発見する力にあります。しかし、既存の方法の多くは、一般的なパフォーマンス保証のないアドホックなものか、計算集約的なものかのいずれかです。私たちは、統計問題をスパース因子回帰として定式化し、2段階の「デフレーション+段階的学習」アプローチで取り組みます。最初の段階では、タスクを一連の共スパース単位ランク推定(CURE)問題に簡略化するためのシーケンシャルアプローチとパラレルアプローチの両方を検討し、これらの一般的に採用されているがあまり理解されていないデフレーション法の統計的基礎を確立します。2番目の段階では、一連の単純な増分更新で構成される競合する段階的学習手法を革新し、CUREのソリューションパス全体を効率的にトレースします。私たちのアルゴリズムは、交互凸探索よりもはるかに低い計算複雑性を実現し、統計的精度と計算効率の間の柔軟で原則的なトレードオフを可能にします。私たちの研究は、非凸問題に対する段階的学習を可能にした最初の研究の1つであり、そのアイデアは多くの多重凸問題に適用できます。広範なシミュレーション研究と遺伝学への応用により、私たちのアプローチの有効性とスケーラビリティが実証されています。

Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm for Log-Concave Sampling
対数凹サンプリングのためのメトロポリス調整ランジュバンアルゴリズムのミニマックス混合時間

We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $\kappa$, we show that MALA with a warm start mixes in $\tilde O(\kappa \sqrt{d})$ iterations up to logarithmic factors. This improves upon the previous work on the dependency of either the condition number $\kappa$ or the dimension $d$. Our proof relies on comparing the leapfrog integrator with the continuous Hamiltonian dynamics, where we establish a new concentration bound for the acceptance rate. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde\Omega(\kappa \sqrt{d})$ steps to mix. The lower bound for MALA matches our upper bound in terms of condition number and dimension. Finally, numerical experiments are included to validate our theoretical results.

私たちは、対数平滑分布および強い対数凹分布からのサンプリングに対するメトロポリス調整ランジュバンアルゴリズム(MALA)の混合時間を調べます。ウォームスタート下での最適なミニマックス混合時間を確立します。我々の主な貢献は2つあります。まず、条件数$\kappa$の$d$次元対数凹密度の場合、ウォームスタートのMALAは、対数係数まで$\tilde O(\kappa \sqrt{d})$回の反復で混合することを示します。これは、条件数$\kappa$または次元$d$のいずれかの依存性に関する以前の研究を改善したものです。我々の証明は、リープフロッグ積分器と連続ハミルトン力学との比較に依存しており、そこで受け入れ率の新しい集中境界を確立します。次に、一般状態空間上の可逆MCMCアルゴリズムのスペクトルギャップに基づく混合時間の下限を証明します。この下限の結果を適用して、MALAが混合するために少なくとも$\tilde\Omega(\kappa \sqrt{d})$ステップを必要とするハード分布を構築します。MALAの下限は、条件数と次元の点で上限と一致します。最後に、理論結果を検証するための数値実験が含まれています。

Learning linear non-Gaussian directed acyclic graph with diverging number of nodes
ノード数の発散による線形非ガウス有向非巡回グラフの学習

An acyclic model, often depicted as a directed acyclic graph (DAG), has been widely employed to represent directional causal relations among collected nodes. In this article, we propose an efficient method to learn linear non-Gaussian DAG in high dimensional cases, where the noises can be of any continuous non-Gaussian distribution. The proposed method leverages the concept of topological layer to facilitate the DAG learning, and its theoretical justification in terms of exact DAG recovery is also established under mild conditions. Particularly, we show that the topological layers can be exactly reconstructed in a bottom-up fashion, and the parent-child relations among nodes can also be consistently established. The established asymptotic DAG recovery is in sharp contrast to that of many existing learning methods assuming parental faithfulness or ordered noise variances. The advantage of the proposed method is also supported by the numerical comparison against some popular competitors in various simulated examples as well as a real application on the global spread of COVID-19.

非巡回モデルは、有向非巡回グラフ(DAG)として表現されることが多く、収集されたノード間の方向性のある因果関係を表すために広く採用されています。この記事では、ノイズが任意の連続非ガウス分布になる可能性がある高次元の場合に線形非ガウスDAGを学習するための効率的な方法を提案します。提案された方法は、DAG学習を容易にするためにトポロジカルレイヤーの概念を活用し、穏やかな条件下での正確なDAG回復に関する理論的正当性も確立されています。特に、トポロジカルレイヤーをボトムアップ方式で正確に再構築できること、およびノード間の親子関係も一貫して確立できることを示しています。確立された漸近的DAG回復は、親の忠実性または順序付けられたノイズ分散を前提とする多くの既存の学習方法とは対照的です。提案された方法の利点は、さまざまなシミュレーション例でのいくつかの一般的な競合製品との数値比較や、COVID-19の世界的な蔓延への実際のアプリケーションによっても裏付けられています。

A Computationally Efficient Framework for Vector Representation of Persistence Diagrams
永続性図のベクトル表現のための計算効率の高いフレームワーク

In Topological Data Analysis, a common way of quantifying the shape of data is to use a persistence diagram (PD). PDs are multisets of points in $R^2$ computed using tools of algebraic topology. However, this multi-set structure limits the utility of PDs in applications. Therefore, in recent years efforts have been directed towards extracting informative and efficient summaries from PDs to broaden the scope of their use for machine learning tasks. We propose a computationally efficient framework to convert a PD into a vector in $R^n$, called a vectorized persistence block (VPB). We show that our representation possesses many of the desired properties of vector-based summaries such as stability with respect to input noise, low computational cost and flexibility. Through simulation studies, we demonstrate the effectiveness of VPBs in terms of performance and computational cost for various learning tasks, namely clustering, classification and change point detection.

トポロジカルデータ解析では、データの形状を定量化する一般的な方法は、パーシステンスダイアグラム(PD)を使用することです。PDは、代数トポロジーのツールを使用して計算された$R^2$内の点の多重集合です。ただし、このマルチセット構造により、アプリケーションにおけるPDの有用性が制限されます。そのため、近年では、PDから有益で効率的な要約を抽出し、機械学習タスクでの使用範囲を広げる取り組みが行われています。PDを$R^n$のベクトルに変換するための計算効率の高いフレームワーク、つまりベクトル化永続化ブロック(VPB)を提案します。私たちの表現が、入力ノイズに対する安定性、低い計算コスト、柔軟性など、ベクトルベースの要約の望ましい特性の多くを持っていることを示します。シミュレーション研究を通じて、クラスタリング、分類、変化点検出など、さまざまな学習タスクのパフォーマンスと計算コストの観点からVPBの有効性を実証します。

Tianshou: A Highly Modularized Deep Reinforcement Learning Library
Tianshou:高度にモジュール化された深層強化学習ライブラリ

In this paper, we present Tianshou, a highly modularized Python library for deep reinforcement learning (DRL) that uses PyTorch as its backend. Tianshou intends to be research-friendly by providing a flexible and reliable infrastructure of DRL algorithms. It supports online and offline training with more than 20 classic algorithms through a unified interface. To facilitate related research and prove Tianshou’s reliability, we have released Tianshou’s benchmark of MuJoCo environments, covering eight classic algorithms with state-of-the-art performance. We open-sourced Tianshou at https://github.com/thu-ml/tianshou/.

この論文では、PyTorchをバックエンドとして使用する深層強化学習(DRL)用の高度にモジュール化されたPythonライブラリであるTianshouを紹介します。Tianshouは、DRLアルゴリズムの柔軟で信頼性の高いインフラストラクチャを提供することにより、研究に適した企業になることを目指しています。統一されたインターフェースを通じて、20以上のクラシックアルゴリズムを使用したオンラインおよびオフラインのトレーニングをサポートします。関連する研究を促進し、Tianshouの信頼性を証明するために、TianshouのMuJoCo環境のベンチマークをリリースし、最先端のパフォーマンスを備えた8つの古典的なアルゴリズムをカバーしています。https://github.com/thu-ml/tianshou/でTianshouをオープンソース化しました。

Functional Linear Regression with Mixed Predictors
混合予測子による関数型線形回帰

We study a functional linear regression model that deals with functional responses and allows for both functional covariates and high-dimensional vector covariates. The proposed model is flexible and nests several functional regression models in the literature as special cases. Based on the theory of reproducing kernel Hilbert spaces (RKHS), we propose a penalized least squares estimator that can accommodate functional variables observed on discrete sample points. Besides a conventional smoothness penalty, a group Lasso-type penalty is further imposed to induce sparsity in the high-dimensional vector predictors. We derive finite sample theoretical guarantees and show that the excess prediction risk of our estimator is minimax optimal. Furthermore, our analysis reveals an interesting phase transition phenomenon that the optimal excess risk is determined jointly by the smoothness and the sparsity of the functional regression coefficients. A novel efficient optimization algorithm based on iterative coordinate descent is devised to handle the smoothness and group penalties simultaneously. Simulation studies and real data applications illustrate the promising performance of the proposed approach compared to the state-of-the-art methods in the literature.

私たちは、機能的応答を扱い、機能的共変量と高次元ベクトル共変量の両方を許容する機能的線形回帰モデルを研究します。提案モデルは柔軟性があり、文献にあるいくつかの機能的回帰モデルを特殊なケースとしてネストします。再生核ヒルベルト空間(RKHS)の理論に基づいて、離散サンプルポイントで観測される機能変数を収容できるペナルティ付き最小二乗推定量を提案します。従来の平滑性ペナルティに加えて、高次元ベクトル予測子にスパース性を誘発するために、グループLasso型ペナルティがさらに課されます。我々は有限サンプルの理論的保証を導出し、推定量の過剰予測リスクがミニマックス最適であることを示す。さらに、我々の分析は、最適な過剰リスクが機能的回帰係数の平滑性とスパース性によって共同で決定されるという興味深い相転移現象を明らかにした。反復座標降下法に基づく新しい効率的な最適化アルゴリズムは、平滑性とグループペナルティを同時に処理するために考案されました。シミュレーション研究と実際のデータアプリケーションは、文献の最先端の方法と比較して、提案されたアプローチの有望なパフォーマンスを示しています。

Stochastic subgradient for composite convex optimization with functional constraints
関数制約を持つ複合凸最適化のための確率的サブグラジエント

In this paper we consider optimization problems with stochastic composite objective function subject to (possibly) infinite intersection of constraints. The objective function is expressed in terms of expectation operator over a sum of two terms satisfying a stochastic bounded gradient condition, with or without strong convexity type properties. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. Based on the flexibility offered by our general optimization model we consider a stochastic subgradient method with random feasibility updates. At each iteration, our algorithm takes a stochastic proximal (sub)gradient step aimed at minimizing the objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed random constraint. We analyze the convergence behavior of the proposed algorithm for diminishing stepsizes and for the case when the objective function is convex or has a quadratic functional growth, unifying the nonsmooth and smooth cases. We prove sublinear convergence rates for this stochastic subgradient algorithm, which are known to be optimal for subgradient methods on this class of problems. When the objective function has a linear least-square form and the constraints are polyhedral, it is shown that the algorithm converges linearly. Numerical evidence supports the effectiveness of our method in real problems.

この論文では、制約の交差が（おそらく）無限になる確率的複合目的関数を伴う最適化問題を考察します。目的関数は、強い凸性タイプ特性の有無にかかわらず、確率的有界勾配条件を満たす2つの項の合計に対する期待値演算子で表現されます。制約が通常単純な集合の交差として表現される従来のアプローチとは対照的に、この論文では、各制約集合が凸関数のレベル集合として与えられるものとみなしますが、必ずしも微分可能とは限りません。一般的な最適化モデルが提供する柔軟性に基づいて、ランダムな実行可能性更新を伴う確率的サブグラディエント法を考察します。各反復で、アルゴリズムは、目的関数を最小化することを目的とした確率的近似（サブ）グラディエントステップを実行し、その後、観測されたランダム制約の実現可能性違反を最小化するサブグラディエントステップを実行します。提案されたアルゴリズムの収束動作を、減少するステップサイズと、目的関数が凸関数であるか2次関数成長を持つ場合について分析し、滑らかでないケースと滑らかなケースを統合します。この確率的サブグラディエントアルゴリズムのサブ線形収束率を証明します。この収束率は、この種の問題に対するサブグラディエント法に最適であることが知られています。目的関数が線形最小二乗形式を持ち、制約が多面体である場合、アルゴリズムは線形に収束することが示されています。数値的証拠は、実際の問題におけるこの方法の有効性を裏付けています。

A Random Matrix Perspective on Random Tensors
ランダムテンソルに対するランダム行列の視点

Several machine learning problems such as latent variable model learning and community detection can be addressed by estimating a low-rank signal from a noisy tensor. Despite recent substantial progress on the fundamental limits of the corresponding estimators in the large-dimensional setting, some of the most significant results are based on spin glass theory, which is not easily accessible to non-experts. We propose a sharply distinct and more elementary approach, relying on tools from random matrix theory. The key idea is to study random matrices arising from contractions of a random tensor, which give access to its spectral properties. In particular, for a symmetric $d$th-order rank-one model with Gaussian noise, our approach yields a novel characterization of maximum likelihood (ML) estimation performance in terms of a fixed-point equation valid in the regime where weak recovery is possible. For $d=3$, the solution to this equation matches the existing results. We conjecture that the same holds for any order $d$, based on numerical evidence for $d \in \{4,5\}$. Moreover, our analysis illuminates certain properties of the large-dimensional ML landscape. Our approach can be extended to other models, including asymmetric and non-Gaussian ones.

潜在変数モデルの学習やコミュニティ検出などのいくつかの機械学習の問題は、ノイズの多いテンソルから低ランク信号を推定することで対処できます。大規模設定での対応する推定量の基本的限界に関する最近の大きな進歩にもかかわらず、最も重要な結果のいくつかはスピングラス理論に基づいており、これは専門家以外には簡単にはアクセスできません。私たちは、ランダム行列理論のツールに頼る、明確に異なる、より基本的なアプローチを提案します。重要なアイデアは、ランダムテンソルの収縮から生じるランダム行列を研究し、そのスペクトル特性にアクセスすることです。特に、ガウスノイズのある対称的な$d$次ランク1モデルの場合、私たちのアプローチは、弱回復が可能な領域で有効な固定小数点方程式の観点から、最大尤度（ML）推定性能の新しい特徴付けをもたらします。$d = 3$の場合、この方程式の解は既存の結果と一致します。私たちは、$d \in \{4,5\}$の数値的証拠に基づいて、任意の順序$d$に対して同じことが当てはまると推測します。さらに、我々の分析は、大規模次元MLランドスケープの特定の特性を明らかにします。我々のアプローチは、非対称モデルや非ガウスモデルを含む他のモデルにも拡張できます。

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks
2層線形ネットワークにおける陰的バイアスと良性過学習の相互作用

The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of benign overfitting has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.

ニューラルネットワークモデルの最近の成功は、ノイズの多いデータに完全に適合した統計モデルが、目に見えないテストデータにうまく一般化できるという、かなり驚くべき統計現象に光を当てています。この良性の過学習の現象を理解することは、激しい理論的および実証的研究を引き付けました。この論文では、二乗損失の勾配流れで訓練された2層線形ニューラルネットワークを補間することを検討し、共変量がサブガウス性と反集中性の特性を満たし、ノイズが独立してサブガウスである場合に、過剰リスクの境界を導き出します。この推定量の暗黙的なバイアスを特徴付ける最近の結果を活用することにより、私たちの境界は、初期化の品質とデータ共分散行列のプロパティの両方が低過剰リスクを達成する上での役割を強調しています。

Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score
分割・プールされた非相関スコアを用いた観測データにおける高次元個別化治療ルールの推定と推論

With the increasing adoption of electronic health records, there is an increasing interest in developing individualized treatment rules, which recommend treatments according to patients’ characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal individualized treatment rule from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal adopts the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method.

電子カルテの導入が進むにつれ、大規模な観察データから患者の特性に応じた治療を推奨する個別治療ルールを開発することへの関心が高まっています。しかし、高次元の共変量が存在する場合、この種のデータから作成されたそのようなルールに対する有効な推論手順が不足しています。この研究では、高次元データから最適な個別治療ルールを推定するためのペナルティ付き二重ロバスト法を開発します。仮説検定と信頼区間を構築するために、分割およびプールされた無相関スコアを提案します。私たちの提案では、アウトカム回帰や傾向モデルのノンパラメトリック法などの厄介なパラメータ推定の収束速度が遅いという問題を克服するために、データ分割を採用しています。分割およびプールされた無相関スコア検定の限界分布と、高次元設定での対応するワンステップ推定量を確立します。シミュレーションと実際のデータ分析を実施して、提案方法の優位性を実証します。

Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning
Auto-Sklearn 2.0: メタ学習によるハンズフリー AutoML

Automated Machine Learning (AutoML) supports practitioners and researchers with the tedious task of designing machine learning pipelines and has recently achieved substantial success. In this paper, we introduce new AutoML approaches motivated by our winning submission to the second ChaLearn AutoML challenge. We develop PoSH Auto-sklearn, which enables AutoML systems to work well on large datasets under rigid time limits by using a new, simple and meta-feature-free meta-learning technique and by employing a successful bandit strategy for budget allocation. However, PoSH Auto-sklearn introduces even more ways of running AutoML and might make it harder for users to set it up correctly. Therefore, we also go one step further and study the design space of AutoML itself, proposing a solution towards truly hands-free AutoML. Together, these changes give rise to the next generation of our AutoML system, Auto-sklearn 2.0 . We verify the improvements by these additions in an extensive experimental study on 39 AutoML benchmark datasets. We conclude the paper by comparing to other popular AutoML frameworks and Auto-sklearn 1.0 , reducing the relative error by up to a factor of 4.5, and yielding a performance in 10 minutes that is substantially better than what Auto-sklearn 1.0 achieves within an hour.

自動機械学習(AutoML)は、機械学習パイプラインの設計という面倒な作業で実務家や研究者をサポートし、最近大きな成功を収めています。この論文では、第2回ChaLearn AutoMLチャレンジで優勝した作品に触発されて、新しいAutoMLアプローチを紹介します。PoSH Auto-sklearnを開発しました。これは、新しいシンプルでメタ機能のないメタ学習手法を使用し、予算配分に成功したバンディット戦略を採用することで、厳格な時間制限の下でAutoMLシステムが大規模なデータセットで適切に機能することを可能にします。ただし、PoSH Auto-sklearnではAutoMLを実行する方法がさらに増え、ユーザーが正しく設定するのが難しくなる可能性があります。そのため、さらに一歩進んでAutoML自体の設計空間を研究し、真にハンズフリーのAutoMLに向けたソリューションを提案します。これらの変更を組み合わせることで、次世代のAutoMLシステムであるAuto-sklearn 2.0が誕生しました。39のAutoMLベンチマークデータセットを使用した大規模な実験研究で、これらの追加による改善点を検証しました。最後に、他の一般的なAutoMLフレームワークとAuto-sklearn 1.0を比較し、相対誤差を最大4.5倍削減し、Auto-sklearn 1.0が1時間以内に達成するパフォーマンスよりも大幅に優れたパフォーマンスを10分で実現しました。

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions
区分線形ターゲット関数に対するReLU活性化を用いたニューラルネットワークの学習におけるランダム初期化による勾配降下最適化法の収束証明

Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains – even in the simplest situation of the plain vanilla GD optimization method and ANNs with one hidden layer – an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero. In this article we establish in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval, where the probability distribution for the random initialization of the ANN parameters is the standard normal distribution, and where the target function under consideration is continuous and piecewise affine linear that the risk of the considered GD process converges exponentially fast to zero with a positive probability. Roughly speaking, the key ingredients in our mathematical convergence analysis are (i) to prove that suitable sets of global minima of the risk functions are twice continuously differentiable submanifolds of the ANN parameter spaces, (ii) to prove that the Hessians of the risk functions on these sets of global minima satisfy an appropriate maximal rank condition, and, thereafter, (iii) to apply the machinery in [Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21(136): 1-48, 2020] to establish local convergence of the GD optimization method. As a consequence, we obtain convergence of the risk to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity.

勾配降下法(GD)タイプの最適化手法は、正規化線形ユニット(ReLU)活性化による人工ニューラルネットワーク(ANN)のトレーニングに標準的な手段です。ReLU活性化によるANNのトレーニングの数値シミュレーションではGDタイプの最適化手法が大きな成功を収めていますが、プレーンバニラGD最適化手法と1つの隠れ層を持つANNの最も単純な状況でさえ、GD最適化手法のリスクがそのようなANNのトレーニングでゼロに収束するという推測を証明(または反証)することは未解決の問題です。この記事では、入力データの確率分布がコンパクトな区間の連続一様分布に等しく、ANNパラメーターのランダム初期化の確率分布が標準正規分布であり、検討中のターゲット関数が連続かつ区分的にアフィン線形である状況で、検討中のGDプロセスのリスクが正の確率で指数関数的に速くゼロに収束することを確立します。大まかに言えば、私たちの数学的収束解析の重要な要素は、(i)リスク関数の大域的最小値の適切な集合がANNパラメータ空間の2回連続微分可能な部分多様体であることを証明すること、(ii)これらの大域的最小値集合上のリスク関数のヘッセ行列が適切な最大ランク条件を満たすことを証明すること、そしてその後、(iii)[Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21(136): 1-48, 2020]の仕組みを適用してGD最適化法の局所収束を確立することです。その結果、ANNの幅、独立したランダム初期化の数、およびGDステップの数が無限大に増加するにつれて、リスクがゼロに収束します。

Learning Temporal Evolution of Spatial Dependence with Generalized Spatiotemporal Gaussian Process Models
一般化時空間ガウス過程モデルによる空間依存性の時間的進化の学習

A large number of scientific studies involve high-dimensional spatiotemporal data with complicated relationships. In this paper, we focus on a type of space-time interaction named temporal evolution of spatial dependence (TESD), which is a zero time-lag spatiotemporal covariance. For this purpose, we propose a novel Bayesian nonparametric method based on non-stationary spatiotemporal Gaussian process (STGP). The classic STGP has a covariance kernel separable in space and time, failed to characterize TESD. More recent works on non-separable STGP treat location and time together as a joint variable, which is unnecessarily inefficient. We generalize STGP (gSTGP) to introduce time-dependence to the spatial kernel by varying its eigenvalues over time in the Mercer’s representation. The resulting non-stationary non-separable covariance model bares a quasi Kronecker sum structure. Finally, a hierarchical Bayesian model for the joint covariance is proposed to allow for full flexibility in learning TESD. A simulation study and a longitudinal neuroimaging analysis on Alzheimer’s patients demonstrate that the proposed methodology is (statistically) effective and (computationally) efficient in characterizing TESD. Theoretic properties of gSTGP including posterior contraction (for covariance) are also studied.

科学研究の多くは、複雑な関係を持つ高次元の時空間データを扱っています。この論文では、ゼロ時間遅れの時空間共分散である、空間依存性の時間的発展(TESD)と呼ばれる一種の時空間相互作用に焦点を当てます。この目的のために、非定常時空間ガウス過程(STGP)に基づく新しいベイズ非パラメトリック法を提案します。従来のSTGPは、空間と時間で分離可能な共分散カーネルを持ち、TESDを特徴付けることができませんでした。非分離STGPに関する最近の研究では、場所と時間を一緒に結合変数として扱っていますが、これは不必要に非効率的です。STGP (gSTGP)を一般化し、マーサー表現で時間の経過と共に固有値を変化させることで、空間カーネルに時間依存性を導入します。結果として得られる非定常非分離共分散モデルは、準クロネッカー和構造を持ちます。最後に、TESDの学習に完全な柔軟性を持たせるために、結合共分散の階層的ベイズモデルが提案されています。シミュレーション研究とアルツハイマー病患者に対する縦断的神経画像分析により、提案された方法論がTESDの特性評価において(統計的に)効果的かつ(計算的に)効率的であることが実証されています。また、(共分散の)後方収縮を含むgSTGPの理論的特性も研究されています。

Tree-Based Models for Correlated Data
相関データのツリーベースモデル

This paper presents a new approach for regression tree-based models, such as simple regression tree, random forest and gradient boosting, in settings involving correlated data. We show the problems that arise when implementing standard regression tree-based models, which ignore the correlation structure. Our new approach explicitly takes the correlation structure into account in the splitting criterion, stopping rules and fitted values in the leaves, which induces some major modifications of standard methodology. The superiority of our new approach over tree-based models that do not account for the correlation, and over previous work that integrated some aspects of our approach, is supported by simulation experiments and real data analyses.

この論文では、相関データを含む設定で、単純な回帰ツリー、ランダムフォレスト、勾配ブースティングなどの回帰ツリーベースのモデルに対する新しいアプローチを紹介します。相関構造を無視する標準的な回帰ツリーベースのモデルを実装するときに発生する問題を示します。私たちの新しいアプローチは、分割基準、停止ルール、葉の適合値で相関構造を明示的に考慮しており、標準的な方法論のいくつかの大きな変更を誘発します。相関を考慮しないツリーベースのモデルや、アプローチの一部の側面を統合した以前の研究に対する新しいアプローチの優位性は、シミュレーション実験と実際のデータ分析によって裏付けられています。

Sparse Continuous Distributions and Fenchel-Young Losses
疎な連続分布とフェンチェル・ヤング損失

Exponential families are widely used in machine learning, including many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, recent work on sparse alternatives to softmax (e.g., sparsemax, $\alpha$-entmax, and fusedmax), has led to distributions with varying support. This paper develops sparse alternatives to continuous distributions, based on several technical contributions: First, we define $\Omega$-regularized prediction maps and Fenchel-Young losses for arbitrary domains (possibly countably infinite or continuous). For linearly parametrized families, we show that minimization of Fenchel-Young losses is equivalent to moment matching of the statistics, generalizing a fundamental property of exponential families. When $\Omega$ is a Tsallis negentropy with parameter $\alpha$, we obtain “deformed exponential families,” which include $\alpha$-entmax and sparsemax ($\alpha=2$) as particular cases. For quadratic energy functions, the resulting densities are $\beta$-Gaussians, an instance of elliptical distributions that contain as particular cases the Gaussian, biweight, triweight, and Epanechnikov densities, and for which we derive closed-form expressions for the variance, Tsallis entropy, and Fenchel-Young loss. When $\Omega$ is a total variation or Sobolev regularizer, we obtain a continuous version of the fusedmax. Finally, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for $\alpha \in \{1,\frac{4}{3}, \frac{3}{2}, 2\}$. Using these algorithms, we demonstrate our sparse continuous distributions for attention-based audio classification and visual question answering, showing that they allow attending to time intervals and compact regions.

指数族は機械学習で広く使用されており、連続領域と離散領域の多くの分布（ガウス分布、ディリクレ分布、ポアソン分布、ソフトマックス変換によるカテゴリ分布など）が含まれます。これらの各族の分布のサポートは固定されています。対照的に、有限領域では、ソフトマックスのスパースな代替物（スパースマックス、$\alpha$-entmax、fusedmaxなど）に関する最近の研究により、サポートが変化する分布が生まれました。この論文では、いくつかの技術的貢献に基づいて、連続分布のスパースな代替物を開発しています。まず、任意の領域（可算無限または連続）に対して、$\Omega$-正規化予測マップとフェンチェル-ヤング損失を定義します。線形パラメータ化された族の場合、フェンチェル-ヤング損失の最小化は統計のモーメントマッチングと同等であり、指数族の基本特性を一般化していることを示します。$\Omega$がパラメータ$\alpha$を持つTsallisネゲントロピーである場合、「変形指数族」が得られます。これには、特殊なケースとして$\alpha$-entmaxとsparsemax ($\alpha=2$)が含まれます。2次エネルギー関数の場合、結果として得られる密度は$\beta$-ガウス分布です。これは、特殊なケースとしてガウス、biweight、triweight、およびEpanechnikov密度を含む楕円分布のインスタンスであり、分散、Tsallisエントロピー、およびFenchel-Young損失の閉じた形式の式を導出します。$\Omega$が全変分またはSobolev正則化子である場合、fusedmaxの連続バージョンが得られます。最後に、連続領域アテンションメカニズムを導入し、$\alpha \in \{1,\frac{4}{3}, \frac{3}{2}, 2\}$の効率的な勾配バックプロパゲーションアルゴリズムを導出します。これらのアルゴリズムを使用して、注意ベースのオーディオ分類と視覚的な質問応答のためのスパース連続分布を実証し、時間間隔とコンパクトな領域に注意を向けることができることを示します。

On Constraints in First-Order Optimization: A View from Non-Smooth Dynamical Systems
1次最適化における制約条件について:非平滑力学系からの視点

We introduce a class of first-order methods for smooth constrained optimization that are based on an analogy to non-smooth dynamical systems. Two distinctive features of our approach are that (i) projections or optimizations over the entire feasible set are avoided, in stark contrast to projected gradient methods or the Frank-Wolfe method, and (ii) iterates are allowed to become infeasible, which differs from active set or feasible direction methods, where the descent motion stops as soon as a new constraint is encountered. The resulting algorithmic procedure is simple to implement even when constraints are nonlinear, and is suitable for large-scale constrained optimization problems in which the feasible set fails to have a simple structure. The key underlying idea is that constraints are expressed in terms of velocities instead of positions, which has the algorithmic consequence that optimizations over feasible sets at each iteration are replaced with optimizations over local, sparse convex approximations. In particular, this means that at each iteration only constraints that are violated are taken into account. The result is a simplified suite of algorithms and an expanded range of possible applications in machine learning.

私たちは、滑らかでない動的システムとの類似性に基づく、滑らかな制約付き最適化のための一階手法のクラスを導入します。我々のアプローチの2つの際立った特徴は、(i)射影勾配法やFrank-Wolfe法とは対照的に、実行可能セット全体にわたる射影や最適化が回避されること、および(ii)反復が実行不可能になることが許容されることです。これは、新しい制約に遭遇するとすぐに降下動作が停止するアクティブセット法や実行可能方向法とは異なります。結果として得られるアルゴリズム手順は、制約が非線形であっても実装が簡単で、実行可能セットが単純な構造を持たない大規模な制約付き最適化問題に適しています。鍵となる根底にある考え方は、制約が位置ではなく速度で表現されるということであり、これにより、各反復での実行可能セットの最適化が、ローカルでスパースな凸近似の最適化に置き換えられるというアルゴリズム上の帰結が得られます。特に、これは各反復で違反された制約のみが考慮されることを意味します。その結果、アルゴリズムのスイートが簡素化され、機械学習における適用範囲が広がります。

Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States
単純なエージェント、複雑な環境:エージェント状態による効率的な強化学習

We design a simple reinforcement learning (RL) agent that implements an optimistic version of $Q$-learning and establish through regret analysis that this agent can operate with some level of competence in any environment. While we leverage concepts from the literature on provably efficient RL, we consider a general agent-environment interface and provide a novel agent design and analysis. This level of generality positions our results to inform the design of future agents for operation in complex real environments. We establish that, as time progresses, our agent performs competitively relative to policies that require longer times to evaluate. The time it takes to approach asymptotic performance is polynomial in the complexity of the agent’s state representation and the time required to evaluate the best policy that the agent can represent. Notably, there is no dependence on the complexity of the environment. The ultimate per-period performance loss of the agent is bounded by a constant multiple of a measure of distortion introduced by the agent’s state representation. This work is the first to establish that an algorithm approaches this asymptotic condition within a tractable time frame.

私たちは、楽観的なバージョンの$Q$学習を実装する単純な強化学習(RL)エージェントを設計し、後悔分析を通じて、このエージェントがあらゆる環境で一定レベルの能力で動作できることを確立しました。私たちは、証明可能な効率的なRLに関する文献の概念を活用しながら、一般的なエージェント環境インターフェイスを考慮し、新しいエージェント設計と分析を提供します。このレベルの一般性により、我々の結果は、複雑な実際の環境で動作する将来のエージェントの設計に情報を提供するものとなります。時間の経過とともに、我々のエージェントは、評価に長い時間を要するポリシーと比較して競争力のあるパフォーマンスを発揮することを確立しました。漸近的なパフォーマンスに近づくのにかかる時間は、エージェントの状態表現の複雑さと、エージェントが表現できる最良のポリシーを評価するのに必要な時間の多項式です。特に、環境の複雑さには依存しません。エージェントの最終的な期間ごとのパフォーマンス損失は、エージェントの状態表現によってもたらされる歪みの尺度の定数倍によって制限されます。この研究では、アルゴリズムが扱いやすい時間枠内でこの漸近的な状態に近づくことを初めて確立したものです。

Adaptive Greedy Algorithm for Moderately Large Dimensions in Kernel Conditional Density Estimation
カーネル条件付き密度推定における適度な大次元のための適応貪欲アルゴリズム

This paper studies the estimation of the conditional density $f(x,\cdot)$ of $Y_i$ given $X_i=x$, from the observation of an i.i.d. sample $(X_i,Y_i)\in \mathbb R^d$, $i\in \{1,\dots,n\}.$ We assume that $f$ depends only on $r$ unknown components with typically $r\ll d$.We provide an adaptive fully-nonparametric strategy based on kernel rules to estimate $f$. To select the bandwidth of our kernel rule, we propose a new fast iterative algorithm inspired by the Rodeo algorithm (Wasserman and Lafferty, 2006) to detect the sparsity structure of $f$. More precisely, in the minimax setting, our pointwise estimator, which is adaptive to both the regularity and the sparsity, achieves the quasi-optimal rate of convergence. Our results also hold for (unconditional) density estimation. The computational complexity of our method is only $O(dn \log n)$. A deep numerical study shows nice performances of our approach.

この論文では、i.i.d.サンプル$(X_i,Y_i)in mathbb R^d$, $iin {1,dots,n}.の観測から、$X_i=x$が与えられた$Y_i$の条件付き密度$f(x,cdot)$の推定について研究します。$f$は、通常は$rll d$の$r$個の未知の成分にのみ依存すると仮定します。カーネルルールに基づいて適応性のある完全ノンパラメトリック戦略を提供し、$f$を見積もります。カーネルルールの帯域幅を選択するために、Rodeoアルゴリズム(Wasserman and Lafferty, 2006)に触発された新しい高速反復アルゴリズムを提案し、$f$のスパース性構造を検出します。より正確には、ミニマックス設定では、規則性とスパース性の両方に適応する点ごとの推定量は、準最適収束率を達成します。私たちの結果は、(無条件の)密度推定にも当てはまります。私たちの方法の計算の複雑さは、わずか$O(dn log n)$です。詳細な数値研究は、私たちのアプローチの優れたパフォーマンスを示しています。

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences
方策最適化のための貪欲化演算子:前方および逆方向のKL分岐の調査

Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization; these are chosen because they underlie many existing policy optimization approaches, as we highlight in this work. We show that the reverse KL has stronger policy improvement guarantees, and that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. This work provides novel theoretical and empirical insights about the forward KL and reverse KL for greedification, and clear next steps for understanding and improving our policy optimization algorithms.

近似ポリシー反復(API)アルゴリズムは、(近似)ポリシー評価と(近似)貪欲化を交互に実行します。近似ポリシー評価にはさまざまなアプローチが検討されてきましたが、近似貪欲化と、どのような選択がポリシーの改善を保証するかについては、あまり理解されていません。この研究では、パラメーター化されたポリシーとアクション値上のボルツマン分布との間のKLダイバージェンスを削減する際の近似貪欲化を調査します。特に、さまざまな程度のエントロピー正則化による順方向KLダイバージェンスと逆方向KLダイバージェンスの違いを調査します。これらは、この研究で強調しているように、多くの既存のポリシー最適化アプローチの基礎となっているため選択されています。逆方向KLの方がポリシー改善の保証が強く、順方向KLを削減するとポリシーが悪くなる可能性があることを示します。ただし、順方向KLを十分に削減すると、追加の仮定の下で改善がもたらされる可能性があることも示します。経験的には、単純な連続アクション環境で、順方向KLはより多くの探索を誘発できるが、より最適ではないポリシーを犠牲にすることが示されます。離散アクション設定や一連のベンチマーク問題では、大きな違いは見られませんでした。この研究では、貪欲化の順方向KLと逆方向KLに関する新しい理論的および経験的洞察と、ポリシー最適化アルゴリズムを理解して改善するための明確な次のステップを提供します。

A Closer Look at Embedding Propagation for Manifold Smoothing
多様体平滑化のための埋め込み伝播の詳細

Supervised training of neural networks requires a large amount of manually annotated data and the resulting networks tend to be sensitive to out-of-distribution (OOD) data. Self- and semi-supervised training schemes reduce the amount of annotated data required during the training process. However, OOD generalization remains a major challenge for most methods. Strategies that promote smoother decision boundaries play an important role in out-of-distribution generalization. For example, embedding propagation (EP) for manifold smoothing has recently shown to considerably improve the OOD performance for few-shot classification. EP achieves smoother class manifolds by building a graph from sample embeddings and propagating information through the nodes in an unsupervised manner. In this work, we extend the original EP paper providing additional evidence and experiments showing that it attains smoother class embedding manifolds and improves results in settings beyond few-shot classification. Concretely, we show that EP improves the robustness of neural networks against multiple adversarial attacks as well as semi- and self-supervised learning performance.

ニューラルネットワークの教師ありトレーニングには、手動で注釈を付けられた大量のデータが必要であり、結果として得られるネットワークは分布外(OOD)データに敏感になる傾向があります。自己教師ありおよび半教師ありトレーニングスキームは、トレーニングプロセス中に必要な注釈データの量を削減します。ただし、OODの一般化は、ほとんどの方法で依然として大きな課題です。より滑らかな決定境界を促進する戦略は、分布外の一般化で重要な役割を果たします。たとえば、マニホールドスムージングの埋め込み伝播(EP)は、最近、少数ショット分類のOODパフォーマンスを大幅に向上させることが示されています。EPは、サンプル埋め込みからグラフを構築し、ノードを通じて教師なしで情報を伝播することにより、より滑らかなクラスマニホールドを実現します。この研究では、元のEP論文を拡張して、より滑らかなクラス埋め込みマニホールドを実現し、少数ショット分類を超えた設定で結果を改善することを示す追加の証拠と実験を提供します。具体的には、EPによって、複数の敵対的攻撃に対するニューラルネットワークの堅牢性が向上するだけでなく、半教師あり学習と自己教師あり学習のパフォーマンスも向上することを示します。

Using Active Queries to Infer Symmetric Node Functions of Graph Dynamical Systems
アクティブクエリを使用したグラフ動的システムの対称ノード関数の推論

Developing techniques to infer the behavior of networked social systems has attracted a lot of attention in the literature. Using a discrete dynamical system to model a networked social system, the problem of inferring the behavior of the system can be formulated as the problem of learning the local functions of the dynamical system. We investigate the problem assuming an active form of interaction with the system through queries. We consider two classes of local functions (namely, symmetric and threshold functions) and two interaction modes, namely batch (where all the queries must be submitted together) and adaptive (where the set of queries submitted at a stage may rely on the answers to previous queries). We establish bounds on the number of queries under both batch and adaptive query modes using vertex coloring and probabilistic methods. Our results show that a small number of appropriately chosen queries are provably sufficient to correctly learn all the local functions. We develop complexity results which suggest that, in general, the problem of generating query sets of minimum size is computationally intractable. We present efficient heuristics that produce query sets under both batch and adaptive query modes. Also, we present a query compaction algorithm that identifies and removes redundant queries from a given query set. Our algorithms were evaluated through experiments on over 20 well-known networks.

ネットワーク化された社会システムの振る舞いを推測する技術の開発は、文献で多くの注目を集めています。離散動的システムを使用してネットワーク化された社会システムをモデル化すると、システムの動作を推測する問題は、動的システムのローカル関数を学習する問題として定式化できます。クエリを介してシステムと能動的にやり取りすることを前提として、この問題を調査します。ローカル関数の2つのクラス(対称関数としきい値関数)と、2つの対話モード(バッチ(すべてのクエリをまとめて送信する必要がある)と適応型(ある段階で送信されたクエリのセットが以前のクエリの回答に依存する可能性がある))を検討します。頂点カラーリングと確率的方法を使用して、バッチモードと適応型クエリモードの両方でクエリの数の上限を設定します。結果は、適切に選択された少数のクエリで、すべてのローカル関数を正しく学習できることを示しています。最小サイズのクエリセットを生成する問題は、一般に計算上扱いにくいことを示唆する複雑性の結果を示します。バッチクエリモードとアダプティブクエリモードの両方でクエリセットを生成する効率的なヒューリスティックを紹介します。また、指定されたクエリセットから冗長なクエリを識別して削除するクエリ圧縮アルゴリズムも紹介します。このアルゴリズムは、20を超えるよく知られたネットワークでの実験を通じて評価されました。

Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials
逐次ルール適応試験からの個別化治療ルールの非漸近特性

Learning optimal individualized treatment rules (ITRs) has become increasingly important in the modern era of precision medicine. Many statistical and machine learning methods for learning optimal ITRs have been developed in the literature. However, most existing methods are based on data collected from traditional randomized controlled trials and thus cannot take advantage of the accumulative evidence when patients enter the trials sequentially. It is also ethically important that future patients should have a high probability to be treated optimally based on the updated knowledge so far. In this work, we propose a new design called sequentially rule-adaptive trials to learn optimal ITRs based on the contextual bandit framework, in contrast to the response-adaptive design in traditional adaptive trials. In our design, each entering patient will be allocated with a high probability to the current best treatment for this patient, which is estimated using the past data based on some machine learning algorithm (for example, outcome weighted learning in our implementation). We explore the tradeoff between training and test values of the estimated ITR in single-stage problems by proving theoretically that for a higher probability of following the estimated ITR, the training value converges to the optimal value at a faster rate, while the test value converges at a slower rate. This problem is different from traditional decision problems in the sense that the training data are generated sequentially and are dependent. We also develop a tool that combines martingale with empirical process to tackle the problem that cannot be solved by previous techniques for i.i.d. data. We show by numerical examples that without much loss of the test value, our proposed algorithm can improve the training value significantly as compared to existing methods. Finally, we use a real data study to illustrate the performance of the proposed method.

最適な個別治療ルール(ITR)の学習は、精密医療の現代においてますます重要になっています。文献では、最適なITRを学習するための多くの統計的および機械学習的方法が開発されています。ただし、既存の方法のほとんどは、従来のランダム化比較試験から収集されたデータに基づいているため、患者が試験に順番に参加する際に蓄積された証拠を活用することができません。また、将来の患者が、これまでに更新された知識に基づいて最適な治療を受ける可能性が高くなることが倫理的に重要です。この研究では、従来の適応型試験の応答適応型設計とは対照的に、コンテキストバンディットフレームワークに基づいて最適なITRを学習するための、順次ルール適応型試験と呼ばれる新しい設計を提案します。この設計では、参加する各患者は、何らかの機械学習アルゴリズム(たとえば、実装の結果重み付け学習)に基づく過去のデータを使用して推定された、その患者に対する現在の最善の治療に高い確率で割り当てられます。私たちは、単一段階の問題における推定ITRのトレーニング値とテスト値のトレードオフを調査し、推定ITRに従う確率が高いほど、トレーニング値はより速く最適値に収束するが、テスト値はより遅く収束することを理論的に証明します。この問題は、トレーニングデータが順次生成され、依存しているという点で、従来の決定問題とは異なります。また、従来のi.i.d.データ手法では解決できない問題に取り組むために、マルチンゲールと経験的プロセスを組み合わせたツールも開発します。数値例により、テスト値をあまり失うことなく、提案アルゴリズムが既存の方法と比較してトレーニング値を大幅に改善できることを示します。最後に、実際のデータスタディを使用して、提案方法のパフォーマンスを示します。

Projected Robust PCA with Application to Smooth Image Recovery
スムーズな画像回復への応用によるロバストなPCAの予測

Most high-dimensional matrix recovery problems are studied under the assumption that the target matrix has certain intrinsic structures. For image data related matrix recovery problems, approximate low-rankness and smoothness are the two most commonly imposed structures. For approximately low-rank matrix recovery, the robust principal component analysis (PCA) is well-studied and proved to be effective. For smooth matrix problem, 2d fused Lasso and other total variation based approaches have played a fundamental role. Although both low-rankness and smoothness are key assumptions for image data analysis, the two lines of research, however, have very limited interaction. Motivated by taking advantage of both features, we in this paper develop a framework named projected robust PCA (PRPCA), under which the low-rank matrices are projected onto a space of smooth matrices. Consequently, a large class of image matrices can be decomposed as a low-rank and smooth component plus a sparse component. A key advantage of this decomposition is that the dimension of the core low-rank component can be significantly reduced. Consequently, our framework is able to address a problematic bottleneck of many low-rank matrix problems: singular value decomposition (SVD) on large matrices. Theoretically, we provide explicit statistical recovery guarantees of PRPCA and include classical robust PCA as a special case.

高次元行列回復問題のほとんどは、対象行列が特定の固有構造を持つという仮定の下で研究されています。画像データ関連の行列回復問題の場合、近似低ランク性と平滑性は、最も一般的に課される2つの構造です。近似低ランク行列回復の場合、ロバスト主成分分析(PCA)は十分に研究されており、効果的であることが証明されています。平滑行列問題の場合、2D融合Lassoおよびその他の全変動ベースのアプローチが基本的な役割を果たしてきました。低ランク性と平滑性はどちらも画像データ解析の重要な仮定ですが、2つの研究ラインの相互作用は非常に限られています。両方の機能を利用することを目的に、この論文では、低ランク行列を平滑行列の空間に投影する、投影ロバストPCA (PRPCA)というフレームワークを開発します。その結果、多くの画像行列を、低ランクで平滑なコンポーネントとスパースコンポーネントに分解できます。この分解の主な利点は、コアとなる低ランクコンポーネントの次元を大幅に削減できることです。その結果、私たちのフレームワークは、多くの低ランク行列問題におけるボトルネックである、大規模行列の特異値分解(SVD)に対処することができます。理論的には、PRPCAの明示的な統計的回復保証を提供し、古典的な堅牢なPCAを特別なケースとして含めます。

Double Spike Dirichlet Priors for Structured Weighting
構造化重み付けのためのダブルスパイクディリクレ事前確率

Assigning weights to a large pool of objects is a fundamental task in a wide variety of applications. In this article, we introduce the concept of structured high-dimensional probability simplexes, in which most components are zero or near zero and the remaining ones are close to each other. Such structure is well motivated by (i) high-dimensional weights that are common in modern applications, and (ii) ubiquitous examples in which equal weights—despite their simplicity—often achieve favorable or even state-of-the-art predictive performance. This particular structure, however, presents unique challenges partly because, unlike high-dimensional linear regression, the parameter space is a simplex and pattern switching between partial constancy and sparsity is unknown. To address these challenges, we propose a new class of double spike Dirichlet priors to shrink a probability simplex to one with the desired structure. When applied to ensemble learning, such priors lead to a Bayesian method for structured high-dimensional ensembles that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. We design efficient Markov chain Monte Carlo algorithms for implementation. Posterior contraction rates are established to study large sample behaviors of the posterior distribution. We demonstrate the wide applicability and competitive performance of the proposed methods through simulations and two real data applications using the European Central Bank Survey of Professional Forecasters data set and a data set from the UC Irvine Machine Learning Repository (UCI).

多数のオブジェクトに重みを割り当てることは、さまざまなアプリケーションで基本的なタスクです。この記事では、ほとんどのコンポーネントがゼロまたはゼロに近く、残りのコンポーネントが互いに近い、構造化された高次元確率単体の概念を紹介します。このような構造は、(i)最新のアプリケーションで一般的な高次元の重み、および(ii)等しい重みが単純であるにもかかわらず、好ましい、または最先端の予測パフォーマンスを達成することが多いという普遍的な例によって十分に動機付けられています。ただし、この特定の構造は、高次元線形回帰とは異なり、パラメーター空間が単体であり、部分的な恒常性とスパース性の間のパターンの切り替えが不明であるため、独特の課題を提示します。これらの課題に対処するために、確率単体を目的の構造に縮小する新しいクラスの二重スパイクディリクレ事前分布を提案します。アンサンブル学習に適用すると、このような事前分布は、構造化された高次元アンサンブルのベイズ法につながり、予測の組み合わせやランダムフォレストの改善に役立ち、不確実性の定量化も可能になります。実装のために、効率的なマルコフ連鎖モンテカルロアルゴリズムを設計しました。事後分布の大規模サンプルの動作を調査するために、事後収縮率が確立されています。欧州中央銀行の専門予測者調査データセットとカリフォルニア大学アーバイン校機械学習リポジトリ(UCI)のデータセットを使用したシミュレーションと2つの実際のデータアプリケーションを通じて、提案された方法の幅広い適用性と競争力のあるパフォーマンスを実証しました。

Quantile regression with ReLU Networks: Estimators and minimax rates
ReLUネットワークによる分位点回帰:推定量とミニマックスレート

Quantile regression is the task of estimating a specified percentile response, such as the median (50th percentile), from a collection of known covariates. We study quantile regression with rectified linear unit (ReLU) neural networks as the chosen model class. We derive an upper bound on the expected mean squared error of a ReLU network used to estimate any quantile conditioning on a set of covariates. This upper bound only depends on the best possible approximation error, the number of layers in the network, and the number of nodes per layer. We further show upper bounds that are tight for two large classes of functions: compositions of Hölder functions and members of a Besov space. These tight bounds imply ReLU networks with quantile regression achieve minimax rates for broad collections of function types. Unlike existing work, the theoretical results hold under minimal assumptions and apply to general error distributions, including heavy-tailed distributions. Empirical simulations on a suite of synthetic response functions demonstrate the theoretical results translate to practical implementations of ReLU networks. Overall, the theoretical and empirical results provide insight into the strong performance of ReLU neural networks for quantile regression across a broad range of function classes and error distributions. All code for this paper is publicly available at https://github.com/tansey/quantile-regression.

分位回帰は、既知の共変量の集合から、中央値(50パーセンタイル)などの指定されたパーセンタイル応答を推定するタスクです。私たちは、選択されたモデルクラスとしてReLU (Rerectified Linear Unit)ニューラルネットワークを使用して、分位回帰を研究します。私たちは、共変量セットに対する分位条件付けを推定するために使用されるReLUネットワークの期待平均二乗誤差の上限を導出します。この上限は、可能な限り最良の近似誤差、ネットワーク内の層の数、および層あたりのノードの数にのみ依存します。さらに、Hölder関数の合成とBesov空間のメンバーという2つの大きな関数クラスに対して厳密な上限を示します。これらの厳密な上限は、分位回帰を使用するReLUネットワークが、幅広い関数タイプの集合に対してミニマックスレートを達成することを意味します。既存の研究とは異なり、理論的結果は最小限の仮定の下で成り立ち、ヘビーテール分布を含む一般的な誤差分布に適用されます。一連の合成応答関数に関する実験的シミュレーションは、理論的結果がReLUネットワークの実際の実装に応用できることを実証しています。全体として、理論的および実験的結果は、広範囲の関数クラスとエラー分布にわたる分位回帰に対するReLUニューラルネットワークの優れたパフォーマンスについての洞察を提供します。この論文のすべてのコードは、https://github.com/tansey/quantile-regressionで公開されています。

Multivariate Boosted Trees and Applications to Forecasting and Control
多変量ブーストツリーと予測と制御への応用

Gradient boosted trees are competition-winning, general-purpose, non-parametric regressors, which exploit sequential model fitting and gradient descent to minimize a specific loss function. The most popular implementations are tailored to univariate regression and classification tasks, precluding the possibility of capturing multivariate target cross-correlations and applying structured penalties to the predictions. In this paper, we present a computationally efficient algorithm for fitting multivariate boosted trees. We show that multivariate trees can outperform their univariate counterpart when the predictions are correlated. Furthermore, the algorithm allows to arbitrarily regularize the predictions, so that properties like smoothness, consistency and functional relations can be enforced. We present applications and numerical results related to forecasting and control.

勾配ブースティングツリーは、コンペティションで優勝した汎用のノンパラメトリック回帰子であり、逐次モデルのフィッティングと勾配降下法を利用して特定の損失関数を最小限に抑えます。最も一般的な実装は、単変量回帰および分類タスクに合わせて調整されているため、多変量ターゲットの相互相関を捕捉し、予測に構造化されたペナルティを適用する可能性が排除されます。この論文では、多変量ブーストツリーを当てはめるための計算効率の高いアルゴリズムを紹介します。私たちは、予測が相関している場合、多変量木が単変量木よりも優れたパフォーマンスを発揮できることを示します。さらに、このアルゴリズムでは、予測を任意に正則化できるため、滑らかさ、一貫性、機能関係などのプロパティを適用できます。予測と制御に関連するアプリケーションと数値結果を提示します。

Mappings for Marginal Probabilities with Applications to Models in Statistical Physics
統計物理学におけるモデルへの応用による周辺確率のマッピング

We present local mappings that relate the marginal probabilities of a global probability mass function represented by its primal normal factor graph to the corresponding marginal probabilities in its dual normal factor graph. The mapping is based on the Fourier transform of the local factors of the models. Details of the mapping are provided for the Ising model, where it is proved that the local extrema of the fixed points are attained at the phase transition of the two-dimensional nearest-neighbor Ising model. The results are further extended to the Potts model, to the clock model, and to Gaussian Markov random fields. By employing the mapping, we can transform simultaneously all the estimated marginal probabilities from the dual domain to the primal domain (and vice versa), which is advantageous if estimating the marginals can be carried out more efficiently in the dual domain. An example of particular significance is the ferromagnetic Ising model in a positive external magnetic field. For this model, there exists a rapidly mixing Markov chain (called the subgraphs–world process) to generate configurations in the dual normal factor graph of the model. Our numerical experiments illustrate that the proposed procedure can provide more accurate estimates of marginal probabilities of a global probability mass function in various settings.

私たちは、主正規因子グラフによって表されるグローバル確率質量関数の周辺確率を、その双対正規因子グラフの対応する周辺確率に関連付けるローカルマッピングを提示します。マッピングは、モデルのローカル因子のフーリエ変換に基づいています。マッピングの詳細は、イジングモデルに対して提供され、そこでは、固定点のローカル極値は、2次元の最近傍イジングモデルの位相遷移で達成されることが証明されています。結果は、ポッツモデル、クロックモデル、およびガウスマルコフランダムフィールドにさらに拡張されます。マッピングを使用することで、推定されたすべての周辺確率を双対領域から主領域に(およびその逆に)同時に変換できます。これは、周辺確率の推定を双対領域でより効率的に実行できる場合に有利です。特に重要な例としては、正の外部磁場における強磁性イジングモデルがあります。このモデルでは、モデルのデュアル正規因子グラフの構成を生成するために、急速に混合するマルコフ連鎖（サブグラフ-ワールドプロセスと呼ばれる）が存在します。数値実験では、提案された手順により、さまざまな設定でグローバル確率質量関数の周辺確率をより正確に推定できることが示されています。

Mitigating the Effects of Non-Identifiability on Inference for Bayesian Neural Networks with Latent Variables
潜在変数を持つベイジアンニューラルネットワークの推論に対する非識別可能性の影響の軽減

Bayesian Neural Networks with Latent Variables (BNN+LVs) capture predictive uncertainty by explicitly modeling model uncertainty (via priors on network weights) and environmental stochasticity (via a latent input noise variable). In this work, we first show that BNN+LV suffers from a serious form of non-identifiability: explanatory power can be transferred between the model parameters and latent variables while fitting the data equally well. We demonstrate that as a result, in the limit of infinite data, the posterior mode over the network weights and latent variables is asymptotically biased away from the ground-truth. Due to this asymptotic bias, traditional inference methods may in practice yield parameters that generalize poorly and misestimate uncertainty. Next, we develop a novel inference procedure that explicitly mitigates the effects of likelihood non-identifiability during training and yields high-quality predictions as well as uncertainty estimates. We demonstrate that our inference method improves upon benchmark methods across a range of synthetic and real data-sets.

潜在変数付きベイジアンニューラルネットワーク(BNN+LV)は、モデルの不確実性(ネットワークの重みの事前分布による)と環境の確率性(潜在的な入力ノイズ変数による)を明示的にモデル化することで、予測の不確実性を捕捉します。この研究では、まず、BNN+LVが深刻な非識別不可能性に悩まされていることを示します。つまり、モデルパラメーターと潜在変数の間で説明力を転送しながら、データを同じように適合させることができます。その結果、無限データの限界では、ネットワークの重みと潜在変数の事後モードが、グラウンドトゥルースから漸近的に偏っていることを実証します。この漸近的偏りのため、従来の推論方法では、実際には一般化が不十分で不確実性を誤って推定するパラメーターが生成される可能性があります。次に、トレーニング中に尤度非識別不可能性の影響を明示的に緩和し、高品質の予測と不確実性の推定を生成する新しい推論手順を開発します。この推論方法が、さまざまな合成データセットと実際のデータセットにわたってベンチマークメソッドよりも優れていることを実証します。

Tree-based Node Aggregation in Sparse Graphical Models
スパースグラフィカルモデルにおけるツリーベースのノード集約

High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated. The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an efficient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposal’s practical advantages in simulation and in applications in finance and biology.

高次元のグラフィカルモデルは、多くの場合、ネットワーク内のエッジの数を減らすことを目的とした正則化を使用して推定されます。この作業では、グラフィカルモデルのノードを集約することで、さらに単純なネットワークを生成する方法を示します。私たちは、エッジスパースとノード集約の両方であるグラフィカルモデルを推定する、ツリー集約グラフィカルなげなわなわまたはタグなげなわと呼ばれる新しい凸正則化方法を開発しています。集約は、ノードの類似性をエンコードし、結果として集約されたノードの解釈を容易にするツリーの形式でサイド情報を活用することにより、データ駆動型の方法で実行されます。私たちは、乗数の局所適応型交互方向法を使用してタグラッソの効率的な実装を提供し、シミュレーションと金融および生物学の応用における提案の実用的な利点を示します。

Bayesian Covariate-Dependent Gaussian Graphical Models with Varying Structure
さまざまな構造を持つベイズ共変量依存ガウスグラフィカルモデル

We introduce Bayesian Gaussian graphical models with covariates (GGMx), a class of multivariate Gaussian distributions with covariate-dependent sparse precision matrix. We propose a general construction of a functional mapping from the covariate space to the cone of sparse positive definite matrices, which encompasses many existing graphical models for heterogeneous settings. Our methodology is based on a novel mixture prior for precision matrices with a non-local component that admits attractive theoretical and empirical properties. The flexible formulation of GGMx allows both the strength and the sparsity pattern of the precision matrix (hence the graph structure) change with the covariates. Posterior inference is carried out with a carefully designed Markov chain Monte Carlo algorithm, which ensures the positive definiteness of sparse precision matrices at any given covariates’ values. Extensive simulations and a case study in cancer genomics demonstrate the utility of the proposed model.

私たちは、共変量依存のスパース精度行列を持つ多変量ガウス分布のクラスである、共変量付きベイズガウスグラフィカルモデル(GGMx)を紹介します。私たちは、共変量空間からスパース正定値行列の円錐への機能マッピングの一般的な構築を提案します。これは、異種設定の既存の多くのグラフィカルモデルを網羅しています。我々の方法論は、魅力的な理論的および経験的特性を認める非局所的要素を持つ、精度行列の新しい混合事前分布に基づいています。GGMxの柔軟な定式化により、精度行列の強度とスパースパターン(したがってグラフ構造)の両方が共変量とともに変化します。事後推論は、任意の共変量値でのスパース精度行列の正定値性を保証する、慎重に設計されたマルコフ連鎖モンテカルロアルゴリズムを使用して実行されます。広範なシミュレーションと癌ゲノミクスのケーススタディにより、提案モデルの有用性が実証されています。

Weakly Supervised Disentangled Generative Causal Representation Learning
弱教師あり解きもつれ生成的因果表現学習

This paper proposes a Disentangled gEnerative cAusal Representation (DEAR) learning method under appropriate supervised information. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally related. We show that previous methods with independent priors fail to disentangle causally related factors even under supervision. Motivated by this finding, we propose a new disentangled learning method called DEAR that enables causal controllable generation and causal representation learning. The key ingredient of this new formulation is to use a structural causal model (SCM) as the prior distribution for a bidirectional generative model. The prior is then trained jointly with a generator and an encoder using a suitable GAN algorithm incorporated with supervised information on the ground-truth factors and their underlying causal structure. We provide theoretical justification on the identifiability and asymptotic convergence of the proposed method. We conduct extensive experiments on both synthesized and real data sets to demonstrate the effectiveness of DEAR in causal controllable generation, and the benefits of the learned representations for downstream tasks in terms of sample efficiency and distributional robustness.

この論文では、適切な教師あり情報に基づく、分離生成因果表現(DEAR)学習法を提案します。潜在変数の独立性を強制する既存の分離法とは異なり、関心の根底にある要因が因果関係にある可能性がある一般的なケースを検討します。独立した事前分布を使用する以前の方法では、教師あり学習であっても因果関係にある要因を分離できないことを示す。この発見に動機づけられて、因果制御可能な生成と因果表現学習を可能にする、DEARと呼ばれる新しい分離学習法を提案します。この新しい定式化の重要な要素は、双方向生成モデルの事前分布として構造因果モデル(SCM)を使用することです。次に、事前分布は、グラウンドトゥルース要因とその根底にある因果構造に関する教師あり情報を組み込んだ適切なGANアルゴリズムを使用して、ジェネレーターとエンコーダーとともにトレーニングされます。提案方法の識別可能性と漸近収束について理論的根拠を示す。私たちは、合成データセットと実際のデータセットの両方で広範な実験を実施し、因果制御可能な生成におけるDEARの有効性と、サンプル効率と分布の堅牢性の観点から見た下流のタスクに対する学習された表現の利点を実証します。

MALTS: Matching After Learning to Stretch
MALTS:ストレッチを学んだ後のマッチング

We introduce a flexible framework that produces high-quality almost-exact matches for causal inference. Most prior work in matching uses ad-hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates. In this work, we learn an interpretable distance metric for matching, which leads to substantially higher quality matches. The learned distance metric stretches the covariate space according to each covariate’s contribution to outcome prediction: this stretching means that mismatches on important covariates carry a larger penalty than mismatches on irrelevant covariates. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for the estimation of conditional average treatment effects.

私たちは、因果推論のための高品質でほぼ完全一致を生成する柔軟なフレームワークを導入します。マッチングにおける先行研究のほとんどは、アドホックな距離メトリクスを使用しており、特に無関係な共変量がある場合、マッチングの品質が低下することがよくあります。この研究では、マッチングのための解釈可能な距離メトリクスを学習し、これによりマッチングの品質が大幅に向上します。学習距離メトリックは、結果予測に対する各共変量の寄与度に従って共変量空間を引き伸ばします:この引き伸ばしは、重要な共変量の不一致が無関係な共変量の不一致よりも大きなペナルティを伴うことを意味します。柔軟な距離メトリクスを学習する能力は、条件付き平均治療効果の推定に解釈可能で有用な一致につながります。

Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization
非平滑非凸最適化のための単純で最適な確率的勾配法

We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. 2016. Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al. 2016) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on ARAH (Nguyen et al. 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the the optimal upper bound, matching the known lower bound. Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-Lojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work. Finally, we focus on the more challenging problem of finding an $(\epsilon, \delta)$-local minimum instead of just finding an $\epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(\epsilon, \delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.

私たちは、非凸、非滑らかな正則化、有限和、オンライン最適化問題での定常点または局所最小値を見つけるためのいくつかの確率的勾配アルゴリズムを提案し、分析します。まず、分散削減に基づくProxSVRG+と呼ばれる単純な近似確率的勾配アルゴリズムを提案します。我々はProxSVRG+の明確で厳密な分析を提供し、幅広いミニバッチサイズで決定論的近似勾配降下法(ProxGD)よりも優れたパフォーマンスを示し、Reddiら2016で提案された未解決の問題を解決します。また、ProxSVRG+はProxSVRG (Reddiら2016)よりも近似オラクル呼び出しをはるかに少なくし、完全な勾配計算を回避することでオンライン設定に拡張します。次に、ARAH (Nguyenら2017)に基づくSSRGDと呼ばれる最適アルゴリズムをさらに提案し、SSRGDがProxSVRG+の勾配複雑度をさらに改善し、既知の下限と一致する最適な上限を達成することを示します。さらに、ProxSVRG+とSSRGDはどちらも、有限和の場合の非凸関数のPolyak-Lojasiewicz (PL)条件などの目的関数のローカル構造に自動的に適応できることを示します。つまり、以前の作業で実行された再起動なしで、どちらもより高速なグローバル線形収束に自動的に切り替えることができることを証明します。最後に、$\epsilon$近似(1次)定常点(いくつかの悪い不安定鞍点である可能性があります)を見つけるだけでなく、$(\epsilon, \delta)$ローカル最小値を見つけるというより困難な問題に焦点を当てます。SSRGDは、ランダムな摂動をいくつか追加するだけで、$(\epsilon, \delta)$局所最小値を見つけることができることを示します。私たちのアルゴリズムは、定常点を見つけるための対応するアルゴリズムとほぼ同じくらい単純で、同様の最適レートを実現します。

A Wasserstein Distance Approach for Concentration of Empirical Risk Estimates
経験的リスク推定値の集中のためのワッサースタイン距離アプローチ

This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper.

この論文では、ワッサーシュタイン距離に基づく統一的なアプローチを提示し、論文で定義される2つの広範なリスク尺度のクラスについて、経験的推定値の集中限界を導出します。紹介するリスク尺度のクラスには、条件付きリスク値(CVaR)、最適化された確実性等価リスク、スペクトルリスク尺度、効用ベースの不足リスク、累積プロスペクト理論(CPT)値、順位依存の期待効用、歪んだリスク尺度など、金融文献でよく知られているリスク尺度が特殊なケースとして含まれます。リスク尺度のクラスごとに1つずつ、2つの推定方式が検討されます。1つの推定方式では、ランダム変数(r.v.)のi.i.d.サンプルのコレクションから形成される経験的分布関数にリスク尺度を適用し、2つ目の方式では、同じ手順を切り捨てサンプルに適用します。提供される限界は、サブガウス分布、サブ指数分布、およびヘビーテール分布という3つの一般的な分布クラスに適用されます。境界は、まず推定誤差を真の分布と経験的分布の間のワッサーシュタイン距離に関連付け、次に後者の最近の集中境界を使用して導き出されます。以前の集中境界は、CVaRやCPT値などの特定のリスク尺度に対してのみ使用できます。この論文で導き出された境界は、以前の境界が使用可能な場合、以前の境界と一致するか、それよりも優れていることが示されています。境界の有用性は、論文で紹介された2つのクラスのそれぞれからの一般的なリスク尺度を含む確率的バンディット問題に対するアルゴリズムと対応する後悔境界によって示されます。

Nonparametric Principal Subspace Regression
ノンパラメトリック主部分空間回帰

In scientific applications, multivariate observations often come in tandem with temporal or spatial covariates, with which the underlying signals vary smoothly. The standard approaches such as principal component analysis and factor analysis neglect the smoothness of the data, while multivariate linear or nonparametric regression fails to leverage the correlation information among multivariate response variables. We propose a novel approach named nonparametric principal subspace regression to overcome these issues. By decoupling the model discrepancy, a simple two-step estimation procedure is introduced, which takes advantage of the low-rank approximation while keeping smooth dynamics. The theoretical property of the proposed procedure is established under an increasing-dimension framework. We demonstrate the favorable performance of our method in comparison with its counterpart, the conventional nonparametric regression, from both theoretical and numerical perspectives.

科学的なアプリケーションでは、多変量観測は多くの場合、時間的または空間的な共変量と並行して行われ、基礎となる信号は滑らかに変化します。主成分分析や因子分析などの標準的なアプローチでは、データの滑らかさが無視されますが、多変量線形回帰またはノンパラメトリック回帰では、多変量応答変数間の相関情報を活用できません。これらの問題を克服するために、ノンパラメトリック主部分空間回帰という新しいアプローチを提案します。モデルの不一致を分離することにより、単純な2ステップ推定手順が導入され、滑らかなダイナミクスを維持しながら低ランク近似を利用します。提案された手順の理論的特性は、増加次元のフレームワークの下で確立されます。私たちは、理論的および数値的観点から、従来のノンパラメトリック回帰と比較して、私たちの方法の良好な性能を実証します。

KoPA: Automated Kronecker Product Approximation
KoPA: 自動クロネッカー製品近似

We consider the problem of matrix approximation and denoising induced by the Kronecker product decomposition. Specifically, we propose to approximate a given matrix by the sum of a few Kronecker products of matrices, which we refer to as the Kronecker product approximation (KoPA). Because the Kronecker product is an extensions of the outer product from vectors to matrices, KoPA extends the low rank matrix approximation, and includes it as a special case. Comparing with the latter, KoPA also offers a greater flexibility, since it allows the user to choose the configuration, which are the dimensions of the two smaller matrices forming the Kronecker product. On the other hand, the configuration to be used is usually unknown, and needs to be determined from the data in order to achieve the optimal balance between accuracy and parsimony. We propose to use extended information criteria to select the configuration. Under the paradigm of high dimensional analysis, we show that the proposed procedure is able to select the true configuration with probability tending to one, under suitable conditions on the signal-to-noise ratio. We demonstrate the superiority of KoPA over the low rank approximations through numerical studies, and several benchmark image examples.

私たちは、クロネッカー積分解によって生じる行列近似とノイズ除去の問題を考察します。具体的には、与えられた行列をいくつかのクロネッカー積の和で近似することを提案します。これをクロネッカー積近似(KoPA)と呼びます。クロネッカー積はベクトルから行列への外積の拡張であるため、KoPAは低ランク行列近似を拡張し、それを特殊なケースとして含めます。後者と比較して、KoPAは、クロネッカー積を形成する2つの小さな行列の次元である構成をユーザーが選択できるため、より高い柔軟性も提供します。一方、使用する構成は通常は不明であり、精度と節約の最適なバランスを実現するためにデータから決定する必要があります。構成を選択するために、拡張情報基準を使用することを提案します。高次元解析のパラダイムの下で、提案された手順は、信号対雑音比に関する適切な条件下で、確率が1に近づくように真の構成を選択できることを示します。数値研究といくつかのベンチマーク画像の例を通じて、低ランク近似に対するKoPAの優位性を実証します。

Bounding the Error of Discretized Langevin Algorithms for Non-Strongly Log-Concave Targets
非強対数凹ターゲットに対する離散化ランジュバンアルゴリズムの誤差の制限

In this paper, we provide non-asymptotic upper bounds on the error of sampling from a target density over $\mathbb{R}^p$ using three schemes of discretized Langevin diffusions. The first scheme is the Langevin Monte Carlo (LMC) algorithm, the Euler discretization of the Langevin diffusion. The second and the third schemes are, respectively, the kinetic Langevin Monte Carlo (KLMC) for differentiable potentials and the kinetic Langevin Monte Carlo for twice-differentiable potentials (KLMC2). The main focus is on the target densities that are smooth and log-concave on $\mathbb{R}^p$, but not necessarily strongly log-concave. Bounds on the computational complexity are obtained under two types of smoothness assumption: the potential has a Lipschitz-continuous gradient and the potential has a Lipschitz-continuous Hessian matrix. The error of sampling is measured by Wasserstein-$q$ distances. We advocate for the use of a new dimension-adapted scaling in the definition of the computational complexity, when Wasserstein-$q$ distances are considered. The obtained results show that the number of iterations to achieve a scaled-error smaller than a prescribed value depends only polynomially in the dimension.

この論文では、離散化ランジュバン拡散の3つのスキームを使用して、$\mathbb{R}^p$上のターゲット密度からのサンプリング誤差の非漸近的上限を示します。最初のスキームは、ランジュバン拡散のオイラー離散化であるランジュバンモンテカルロ(LMC)アルゴリズムです。2番目と3番目のスキームは、それぞれ微分可能ポテンシャルの運動論的ランジュバンモンテカルロ(KLMC)と2度微分可能ポテンシャルの運動論的ランジュバンモンテカルロ(KLMC2)です。主な焦点は、$\mathbb{R}^p$上で滑らかで対数凹であるが、必ずしも強く対数凹であるわけではないターゲット密度にあります。計算の複雑さの上限は、ポテンシャルがリップシッツ連続勾配を持つ、およびポテンシャルがリップシッツ連続ヘッセ行列を持つという2種類の滑らかさの仮定の下で得られます。サンプリングの誤差は、Wasserstein-$q$距離によって測定されます。Wasserstein-$q$距離を考慮する場合、計算の複雑さの定義に新しい次元適応スケーリングを使用することを推奨します。得られた結果は、規定値よりも小さいスケール誤差を達成するための反復回数が次元の多項式のみに依存することを示しています。

Change point localization in dependent dynamic nonparametric random dot product graphs
従属動的ノンパラメトリックランダム内積グラフにおける変化点局在化

In this paper, we study the offline change point localization problem in a sequence of dependent nonparametric random dot product graphs. To be specific, assume that at every time point, a network is generated from a nonparametric random dot product graph model (see e.g. Athreya et al., 2018), where the latent positions are generated from unknown underlying distributions. The underlying distributions are piecewise constant in time and change at unknown locations, called change points. Most importantly, we allow for dependence among networks generated between two consecutive change points. This setting incorporates edge-dependence within networks and temporal dependence between networks, which is the most flexible setting in the published literature. To accomplish the task of consistently localizing change points, we propose a novel change point detection algorithm, consisting of two steps. First, we estimate the latent positions of the random dot product model, our theoretical result being a refined version of the state-of-the-art results, allowing the dimension of the latent positions to diverge. Subsequently, we construct a nonparametric version of the CUSUM statistic (e.g. Page, 1954; Padilla et al., 2019a) that allows for temporal dependence. Consistent localization is proved theoretically and supported by extensive numerical experiments, which illustrate state-of-the-art performance. We also provide in depth discussion of possible extensions to give more understanding and insights.

この論文では、従属ノンパラメトリックランダムドット積グラフのシーケンスにおけるオフライン変化点の特定問題を研究します。具体的には、すべての時点で、ネットワークがノンパラメトリックランダムドット積グラフモデル（Athreyaら、2018などを参照）から生成され、潜在的な位置が未知の基礎分布から生成されると仮定します。基礎分布は時間的に区分的に一定であり、変化点と呼ばれる未知の場所で変化します。最も重要なのは、2つの連続する変化点間で生成されるネットワーク間の依存関係を許容することです。この設定には、ネットワーク内のエッジ依存性とネットワーク間の時間依存性が組み込まれており、これは公開された文献の中で最も柔軟な設定です。変化点を一貫して特定するというタスクを達成するために、2つのステップで構成される新しい変化点検出アルゴリズムを提案します。まず、ランダムドット積モデルの潜在的な位置を推定します。理論的な結果は、最先端の結果を改良したもので、潜在的な位置の次元を発散させることができます。次に、時間依存性を考慮したCUSUM統計のノンパラメトリックバージョン(例: Page、1954、Padillaら、2019a)を構築します。一貫性のあるローカリゼーションは理論的に証明され、最先端のパフォーマンスを示す広範な数値実験によってサポートされています。また、理解と洞察を深めるために、可能な拡張についても詳細に説明します。

An Efficient Sampling Algorithm for Non-smooth Composite Potentials
非平滑複合ポテンシャルの効率的なサンプリングアルゴリズム

We consider the problem of sampling from a density of the form $p(x) \propto \exp(-f(x)- g(x))$, where $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a smooth function and $g: \mathbb{R}^d \rightarrow \mathbb{R}$ is a convex and Lipschitz function. We propose a new algorithm based on the Metropolis–Hastings framework. Under certain isoperimetric inequalities on the target density, we prove that the algorithm mixes to within total variation (TV) distance $\varepsilon$ of the target density in at most $O(d \log (d/\varepsilon))$ iterations. This guarantee extends previous results on sampling from distributions with smooth log densities ($g = 0$) to the more general composite non-smooth case, with the same mixing time up to a multiple of the condition number. Our method is based on a novel proximal-based proposal distribution that can be efficiently computed for a large class of non-smooth functions $g$. Simulation results on posterior sampling problems that arise from the Bayesian Lasso show empirical advantage over previous proposal distributions.

私たちは、形式$p(x) \propto \exp(-f(x)- g(x))$の密度からのサンプリングの問題を考察します。ここで、$f: \mathbb{R}^d \rightarrow \mathbb{R}$は滑らかな関数、$g: \mathbb{R}^d \rightarrow \mathbb{R}$は凸Lipschitz関数です。私たちは、メトロポリス-ヘイスティングスフレームワークに基づく新しいアルゴリズムを提案します。ターゲット密度に関する特定の等周不等式の下で、アルゴリズムが最大$O(d \log (d/\varepsilon))$回の反復でターゲット密度の全変動(TV)距離$\varepsilon$内で混合することを証明します。この保証は、滑らかな対数密度($g = 0$)を持つ分布からのサンプリングに関する以前の結果を、条件数の倍数まで同じ混合時間で、より一般的な複合非滑らかなケースに拡張します。私たちの方法は、大規模な非滑らかな関数$g$のクラスに対して効率的に計算できる、新しい近似ベースの提案分布に基づいています。ベイジアンLassoから生じる事後サンプリング問題に関するシミュレーション結果は、以前の提案分布よりも経験的に優れていることを示しています。

Gaussian Process Boosting
ガウスプロセスブースティング

We introduce a novel way to combine boosting with Gaussian process and mixed effects models. This allows for relaxing, first, the zero or linearity assumption for the prior mean function in Gaussian process and grouped random effects models in a flexible non-parametric way and, second, the independence assumption made in most boosting algorithms. The former is advantageous for prediction accuracy and for avoiding model misspecifications. The latter is important for efficient learning of the fixed effects predictor function and for obtaining probabilistic predictions. Our proposed algorithm is also a novel solution for handling high-cardinality categorical variables in tree-boosting. In addition, we present an extension that scales to large data using a Vecchia approximation for the Gaussian process model relying on novel results for covariance parameter inference. We obtain increased prediction accuracy compared to existing approaches on multiple simulated and real-world data sets.

私たちは、ブースティングをガウスプロセスモデルおよび混合効果モデルと組み合わせる新しい方法を紹介します。これにより、第一に、ガウス過程における事前平均関数のゼロまたは線形性の仮定と、柔軟なノンパラメトリックな方法でグループ化されたランダム効果モデルの仮定、そして第二に、ほとんどのブースティングアルゴリズムで行われる独立性の仮定を緩和することができます。前者は、予測精度とモデルの誤指定の回避に有利です。後者は、固定効果予測関数を効率的に学習し、確率的予測を得るために重要です。私たちが提案するアルゴリズムは、ツリーブースティングで高カーディナリティのカテゴリ変数を処理するための新しいソリューションでもあります。さらに、共分散パラメーター推論の新しい結果に依存するガウス過程モデルのVecchia近似を使用して、大規模なデータにスケーリングする拡張を示します。複数のシミュレーションデータセットと実世界のデータセットに対する既存のアプローチと比較して、予測精度が向上しています。

Representation Learning for Maximization of MI, Nonlinear ICA and Nonlinear Subspaces with Robust Density Ratio Estimation
ロバスト密度比推定によるMI、非線形ICA、および非線形部分空間の最大化のための表現学習

Unsupervised representation learning is one of the most important problems in machine learning. A recent promising approach is contrastive learning: A feature representation of data is learned by solving a pseudo classification problem where class labels are automatically generated from unlabelled data. However, it is not straightforward to understand what representation contrastive learning yields through the classification problem. In addition, most of practical methods for contrastive learning are based on the maximum likelihood estimation, which is often vulnerable to the contamination by outliers. In order to promote the understanding to contrastive learning, this paper first theoretically shows a connection to maximization of mutual information (MI). Our result indicates that density ratio estimation is necessary and sufficient for maximization of MI under some conditions. Since popular objective functions for classification can be regarded as estimating density ratios, contrastive learning related to density ratio estimation can be interpreted as maximizing MI. Next, in terms of density ratio estimation, we establish new recovery conditions for the latent source components in nonlinear independent component analysis (ICA). In contrast with existing work, the established conditions include a novel insight for the dimensionality of data, which is clearly supported by numerical experiments. Furthermore, inspired by nonlinear ICA, we propose a novel framework to estimate a nonlinear subspace for lower-dimensional latent source components, and some theoretical conditions for the subspace estimation are established with density ratio estimation. Motivated by the theoretical results, we propose a practical method through outlier-robust density ratio estimation, which can be seen as performing maximization of MI, nonlinear ICA or nonlinear subspace estimation. Moreover, a sample-efficient nonlinear ICA method is also proposed based on a variational lower-bound of MI. Then, we theoretically investigate outlier-robustness of the proposed methods. Finally, we numerically demonstrate usefulness of the proposed methods in nonlinear ICA and through application to a downstream task for linear classification.

教師なし表現学習は、機械学習における最も重要な問題の一つです。最近の有望なアプローチは、対照学習です。データの特徴表現は、ラベルなしデータからクラスラベルが自動的に生成される疑似分類問題を解くことによって学習されます。しかし、分類問題を通して対照学習がどのような表現を生み出すかを理解するのは簡単ではありません。さらに、対照学習の実用的な方法のほとんどは最大尤度推定に基づいており、外れ値による汚染に対して脆弱であることがよくあります。対照学習への理解を促進するために、本論文ではまず、相互情報量(MI)の最大化との関連を理論的に示します。私たちの結果は、いくつかの条件下では、密度比推定がMIの最大化に必要かつ十分であることを示しています。分類の一般的な目的関数は密度比を推定するものと見なすことができるため、密度比推定に関連する対照学習はMIの最大化と解釈できます。次に、密度比推定の観点から、非線形独立成分分析(ICA)における潜在的なソース成分の新しい回復条件を確立します。既存の研究とは対照的に、確立された条件には、数値実験によって明確にサポートされているデータの次元に関する新しい洞察が含まれています。さらに、非線形ICAに触発されて、低次元の潜在ソースコンポーネントの非線形サブスペースを推定する新しいフレームワークを提案し、サブスペース推定のいくつかの理論的条件が密度比推定によって確立されています。理論的結果に動機付けられて、外れ値に強い密度比推定による実用的な方法を提案します。これは、MI、非線形ICA、または非線形サブスペース推定の最大化を実行するものと見なすことができます。さらに、MIの変分下限に基づいて、サンプル効率の高い非線形ICA方法も提案されています。次に、提案された方法の外れ値に対する堅牢性を理論的に調査します。最後に、非線形ICAでの提案方法の有用性を数値的に実証し、線形分類のダウンストリームタスクへの適用を通じて実証します。

Multi-Task Dynamical Systems
マルチタスクダイナミックシステム

Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describes the multi-task dynamical system (MTDS); a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model.

時系列データセットは、多くの場合、同じドメインから、しかし個人、製品、組織など、異なるエンティティからのさまざまなシーケンスで構成されています。私たちは、時系列モデルを個々のシーケンスに特化し(特定の特性をキャプチャする)、シーケンス間で共通性を共有することで統計的な検出力を保持する方法に関心があります。この論文では、マルチタスク動的システム(MTDS)について説明します。マルチタスク学習(MTL)を時系列モデルに拡張するための一般的な方法論。私たちのアプローチは、すべてのモデルパラメータを調節できる一連の階層的潜在変数を動的システムに付与します。私たちの知る限り、これはMTLの斬新な開発であり、制御入力がある場合とない場合の両方の時系列に適用されます。MTDSは、マルチタスク再帰型ニューラルネットワーク(RNN)を使用してさまざまなスタイルで歩く人々のモーションキャプチャデータに適用され、マルチタスク薬力学モデルを使用して患者の薬物反応データに適用されます。

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration
実践的アダムに向けて:非凸性、収束理論、ミニバッチ加速

Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. This observation, coupled with this sufficient condition, gives much deeper interpretations on the divergence of Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely used without any theoretical guarantee. We further give an analysis on how the batch size or the number of nodes in the distributed system affects the convergence of Adam, which theoretically shows that mini-batch and distributed Adam can be linearly accelerated by using a larger mini-batch size or a larger number of nodes. At last, we apply the generic Adam and mini-batch Adam with the sufficient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis.

Adamは、ディープニューラルネットワークをトレーニングするための最も影響力のある適応型確率アルゴリズムの1つですが、いくつかの簡単な反例により、単純な凸設定でも発散することが指摘されています。適応学習率を下げる、大きなバッチサイズを採用する、時間的非相関化手法を組み込む、類似の代理を探すなど、Adam型アルゴリズムの収束を促進するために多くの試みが行われてきました。既存のアプローチとは対照的に、大規模な非凸確率最適化を解くための汎用Adamのグローバル収束を保証するために、基本学習率のパラメーターと過去の2次モーメントの組み合わせのみに依存する、簡単に確認できる代替の十分条件を導入します。この観察とこの十分条件を組み合わせることで、Adamの発散に関するより深い解釈が得られます。一方、実際には、ミニAdamと分散Adamは理論的な保証なしに広く使用されています。さらに、分散システム内のバッチサイズまたはノード数がAdamの収束にどのように影響するかを分析します。理論的には、より大きなミニバッチサイズまたはより大きなノード数を使用することで、ミニバッチAdamと分散Adamを線形に加速できることを示しています。最後に、反例を解決し、さまざまな実際のデータセットでいくつかのニューラルネットワークをトレーニングするための十分な条件を備えた汎用AdamとミニバッチAdamを適用します。実験結果は、理論分析と完全に一致しています。

Asymptotic Study of Stochastic Adaptive Algorithms in Non-convex Landscape
非凸面における確率的適応アルゴリズムの漸近的研究

This paper studies some asymptotic properties of adaptive algorithms widely used in optimization and machine learning, and among them Adagrad and Rmsprop, which are involved in most of the blackbox deep learning algorithms. Our setup is the non-convex landscape optimization point of view, we consider a one time scale parametrization and the situation where these algorithms may or may not be used with mini-batches. We adopt the point of view of stochastic algorithms and establish the almost sure convergence of these methods when using a decreasing step-size towards the set of critical points of the target function. With a mild extra assumption on the noise, we also obtain the convergence towards the set of minimizers of the function. Along our study, we also obtain a “convergence rate” of the methods, namely a bound on the expected value of the gradient of the cost function along a finite number of iterations.

この論文では、最適化と機械学習で広く使用されている適応アルゴリズムのいくつかの漸近特性を研究し、その中でもほとんどのブラックボックス深層学習アルゴリズムに関与しているAdagradとRmspropを研究しています。私たちのセットアップは非凸ランドスケープ最適化の観点であり、ワンタイムスケールのパラメータ化と、これらのアルゴリズムがミニバッチで使用される場合と使用されない場合がある状況を考慮します。確率的アルゴリズムの視点を採用し、ターゲット関数の臨界点のセットに向かってステップサイズを小さくするときに、これらの方法のほぼ確実な収束を確立します。ノイズを少し余分に仮定すると、関数の最小化器のセットへの収束も得られます。私たちの研究に沿って、方法の「収束率」、つまり有限回の反復に沿ったコスト関数の勾配の期待値の限界も取得します。

Gaussian Process Parameter Estimation Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits
ミニバッチ確率的勾配降下法を用いたガウス過程パラメータ推定:収束保証と経験的利点

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparmeter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.

確率的勾配降下法(SGD)とその変種は、一般化性能と固有の計算上の利点により、独立したサンプルを持つ大規模な機械学習の問題に対する頼りになるアルゴリズムとしての地位を確立しています。しかし、確率的勾配は相関サンプルを持つ完全な勾配の偏った推定値であるという事実により、相関設定下でのSGDの動作に関する理論的理解が不足し、そのような場合の使用が妨げられてきました。この論文では、ガウス過程(GP)のハイパーパラメータ推定に焦点を当て、ミニバッチSGDが完全な対数尤度損失関数の臨界点に収束し、ミニバッチサイズに依存する統計的誤差項まで、$K$回の反復で$O(\frac{1}{K})$の速度でモデルのハイパーパラメータを回復することを証明することで、障壁を打破するための一歩を踏み出します。我々の理論的な保証は、カーネル関数が指数関数的または多項式的な固有減衰を示し、GPで一般的に使用されるさまざまなカーネルがこれを満たすことを条件に成立します。シミュレーションと実際のデータセットの両方に関する数値的研究により、ミニバッチSGDは最先端のGP手法よりも一般化が優れていると同時に、計算負荷が軽減され、GPに対してこれまで未開拓だった新しいデータサイズ領域が開かれることが実証されています。

Underspecification Presents Challenges for Credibility in Modern Machine Learning
仕様不足は、現代の機械学習における信頼性に課題をもたらします

Machine learning (ML) systems often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification in ML pipelines as a key reason for these failures. An ML pipeline is the full procedure followed to train and validate a predictor. Such a pipeline is underspecified when it can return many distinct predictors with equivalently strong test performance. Underspecification is common in modern ML pipelines that primarily validate predictors on held-out data that follow the same distribution as the training data. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We provide evidence that underspecfication has substantive implications for practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

機械学習(ML)システムは、実際のドメインに展開されると、予期しない動作不良を示すことがよくあります。これらの失敗の主な原因として、MLパイプラインの指定不足を特定しました。MLパイプラインは、予測子をトレーニングおよび検証するために従う完全な手順です。そのようなパイプラインは、同等に強力なテストパフォーマンスを持つ多くの異なる予測子を返すことができる場合、指定不足です。指定不足は、主にトレーニングデータと同じ分布に従うホールドアウトデータで予測子を検証する最新のMLパイプラインでよく見られます。指定不足のパイプラインによって返される予測子は、トレーニングドメインのパフォーマンスに基づいて同等として扱われることがよくありますが、ここでは、そのような予測子がデプロイメントドメインで非常に異なる動作をする可能性があることを示します。このあいまいさは、実際には不安定性とモデルの動作不良につながる可能性があり、トレーニングドメインとデプロイメントドメインの構造的不一致から生じる以前に特定された問題とは異なる障害モードです。コンピュータービジョン、医療画像処理、自然言語処理、電子健康記録に基づく臨床リスク予測、医療ゲノミクスの例を使用して、仕様不足が実用的なMLパイプラインに実質的な影響を与えるという証拠を示します。私たちの結果は、あらゆるドメインでの実際の展開を目的としたパイプラインのモデリングにおいて、仕様不足を明示的に考慮する必要があることを示しています。

Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions
ランク付けされた選好データに対する2サンプル検定とモデリング仮定の役割

A number of applications require two-sample testing on ranked preference data. For instance, in crowdsourcing, there is a long-standing question of whether pairwise-comparison data provided by people is distributed identically to ratings-converted-to-comparisons. Other applications include sports data analysis and peer grading. In this paper, we design twosample tests for pairwise-comparison data and ranking data. For our two-sample test for pairwise-comparison data, we establish an upper bound on the sample complexity required to correctly test whether the distributions of the two sets of samples are identical. Our test requires essentially no assumptions on the distributions. We then prove complementary lower bounds showing that our results are tight (in the minimax sense) up to constant factors. We investigate the role of modeling assumptions by proving lower bounds for a range of pairwise-comparison models (WST, MST, SST, parameter-based such as BTL and Thurstone). We also provide tests and associated sample complexity bounds for partial (or total) ranking data. Furthermore, we empirically evaluate our results via extensive simulations as well as three real-world data sets consisting of pairwise-comparisons and rankings. By applying our two-sample test on real-world pairwise-comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.

多くのアプリケーションでは、順位付けされた嗜好データに対する2サンプル検定が必要です。たとえば、クラウドソーシングでは、人々が提供するペアワイズ比較データが、比較に変換された評価と同一に分布しているかどうかという長年の疑問があります。その他のアプリケーションには、スポーツデータ分析やピアグレーディングがあります。この論文では、ペアワイズ比較データとランキングデータに対する2サンプル検定を設計します。ペアワイズ比較データの2サンプル検定では、2セットのサンプルの分布が同一であるかどうかを正しく検定するために必要なサンプル複雑度の上限を設定します。この検定では、分布に関する仮定は基本的に必要ありません。次に、定数因子まで結果がタイト(ミニマックスの意味で)であることを示す補完的な下限を証明します。さまざまなペアワイズ比較モデル(WST、MST、SST、BTLやThurstoneなどのパラメーターベース)の下限を証明することで、モデリングの仮定の役割を調査します。また、部分的(または全体的)ランキングデータに対する検定と関連するサンプル複雑度の上限も提供します。さらに、私たちは、広範囲にわたるシミュレーションと、一対比較とランキングからなる3つの実際のデータセットを通じて、結果を経験的に評価しました。実際の一対比較データに2サンプルテストを適用することで、人々が提供する評価とランキングは実際には異なる分布をしているという結論に達しました。

Getting Better from Worse: Augmented Bagging and A Cautionary Tale of Variable Importance
悪くも良くなる:拡張されたバギングと可変重要性の教訓物語

As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications. Tools like random forests have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships among variables. Here, motivated by recent insights into random forest behavior, we introduce the simple idea of augmented bagging (AugBagg), a procedure that operates in an identical fashion to classical bagging and random forests, but which operates on a larger, augmented space containing additional randomly generated noise features. Surprisingly, we demonstrate that this simple act of including extra noise variables in the model can lead to dramatic improvements in out-of-sample predictive accuracy, sometimes outperforming even an optimally tuned traditional random forest. As a result, intuitive notions of variable importance based on improved model accuracy may be deeply flawed, as even purely random noise can routinely register as statistically significant. Numerous demonstrations on both real and synthetic data are provided along with a proposed solution.

データのサイズ、複雑さ、および可用性が増大し続けるにつれて、科学者は、事前のモデル仕様を最小限に抑えて正確な予測を提供できることが多いブラックボックス学習アルゴリズムにますます依存するようになっています。ランダムフォレストなどのツールは、既成の成功の実績が確立されており、変数間の根本的な関係を分析するためのさまざまな戦略も提供しています。ここでは、ランダムフォレストの動作に関する最近の洞察に触発されて、拡張バギング(AugBagg)というシンプルなアイデアを紹介します。これは、従来のバギングやランダムフォレストと同じように機能しますが、追加のランダムに生成されたノイズ機能を含む、より大きく拡張された空間で動作する手順です。驚くべきことに、モデルに追加のノイズ変数を含めるというこの単純な行為により、サンプル外の予測精度が劇的に向上し、最適に調整された従来のランダムフォレストよりも優れたパフォーマンスを発揮する場合があることを示しています。その結果、改善されたモデル精度に基づく変数の重要性の直感的な概念には大きな欠陥がある可能性があります。純粋にランダムなノイズでさえ、日常的に統計的に有意であると記録される可能性があるためです。提案されたソリューションとともに、実際のデータと合成データの両方に関する多数のデモンストレーションが提供されます。

On Acceleration for Convex Composite Minimization with Noise-Corrupted Gradients and Approximate Proximal Mapping
ノイズ破損勾配と近似近位マッピングによる凸複合材料最小化のための加速について

The accelerated proximal methods (APM) have become one of the most important optimization tools for large-scale convex composite minimization problems, due to their wide range of applications and the optimal convergence rate in first-order algorithms. However, most existing theoretical results of APM are obtained by assuming that the gradient oracle is exact and the proximal mapping must be exactly solved, which may not hold in practice. This work presents a theoretical study of APM by allowing to use inexact gradient oracle and approximate proximal mapping. Specifically, we analyze inexact APM by improving the approximate duality gap technique (ADGT) which was originally designed for convergence analysis for first-order methods with both exact gradient oracle and proximal mapping. Our approach has several advantages: 1) we provide a unified convergence analysis that allows both inexact gradient oracle and approximate proximal mapping; 2) our proof is generic that naturally recovers the convergence rates of both accelerated and non-accelerated proximal methods, on top of which the advantages and the disadvantages of acceleration can be easily derived; 3) we derive the same convergence bound as previous methods in terms of inexact gradient oracle, but a tighter convergence bound in terms of approximate proximal mapping.

加速近似法(APM)は、その幅広い応用範囲と一次アルゴリズムにおける最適な収束率により、大規模凸複合最小化問題に対する最も重要な最適化ツールの1つとなっています。しかし、APMの既存の理論的結果のほとんどは、勾配オラクル(勾配神託予言)が正確であり、近似マッピングを正確に解く必要があるという仮定によって得られており、実際には成り立たない可能性があります。この研究では、不正確な勾配オラクル(勾配神託予言)と近似近似マッピングの使用を可能にすることで、APMの理論的研究を提示します。具体的には、正確な勾配オラクル(勾配神託予言)と近似マッピングの両方を備えた一次法の収束解析のために元々設計された近似双対性ギャップ手法(ADGT)を改良することで、不正確なAPMを解析します。このアプローチには、いくつかの利点があります。1)不正確な勾配オラクル(勾配神託予言)と近似近似マッピングの両方を可能にする統一された収束解析を提供します。2)私たちの証明は、加速近似法と非加速近似法の両方の収束率を自然に回復する汎用的なものであり、その上で加速の利点と欠点を簡単に導くことができます。3)不正確な勾配オラクルの観点からは以前の方法と同じ収束境界を導きますが、近似近似マッピングの観点からはより厳しい収束境界を導きます。

Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization
分散の低減EXTRAとDIGing、およびそれらの最適加速による強凸分散最適化

We study stochastic decentralized optimization for the problem of training machine learning models with large-scale distributed data. We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O((\kappa_s+n)\log\frac{1}{\epsilon})$ stochastic gradient evaluations and $O((\kappa_b+\kappa_c)\log\frac{1}{\epsilon})$ communication rounds to reach precision $\epsilon$, which are the best complexities among the non-accelerated gradient-type methods, where $\kappa_s$ and $\kappa_b$ are the stochastic condition number and batch condition number for strongly convex and smooth problems, respectively, $\kappa_c$ is the condition number of the communication network, and $n$ is the sample size on each distributed node. The proposed VR-DIGing has a little higher communication cost of $O((\kappa_b+\kappa_c^2)\log\frac{1}{\epsilon})$. Our stochastic gradient computation complexities are the same as the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and our communication complexities keep the same as those of EXTRA and DIGing, respectively. To further speed up the convergence, we also propose the accelerated VR-EXTRA and VR-DIGing with both the optimal $O((\sqrt{n\kappa_s}+n)\log\frac{1}{\epsilon})$ stochastic gradient computation complexity and $O(\sqrt{\kappa_b\kappa_c}\log\frac{1}{\epsilon})$ communication complexity. Our stochastic gradient computation complexity is also the same as the ones of single-machine accelerated VR methods, such as Katyusha, and our communication complexity keeps the same as those of accelerated full batch decentralized methods, such as MSDA. To the best of our knowledge, our accelerated methods are the first to achieve both the optimal stochastic gradient computation complexity and communication complexity in the class of gradient-type methods.

私たちは、大規模分散データによる機械学習モデルのトレーニング問題に対する確率的分散最適化を研究します。広く使用されているEXTRA法とDIGing法を分散削減(VR)で拡張し、VR-EXTRAとVR-DIGingという2つの方法を提案します。提案されたVR-EXTRAでは、精度$\epsilon$に到達するために、$O((\kappa_s+n)\log\frac{1}{\epsilon})$の確率的勾配評価時間と$O((\kappa_b+\kappa_c)\log\frac{1}{\epsilon})$の通信ラウンドの時間が必要であり、これらは非加速勾配型方法の中で最良の複雑度です。ここで、$\kappa_s$と$\kappa_b$は、それぞれ強凸問題と滑らかな問題の確率的条件数とバッチ条件数、$\kappa_c$は通信ネットワークの条件数、$n$は各分散ノードのサンプルサイズです。提案されたVR-DIGingの通信コストは$O((\kappa_b+\kappa_c^2)\log\frac{1}{\epsilon})$と少し高くなります。確率的勾配計算の複雑さは、SAG、SAGA、SVRGなどの単一マシンVR手法の複雑さと同じで、通信の複雑さはそれぞれEXTRAとDIGingの複雑さと同じです。収束をさらに高速化するために、最適な$O((\sqrt{n\kappa_s}+n)\log\frac{1}{\epsilon})$の確率的勾配計算複雑さと$O(\sqrt{\kappa_b\kappa_c}\log\frac{1}{\epsilon})$の通信複雑さを備えた高速VR-EXTRAとVR-DIGingも提案します。確率的勾配計算の複雑さもKatyushaなどの単一マシン加速VR方式と同じであり、通信の複雑さはMSDAなどの加速フルバッチ分散方式と同じままです。私たちの知る限り、私たちの加速方式は、勾配型方式のクラスで最適な確率的勾配計算の複雑さと通信の複雑さの両方を達成した最初のものです。

Behavior Priors for Efficient Reinforcement Learning
効率的な強化学習のための行動事前確率

As we deploy reinforcement learning agents to solve increasingly challenging problems, methods that allow us to inject prior knowledge about the structure of the world and effective solution strategies becomes increasingly important. In this work we consider how information and architectural constraints can be combined with ideas from the probabilistic modeling literature to learn behavior priors that capture the common movement and interaction patterns that are shared across a set of related tasks or contexts. For example the day-to day behavior of humans comprises distinctive locomotion and manipulation patterns that recur across many different situations and goals. We discuss how such behavior patterns can be captured using probabilistic trajectory models and how these can be integrated effectively into reinforcement learning schemes, e.g. to facilitate multi-task and transfer learning. We then extend these ideas to latent variable models and consider a formulation to learn hierarchical priors that capture different aspects of the behavior in reusable modules. We discuss how such latent variable formulations connect to related work on hierarchical reinforcement learning (HRL) and mutual information and curiosity based objectives, thereby offering an alternative perspective on existing ideas. We demonstrate the effectiveness of our framework by applying it to a range of simulated continuous control domains, videos of which can be found at the following url: https://sites.google.com/view/behavior-priors.

強化学習エージェントを展開してますます困難な問題を解決するにつれて、世界の構造と効果的な解決戦略に関する事前知識を注入できる方法がますます重要になります。この研究では、情報とアーキテクチャの制約を確率モデリングの文献のアイデアと組み合わせて、一連の関連するタスクまたはコンテキスト全体で共有される共通の動きと相互作用のパターンをキャプチャする動作の事前確率を学習する方法を検討します。たとえば、人間の日常の行動は、さまざまな状況や目標にわたって繰り返される独特の移動と操作のパターンで構成されています。確率的軌道モデルを使用してこのような動作パターンをキャプチャする方法と、マルチタスクや転移学習を容易にするためにこれらを強化学習スキームに効果的に統合する方法について説明します。次に、これらのアイデアを潜在変数モデルに拡張し、再利用可能なモジュールで動作のさまざまな側面をキャプチャする階層的な事前確率を学習するための定式化を検討します。このような潜在変数の定式化が、階層的強化学習(HRL)や相互情報量、好奇心に基づく目標に関する関連研究とどのように関連しているかを説明し、既存のアイデアに対する別の視点を提示します。私たちは、さまざまなシミュレートされた連続制御ドメインにフレームワークを適用することで、その有効性を実証します。そのビデオは次のURLでご覧いただけます: https://sites.google.com/view/behavior-priors。

Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks
マルチエージェントネットワークのためのロバストな分散加速確率勾配法

We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance and dependence to network effects. When gradients do not contain noise, we also prove that D-ASG can achieve acceleration, in the sense that it requires $\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $\kappa$ is the condition number and $\varepsilon$ is the target accuracy. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact convergence to the optimal solution. It achieves optimal and accelerated $\mathcal{O}(-k/\sqrt{\kappa})$ linear decay in the bias term as well as optimal $\mathcal{O}(\sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in accelerated practical algorithms that are robust to gradient noise and that can outperform existing methods.

私たちは、分散型確率勾配法(D-SG)とその高速版(D-ASG)を研究し、分散型の強凸確率最適化問題を解く。この問題は、目的関数が複数の計算ユニットに分散され、固定されているが任意の接続された通信グラフ上にあり、勾配のノイズ推定が利用可能なローカル通信制約を受ける。私たちは、バイアス、分散、ネットワーク効果への依存性を体系的にトレードオフすることでパフォーマンスを最適化する方法で、これらのアルゴリズムのステップサイズと運動量パラメータを選択できるフレームワークを開発します。勾配にノイズが含まれていない場合、D-ASGが加速を達成できることも証明します。つまり、$\kappa$は条件数、$\varepsilon$は目標精度で、非加速バージョンと同じ固定点に収束するには、$\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$の勾配評価と$\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$の通信が必要です。2次関数については、バイアスと分散の項に関して厳密なパフォーマンス境界も提供します。最後に、最適解への正確な収束を確実にするために、段階ごとに慎重にパラメータを変化させたD-ASGの多段階バージョンを検討します。これは、バイアス項で最適かつ加速された$\mathcal{O}(-k/\sqrt{\kappa})$線形減衰と、分散項で最適な$\mathcal{O}(\sigma^2/k)$を実現します。数値実験を通じて、私たちのアプローチにより、勾配ノイズに対して堅牢で、既存の方法よりも優れた、加速された実用的なアルゴリズムが実現されることを示します。

Structural Agnostic Modeling: Adversarial Learning of Causal Graphs
構造にとらわれないモデリング:因果グラフの敵対的学習

A new causal discovery method, Structural Agnostic Modeling (SAM), is presented in this paper. Leveraging both conditional independencies and distributional asymmetries, SAM aims to find the underlying causal structure from observational data. The approach is based on a game between different players estimating each variable distribution conditionally to the others as a neural net, and an adversary aimed at discriminating the generated data against the original data. A learning criterion combining distribution estimation, sparsity and acyclicity constraints is used to enforce the optimization of the graph structure and parameters through stochastic gradient descent. SAM is extensively experimentally validated on synthetic and real data.

この論文では、新しい因果関係の発見方法であるStructural Agnostic Modeling(SAM)を紹介します。SAMは、条件付きの非依存性と分布の非対称性の両方を活用して、観測データから根本的な因果構造を見つけることを目指しています。このアプローチは、ニューラルネットとして各変数の分布を他のプレイヤーに条件付きで推定する異なるプレイヤーと、生成されたデータを元のデータに対して識別することを目的とした敵対者との間のゲームに基づいています。分布推定、スパース性、非巡回性制約を組み合わせた学習基準を使用して、確率的勾配降下法によるグラフ構造とパラメータの最適化を実施します。SAMは、合成データと実データで広く実験的に検証されています。

Learning Green’s functions associated with time-dependent partial differential equations
時間依存偏微分方程式に関連するグリーン関数の学習

Neural operators are a popular technique in scientific machine learning to learn a mathematical model of the behavior of unknown physical systems from data. Neural operators are especially useful to learn solution operators associated with partial differential equations (PDEs) from pairs of forcing functions and solutions when numerical solvers are not available or the underlying physics is poorly understood. In this work, we attempt to provide theoretical foundations to understand the amount of training data needed to learn time-dependent PDEs. Given input-output pairs from a parabolic PDE in any spatial dimension $n\geq 1$, we derive the first theoretically rigorous scheme for learning the associated solution operator, which takes the form of a convolution with a Green’s function $G$. Until now, rigorously learning Green’s functions associated with time-dependent PDEs has been a major challenge in the field of scientific machine learning because $G$ may not be square-integrable when $n>1$, and time-dependent PDEs have transient dynamics. By combining the hierarchical low-rank structure of $G$ together with randomized numerical linear algebra, we construct an approximant to $G$ that achieves a relative error of $\smash{\mathcal{O}(\Gamma_\epsilon^{-1/2}\epsilon)}$ in the $L^1$-norm with high probability by using at most $\smash{\mathcal{O}(\epsilon^{-\frac{n+2}{2}}\log(1/\epsilon))}$ input-output training pairs, where $\Gamma_\epsilon$ is a measure of the quality of the training dataset for learning $G$, and $\epsilon>0$ is sufficiently small.

ニューラル演算子は、データから未知の物理システムの挙動の数学的モデルを学習するための科学的機械学習で一般的な手法です。ニューラル演算子は、数値ソルバーが利用できない場合や基礎となる物理が十分に理解されていない場合に、強制関数と解のペアから偏微分方程式(PDE)に関連付けられた解演算子を学習するのに特に役立ちます。この研究では、時間依存PDEを学習するために必要なトレーニングデータの量を理解するための理論的基礎を提供します。任意の空間次元$n\geq 1$の放物型PDEからの入出力ペアが与えられた場合、関連付けられた解演算子を学習するための理論的に厳密な最初のスキームを導出します。このスキームは、グリーン関数$G$との畳み込みの形をとります。これまで、時間依存PDEに関連付けられたグリーン関数を厳密に学習することは、科学的機械学習の分野で大きな課題でした。これは、$n>1$の場合に$G$が二乗積分可能でない可能性があり、時間依存PDEが過渡的なダイナミクスを持つためです。$G$の階層的低ランク構造とランダム化数値線形代数を組み合わせることで、最大で$\smash{\mathcal{O}(\epsilon^{-\frac{n+2}{2}}\log(1/\epsilon))}$個の入力-出力トレーニングペアを使用して、$L^1$ノルムで相対誤差$\smash{\mathcal{O}(\Gamma_\epsilon^{-1/2}\epsilon)}$を高い確率で達成する$G$の近似値を構築します。ここで、$\Gamma_\epsilon$は$G$を学習するためのトレーニングデータセットの品質の尺度であり、$\epsilon>0$は十分に小さい値です。

Smooth Robust Tensor Completion for Background/Foreground Separation with Missing Pixels: Novel Algorithm with Convergence Guarantee
欠落ピクセルによる背景/前景分離のための滑らかでロバストなテンソル補完:収束保証のある新しいアルゴリズム

Robust PCA (RPCA) and its tensor extension, namely, Robust Tensor PCA (RTPCA), provide an effective framework for background/foreground separation by decomposing the data into low-rank and sparse components, which contain the background and the foreground (moving objects), respectively. However, in real-world applications, the presence of missing pixels is a very common and challenging issue due to errors in the acquisition process or manufacturer defects. RPCA and RTPCA are not able to recover the background and foreground simultaneously with missing pixels. This study aims to address the problem of background/foreground separation with missing pixels by combining video recovery and background/foreground separation into a single framework. To achieve this goal, a smooth robust tensor completion (SRTC) model is proposed to recover the data and decompose it into the static background and smooth foreground, respectively. An efficient algorithm based on tensor proximal alternating minimization (tenPAM) is implemented to solve the proposed model with a global convergence guarantee under very mild conditions. Extensive experiments on actual data demonstrate that the proposed method significantly outperforms the state-of-the-art approaches for background/foreground separation with missing pixels.

ロバストPCA (RPCA)とそのテンソル拡張であるロバストテンソルPCA (RTPCA)は、データを低ランクおよびスパースコンポーネントに分解することで、背景と前景(移動オブジェクト)をそれぞれ含む背景/前景分離の効果的なフレームワークを提供します。ただし、実際のアプリケーションでは、取得プロセスのエラーや製造元の欠陥により、欠落ピクセルの存在は非常に一般的で困難な問題です。RPCAとRTPCAは、欠落ピクセルがある背景と前景を同時に復元することはできません。この研究では、ビデオ復元と背景/前景分離を1つのフレームワークに組み合わせることで、欠落ピクセルがある背景/前景分離の問題に対処することを目的としています。この目標を達成するために、データを復元して静的背景と滑らかな前景に分解するスムーズロバストテンソル補完(SRTC)モデルが提案されています。テンソル近似交互最小化(tenPAM)に基づく効率的なアルゴリズムが実装され、非常に穏やかな条件下でグローバル収束保証を備えた提案モデルを解決します。実際のデータに対する広範な実験により、提案された方法が、欠損ピクセルのある背景/前景の分離における最先端のアプローチを大幅に上回ることが実証されています。

Kernel Partial Correlation Coefficient — a Measure of Conditional Dependence
カーネル偏相関係数 — 条件付き依存性の尺度

We propose and study a class of simple, nonparametric, yet interpretable measures of conditional dependence, which we call kernel partial correlation (KPC) coefficient, between two random variables $Y$ and $Z$ given a third variable $X$, all taking values in general topological spaces. The population KPC captures the strength of conditional dependence and it is 0 if and only if $Y$ is conditionally independent of $Z$ given $X$, and 1 if and only if $Y$ is a measurable function of $Z$ and $X$. We describe two consistent methods of estimating KPC. Our first method is based on the general framework of geometric graphs, including $K$-nearest neighbor graphs and minimum spanning trees. A sub-class of these estimators can be computed in near linear time and converges at a rate that adapts automatically to the intrinsic dimensionality of the underlying distributions. The second strategy involves direct estimation of conditional mean embeddings in the RKHS framework. Using these empirical measures we develop a fully model-free variable selection algorithm, and formally prove the consistency of the procedure under suitable sparsity assumptions. Extensive simulation and real-data examples illustrate the superior performance of our methods compared to existing procedures.

私たちは、一般的な位相空間で値を取る、第3の変数$X$が与えられた2つのランダム変数$Y$と$Z$の間の、条件付き依存性の単純でノンパラメトリックでありながら解釈可能な尺度、カーネル偏相関（KPC）係数のクラスを提案し、研究します。母集団KPCは条件付き依存性の強さを捉え、$X$が与えられた場合に$Y$が$Z$から条件付きで独立している場合に限り0になり、$Y$が$Z$と$X$の測定可能な関数である場合に限り1になります。我々はKPCを推定する2つの一貫した方法を説明します。最初の方法は、$K$近傍グラフや最小全域木などの幾何学的グラフの一般的なフレームワークに基づく。これらの推定量のサブクラスは、ほぼ線形時間で計算でき、基礎となる分布の固有の次元に自動的に適応する速度で収束します。2番目の戦略は、RKHSフレームワークでの条件付き平均埋め込みの直接推定を伴う。これらの経験的尺度を使用して、完全にモデルフリーの変数選択アルゴリズムを開発し、適切なスパース仮定の下で手順の一貫性を正式に証明します。広範なシミュレーションと実際のデータの例により、既存の手順と比較して、当社の方法が優れたパフォーマンスを発揮することが示されています。

Learning Operators with Coupled Attention
結合された注意によるオペレータの学習

Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function measurements in the training set is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks.

教師あり演算子学習は、時空間動的システムの進化のモデル化や、機能データ間の一般的なブラックボックス関係の近似に応用される、新しい機械学習パラダイムです。私たちは、最近の注目メカニズムの成功に触発されて、新しい演算子学習法LOCA (Learning Operators with Coupled Attention)を提案します。私たちのアーキテクチャでは、入力関数は有限の特徴セットにマッピングされ、出力クエリの場所に依存する注目重みで平均化されます。これらの注目重みを積分変換と結合することにより、LOCAはターゲット出力関数の相関関係を明示的に学習できるため、トレーニングセットの出力関数測定の数が非常に少ない場合でも、非線形演算子を近似できます。私たちの定式化には、提案モデルの普遍的な表現力に対する厳密な近似理論的保証が伴います。経験的に、常微分方程式と偏微分方程式によって支配されるシステム、およびブラックボックス気候予測問題を含むいくつかの演算子学習シナリオでLOCAのパフォーマンスを評価します。これらのシナリオを通じて、分布外予測タスクの場合でも、最先端の精度、ノイズの多い入力データに対する堅牢性、テストデータセット全体にわたるエラーの一貫した小さな分散を実証します。

When is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint
ランジュバンアルゴリズムの収束時間は次元に依存しないのか？複合最適化の観点

There has been a surge of works bridging MCMC sampling and optimization, with a specific focus on translating non-asymptotic convergence guarantees for optimization problems into the analysis of Langevin algorithms in MCMC sampling. A conspicuous distinction between the convergence analysis of Langevin sampling and that of optimization is that all known convergence rates for Langevin algorithms depend on the dimensionality of the problem, whereas the convergence rates for optimization are dimension-free for convex problems. Whether a dimension independent convergence rate can be achieved by the Langevin algorithm is thus a long-standing open problem. This paper provides an affirmative answer to this problem for the case of either Lipschitz or smooth convex functions with normal priors. By viewing Langevin algorithm as composite optimization, we develop a new analysis technique that leads to dimension independent convergence rates for such problems.

MCMCサンプリングと最適化を橋渡しする作業が急増しており、特に最適化問題の非漸近収束保証をMCMCサンプリングのLangevinアルゴリズムの解析に変換することに重点を置いています。ランジュバンサンプリングの収束解析と最適化の収束解析の顕著な違いは、ランジュバンアルゴリズムの既知の収束率はすべて問題の次元に依存するのに対し、最適化の収束率は凸問題では次元がないことです。したがって、次元に依存しない収束率がランジュバンアルゴリズムによって達成できるかどうかは、長年の未解決の問題です。この論文では、リプシッツまたは正規事前分布を持つ滑らかな凸関数のいずれかの場合について、この問題に対する肯定的な答えを提供します。ランジュバンアルゴリズムを複合最適化と見なすことで、このような問題に対して次元に依存しない収束率をもたらす新しい解析手法を開発します。

Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features
シャープレイ値と変分オートエンコーダを使用した、従属混合特徴を持つ予測モデルの説明

Shapley values are today extensively used as a model-agnostic explanation framework to explain complex predictive machine learning models. Shapley values have desirable theoretical properties and a sound mathematical foundation in the field of cooperative game theory. Precise Shapley value estimates for dependent data rely on accurate modeling of the dependencies between all feature combinations. In this paper, we use a variational autoencoder with arbitrary conditioning (VAEAC) to model all feature dependencies simultaneously. We demonstrate through comprehensive simulation studies that our VAEAC approach to Shapley value estimation outperforms the state-of-the-art methods for a wide range of settings for both continuous and mixed dependent features. For high-dimensional settings, our VAEAC approach with a non-uniform masking scheme significantly outperforms competing methods. Finally, we apply our VAEAC approach to estimate Shapley value explanations for the Abalone data set from the UCI Machine Learning Repository.

Shapley値は現在、複雑な予測機械学習モデルを説明するためのモデルに依存しない説明フレームワークとして広く使用されています。Shapley値は、協力ゲーム理論の分野で望ましい理論的特性と健全な数学的基礎を備えています。従属データの正確なShapley値の推定は、すべての特徴の組み合わせ間の依存関係の正確なモデル化に依存します。この論文では、任意の条件付けによる変分オートエンコーダ(VAEAC)を使用して、すべての特徴の依存関係を同時にモデル化します。包括的なシミュレーション研究を通じて、Shapley値推定に対するVAEACアプローチが、連続および混合従属特徴の両方の幅広い設定で最先端の方法よりも優れていることを実証します。高次元設定では、非均一マスキング方式を使用したVAEACアプローチが、競合方法よりも大幅に優れています。最後に、VAEACアプローチを適用して、UCI機械学習リポジトリのAbaloneデータセットのShapley値の説明を推定します。

Multi-Agent Multi-Armed Bandits with Limited Communication
通信が制限されたマルチエージェント、マルチアームドバンディット

We consider the problem where $N$ agents collaboratively interact with an instance of a stochastic $K$ arm bandit problem for $K \gg N$. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of $T$ time steps, the number of communication rounds, and the number of bits in each communication round. We present Limited Communication Collaboration – Upper Confidence Bound (LCC-UCB), a doubling-epoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCC-UCB, each agent enjoys a regret of $\tilde{O}\left(\sqrt{({K/N}+ N)T}\right)$, communicates for $O(\log T)$ steps and broadcasts $O(\log K)$ bits in each communication step. We extend the work to sparse graphs with maximum degree $K_G$ and diameter $D$ to propose LCC-UCB-GRAPH which enjoys a regret bound of $\tilde{O}\left(D\sqrt{(K/N+ K_G)DT}\right)$. Finally, we empirically show that the LCC-UCB and the LCC-UCB-GRAPH algorithms perform well and outperform strategies that communicate through a central node.

私たちは、$N$エージェントが、$K \gg N$の確率的$K$アームバンディット問題のインスタンスと協調して対話する問題を考察します。エージェントは、合計$T$タイムステップ、通信ラウンドの数、および各通信ラウンドのビット数について、すべてのエージェントの累積後悔を同時に最小化することを目指します。ここでは、Limited Communication Collaboration – Upper Confidence Bound (LCC-UCB)を紹介します。これは、各エージェントがエポックの終了後にのみ通信し、自分が知っている最良のアームのインデックスを共有する、倍増エポックベースのアルゴリズムです。私たちのアルゴリズムLCC-UCBでは、各エージェントは$\tilde{O}\left(\sqrt{({K/N}+ N)T}\right)$の後悔を享受し、$O(\log T)$ステップ通信し、各通信ステップで$O(\log K)$ビットをブロードキャストします。私たちは、最大次数$K_G$と直径$D$を持つスパースグラフに研究を拡張し、$\tilde{O}\left(D\sqrt{(K/N+ K_G)DT}\right)$の後悔境界を持つLCC-UCB-GRAPHを提案します。最後に、LCC-UCBアルゴリズムとLCC-UCB-GRAPHアルゴリズムが適切に機能し、中央ノードを介して通信する戦略よりも優れていることを経験的に示します。

Efficient Inference for Dynamic Flexible Interactions of Neural Populations
神経集団の動的で柔軟な相互作用のための効率的な推論

Hawkes process provides an effective statistical framework for analyzing the interactions of neural spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modeling inhibitory interactions among neural population. Instead, the nonlinear Hawkes process allows for modeling a more flexible influence pattern with excitatory or inhibitory interactions. This work proposes a flexible nonlinear Hawkes process variant based on sigmoid nonlinearity. To ease inference, three sets of auxiliary latent variables (Polya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make functional connection weights appear in a Gaussian form, which enables simple iterative algorithms with analytical updates. As a result, the efficient Gibbs sampler, expectation-maximization algorithm and mean-field approximation are derived to estimate the interactions among neural populations. Furthermore, to reconcile with time-varying neural systems, the proposed time-invariant model is extended to a dynamic version by introducing a Markov state process. Similarly, three analytical iterative inference algorithms: Gibbs sampler, EM algorithm and mean-field approximation are derived. We compare the accuracy and efficiency of these inference algorithms on synthetic data, and further experiment on real neural recordings to demonstrate that the developed models achieve superior performance over the state-of-the-art competitors.

ホークス過程は、神経スパイク活動の相互作用を分析するための効果的な統計フレームワークを提供します。多くの実際のアプリケーションで利用されていますが、古典的なホークス過程は、神経集団間の抑制性相互作用をモデル化することができません。代わりに、非線形ホークス過程は、興奮性または抑制性相互作用によるより柔軟な影響パターンのモデル化を可能にします。この研究では、シグモイド非線形性に基づく柔軟な非線形ホークス過程のバリアントを提案します。推論を容易にするために、3セットの補助潜在変数（ポリアガンマ変数、潜在マーク付きポアソン過程、スパース変数）が拡張され、機能接続重みがガウス形式で表示されます。これにより、分析更新による単純な反復アルゴリズムが可能になります。その結果、効率的なギブスサンプラー、期待値最大化アルゴリズム、平均場近似が導出され、神経集団間の相互作用を推定します。さらに、時間変動ニューラルシステムと調和させるために、提案された時間不変モデルは、マルコフ状態プロセスを導入することで動的バージョンに拡張されます。同様に、ギブスサンプラー、EMアルゴリズム、平均場近似の3つの分析反復推論アルゴリズムが導出されます。合成データでこれらの推論アルゴリズムの精度と効率を比較し、実際のニューラル記録でさらに実験して、開発されたモデルが最先端の競合モデルよりも優れたパフォーマンスを実現することを実証します。

A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review
グラントパネルレビューへの適用によるランキングとスコアの統一統計的学習モデル

Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus among judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.

ランキングとスコアは、審査員がオブジェクトのコレクションにおける好みや品質の認識を表現するために使用する2つの一般的なデータタイプです。各タイプのデータを個別に調査するモデルは多数存在しますが、最初にデータ変換を実行せずに両方のデータタイプを同時に取得する統合統計モデルはありません。このギャップを埋めるために、Mallows-Binomialモデルを提案します。このモデルは、オブジェクトの品質、コンセンサスランキング、審査員間のコンセンサスのレベルを定量化する共有パラメーターを介して、Mallows $\phi$ランキングモデルとBinomialスコアモデルを組み合わせます。モデルパラメーターの正確なMLEを計算する効率的なツリー検索アルゴリズムを提案し、モデルの統計的特性を分析的およびシミュレーションの両方で調査し、スコアと部分的なランキングの両方を収集した助成金パネルレビューのインスタンスからの実際のデータにモデルを適用します。さらに、モデルの出力を使用してオブジェクトを自信を持ってランク付けする方法を示します。提案されたモデルは、スコアとランキングの両方からの情報を賢明に組み合わせて、オブジェクトの品質を定量化し、適切なレベルの統計的不確実性でコンセンサスを測定することが示されています。

Ranking and Tuning Pre-trained Models: A New Paradigm for Exploiting Model Hubs
事前学習済みモデルのランク付けとチューニング: モデルハブを活用するための新しいパラダイム

Model hubs with many pre-trained models (PTMs) have become a cornerstone of deep learning. Although built at a high cost, they remain under-exploited—practitioners usually pick one PTM from the provided model hub by popularity and then fine-tune the PTM to solve the target task. This na\”ive but common practice poses two obstacles to full exploitation of pre-trained model hubs: first, the PTM selection by popularity has no optimality guarantee, and second, only one PTM is used while the remaining PTMs are ignored. An alternative might be to consider all possible combinations of PTMs and extensively fine-tune each combination, but this would not only be prohibitive computationally but may also lead to statistical over-fitting. In this paper, we propose a new paradigm for exploiting model hubs that is intermediate between these extremes. The paradigm is characterized by two aspects: (1) We use an evidence maximization procedure to estimate the maximum value of label evidence given features extracted by pre-trained models. This procedure can rank all the PTMs in a model hub for various types of PTMs and tasks before fine-tuning. (2) The best ranked PTM can either be fine-tuned and deployed if we have no preference for the model’s architecture or the target PTM can be tuned by the top $K$ ranked PTMs via a Bayesian procedure that we propose. This procedure, which we refer to as B-Tuning, not only improves upon specialized methods designed for tuning homogeneous PTMs, but also applies to the challenging problem of tuning heterogeneous PTMs where it yields a new level of benchmark performance.

多数の事前トレーニング済みモデル(PTM)を備えたモデルハブは、ディープラーニングの基礎となっています。高コストで構築されているにもかかわらず、十分に活用されていません。実践者は通常、提供されているモデルハブから人気度に応じて1つのPTMを選択し、そのPTMを微調整してターゲットタスクを解決します。この単純だが一般的な方法は、事前学習済みモデルハブの完全な活用に2つの障害をもたらします。1つは、人気度によるPTMの選択には最適性の保証がなく、2つ目は、1つのPTMのみが使用され、残りのPTMは無視されることです。代替案としては、PTMのすべての可能な組み合わせを検討し、各組み合わせを徹底的に微調整することが考えられますが、これは計算上法外なだけでなく、統計的過剰適合につながる可能性もあります。この論文では、これらの両極端の中間に位置する、モデルハブを活用するための新しいパラダイムを提案します。このパラダイムは、2つの側面で特徴付けられます。(1)事前学習済みモデルによって抽出された特徴に基づいて、ラベル証拠の最大値を推定するために証拠最大化手順を使用します。この手順では、微調整の前に、モデルハブ内のすべてのPTMをさまざまな種類のPTMとタスクについてランク付けできます。(2)モデルのアーキテクチャに好みがない場合は、最もランク付けされたPTMを微調整して展開するか、または、ターゲットPTMを、私たちが提案するベイジアン手順です。B-Tuningと呼ばれるこの手順は、同種のPTMをチューニングするために設計された特殊な方法を改善するだけでなく、異種のPTMをチューニングするという困難な問題にも適用され、新しいレベルのベンチマークパフォーマンスをもたらします。

tntorch: Tensor Network Learning with PyTorch
tntorch: PyTorch によるテンソルネットワーク学習

We present tntorch, a tensor learning framework that supports multiple decompositions (including Candecomp/Parafac, Tucker, and Tensor Train) under a unified interface. With our library, the user can learn and handle low-rank tensors with automatic differentiation, seamless GPU support, and the convenience of PyTorch’s API. Besides decomposition algorithms, tntorch implements differentiable tensor algebra, rank truncation, cross-approximation, batch processing, comprehensive tensor arithmetics, and more.

私たちは、tntorchは、統一されたインターフェースの下で複数の分解(Candecomp/Parafac、Tucker、Tensor Trainなど)をサポートするテンソル学習フレームワークです。当社のライブラリを使用すると、ユーザーは自動微分、シームレスなGPUサポート、およびPyTorchのAPIの利便性により、低ランクのテンソルを学習して処理できます。分解アルゴリズムの他に、tntorchは微分可能テンソル代数、ランク切り捨て、交差近似、バッチ処理、包括的なテンソル演算などを実装しています。

Nonconvex Matrix Completion with Linearly Parameterized Factors
線形パラメーター化因子による非凸行列補完

Techniques of matrix completion aim to impute a large portion of missing entries in a data matrix through a small portion of observed ones. In practice, prior information and special structures are usually employed in order to improve the accuracy of matrix completion. In this paper, we propose a unified nonconvex optimization framework for matrix completion with linearly parameterized factors. In particular, by introducing a condition referred to as Correlated Parametric Factorization, we conduct a unified geometric analysis for the nonconvex objective by establishing uniform upper bounds for low-rank estimation resulting from any local minimizer. Perhaps surprisingly, the condition of Correlated Parametric Factorization holds for important examples including subspace-constrained matrix completion and skew-symmetric matrix completion. The effectiveness of our unified nonconvex optimization method is also empirically illustrated by extensive numerical simulations.

行列補完の手法は、データ行列の欠落しているエントリの大部分を、観測されたエントリのごく一部を通じて補完することを目的としています。実際には、行列補完の精度を向上させるために、通常、事前情報と特別な構造が使用されます。この論文では、線形パラメータ化された因子を使用した行列補完のための統一された非凸最適化フレームワークを提案します。特に、相関パラメトリック因数分解と呼ばれる条件を導入することにより、任意の局所最小化器から得られる低ランク推定の均一な上限を確立することにより、非凸目的に対して統一的な幾何学的解析を行います。意外かもしれませんが、相関パラメトリック因数分解の条件は、部分空間制約行列の完成やスキュー対称の行列の完成などの重要な例に当てはまります。私たちの統一された非凸最適化法の有効性は、広範な数値シミュレーションによっても経験的に示されています。

Stochastic DCA with Variance Reduction and Applications in Machine Learning
分散削減による確率的DCAと機械学習への応用

We design stochastic Difference-of-Convex-functions Algorithms (DCA) for solving a class of structured Difference-of-Convex-functions (DC) problems. As the standard DCA requires the full information of (sub)gradients which could be expensive in large-scale settings, stochastic approaches rely upon stochastic information instead. However, stochastic estimations generate additional variance terms making stochastic algorithms unstable. Therefore, we integrate some novel variance reduction techniques including SVRG and SAGA into our design. The almost sure convergence to critical points of the proposed algorithms is established and the algorithms’ complexities are analyzed. To study the efficiency of our algorithms, we apply them to three important problems in machine learning: nonnegative principal component analysis, group variable selection in multiclass logistic regression, and sparse linear regression. Numerical experiments have shown the merits of our proposed algorithms in comparison with other state-of-the-art stochastic methods for solving nonconvex large-sum problems.

私たちは、構造化された凸関数の差(DC)問題のクラスを解決するための確率的凸関数の差アルゴリズム(DCA)を設計します。標準的なDCAは大規模な設定ではコストがかかる可能性がある(サブ)勾配の完全な情報を必要とするため、確率的アプローチは代わりに確率的情報に依存します。しかし、確率的推定は追加の分散項を生成するため、確率的アルゴリズムは不安定になります。そのため、我々はSVRGやSAGAなどのいくつかの新しい分散削減手法を設計に統合します。提案されたアルゴリズムの臨界点へのほぼ確実な収束が確立され、アルゴリズムの複雑さが分析されます。アルゴリズムの効率を調査するために、機械学習における3つの重要な問題、つまり非負主成分分析、多クラスロジスティック回帰におけるグループ変数選択、およびスパース線形回帰にアルゴリズムを適用します。数値実験により、非凸大和問題を解くための他の最先端の確率的方法と比較して、私たちが提案したアルゴリズムのメリットが示されました。

Contraction rates for sparse variational approximations in Gaussian process regression
ガウス過程回帰におけるスパース変分近似の縮約率

We study the theoretical properties of a variational Bayes method in the Gaussian Process regression model. We consider the inducing variables method and derive sufficient conditions for obtaining contraction rates for the corresponding variational Bayes (VB) posterior. As examples we show that for three particular covariance kernels (Matérn, squared exponential, random series prior) the VB approach can achieve optimal, minimax contraction rates for a sufficiently large number of appropriately chosen inducing variables. The theoretical findings are demonstrated by numerical experiments.

私たちは、ガウス過程回帰モデルにおける変分ベイズ法の理論的特性を研究します。誘導変数法を検討し、対応する変分ベイズ(VB)後方の収縮率を取得するための十分な条件を導き出します。例として、3つの特定の共分散カーネル(Matérn、二乗指数、ランダム級数)について、VBアプローチが、適切に選択された十分に多数の誘導変数に対して最適でミニマックスの収縮率を達成できることを示しています。理論的な発見は、数値実験によって実証されます。

Selective Machine Learning of the Average Treatment Effect with an Invalid Instrumental Variable
無効な操作変数による平均治療効果の選択的機械学習

Instrumental variable methods have been widely used to identify causal effects in the presence of unmeasured confounding. A key identification condition known as the exclusion restriction states that the instrument cannot have a direct effect on the outcome which is not mediated by the exposure in view. In the health and social sciences, such an assumption is often not credible. To address this concern, we consider identification conditions of the population average treatment effect with an invalid instrumental variable which does not satisfy the exclusion restriction, and derive the efficient influence function targeting the identifying functional under a nonparametric observed data model. We propose a novel multiply robust locally efficient estimator of the average treatment effect that is consistent in the union of multiple parametric nuisance models, as well as a multiply debiased machine learning estimator for which the nuisance parameters are estimated using generic machine learning methods, that effectively exploit various forms of linear or nonlinear structured sparsity in the nuisance parameter space. When one cannot be confident that any of these machine learners is consistent at sufficiently fast rates to ensure $\surd{n}$-consistency for the average treatment effect, we introduce new criteria for selective machine learning which leverage the multiple robustness property in order to ensure small bias. The proposed methods are illustrated through extensive simulations and a data analysis evaluating the causal effect of 401(k) participation on savings.

測定されていない交絡がある場合に因果効果を識別するために、操作変数法が広く使用されています。排除制約として知られる重要な識別条件は、対象とする曝露によって媒介されない結果に対して、操作変数が直接影響を与えることはできないというものです。健康科学や社会科学では、このような仮定はしばしば信用できません。この懸念に対処するために、私たちは、排除制約を満たさない無効な操作変数による母集団平均治療効果の識別条件を考慮し、ノンパラメトリック観測データモデルの下で識別関数をターゲットとする効率的な影響関数を導出します。私たちは、複数のパラメトリックなニューサンスモデルの結合と整合する、平均治療効果の新しい多重ロバストな局所的に効率的な推定量、およびニューサンスパラメータが汎用的な機械学習方法を使用して推定され、ニューサンスパラメータ空間におけるさまざまな形式の線形または非線形構造化スパース性を効果的に活用する多重にバイアスを取り除いた機械学習推定量を提案します。これらの機械学習のいずれかが平均治療効果の$\surd{n}$一貫性を保証するのに十分な速さで一貫性があるかどうか確信が持てない場合、小さなバイアスを保証するために多重堅牢性プロパティを活用する選択的機械学習の新しい基準を導入します。提案された方法は、広範なシミュレーションと、401(k)参加による貯蓄への因果効果を評価するデータ分析を通じて説明されます。

Testing Whether a Learning Procedure is Calibrated
学習手順がキャリブレーションされているかどうかのテスト

A learning procedure takes as input a dataset and performs inference for the parameters $\theta$ of a model that is assumed to have given rise to the dataset. Here we consider learning procedures whose output is a probability distribution, representing uncertainty about $\theta$ after seeing the dataset. Bayesian inference is a prime example of such a procedure, but one can also construct other learning procedures that return distributional output. This paper studies conditions for a learning procedure to be considered calibrated, in the sense that the true data-generating parameters are plausible as samples from its distributional output. A learning procedure whose inferences and predictions are systematically over- or under-confident will fail to be calibrated. On the other hand, a learning procedure that is calibrated need not be statistically efficient. A hypothesis-testing framework is developed in order to assess, using simulation, whether a learning procedure is calibrated. Several vignettes are presented to illustrate different aspects of the framework.

学習手順は、データセットを入力として受け取り、データセットの生成元と想定されるモデルのパラメータ$\theta$の推論を実行します。ここでは、データセットを確認した後の$\theta$に関する不確実性を表す確率分布を出力する学習手順を検討します。ベイズ推論はこのような手順の代表的な例ですが、分布出力を返す他の学習手順を構築することもできます。この論文では、真のデータ生成パラメータがその分布出力からのサンプルとして妥当であるという意味で、学習手順が較正されていると見なされるための条件を検討します。推論と予測が体系的に過信または過小評価されている学習手順は、較正されません。一方、較正された学習手順は、統計的に効率的である必要はありません。シミュレーションを使用して学習手順が較正されているかどうかを評価するために、仮説検定フレームワークが開発されています。フレームワークのさまざまな側面を説明するために、いくつかのビネットが提示されています。

abess: A Fast Best-Subset Selection Library in Python and R
abess: Python と R の高速ベストサブセット選択ライブラリ

We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, abess certifiably gets the optimal solution within polynomial time with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best subset of groups selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for convenient integration with scikit-learn, and it can be installed from the Python Package Index (PyPI). In addition, a user-friendly R library is available at the Comprehensive R Archive Network (CRAN). The source code is available at: https://github.com/abess-team/abess.

私たちは、線形回帰、分類、主成分分析などの多様な機械学習問題を解決するためのベストサブセット選択の統一フレームワークを実装するabessという新しいライブラリを紹介します。特に、abessは線形モデルの下で多項式時間内に高い確率で最適解を確実に得ます。我々の効率的な実装により、abessは既存の競合する変数(モデル)選択ツールボックスと同じか、あるいは20倍も速くベストサブセット選択問題の解決を達成できます。さらに、グループのベストサブセット選択や$\ell_2$正規化ベストサブセット選択などの一般的なバリアントをサポートしています。ライブラリのコアはC++でプログラムされています。使いやすさを考慮して、Pythonライブラリはscikit-learnとの便利な統合のために設計されており、Pythonパッケージインデックス(PyPI)からインストールできます。さらに、ユーザーフレンドリーなRライブラリは、Comprehensive R Archive Network (CRAN)で入手できます。ソースコードはhttps://github.com/abess-team/abessから入手できます。

Three rates of convergence or separation via U-statistics in a dependent framework
従属フレームワークにおけるU統計による収束または分離の3つの速度

Despite the ubiquity of U-statistics in modern Probability and Statistics, their non-asymptotic analysis in a dependent framework may have been overlooked. In a recent work, a new concentration inequality for U-statistics of order two for uniformly ergodic discrete time Markov chains has been proved. In this paper, we put this theoretical breakthrough into action by pushing further the current state of knowledge in three different active fields of research. First, we establish a new exponential inequality for the estimation of spectra of integral operators with MCMC methods. The novelty is that this result holds for kernels with positive and negative eigenvalues, which is new as far as we know. In addition, we investigate generalization performance of online algorithms working with pairwise loss functions and Markov chain samples. We provide an online-to-batch conversion result by showing how we can extract a low risk hypothesis from the sequence of hypotheses generated by any online learner. We finally give a non-asymptotic analysis of a goodness-of-fit test on the density of the stationary measure of a Markov chain. We identify some classes of alternatives over which our test based on the $L^2$ distance has a prescribed power.

現代の確率と統計ではU統計が広く使われているにもかかわらず、従属フレームワークにおけるそれらの非漸近的解析は見過ごされてきた可能性があります。最近の研究では、一様エルゴード離散時間マルコフ連鎖の2次U統計の新しい集中不等式が証明されました。この論文では、3つの異なる活発な研究分野の現在の知識をさらに推し進めることで、この理論的ブレークスルーを実践します。まず、MCMC法による積分演算子のスペクトルの推定に対する新しい指数不等式を確立します。目新しいのは、この結果が正と負の固有値を持つカーネルに当てはまることです。これは、私たちが知る限りでは新しいことです。さらに、ペアワイズ損失関数とマルコフ連鎖サンプルを使用するオンラインアルゴリズムの一般化パフォーマンスを調査します。任意のオンライン学習者によって生成された仮説のシーケンスから低リスクの仮説を抽出する方法を示すことにより、オンラインからバッチへの変換結果を提供します。最後に、マルコフ連鎖の定常測度の密度に対する適合度検定の非漸近分析を示します。$L^2$距離に基づく検定が規定の検出力を持ついくつかの選択肢のクラスを特定します。

A Nonconvex Framework for Structured Dynamic Covariance Recovery
構造化動的共分散回復のための非凸フレームワーク

We propose a flexible, yet interpretable model for high-dimensional data with time-varying second-order statistics, motivated and applied to functional neuroimaging data. Our approach implements the neuroscientific hypothesis of discrete cognitive processes by factorizing covariances into sparse spatial and smooth temporal components. Although this factorization results in parsimony and domain interpretability, the resulting estimation problem is nonconvex. We design a two-stage optimization scheme with a tailored spectral initialization, combined with iteratively refined alternating projected gradient descent. Weprove a linear convergence rate up to a nontrivial statistical error for the proposed descent scheme and establish sample complexity guarantees for the estimator. Empirical results using simulated data and brain imaging data illustrate that our approach outperforms existing baselines.

私たちは、時間的に変化する二次統計を持つ高次元データのための柔軟でありながら解釈可能なモデルを提案し、動機付けられ、機能的な神経画像データに適用します。私たちのアプローチは、共分散をまばらな空間的および滑らかな時間的要素に因数分解することにより、離散的な認知プロセスの神経科学的な仮説を実装します。この因数分解により、倹約性と領域の解釈可能性が得られますが、結果として得られる推定問題は非凸型です。私たちは、調整されたスペクトル初期化と、反復的に精製された交互の投影勾配降下法を組み合わせた2段階の最適化スキームを設計します。提案された降下スキームの非自明な統計誤差までの線形収束率を証明し、推定器のサンプル複雑さの保証を確立します。シミュレーションデータと脳画像データを用いた経験的結果は、私たちのアプローチが既存のベースラインを凌駕していることを示しています。

A Forward Approach for Sufficient Dimension Reduction in Binary Classification
二項分類における十分な次元削減のための前方アプローチ

Since the proposal of the seminal sliced inverse regression (SIR), inverse-type methods have proved to be canonical in sufficient dimension reduction (SDR). However, they often underperform in binary classification because the binary responses yield two slices at most. In this article, we develop a forward SDR approach in binary classification based on weighted large-margin classifiers. First, we show that the gradient of a large-margin classifier is unbiased for SDR as long as the corresponding loss function is Fisher consistent. This leads us to propose the weighted outer-product of gradients (wOPG) estimator. The wOPG estimator can recover the central subspace exhaustively without linearity (or constant variance) conditions, which despite being routinely required, they are untestable assumption. We propose the gradient-based formulation for the large-margin classifier to estimate the gradient function of the classifier directly. We also establish the consistency of the proposed wOPG estimator and demonstrate its promising finite-sample performance through both simulated and real data examples.

画期的なスライス逆回帰(SIR)の提案以来、逆タイプの方法は十分な次元削減(SDR)の標準であることが証明されています。ただし、バイナリ応答では最大2つのスライスしか生成されないため、バイナリ分類ではパフォーマンスが低下することがよくあります。この記事では、重み付けされた大マージン分類器に基づくバイナリ分類における順方向SDRアプローチを開発します。まず、対応する損失関数がフィッシャー整合である限り、大マージン分類器の勾配はSDRに対して不偏であることを示します。これにより、重み付け勾配外積(wOPG)推定量を提案します。wOPG推定量は、日常的に要求されるにもかかわらず、検証不可能な仮定である線形性(または一定分散)条件なしに、中心部分空間を網羅的に回復できます。分類器の勾配関数を直接推定するために、大マージン分類器の勾配ベースの定式化を提案します。また、提案されたwOPG推定量の一貫性を確立し、シミュレーションと実際のデータの両方の例を通じて、その有望な有限サンプルのパフォーマンスを実証します。

Meta-analysis of heterogeneous data: integrative sparse regression in high-dimensions
異種データのメタアナリシス:高次元における統合スパース回帰

We consider the task of meta-analysis in high-dimensional settings in which the data sources are similar but non-identical. To borrow strength across such heterogeneous datasets, we introduce a global parameter that emphasizes interpretability and statistical efficiency in the presence of heterogeneity. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset. For high-dimensional linear model settings, we demonstrate the superiority of our identification restrictions in adapting to a previously seen data distribution as well as predicting for a new/unseen data distribution. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell-lines.

私たちは、データソースが類似しているが同一ではない高次元の設定でのメタアナリシスのタスクを検討します。このような異種データセット全体の強度を借りるために、異質性が存在する場合の解釈可能性と統計効率を強調するグローバルパラメータを導入します。また、データソースの匿名性を保持し、結合されたデータセットのサイズに依存する速度で収束するグローバルパラメーターの1回限りの推定器も提案します。高次元線形モデルの設定では、以前に見たデータ分布への適応と、新しい/未知のデータ分布の予測における同定制限の優位性を示します。最後に、いくつかの異なるがん細胞株を含む大規模な薬物治療データセットで、このアプローチの利点を実証します。

InterpretDL: Explaining Deep Models in PaddlePaddle
InterpretDL: PaddlePaddle のディープモデルの説明

Techniques to explain the predictions of deep neural networks (DNNs) have been largely required for gaining insights into the black boxes. We introduce InterpretDL, a toolkit of explanation algorithms based on PaddlePaddle, with uniformed programming interfaces and “plug-and-play” designs. A few lines of codes are needed to obtain the explanation results without modifying the structure of the model. InterpretDL currently contains 16 algorithms, explaining training phases, datasets, global and local behaviors of post-trained deep models. InterpretDL also provides a number of tutorial examples and showcases to demonstrate the capability of InterpretDL working on a wide range of deep learning models, e.g., Convolutional Neural Networks (CNNs), Multi-Layer Preceptors (MLPs), Transformers, etc., for various tasks in both Computer Vision (CV) and Natural Language Processing (NLP). Furthermore, InterpretDL modularizes the implementations, making efforts to support the compatibility across frameworks. The project is available at https://github.com/PaddlePaddle/InterpretDL.

ブラックボックスの洞察を得るには、ディープニューラルネットワーク(DNN)の予測を説明する手法が主に必要とされてきました。ここでは、統一されたプログラミングインターフェイスと「プラグアンドプレイ」設計を備えた、PaddlePaddleに基づく説明アルゴリズムのツールキットであるInterpretDLを紹介します。モデルの構造を変更せずに説明結果を取得するには、数行のコードが必要です。InterpretDLには現在、トレーニングフェーズ、データセット、トレーニング後のディープモデルのグローバルおよびローカル動作を説明する16のアルゴリズムが含まれています。InterpretDLには、コンピュータービジョン(CV)と自然言語処理(NLP)の両方のさまざまなタスクで、畳み込みニューラルネットワーク(CNN)、マルチレイヤープリセプター(MLP)、トランスフォーマーなど、幅広いディープラーニングモデルで動作するInterpretDLの機能を示すチュートリアル例とショーケースも多数用意されています。さらに、InterpretDLは実装をモジュール化し、フレームワーク間の互換性をサポートするよう努めています。このプロジェクトはhttps://github.com/PaddlePaddle/InterpretDLで入手できます。

Universal Approximation Theorems for Differentiable Geometric Deep Learning
微分可能幾何学的深層学習のためのユニバーサル近似定理

This paper addresses the growing need to process non-Euclidean data, by introducing a geometric deep learning (GDL) framework for building universal feedforward-type models compatible with differentiable manifold geometries. We show that our GDL models can approximate any continuous target function uniformly on compact sets of a controlled maximum diameter. We obtain curvature-dependent lower-bounds on this maximum diameter and upper-bounds on the depth of our approximating GDL models. Conversely, we find that there is always a continuous function between any two non-degenerate compact manifolds that any “locally-defined” GDL model cannot uniformly approximate. Our last main result identifies data-dependent conditions guaranteeing that the GDL model implementing our approximation breaks “the curse of dimensionality.” We find that any “real-world” (i.e. finite) dataset always satisfies our condition and, conversely, any dataset satisfies our requirement if the target function is smooth. As applications, we confirm the universal approximation capabilities of the following GDL models: Ganea et al. (2018)’s hyperbolic feedforward networks, the architecture implementing Krishnan et al. (2015)’s deep Kalman-Filter, and deep softmax classifiers. We build universal extensions/variants of: the SPD-matrix regressor of Meyer et al. (2011), and Fletcher (2003)’s Procrustean regressor. In the Euclidean setting, our results imply a quantitative version of Kidger and Lyons (2020)’s approximation theorem and a data-dependent version of Yarotsky and Zhevnerchuk (2019)’s uncursed approximation rates.

この論文では、微分可能な多様体ジオメトリと互換性のあるユニバーサルなフィードフォワード型モデルを構築するための幾何学的ディープラーニング(GDL)フレームワークを紹介することで、非ユークリッドデータを処理する必要性の高まりに対処します。GDLモデルは、制御された最大直径のコンパクトセット上で任意の連続ターゲット関数を均一に近似できることを示します。この最大直径の曲率依存の下限と、近似GDLモデルの深さの上限を取得します。逆に、任意の2つの非退化コンパクト多様体間には、任意の「ローカル定義」GDLモデルで均一に近似できない連続関数が常に存在することがわかります。最後の主要な結果では、近似を実装するGDLモデルが「次元の呪い」を破ることを保証するデータ依存の条件を特定します。任意の「現実世界」(つまり有限)データセットは常に条件を満たし、逆にターゲット関数が滑らかであれば任意のデータセットが要件を満たすことがわかります。アプリケーションとして、Ganeaら(2018)の双曲フィードフォワードネットワーク、Krishnanら(2015)のディープカルマンフィルターを実装するアーキテクチャ、ディープソフトマックス分類器などのGDLモデルの普遍的な近似機能を確認します。Meyerら(2011)のSPD行列回帰器とFletcher (2003)のプロクラステ回帰器の普遍的な拡張/変形を構築します。ユークリッド設定では、KidgerとLyons (2020)の近似定理の定量的バージョンと、YarotskyとZhevnerchuk (2019)のuncursed近似率のデータ依存バージョンが示唆されます。

Distributed Bootstrap for Simultaneous Inference Under High Dimensionality
高次元下での同時推論のための分散ブートストラップ

We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available in Supplementary Material.

私たちは、多くの機械で保存・処理される高次元の大量データに対して同時推論を行う分散ブートストラップ法を提案します。この手法は、通信効率の高い偏りのないなげなわに基づいて$ell_infty$-norm信頼領域を生成し、反復ごとに手法を調整するための効率的な交差検証アプローチを提案します。理論的には、統計的な精度と効率を保証する通信ラウンド数$tau_{min}$の下限を証明します。さらに、$tau_{min}$は、労働者の数と固有次元によって対数的にのみ増加しますが、名目次元に対してはほぼ不変です。私たちは、広範なシミュレーション研究と、US Airline On-Time Performanceデータセットに基づく半合成データセットに対する変数スクリーニングタスクによって、理論を検証します。数値結果を再現するコードは、補足資料で入手できます。

Uniform deconvolution for Poisson Point Processes
ポアソン点過程の一様なデコンボリューション

We focus on the estimation of the intensity of a Poisson process in the presence of a uniform noise. We propose a kernel-based procedure fully calibrated in theory and practice. We show that our adaptive estimator is optimal from the oracle and minimax points of view, and provide new lower bounds when the intensity belongs to a Sobolev ball. By developing the Goldenshluger-Lepski methodology in the case of deconvolution for Poisson processes, we propose an optimal data-driven selection of the kernel bandwidth. Our method is illustrated on the spatial distribution of replication origins and sequence motifs along the human genome.

私たちは、均一なノイズの存在下でのポアソン過程の強度の推定に焦点を当てています。私たちは、理論的にも実践的にも完全に較正されたカーネルベースの手順を提案します。適応推定量がオラクルとミニマックスの観点から最適であることを示し、強度がソボレフボールに属するときに新しい下限を提供します。ポアソン過程のデコンボリューションの場合のGoldenshluger-Lepski方法論を開発することにより、カーネル帯域幅の最適なデータ駆動型選択を提案します。私たちの方法は、ヒトゲノムに沿った複製起源と配列モチーフの空間分布を示しています。

Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression
ガウス過程回帰:最適性、ロバスト性、およびカーネルリッジ回帰との関係

Gaussian process regression is widely used in many fields, for example, machine learning, reinforcement learning and uncertainty quantification. One key component of Gaussian process regression is the unknown correlation function, which needs to be specified. In this paper, we investigate what would happen if the correlation function is misspecified. We derive upper and lower error bounds for Gaussian process regression with possibly misspecified correlation functions. We find that when the sampling scheme is quasi-uniform, the optimal convergence rate can be attained even if the smoothness of the imposed correlation function exceeds that of the true correlation function. We also obtain convergence rates of kernel ridge regression with misspecified kernel function, where the underlying truth is a deterministic function. Our study reveals a close connection between the convergence rates of Gaussian process regression and kernel ridge regression, which is aligned with the relationship between sample paths of Gaussian process and the corresponding reproducing kernel Hilbert space. This work establishes a bridge between Bayesian learning based on Gaussian process and frequentist kernel methods with reproducing kernel Hilbert space.

ガウス過程回帰は、機械学習、強化学習、不確実性定量化など、多くの分野で広く使用されています。ガウス過程回帰の重要な要素の1つは、指定する必要がある未知の相関関数です。この論文では、相関関数が誤って指定された場合に何が起こるかを調査します。相関関数が誤って指定された可能性のあるガウス過程回帰の上限と下限の誤差を導出します。サンプリングスキームが準均一である場合、課された相関関数の滑らかさが真の相関関数の滑らかさを超えていても、最適な収束率を達成できることがわかっています。また、基礎となる真実が決定論的関数であるカーネル関数を誤って指定したカーネルリッジ回帰の収束率も取得します。この研究では、ガウス過程回帰とカーネルリッジ回帰の収束率の間に密接な関係があることを明らかにしており、これはガウス過程のサンプルパスと対応する再生カーネルヒルベルト空間との関係と一致しています。この研究では、ガウス過程に基づくベイズ学習と再生カーネルヒルベルト空間を持つ頻度主義カーネル法との間の橋渡しを確立します。

A Bregman Learning Framework for Sparse Neural Networks
スパースニューラルネットワークのためのブレグマン学習フレームワーク

We propose a learning framework based on stochastic Bregman iterations, also known as mirror descent, to train sparse neural networks with an inverse scale space approach. We derive a baseline algorithm called LinBreg, an accelerated version using momentum, and AdaBreg, which is a Bregmanized generalization of the Adam algorithm. In contrast to established methods for sparse training the proposed family of algorithms constitutes a regrowth strategy for neural networks that is solely optimization-based without additional heuristics. Our Bregman learning framework starts the training with very few initial parameters, successively adding only significant ones to obtain a sparse and expressive network. The proposed approach is extremely easy and efficient, yet supported by the rich mathematical theory of inverse scale space methods. We derive a statistically profound sparse parameter initialization strategy and provide a rigorous stochastic convergence analysis of the loss decay and additional convergence proofs in the convex regime. Using only $3.4\%$ of the parameters of ResNet-18 we achieve $90.2\%$ test accuracy on CIFAR-10, compared to $93.6\%$ using the dense network. Our algorithm also unveils an autoencoder architecture for a denoising task. The proposed framework also has a huge potential for integrating sparse backpropagation and resource-friendly training. Code is available at https://github.com/TimRoith/BregmanLearning.

私たちは、逆スケール空間アプローチでスパースニューラルネットワークをトレーニングするための、確率的ブレグマン反復法（ミラー降下法とも呼ばれる）に基づく学習フレームワークを提案します。私たちは、モメンタムを使用した高速版であるLinBregと呼ばれるベースラインアルゴリズムと、アダムアルゴリズムのブレグマン化された一般化であるAdaBregを導出します。スパーストレーニングの確立された方法とは対照的に、提案されたアルゴリズムファミリーは、追加のヒューリスティックなしで完全に最適化に基づいたニューラルネットワークの再成長戦略を構成します。我々のブレグマン学習フレームワークは、非常に少ない初期パラメータでトレーニングを開始し、スパースで表現力豊かなネットワークを得るために重要なパラメータのみを順次追加します。提案されたアプローチは非常に簡単で効率的でありながら、逆スケール空間法の豊富な数学理論によってサポートされています。私たちは、統計的に深いスパースパラメータ初期化戦略を導出し、損失減衰の厳密な確率的収束分析と凸領域での追加の収束証明を提供します。ResNet-18のパラメータのわずか$3.4\%$を使用することで、CIFAR-10で$90.2\%$のテスト精度を達成しました。これは、密なネットワークを使用した場合の$93.6\%$と比較して高いものです。私たちのアルゴリズムは、ノイズ除去タスク用のオートエンコーダアーキテクチャも明らかにしています。提案されたフレームワークには、スパースバックプロパゲーションとリソースに優しいトレーニングを統合する大きな可能性もあります。コードはhttps://github.com/TimRoith/BregmanLearningで入手できます。

Deep Limits and a Cut-Off Phenomenon for Neural Networks
ニューラルネットワークの深限界とカットオフ現象

We consider dynamical and geometrical aspects of deep learning. For many standard choices of layer maps we display semi-invariant metrics which quantify differences between data or decision functions. This allows us, when considering random layer maps and using non-commutative ergodic theorems, to deduce that certain limits exist when letting the number of layers tend to infinity. We also examine the random initialization of standard networks where we observe a surprising cut-off phenomenon in terms of the number of layers, the depth of the network. This could be a relevant parameter when choosing an appropriate number of layers for a given learning task, or for selecting a good initialization procedure. More generally, we hope that the notions and results in this paper can provide a framework, in particular a geometric one, for a part of the theoretical understanding of deep neural networks.

私たちは、深層学習の動的および幾何学的側面を考慮します。レイヤーマップの多くの標準的な選択肢では、データまたは決定関数間の違いを定量化する半不変メトリックを表示します。これにより、ランダムレイヤーマップを検討し、非可換エルゴード定理を使用すると、レイヤーの数が無限大になる傾向があるときに特定の制限が存在することを推論できます。また、標準ネットワークのランダム初期化についても調べ、レイヤー数、ネットワークの深さに関して驚くべきカットオフ現象を観察します。これは、特定の学習タスクに適切な数のレイヤーを選択する場合、または適切な初期化手順を選択する場合に関連するパラメーターになる可能性があります。より一般的には、この論文の概念と結果が、ディープニューラルネットワークの理論的理解の一部にフレームワーク、特に幾何学的なフレームワークを提供できることを願っています。

Clustering with Semidefinite Programming and Fixed Point Iteration
半定値計画法と固定小数点反復によるクラスタリング

We introduce a novel method for clustering using a semidefinite programming (SDP) relaxation of the Max k-Cut problem. The approach is based on a new methodology for rounding the solution of an SDP relaxation using iterated linear optimization. We show the vertices of the Max k-Cut relaxation correspond to partitions of the data into at most k sets. We also show the vertices are attractive fixed points of iterated linear optimization. Each step of this iterative process solves a relaxation of the closest vertex problem and leads to a new clustering problem where the underlying clusters are more clearly defined. Our experiments show that using fixed point iteration for rounding the Max k-Cut SDP relaxation leads to significantly better results when compared to randomized rounding.

私たちは、Max k-Cut問題の半定値計画法(SDP)緩和を使用したクラスタリングの新しい方法を紹介します。このアプローチは、反復線形最適化を使用してSDP緩和の解を丸めるための新しい手法に基づいています。Max k-Cut緩和の頂点が、データの最大kセットの分割に対応することを示します。また、頂点が反復線形最適化の魅力的な不動点であることも示しています。この反復プロセスの各ステップは、最も近い頂点問題の緩和を解決し、基になるクラスターがより明確に定義される新しいクラスタリング問題につながります。私たちの実験では、Max k-Cut SDP緩和の丸めに固定小数点反復を使用すると、ランダム化された丸めと比較して大幅に良い結果が得られることが示されています。

Learning to Optimize: A Primer and A Benchmark
最適化の学習:入門書とベンチマーク

Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods, aiming at reducing the laborious iterations of hand engineering. It automates the design of an optimization method based on its performance on a set of training problems. This data-driven procedure generates methods that can efficiently solve problems similar to those in training. In sharp contrast, the typical and traditional designs of optimization methods are theory-driven, so they obtain performance guarantees over the classes of problems specified by the theory. The difference makes L2O suitable for repeatedly solving a particular optimization problem over a specific distribution of data, while it typically fails on out-of-distribution problems. The practicality of L2O depends on the type of target optimization, the chosen architecture of the method to learn, and the training procedure. This new paradigm has motivated a community of researchers to explore L2O and report their findings. This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization. We set up taxonomies, categorize existing works and research directions, present insights, and identify open challenges. We benchmarked many existing L2O approaches on a few representative optimization problems. For reproducible research and fair benchmarking purposes, we released our software implementation and data in the package Open-L2O at https://github.com/VITA-Group/Open-L2O.

最適化学習(L2O)は、機械学習を活用して最適化手法を開発する新しいアプローチであり、手作業によるエンジニアリングの面倒な反復を減らすことを目的としています。一連のトレーニング問題に対するパフォーマンスに基づいて、最適化手法の設計を自動化します。このデータ駆動型の手順により、トレーニングの問題に類似した問題を効率的に解決できる手法が生成されます。対照的に、最適化手法の典型的な従来の設計は理論駆動型であるため、理論によって指定された問題のクラスでパフォーマンスが保証されます。この違いにより、L2Oは特定のデータ分布で特定の最適化問題を繰り返し解決するのに適していますが、分布外の問題では通常失敗します。L2Oの実用性は、ターゲット最適化のタイプ、学習する手法の選択されたアーキテクチャ、およびトレーニング手順によって異なります。この新しいパラダイムにより、研究者のコミュニティがL2Oを調査し、その調査結果を報告するようになりました。この記事は、継続的な最適化のためのL2Oの最初の包括的な調査とベンチマークとなる予定です。私たちは分類法を設定し、既存の研究と研究の方向性を分類し、洞察を提示し、未解決の課題を特定しました。いくつかの代表的な最適化問題で、多くの既存のL2Oアプローチをベンチマークしました。再現可能な研究と公正なベンチマークを目的として、ソフトウェア実装とデータをOpen-L2Oパッケージでhttps://github.com/VITA-Group/Open-L2Oにリリースしました。

Active Structure Learning of Bayesian Networks in an Observational Setting
観測環境におけるベイジアンネットワークの能動的構造学習

We study active structure learning of Bayesian networks in an observational setting, in which there are external limitations on the number of variable values that can be observed from the same sample. Random samples are drawn from the joint distribution of the network variables, and the algorithm iteratively selects which variables to observe in the next sample. We propose a new active learning algorithm for this setting, that finds with a high probability a structure with a score that is $\epsilon$-close to the optimal score. We show that for a class of distributions that we term stable, a sample complexity reduction of up to a factor of $\widetilde{\Omega}(d^3)$ can be obtained, where $d$ is the number of network variables. We further show that in the worst case, the sample complexity of the active algorithm is guaranteed to be almost the same as that of a naive baseline algorithm. To supplement the theoretical results, we report experiments that compare the performance of the new active algorithm to the naive baseline and demonstrate the sample complexity improvements. Code for the algorithm and for the experiments is provided at https://github.com/noabdavid/activeBNSL.

私たちは、同じサンプルから観測できる変数値の数に外的な制限がある観測設定でのベイジアンネットワークの能動構造学習を研究します。ランダムサンプルはネットワーク変数の結合分布から抽出され、アルゴリズムは繰り返して次のサンプルで観測する変数を選択します。我々はこの設定のための新しい能動学習アルゴリズムを提案します。このアルゴリズムは、最適スコアに$\epsilon$近いスコアを持つ構造を高い確率で見つける。安定と呼ぶ分布のクラスでは、サンプル複雑度を最大$\widetilde{\Omega}(d^3)$倍削減できることを示す。ここで、$d$はネットワーク変数の数です。さらに、最悪の場合でも、能動アルゴリズムのサンプル複雑度は、単純なベースラインアルゴリズムのサンプル複雑度とほぼ同じになることが保証されることを示す。理論的な結果を補足するために、新しい能動アルゴリズムのパフォーマンスを単純なベースラインと比較し、サンプル複雑度の改善を示す実験を報告します。アルゴリズムと実験のコードはhttps://github.com/noabdavid/activeBNSLで提供されています。

Adversarial Classification: Necessary Conditions and Geometric Flows
敵対的分類:必要条件と幾何学的流れ

We study a version of adversarial classification where an adversary is empowered to corrupt data inputs up to some distance $\varepsilon$, using tools from variational analysis. In particular, we describe necessary conditions associated with the optimal classifier subject to such an adversary. Using the necessary conditions, we derive a geometric evolution equation which can be used to track the change in classification boundaries as $\varepsilon$ varies. This evolution equation may be described as an uncoupled system of differential equations in one dimension, or as a mean curvature type equation in higher dimension. In one dimension, and under mild assumptions on the data distribution, we rigorously prove that one can use the initial value problem starting from $\varepsilon=0$, which is simply the Bayes classifier, in order to solve for the global minimizer of the adversarial problem for small values of $\varepsilon$. In higher dimensions we provide a similar result, albeit conditional to the existence of regular solutions of the initial value problem. In the process of proving our main results we obtain a result of independent interest connecting the original adversarial problem with an optimal transport problem under no assumptions on whether classes are balanced or not. Numerical examples illustrating these ideas are also presented.

私たちは、変分解析のツールを用いて、敵対者がある距離$\varepsilon$までデータ入力を改ざんできる敵対的分類のバージョンを研究します。特に、そのような敵対者に対する最適な分類器に関連する必要条件を説明します。必要条件を用いて、$\varepsilon$が変化するにつれて分類境界がどのように変化するかを追跡するために使用できる幾何進化方程式を導出します。この進化方程式は、1次元では非結合微分方程式系として、または高次元では平均曲率型方程式として記述することができます。1次元では、データ分布に関する軽い仮定の下で、$\varepsilon=0$から始まる初期値問題(これは単にベイズ分類器である)を使用して、$\varepsilon$の値が小さい場合の敵対的問題のグローバル最小化を解くことができることを厳密に証明します。高次元では、初期値問題の正規解が存在するという条件付きではあるが、同様の結果を提供します。主な結果を証明する過程で、クラスがバランスしているかどうかという仮定を置かずに、元の敵対的問題と最適輸送問題を結び付ける独立した興味深い結果が得られました。これらのアイデアを示す数値例も示します。

Estimating Density Models with Truncation Boundaries using Score Matching
スコアマッチングを使用した打ち切り境界を持つ密度モデルの推定

Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for fitting parameters using only unnormalized models. However, it cannot be directly applied here as boundary conditions that derive a tractable SM objective are not satisfied by truncated densities. This paper studies parameter estimation for truncated probability densities using SM. The estimator minimizes a weighted Fisher divergence. The weight function is simply the shortest distance from a data point to the domain’s boundary. We show this choice of weight function naturally arises from minimizing the Stein discrepancy and upper bounding the finite-sample estimation error. We demonstrate the usefulness of our method via numerical experiments and a study on the Chicago crime data set. We also show that the proposed density estimation can correct the outlier-trimming bias caused by aggressive outlier detection methods.

切り捨て密度は、切り捨てられた領域で定義される確率密度関数です。正規化定数を除けば、切り捨てられていないものと同じパラメータ形式を共有します。正規化定数の計算は通常実行不可能なので、最大尤度推定法は切り捨てられた密度モデルの推定に簡単には適用できません。スコアマッチング(SM)は、正規化されていないモデルのみを使用してパラメータをフィッティングするための強力なツールです。ただし、扱いやすいSM目的関数を導出する境界条件が切り捨てられた密度では満たされないため、ここでは直接適用できません。この論文では、SMを使用した切り捨てられた確率密度のパラメータ推定について検討します。推定量は、重み付きフィッシャーダイバージェンスを最小化します。重み関数は、データポイントから領域の境界までの最短距離です。この重み関数の選択は、Stein不一致を最小化し、有限サンプル推定誤差の上限を設定することから自然に生じることを示します。数値実験とシカゴの犯罪データセットの研究を通じて、この方法の有用性を実証します。また、提案された密度推定により、積極的な外れ値検出方法によって引き起こされる外れ値トリミングバイアスを修正できることも示します。

A Primer for Neural Arithmetic Logic Modules
神経演算論理モジュールの入門書

Neural Arithmetic Logic Modules have become a growing area of interest, though remain a niche field. These modules are neural networks which aim to achieve systematic generalisation in learning arithmetic and/or logic operations such as $\{+, -, \times, \div, \leq, \textrm{AND}\}$ while also being interpretable. This paper is the first in discussing the current state of progress of this field, explaining key works, starting with the Neural Arithmetic Logic Unit (NALU). Focusing on the shortcomings of the NALU, we provide an in-depth analysis to reason about design choices of recent modules. A cross-comparison between modules is made on experiment setups and findings, where we highlight inconsistencies in a fundamental experiment causing the inability to directly compare across papers. To alleviate the existing inconsistencies, we create a benchmark which compares all existing arithmetic NALMs. We finish by providing a novel discussion of existing applications for NALU and research directions requiring further exploration.

ニューラル算術論理モジュールは、ニッチな分野ではありますが、関心が高まっている分野です。これらのモジュールは、解釈可能でありながら、$\{+、-、\times、\div、\leq、\textrm{AND}\}$などの算術および/または論理演算の学習における体系的な一般化を達成することを目的としたニューラルネットワークです。この論文では、ニューラル算術論理ユニット(NALU)から始めて主要な研究を説明し、この分野の進歩の現状を議論した最初の論文です。NALUの欠点に焦点を当て、最近のモジュールの設計上の選択について推論するための詳細な分析を提供します。モジュール間の相互比較は、実験のセットアップと結果に基づいて行われ、論文間で直接比較できない原因となっている基本的な実験の不一致を強調します。既存の不一致を軽減するために、既存のすべての算術NALMを比較するベンチマークを作成します。最後に、NALUの既存のアプリケーションと、さらなる調査が必要な研究の方向性について新たな議論を提供します。

Statistical Optimality and Stability of Tangent Transform Algorithms in Logit Models
ロジットモデルにおける接線変換アルゴリズムの統計的最適性と安定性

A systematic approach to finding variational approximation in an otherwise intractable non-conjugate model is to exploit the general principle of convex duality by minorizing the marginal likelihood that renders the problem tractable. While such approaches are popular in the context of variational inference in non-conjugate Bayesian models, theoretical guarantees on statistical optimality and algorithmic convergence are lacking. Focusing on logistic regression models, we provide mild conditions on the data generating process to derive non-asymptotic upper bounds to the risk incurred by the variational optima. We demonstrate that these assumptions can be completely relaxed if one considers a slight variation of the algorithm by raising the likelihood to a fractional power. Next, we utilize the theory of dynamical systems to provide convergence guarantees for such algorithms in logistic and multinomial logit regression. In particular, we establish local asymptotic stability of the algorithm without any assumptions on the data-generating process. We explore a special case involving a semi-orthogonal design under which a global convergence is obtained. The theory is further illustrated using several numerical studies.

そうでなければ扱いにくい非共役モデルで変分近似を見つけるための体系的なアプローチは、問題を扱いやすくする周辺尤度を最小化することによって凸双対性の一般原理を利用することです。このようなアプローチは非共役ベイズモデルの変分推論のコンテキストでは一般的ですが、統計的最適性とアルゴリズムの収束に関する理論的保証が欠けています。ロジスティック回帰モデルに焦点を当て、変分最適によって発生するリスクの非漸近的上限を導くために、データ生成プロセスに緩やかな条件を提供します。尤度を分数べき乗に上げることでアルゴリズムをわずかに変更することを考慮すると、これらの仮定を完全に緩和できることを示します。次に、動的システムの理論を利用して、ロジスティック回帰および多項ロジット回帰におけるこのようなアルゴリズムの収束保証を提供します。特に、データ生成プロセスに関する仮定なしで、アルゴリズムの局所的漸近安定性を確立します。ここでは、大域的収束が得られる半直交設計を伴う特殊なケースを検討します。この理論は、いくつかの数値研究を使用してさらに説明されます。

Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution
Boulevard:正則化確率勾配ブーストツリーとその制限分布

This paper examines a novel gradient boosting framework for regression. We regularize gradient boosted trees by introducing subsampling and employ a modified shrinkage algorithm so that at every boosting stage the estimate is given by an average of trees. The resulting algorithm, titled “Boulevard'”, is shown to converge as the number of trees grows. This construction allows us to demonstrate a central limit theorem for this limit, providing a characterization of uncertainty for predictions. A simulation study and real world examples provide support for both the predictive accuracy of the model and its limiting behavior.

この論文では、回帰のための新しい勾配ブースティングフレームワークを検討します。サブサンプリングを導入することで勾配ブースティングツリーを正規化し、修正された収縮アルゴリズムを採用して、すべてのブースティングステージで推定値がツリーの平均によって与えられるようにします。「Boulevard」というタイトルのアルゴリズムは、樹木の数が増えるにつれて収束することを示しています。この構造により、この極限の中心極限定理を実証し、予測の不確実性の特徴付けを提供できます。シミュレーション研究と実世界の例は、モデルの予測精度とその制限動作の両方をサポートします。

Extensions to the Proximal Distance Method of Constrained Optimization
制約付き最適化の近位距離法の拡張

The current paper studies the problem of minimizing a loss $f(\boldsymbol{x})$ subject to constraints of the form $\boldsymbol{D}\boldsymbol{x} \in S$, where $S$ is a closed set, convex or not, and $\boldsymbol{D}$ is a matrix that fuses parameters. Fusion constraints can capture smoothness, sparsity, or more general constraint patterns. To tackle this generic class of problems, we combine the Beltrami-Courant penalty method of optimization with the proximal distance principle. The latter is driven by minimization of penalized objectives $f(\boldsymbol{x})+\frac{\rho}{2}\text{dist}(\boldsymbol{D}\boldsymbol{x},S)^2$ involving large tuning constants $\rho$ and the squared Euclidean distance of $\boldsymbol{D}\boldsymbol{x}$ from $S$. The next iterate $\boldsymbol{x}_{n+1}$ of the corresponding proximal distance algorithm is constructed from the current iterate $\boldsymbol{x}_n$ by minimizing the majorizing surrogate function $f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{D}\boldsymbol{x}-\mathcal{P}_{S}(\boldsymbol{D}\boldsymbol{x}_n)\|^2$. For fixed $\rho$ and a subanalytic loss $f(\boldsymbol{x})$ and a subanalytic constraint set $S$, we prove convergence to a stationary point. Under stronger assumptions, we provide convergence rates and demonstrate linear local convergence. We also construct a steepest descent variant to avoid costly linear system solves. To benchmark our algorithms, we compare their results to those delivered by the alternating direction method of multipliers. Our extensive numerical tests include problems on metric projection, convex regression, convex clustering, total variation image denoising, and projection of a matrix to a good condition number. These experiments demonstrate the superior speed and acceptable accuracy of our steepest variant on high-dimensional problems.

この論文では、形式$\boldsymbol{D}\boldsymbol{x} \in S$の制約の下で損失$f(\boldsymbol{x})$を最小化する問題を研究します。ここで、$S$は凸または非凸の閉集合であり、$\boldsymbol{D}$はパラメータを融合する行列です。融合制約は、滑らかさ、スパース性、またはより一般的な制約パターンを捉えることができます。この一般的な問題に対処するために、最適化のBeltrami-Courantペナルティ法と近位距離原理を組み合わせます。後者は、大きな調整定数$\rho$と$\boldsymbol{D}\boldsymbol{x}$から$S$までのユークリッド距離の二乗を含むペナルティ付き目的関数$f(\boldsymbol{x})+\frac{\rho}{2}\text{dist}(\boldsymbol{D}\boldsymbol{x},S)^2$の最小化によって推進されます。対応する近位距離アルゴリズムの次の反復$\boldsymbol{x}_{n+1}$は、現在の反復$\boldsymbol{x}_n$から、主要化代理関数$f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{D}\boldsymbol{x}-\mathcal{P}_{S}(\boldsymbol{D}\boldsymbol{x}_n)\|^2$を最小化することによって構築されます。固定された$\rho$およびサブ解析的損失$f(\boldsymbol{x})$およびサブ解析的制約セット$S$に対して、定常点への収束を証明します。より強い仮定の下で、収束率を提供し、線形局所収束を実証します。また、コストのかかる線形システム解決を回避するために、最急降下法のバリアントを構築します。私たちのアルゴリズムをベンチマークするために、その結果を乗算器の交互方向法によって得られる結果と比較します。私たちの広範な数値テストには、メトリック投影、凸回帰、凸クラスタリング、全変動画像ノイズ除去、および適切な条件数への行列の投影に関する問題が含まれます。これらの実験は、高次元の問題に対する私たちの最も急峻なバリアントの優れた速度と許容可能な精度を実証しています。

Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent
勾配降下法による 2 層 ReLU ネットワークの学習に一貫性がない

We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior.

私たちは、Heら(2015)によって広く用いられている方法などによって初期化され、最小二乗損失の勾配降下法を用いて訓練された2層(Leaky)ReLUネットワークが普遍的に一貫していないことを証明します。具体的には、1次元データ生成分布の大きなクラスについて説明します。この分布では、高い確率で、勾配降下法は最適化ランドスケープの悪い局所最小値のみを見つけます。これは、バイアスをゼロで初期化から遠く離れることができないためです。これらの場合、見つかったネットワークは、ターゲット関数が非線形であっても、基本的に線形回帰を実行することがわかります。さらに、これが実際の状況で、一部の多次元分布で発生し、確率的勾配降下法が同様の振る舞いを示すという数値的証拠を提供します。また、初期化とオプティマイザの選択がこの動作にどのように影響するかについての経験的な結果も提供します。

Matrix Completion with Covariate Information and Informative Missingness
共変量情報と情報欠損による行列補完

We study the problem of matrix completion when the missingness of the matrix entries is dependent on the unobserved response values themselves and hence the missingness itself is informative. Furthermore, we allow to take into account the covariate information to establish its relation with the response and hence enable prediction. We devise a novel procedure to simultaneously complete the partially observed matrix and assess the covariate effect. Allowing the matrix dimensions as well as the number of covariates to grow ultra-high, under the classic low-rank matrix and sparse covariate effect assumptions, we rigorously establish the statistical guarantee of our procedure and the algorithmic convergence. The method is demonstrated via simulation studies and is used to analyze a Yelp data set and a MovieLens data set.

私たちは、行列エントリの欠損が観測されていない応答値自体に依存しているため、欠損自体が情報を提供する場合の行列補完の問題を研究します。さらに、共変量情報を考慮に入れて応答との関係を確立し、予測を可能にすることができます。私たちは、部分的に観察されたマトリックスを完成させると同時に共変量効果を評価するための新しい手順を考案します。行列の次元と共変量の数が非常に大きくなることを許容し、古典的な低ランク行列とスパース共変量効果の仮定の下で、手順の統計的保証とアルゴリズムの収束を厳密に確立します。この手法は、シミュレーション研究を通じて実証され、YelpデータセットとMovieLensデータセットの分析に使用されます。

KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints
KL-UCB-Switch:確率的バンディットに対する分布依存性と分布非依存性の両方の観点からの最適後悔限界

We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa \ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). Ménard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Cappé et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for $K$-armed stochastic bandits.

私たちは、$K$アームの確率的バンディットを考慮し、時間$T$までの累積後悔境界を考慮します。我々が関心を持っているのは、分布に依存しない最適な順序の後悔境界$\sqrt{KT}$と、漸近的に最適な分布依存の後悔境界、つまりLaiとRobbins (1985)およびBurnetasとKatehakis (1996)による$\kappa \ln T$下限に一致する後悔境界を同時に達成する戦略です。ここで、$\kappa$は問題に依存する最適な定数です。この定数$\kappa$は、検討対象のモデル$\mathcal{D}$ (アーム上の可能な分布の族)に依存します。MénardとGarivier (2017)は、1次元指数分布族によって与えられるモデルのパラメトリックケースでこのような双最適性を達成する戦略を提供し、一方Lattimore (2016, 2018)は、分散が$1$未満の(サブ)ガウス分布族に対してこれを行いました。我々はこの結果を、$[0,1]$上のすべての分布のノンパラメトリックケースに拡張します。これは、最適順序$\sqrt{KT}$の分布フリーの後悔境界を利用するAudibertとBubeck (2009)によるMOSS戦略と、$[0,1]$上のすべての分布のモデルにおける最適な分布依存の$\kappa\ln T$後悔境界の最初の分析をついでに提供するCappé ら(2013)によるKL-UCB戦略を組み合わせることによって行います。私たちは、証明（既知の後悔境界と、それによって実行された新しい分析）を合理化するために懸命に取り組みながら、このノンパラメトリックな双最適性の結果を得ることができました。したがって、本貢献の2つ目のメリットは、Kアーム型確率的バンディットのインデックスベース戦略に対する古典的な後悔境界の証明のレビューを提供することです。

Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon
有限時間ホライゾン上のエピソード連続時間線形-二次強化学習に対する対数後悔

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of magnitude $\mathcal{O}((\ln M)(\ln\ln M) )$, with $M$ being the number of learning episodes. The analysis consists of two components: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.

私たちは、状態係数と制御係数の両方がコントローラーにわからないエピソード設定で、有限時間ホライズン連続時間線形二次強化学習の問題を研究します。まず、連続時間の観測と制御に基づく最小二乗アルゴリズムを提案し、大きさ$mathcal{O}((ln M)(lnln M) )$、$M$を学習エピソードの数とする対数後悔範囲を確立します。この解析は、関連するリッカーティ微分方程式の規則性とロバスト性を利用する摂動解析の2つの要素で構成されています。パラメータ推定誤差は、連続時間最小二乗推定量のサブ指数特性に依存します。さらに、離散時間観測と区分定数制御に基づく実用的に実装可能な最小二乗アルゴリズムを提案し、アルゴリズムで使用される時間ステップサイズに明示的に依存する追加の項で同様の対数後悔を達成します。

Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms
ランダム化数値線形代数アルゴリズムのためのサンプリング推定量の漸近解析

The statistical analysis of Randomized Numerical Linear Algebra (RandNLA) algorithms within the past few years has mostly focused on their performance as point estimators. However, this is insufficient for conducting statistical inference, e.g., constructing confidence intervals and hypothesis testing, since the distribution of the estimator is lacking. In this article, we develop an asymptotic analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem. In particular, we derive the asymptotic distribution of a general sampling estimator with arbitrary sampling probabilities in a fixed design setting. The analysis is conducted in two complementary settings, i.e., when the objective of interest is to approximate the full sample estimator, and when it is to infer the underlying ground truth model parameters. For each setting, we show that the sampling estimator is asymptotically normally distributed under mild regularity conditions. Moreover, the sampling estimator is asymptotically unbiased in both settings. Based on our asymptotic analysis, we use two criteria, the Asymptotic Mean Squared Error (AMSE) and the Expected Asymptotic Mean Squared Error (EAMSE), to identify optimal sampling probabilities. Several of these optimal sampling probability distributions are new to the literature, e.g., the root leverage sampling estimator and the predictor length sampling estimator. Our theoretical results clarify the role of leverage in the sampling process, and our empirical results demonstrate improvements over existing methods.

過去数年間のランダム化数値線形代数(RandNLA)アルゴリズムの統計分析は、主に点推定量としてのパフォーマンスに焦点を当ててきました。しかし、推定量の分布が欠如しているため、信頼区間の構築や仮説検定などの統計的推論を行うには不十分です。この記事では、最小二乗問題に対するRandNLAサンプリング推定量の分布を導出するための漸近分析を開発します。特に、固定された設計設定で任意のサンプリング確率を持つ一般的なサンプリング推定量の漸近分布を導出します。分析は、2つの補完的な設定、つまり、関心のある目的が完全なサンプル推定量を近似することである場合と、基礎となるグラウンドトゥルースモデルパラメーターを推論する場合で実行されます。各設定について、サンプリング推定量が軽度の規則性条件下で漸近的に正規分布していることを示します。さらに、サンプリング推定量は、どちらの設定でも漸近的に偏りがありません。漸近分析に基づいて、漸近平均二乗誤差(AMSE)と期待漸近平均二乗誤差(EAMSE)という2つの基準を使用して、最適なサンプリング確率を特定します。これらの最適なサンプリング確率分布のいくつかは、たとえばルートレバレッジサンプリング推定量や予測子長さサンプリング推定量など、文献では新しいものです。理論的結果は、サンプリングプロセスにおけるレバレッジの役割を明らかにし、経験的結果は既存の方法よりも改善されていることを示しています。

Signature Moments to Characterize Laws of Stochastic Processes
確率過程の法則を特徴付けるシグネチャーモーメント

The sequence of moments of a vector-valued random variable can characterize its law. We study the analogous problem for path-valued random variables, that is stochastic processes, by using so-called robust signature moments. This allows us to derive a metric of maximum mean discrepancy type for laws of stochastic processes and study the topology it induces on the space of laws of stochastic processes. This metric can be kernelized using the signature kernel which allows to efficiently compute it. As an application, we provide a non-parametric two-sample hypothesis test for laws of stochastic processes.

ベクトル値確率変数のモーメントのシーケンスは、その法則を特徴付けることができます。パス値確率変数の類似の問題、つまり確率過程について、いわゆるロバストシグネチャモーメントを使用して研究します。これにより、確率過程の法則の最大平均不一致タイプの計量を導出し、それが確率過程の法則の空間に誘導するトポロジーを研究することができます。このメトリックは、シグネチャカーネルを使用してカーネル化できるため、効率的に計算できます。アプリケーションとして、確率過程の法則に対するノンパラメトリックな2サンプル仮説検定を提供します。

Improved Generalization Bounds for Adversarially Robust Learning
敵対的にロバストな学習のための汎化範囲の改善

We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner’s goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.

私たちは、敵対的環境における堅牢な学習モデルを検討します。学習者は、テスト中に敵対者によって影響を受ける可能性のある破損にアクセスできる、破損していないトレーニングデータを取得します。学習者の目標は、将来の敵対的サンプルでテストされる堅牢な分類器を構築することです。敵対者は、入力ごとに$k$個の破損の可能性に制限されます。学習者と敵対者の相互作用をゼロサムゲームとしてモデル化します。このモデルは、Schmidtら(2018)、Madryら(2017)の敵対的サンプルモデルと密接に関連しています。主な結果は、バイナリ分類とマルチクラス分類の一般化境界と、実数値の場合(回帰)で構成されています。バイナリ分類設定では、Feigeら(2015)の一般化境界を厳しくし、無限の仮説クラスも処理できます。サンプルの複雑度は、任意の$\alpha > 0$に対して、$O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$から$O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$に改善されます。さらに、バイナリからマルチクラスおよび実数値のケースにアルゴリズムと一般化境界を拡張します。その過程で、関数クラス上の$k$倍最大値の脂肪粉砕次元とRademacher複雑度に関する結果を取得します。これらは独立した関心事である可能性があります。バイナリ分類の場合、Feigeら(2015)のアルゴリズムは、後悔最小化アルゴリズムとERMオラクルをブラックボックスとして使用します。これをマルチクラスおよび回帰設定に適応させます。このアルゴリズムは、ほぼ最適なポリシーを提供します。特定のトレーニングサンプルのプレーヤー向け。

Training and Evaluation of Deep Policies Using Reinforcement Learning and Generative Models
強化学習と生成モデルを用いた深層政策の学習と評価

We present a data-efficient framework for solving sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models. The framework, called GenRL, trains deep policies by introducing an action latent variable such that the feed-forward policy search can be divided into two parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, and (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable. GenRL enables safe exploration and alleviates the data-inefficiency problem as it exploits prior knowledge about valid sequences of motor actions. Moreover, we provide a set of measures for evaluation of generative models such that we are able to predict the performance of the RL policy training prior to the actual training on a physical robot. We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training on two robotics tasks: shooting a hockey puck and throwing a basketball. Furthermore, we empirically demonstrate that GenRL is the only method which can safely and efficiently solve the robotics tasks compared to two state-of-the-art RL methods.

私たちは、強化学習(RL)と潜在変数生成モデルの組み合わせを活用した、順次意思決定問題を解決するためのデータ効率の高いフレームワークを紹介します。GenRLと呼ばれるこのフレームワークは、アクション潜在変数を導入することでディープポリシーをトレーニングします。これにより、フィードフォワードポリシー検索は、(i)システムの状態を与えられたアクション潜在変数上の分布を出力するサブポリシーのトレーニング、および(ii)潜在アクション変数を条件とする一連のモーターアクションを出力する生成モデルの教師なしトレーニングの2つの部分に分割できます。GenRLは、有効なモーターアクションのシーケンスに関する事前知識を活用するため、安全な探索が可能になり、データの非効率性の問題が軽減されます。さらに、生成モデルを評価するための一連の尺度も提供しているため、実際の物理ロボットでのトレーニングの前にRLポリシートレーニングのパフォーマンスを予測できます。私たちは、ホッケーのパックを撃つこととバスケットボールを投げることという2つのロボットタスクにおける最終的なポリシートレーニングのパフォーマンスに最も影響を与える生成モデルの特性を実験的に決定します。さらに、最先端の2つのRL手法と比較して、GenRLがロボットタスクを安全かつ効率的に解決できる唯一の方法であることを実証します。

Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training
バッチサイズの関数としての学習率:ニューラルネットワーク学習へのランダム行列理論アプローチ

We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size, informing practical training regimens for both stochastic gradient descent (linear scaling) and adaptive algorithms, such as Adam (square root scaling), for smooth, non-convex deep neural networks. Whilst the linear scaling for stochastic gradient descent has been derived under more restrictive conditions, which we generalise, the square root scaling rule for adaptive optimisers is, to our knowledge, completely novel. We validate our claims on the VGG/WideResNet architectures on the CIFAR-100 and ImageNet data sets. Based on our investigations of the sub-sampled Hessian we develop a stochastic Lanczos quadrature based on the fly learning rate and momentum learner, which avoids the need for expensive multiple evaluations for these key hyper-parameters and shows good preliminary results on the Pre-Residual Architecture for CIFAR-100. We further investigate the similarity between the Hessian spectrum of a multi-layer perceptron, trained on Gaussian mixture data, compared to that of deep neural networks trained on natural images. We find striking similarities, with both exhibiting rank degeneracy, a bulk spectrum and outliers to that spectrum. Furthermore, we show that ZCA whitening can remove such outliers early on in training before class separation occurs, but that outliers persist in later training.

私たちは、スパイクされたフィールド依存ランダム行列理論を用いて、ミニバッチ処理がディープニューラルネットワークの損失ランドスケープに与える影響を研究しました。バッチヘッセ行列の極値の大きさは、経験的ヘッセ行列の極値より大きいことを実証しました。また、ヘッセ行列の一般化ガウスニュートン行列近似についても同様の結果を導きました。我々の定理の結果として、バッチサイズの関数としての最大学習率の解析式を導き、滑らかな非凸ディープニューラルネットワークに対する確率的勾配降下法(線形スケーリング)とAdam (平方根スケーリング)などの適応アルゴリズムの両方の実用的なトレーニングレジメンに関する情報を提供します。確率的勾配降下法の線形スケーリングは、より制限された条件下で導き出されており、我々はそれを一般化しましたが、適応型オプティマイザーの平方根スケーリング規則は、我々の知る限り、まったく新しいものです。CIFAR-100およびImageNetデータセットで、VGG/WideResNetアーキテクチャに関する主張を検証しました。サブサンプリングされたヘッシアンの調査に基づいて、フライ学習率とモメンタム学習器に基づく確率的Lanczos求積法を開発しました。これにより、これらの重要なハイパーパラメータのコストのかかる多重評価の必要性が回避され、CIFAR-100のPre-Residualアーキテクチャに関する良好な予備結果が示されました。さらに、ガウス混合データでトレーニングされた多層パーセプトロンのヘッシアンスペクトルと、自然画像でトレーニングされたディープニューラルネットワークのヘッシアンスペクトルの類似性を調査しました。両方ともランクの退化、バルクスペクトル、およびそのスペクトルの外れ値を示すという驚くべき類似点が見つかりました。さらに、ZCAホワイトニングにより、クラス分離が発生する前のトレーニングの早い段階でこのような外れ値を削除できますが、外れ値は後のトレーニングでも残ることを示しています。

Projection-free Distributed Online Learning with Sublinear Communication Complexity
サブリニア通信の複雑さによるプロジェクションフリーの分散型オンライン学習

To deal with complicated constraints via locally light computations in distributed online learning, a recent study has presented a projection-free algorithm called distributed online conditional gradient (D-OCG), and achieved an $O(T^{3/4})$ regret bound for convex losses, where $T$ is the number of total rounds. However, it requires $T$ communication rounds, and cannot utilize the strong convexity of losses. In this paper, we propose an improved variant of D-OCG, namely D-BOCG, which can attain the same $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication rounds for convex losses, and a better regret bound of $O(T^{2/3}(\log T)^{1/3})$ with fewer $O(T^{1/3}(\log T)^{2/3})$ communication rounds for strongly convex losses. The key idea is to adopt a delayed update mechanism that reduces the communication complexity, and redefine the surrogate loss function in D-OCG for exploiting the strong convexity. Furthermore, we provide lower bounds to demonstrate that the $O(\sqrt{T})$ communication rounds required by D-BOCG are optimal (in terms of $T$) for achieving the $O(T^{3/4})$ regret with convex losses, and the $O(T^{1/3}(\log T)^{2/3})$ communication rounds required by D-BOCG are near-optimal (in terms of $T$) for achieving the $O(T^{2/3}(\log T)^{1/3})$ regret with strongly convex losses up to polylogarithmic factors. Finally, to handle the more challenging bandit setting, in which only the loss value is available, we incorporate the classical one-point gradient estimator into D-BOCG, and obtain similar theoretical guarantees.

分散オンライン学習における局所的に軽量な計算を介して複雑な制約に対処するために、最近の研究では、分散オンライン条件付き勾配(D-OCG)と呼ばれる射影のないアルゴリズムが提示され、凸損失に対して$O(T^{3/4})$の後悔境界を達成しました。ここで、$T$は合計ラウンド数です。ただし、$T$回の通信ラウンドが必要であり、損失の強い凸性は利用できません。この論文では、D-OCGの改良版であるD-BOCGを提案します。これは、凸損失に対してわずか$O(\sqrt{T})$回の通信ラウンドで同じ$O(T^{3/4})$の後悔境界を達成し、強い凸損失に対してより少ない$O(T^{1/3}(\log T)^{2/3})$回の通信ラウンドでより良い$O(T^{2/3}(\log T)^{1/3})$の後悔境界を達成できます。鍵となるアイデアは、通信の複雑さを軽減する遅延更新メカニズムを採用し、強い凸性を利用するためにD-OCGの代理損失関数を再定義することです。さらに、D-BOCGで必要な$O(\sqrt{T})$通信ラウンドが凸損失で$O(T^{3/4})$リグレットを達成するために最適($T$の観点から)であること、およびD-BOCGで必要な$O(T^{1/3}(\log T)^{2/3})$通信ラウンドが多重対数因数までの強い凸損失で$O(T^{2/3}(\log T)^{1/3})$リグレットを達成するためにほぼ最適($T$の観点から)であることを示す下限を提供します。最後に、損失値のみが利用可能な、より困難なバンディット設定を処理するために、古典的な1点勾配推定量をD-BOCGに組み込み、同様の理論的保証を取得します。

Interlocking Backpropagation: Improving depthwise model-parallelism
インターロッキングバックプロパゲーション:深さ方向のモデル並列性の改善

The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism can suffer from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation in the global setting and poor task performance in the local setting, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.

最先端のニューラルネットワークのパラメーターの数は、近年大幅に増加しています。大規模ニューラルネットワークへの関心の高まりにより、このようなモデルを可能にする新しい分散トレーニング戦略の開発が促進されています。そのような戦略の1つが、モデル並列分散トレーニングです。残念ながら、モデル並列ではリソースの利用率が低く、リソースの無駄が生じる可能性があります。この研究では、理想的なモデル並列最適化設定における最近の開発であるローカル学習を改善します。グローバル設定でのリソースの利用率の低さとローカル設定でのタスクパフォーマンスの低さに着目し、インターロックバックプロパゲーションと呼ばれる、ローカル学習とグローバル学習の中間戦略のクラスを導入します。これらの戦略は、ローカル最適化の計算効率の利点の多くを維持しながら、グローバル最適化によって達成されるタスクパフォーマンスの多くを回復します。画像分類ResNetとTransformer言語モデルの両方で戦略を評価したところ、タスクパフォーマンスの点ではローカル学習を一貫して上回り、トレーニング効率ではグローバル学習を上回ることがわかりました。

Scalable and Efficient Hypothesis Testing with Random Forests
ランダムフォレストによるスケーラブルで効率的な仮説検定

Throughout the last decade, random forests have established themselves as among the most accurate and popular supervised learning methods. While their black-box nature has made their mathematical analysis difficult, recent work has established important statistical properties like consistency and asymptotic normality by considering subsampling in lieu of bootstrapping. Though such results open the door to traditional inference procedures, all formal methods suggested thus far place severe restrictions on the testing framework and their computational overhead often precludes their practical scientific use. Here we propose a hypothesis test to formally assess feature significance, which uses permutation tests to circumvent computationally infeasible estimates of nuisance parameters. This test is intended to be analogous to the F-test for linear regression. We establish asymptotic validity of the test via exchangeability arguments and show that the test maintains high power with orders of magnitude fewer computations. Importantly, the procedure scales easily to big data settings where large training and testing sets may be employed, conducting statistically valid inference without the need to construct additional models. Simulations and applications to ecological data, where random forests have recently shown promise, are provided.

過去10年間、ランダムフォレストは最も正確で人気のある教師あり学習法の1つとして定着してきました。ブラックボックスの性質により数学的分析は困難でしたが、最近の研究では、ブートストラップの代わりにサブサンプリングを考慮することで、一貫性や漸近正規性などの重要な統計特性を確立しました。このような結果は従来の推論手順への扉を開きますが、これまでに提案されたすべての形式的方法はテストフレームワークに厳しい制限を課し、計算オーバーヘッドにより実際の科学的使用が妨げられることがよくあります。ここでは、特徴の重要性を正式に評価するための仮説検定を提案します。この検定では、計算上実行不可能な迷惑パラメータの推定を回避するために順列検定を使用します。この検定は、線形回帰のF検定に類似することを目的としています。交換可能性の議論を通じて検定の漸近的妥当性を確立し、検定が桁違いに少ない計算で高い検出力を維持することを示します。重要なことは、この手順が、大規模なトレーニングセットとテストセットを使用できるビッグデータ設定に簡単に拡張でき、追加のモデルを構築することなく統計的に有効な推論を実行できることです。ランダムフォレストが最近有望性を示している、生態学的データへのシミュレーションとアプリケーションが提供されます。

D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data
D-GCCA: 多視点高次元データのための分解ベース一般化正準相関分析

Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view’s data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on the L2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.

現代の生物医学研究では、マルチビューデータ、つまり同じオブジェクトセットで測定された複数のタイプのデータを収集することがよくあります。高次元マルチビューデータ分析の一般的なモデルは、各ビューのデータマトリックスを、すべてのデータビューに共通する潜在的要因によって生成される低ランクの共通ソースマトリックス、各ビューに対応する低ランクの固有ソースマトリックス、および加法ノイズマトリックスに分解することです。私たちは、このモデルに対して、分解ベースの一般化正準相関分析(D-GCCA)と呼ばれる新しい分解方法を提案します。D-GCCAは、ほとんどの既存の方法で使用されているユークリッドドット積空間とは対照的に、ランダム変数のL2空間での分解を厳密に定義し、低ランクマトリックス回復の推定値の一貫性を提供できます。さらに、共通潜在的要因を適切に較正するために、固有潜在的要因に望ましい直交性制約を課します。しかし、既存の方法では、このような直交性が適切に考慮されていないため、検出されていない共通ソース変動が大幅に失われる可能性があります。D-GCCAは、正準変数間の共通成分と固有成分を分離することで、一般化正準相関分析よりも一歩進んでおり、主成分分析の観点から魅力的な解釈を享受しています。さらに、最も影響を受ける変数を選択するために、共通または固有潜在因子によって説明される信号分散の変数レベルの割合を使用することを提案します。D-GCCA法の一貫した推定量は、優れた有限サンプル数値パフォーマンスで確立されており、特に大規模データに対して効率的な計算につながる閉じた形式の表現を備えています。最先端の方法に対するD-GCCAの優位性は、シミュレーションと実際のデータ例でも裏付けられています。

A Worst Case Analysis of Calibrated Label Ranking Multi-label Classification Method
較正ラベルランキングマルチラベル分類法の最悪ケース分析

Most multi-label classification methods are evaluated on real datasets, which is a good practice for comparing the performance among methods on the average scenario. Due to the large amount of factors to consider, this empirical approach does not explain, nor does show the factors impacting the performance. A reasonable way to understand some of the performance’s factors of multi-label methods independently of the context is to find a mathematical proof about them. In this paper, mathematical proofs are given for the multi-label method ranking by pairwise comparison and its extension for classification named by calibrated label ranking, showing their performance on a worst case scenario for five multi-label metrics. The pairwise approach adopted by ranking by pairwise comparison enables the algorithm to achieve the optimal performance on Spearman rank correlation. However, the findings presented in this paper clearly show that the same pairwise approach adopted by the algorithm is also a crucial factor contributing to a very poor performance on other multi-label metrics.

ほとんどのマルチラベル分類法は実際のデータセットで評価されます。これは、平均的なシナリオで方法間のパフォーマンスを比較するための良い方法です。考慮すべき要素が多数あるため、この経験的アプローチでは、パフォーマンスに影響を与える要素を説明したり示したりすることはできません。コンテキストとは関係なく、マルチラベル法のパフォーマンスの要素のいくつかを理解する合理的な方法は、それらについて数学的な証明を見つけることです。この論文では、ペアワイズ比較によるマルチラベル法のランク付けと、較正ラベルランク付けと呼ばれる分類のためのその拡張について数学的な証明が示され、5つのマルチラベルメトリックの最悪のシナリオでのパフォーマンスが示されています。ペアワイズ比較によるランク付けで採用されたペアワイズアプローチにより、アルゴリズムはスピアマン順位相関で最適なパフォーマンスを達成できます。ただし、この論文で提示された調査結果は、アルゴリズムで採用された同じペアワイズアプローチが、他のマルチラベルメトリックで非常に悪いパフォーマンスに寄与する重要な要因でもあることを明確に示しています。

Unbiased estimators for random design regression
ランダム設計回帰の不偏推定量

In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for constructing such unbiased estimators in a number of practical settings. In one such setting, we let the input distribution be uniform over a large dataset of $n\gg d$ points. Here, we obtain the first unbiased least squares estimator that can be constructed in time nearly-linear in the data size, resulting in strong guarantees for model averaging. We achieve these computational gains by introducing a new algorithmic technique, called distortion-free intermediate sampling, which is the first method to enable sampling from determinantal point processes in time polynomial in the sample size.

線形回帰では、小さなサンプルに基づいて、$d$次元の入力ポイントと実数値の応答にわたる分布の最適な線形最小二乗予測子を推定します。サンプルが入力分布からi.i.d.に抽出される標準ランダム設計分析では、そのサンプルの最小二乗解は最適値の自然な推定値と見なすことができます。残念ながら、この推定値は、ほとんどの場合、入力ポイントのランダム性から生じる望ましくないバイアスを招き、これがモデル平均化の大きなボトルネックとなります。この論文では、応答モデルに関係なく、最小二乗解が最適値の不偏推定値となるように、入力ポイントの非i.i.d.サンプルを抽出できることを示します。さらに、このサンプルは、以前に抽出されたi.i.d.サンプルに、ポイントが張られる体積の二乗で再スケールされた入力分布から構築された特定の決定的ポイントプロセスに従って共同で抽出された追加の$d$ポイントセットを追加することで、効率的に生成できます。これをきっかけに、我々はボリューム再スケールサンプリングを研究するための理論的枠組みを開発し、その過程でいくつかの新しい行列期待値恒等式を証明した。我々はそれらを用いて、任意の入力分布と$\epsilon>0$に対して、分布全体にわたる期待二乗損失が最適値の損失の$1+\epsilon$倍で制限される不偏推定量を構築できる$O(d\log d+ d/\epsilon)$ポイントからなるランダム設計が存在することを示す。私たちは、いくつかの実用的な設定でそのような不偏推定量を構築するための効率的なアルゴリズムを提供します。そのような設定の1つでは、入力分布を$n\gg d$ポイントの大規模なデータセットにわたって均一にします。ここでは、データサイズにほぼ線形の時間で構築できる最初の不偏最小二乗推定量を取得し、モデルの平均化が強力に保証されます。私たちは、歪みのない中間サンプリングと呼ばれる新しいアルゴリズム手法を導入することで、これらの計算上の利点を実現しました。これは、サンプルサイズの時間多項式における行列式点過程からのサンプリングを可能にする最初の方法です。

Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects
潜在的結果と因果効果の推定のための一般化限界と表現学習

Practitioners in diverse fields such as healthcare, economics and education are eager to apply machine learning to improve decision making. The cost and impracticality of performing experiments and a recent monumental increase in electronic record keeping has brought attention to the problem of evaluating decisions based on non-experimental observational data. This is the setting of this work. In particular, we study estimation of individual-level potential outcomes and causal effects—such as a single patient’s response to alternative medication—from recorded contexts, decisions and outcomes. We give generalization bounds on the error in estimated outcomes based on distributional distance measures between re-weighted samples of groups receiving different treatments. We provide conditions under which our bounds are tight and show how they relate to results for unsupervised domain adaptation. Led by our theoretical results, we devise algorithms which learn representations and weighting functions that minimize our bounds by regularizing the representation’s induced treatment group distance, and encourage sharing of information between treatment groups. Finally, an experimental evaluation on real and synthetic data shows the value of our proposed representation architecture and regularization scheme.

医療、経済、教育など、さまざまな分野の実務家は、意思決定を改善するために機械学習を適用したいと熱望しています。実験を行うコストと非実用性、および最近の電子記録の大幅な増加により、非実験的観察データに基づく決定の評価の問題が注目されています。これが本研究の背景です。特に、記録されたコンテキスト、決定、結果から、代替薬に対する単一の患者の反応など、個人レベルの潜在的な結果と因果効果の推定を研究します。異なる治療を受けているグループの再重み付けされたサンプル間の分布距離測定に基づいて、推定結果の誤差の一般化境界を示します。境界が厳密である条件を示し、それが教師なしドメイン適応の結果とどのように関係するかを示します。理論的結果に基づいて、表現によって誘導された治療グループ距離を正規化することで境界を最小化し、治療グループ間での情報共有を促進する表現と重み関数を学習するアルゴリズムを考案します。最後に、実際のデータと合成データに対する実験的評価により、提案された表現アーキテクチャと正規化スキームの価値が示されます。

Improved Classification Rates for Localized SVMs
ローカライズされた SVM の分類率の向上

Localized support vector machines solve SVMs on many spatially defined small chunks and besides their computational benefit compared to global SVMs one of their main characteristics is the freedom of choosing arbitrary kernel and regularization parameter on each cell. We take advantage of this observation to derive global learning rates for localized SVMs with Gaussian kernels and hinge loss. It turns out that our rates outperform under suitable sets of assumptions known classification rates for localized SVMs, for global SVMs, and other learning algorithms based on e.g., plug-in rules or trees. The localized SVM rates are achieved under a set of margin conditions, which describe the behavior of the data-generating distribution, and no assumption on the existence of a density is made. Moreover, we show that our rates are obtained adaptively, that is without knowing the margin parameters in advance. The statistical analysis of the excess risk relies on a simple partitioning based technique, which splits the input space into a subset that is close to the decision boundary and into a subset that is sufficiently far away. A crucial condition to derive then improved global rates is a margin condition that relates the distance to the decision boundary to the amount of noise.

ローカライズされたサポートベクターマシンは、空間的に定義された多数の小さなチャンクでSVMを解きます。グローバルSVMと比較した計算上の利点に加えて、各セルで任意のカーネルと正則化パラメータを自由に選択できるという主な特徴があります。この観察を利用して、ガウスカーネルとヒンジ損失を持つローカライズされたSVMのグローバル学習率を導きます。適切な一連の仮定の下では、ローカライズされたSVM、グローバルSVM、およびプラグインルールやツリーなどに基づくその他の学習アルゴリズムの既知の分類率よりも、私たちの学習率が優れていることがわかりました。ローカライズされたSVMの学習率は、データ生成分布の動作を記述するマージン条件のセットの下で達成され、密度の存在に関する仮定は行われません。さらに、私たちの学習率は適応的に、つまりマージンパラメータを事前に知らなくても得られることを示しています。過剰リスクの統計分析は、入力空間を決定境界に近いサブセットと十分に離れたサブセットに分割する、単純なパーティションベースの手法に依存しています。改善されたグローバルレートを導出するための重要な条件は、決定境界までの距離とノイズの量を関連付けるマージン条件です。

Solving L1-regularized SVMs and Related Linear Programs: Revisiting the Effectiveness of Column and Constraint Generation
L1正則化SVMと関連線形プログラムの解法:列生成と制約生成の有効性の再検討

The linear Support Vector Machine (SVM) is a classic classification technique in machine learning. Motivated by applications in high dimensional statistics, we consider penalized SVM problems involving the minimization of a hinge-loss function with a convex sparsity-inducing regularizer such as: the L1-norm on the coefficients, its grouped generalization and the sorted L1-penalty (aka Slope). Each problem can be expressed as a Linear Program (LP) and is computationally challenging when the number of features and/or samples is large—the current state of algorithms for these problems is rather nascent when compared to the usual L2-regularized linear SVM. To this end, we propose new computational algorithms for these LPs by bringing together techniques from (a) classical column (and constraint) generation methods and (b) first-order methods for non-smooth convex optimization—techniques that appear to be rarely used together for solving large scale LPs. These components have their respective strengths; and while they are found to be useful as separate entities, they appear to be more powerful in practice when used together in the context of solving large-scale LPs such as the ones studied herein. Our approach complements the strengths of (a) and (b)—leading to a scheme that seems to significantly outperform commercial solvers as well as specialized implementations for these problems. We present numerical results on a series of real and synthetic data sets demonstrating the surprising effectiveness of classic column/constraint generation methods in the context of challenging LP-based machine learning tasks.

線形サポートベクターマシン(SVM)は、機械学習における古典的な分類手法です。高次元統計への応用に着目し、係数のL1ノルム、グループ化された一般化、ソートされたL1ペナルティ(別名スロープ)などの凸スパース性誘導正則化子を使用したヒンジ損失関数の最小化を伴うペナルティ付きSVM問題を考察します。各問題は線形計画(LP)として表現でき、特徴やサンプルの数が多い場合は計算が困難です。これらの問題に対するアルゴリズムの現状は、通常のL2正則化線形SVMと比較するとかなり初期段階です。このため、(a)古典的な列(および制約)生成法と(b)非滑らかな凸最適化のための一次手法の手法を組み合わせて、これらのLP用の新しい計算アルゴリズムを提案します。これらの手法は、大規模なLPを解決するために一緒に使用されることはめったにないようです。これらのコンポーネントにはそれぞれ長所があります。これらは別々に使用しても有用であることが判明していますが、ここで研究されているような大規模なLPを解決するコンテキストで一緒に使用すると、実際にはより強力になるようです。私たちのアプローチは、(a)と(b)の長所を補完し、これらの問題に対する市販のソルバーや特殊な実装を大幅に上回るスキームにつながります。一連の実際のデータセットと合成データセットに関する数値結果を提示し、困難なLPベースの機械学習タスクのコンテキストで、従来の列/制約生成方法が驚くほど有効であることを示しています。

Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements
スケーリングとスケーラビリティ:不完全測定からの証明可能な非凸低ランクテンソル推定

Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science across various fields in science and engineering. A fundamental task is to faithfully recover the tensor from highly incomplete measurements in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for two canonical problems — tensor completion and tensor regression — as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. This leads to an extremely scalable approach to low-rank tensor estimation compared with prior art, which suffers from at least one of the following drawbacks: extreme sensitivity to ill-conditioning, high per-iteration costs in terms of memory and computation, or poor sample complexity guarantees. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization.

テンソルは、多属性データと多方向相互作用を表す強力で柔軟なモデルを提供し、科学と工学のさまざまな分野にわたる現代のデータサイエンスで不可欠な役割を果たしています。基本的なタスクは、統計的かつ計算的に効率的な方法で、非常に不完全な測定からテンソルを忠実に回復することです。この論文では、タッカー分解におけるテンソルの低ランク構造を利用して、調整されたスペクトル初期化を使用してテンソル因子を直接回復するスケール勾配降下法(ScaledGD)アルゴリズムを開発し、サンプルサイズが他のパラメーターの依存性を無視して$n^{3/2}$のオーダーを超えるとすぐに、2つの標準的な問題(テンソル補完とテンソル回帰)に対して、グラウンドトゥルーステンソルの条件数に依存しない線形速度で収束することが証明されていることを示します。ここで、$n$はテンソルの次元です。これにより、悪条件に対する極度の敏感性、メモリと計算に関する反復あたりのコストの高さ、サンプルの複雑さの保証の悪さなど、少なくとも1つの欠点を抱える従来技術と比較して、低ランクテンソル推定に対する極めてスケーラブルなアプローチが可能になります。私たちが知る限り、ScaledGDは、タッカー分解による低ランクテンソル補完に対して、ほぼ最適な統計的および計算的複雑さを同時に実現する最初のアルゴリズムです。私たちのアルゴリズムは、非凸統計的推定を加速するための適切な前処理の威力を強調しており、反復によって変化する前処理により、低ランクテンソル因数分解の基礎となる対称性に関する軌跡の望ましい不変性特性が促進されます。

Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods
貪欲法とランダム準ニュートン法の陽的収束率

Optimization is important in machine learning problems, and quasi-Newton methods have a reputation as the most efficient numerical methods for smooth unconstrained optimization. In this paper, we study the explicit superlinear convergence rates of quasi-Newton methods and address two open problems mentioned by Rodomanov and Nesterov (2021b). First, we extend Rodomanov and Nesterov (2021b)’s results to random quasi-Newton methods, which include common DFP, BFGS, SR1 methods. Such random methods employ a random direction for updating the approximate Hessian matrix in each iteration. Second, we focus on the specific quasi-Newton methods: SR1 and BFGS methods. We provide improved versions of greedy and random methods with provable better explicit (local) superlinear convergence rates. Our analysis is closely related to the approximation of a given Hessian matrix, unconstrained quadratic objective, as well as the general strongly convex, smooth, and strongly self-concordant functions.

機械学習の問題では最適化が重要であり、準ニュートン法は、制約のないスムーズな最適化のための最も効率的な数値手法として評価されています。この論文では、準ニュートン法の陽的超線形収束率を研究し、RodomanovとNesterov (2021b)が言及した2つの未解決の問題を取り上げます。まず、Rodomanov and Nesterov (2021b)の結果を、一般的なDFP、BFGS、SR1法を含むランダムな準ニュートン法に拡張します。このようなランダム法では、各反復で近似ヘッセ行列を更新するためにランダムな方向が使用されます。次に、特定の準ニュートン法であるSR1法とBFGS法に焦点を当てます。私たちは、証明可能なより優れた明示的な(ローカルな)超線形収束率を備えた貪欲でランダムな方法の改良版を提供します。私たちの分析は、特定のヘッセ行列、制約のない二次目的関数、および一般的な強凸関数、平滑関数、および強自己一致関数の近似と密接に関連しています。

Topologically penalized regression on manifolds
多様体上のトポロジカルペナルティ付き回帰

We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is “topologically smooth”.

私たちは、コンパクト多様体M上の回帰問題を研究します。データの基礎となる幾何学とトポロジーを利用するために、回帰タスクは、多様体のラプラス・ベルトラミ演算子の最初のいくつかの固有関数に基づいて実行されます。これはトポロジカルペナルティで正則化されます。提案されたペナルティは、固有関数または推定関数のいずれかのサブレベルセットのトポロジーに基づいています。全体的なアプローチは、合成データセットと実際のデータセットの両方に対するさまざまなアプリケーションで有望で競争力のあるパフォーマンスをもたらすことが示されています。また、回帰関数の推定値について、その予測誤差と(トポロジカルな意味での)滑らかさの両方について、理論的な保証を提供します。まとめると、これらの結果は、ターゲット関数が「トポロジカルに滑らか」である場合のアプローチの関連性を裏付けています。

Fairness-Aware PAC Learning from Corrupted Data
破損したデータから学習する公平性を意識したPAC

Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit.

機械学習モデルに関する公平性の懸念に対処することは、実際の自動化システムでモデルを長期的に採用するための重要なステップです。データから公平なモデルをトレーニングするための多くのアプローチが開発されていますが、これらの方法がデータ破損に対して堅牢であるかどうかはほとんどわかっていません。この研究では、最悪のデータ操作下での公平性を考慮した学習を検討します。状況によっては、サンプルサイズや精度の低下の有無に関係なく、敵対者が学習者に過度に偏った分類器を返すように強制できること、およびデータ内で過小評価されている保護グループを含む学習問題では過剰な偏りの強さが増すことを示します。また、困難性の結果が定数因子まで厳密であることを証明します。この目的のために、精度と公平性の両方を最適化する2つの自然学習アルゴリズムを研究し、これらのアルゴリズムが、大規模データ制限内での破損率と保護グループの頻度に関して順序最適な保証を享受できることを示します。

Structure Learning for Directed Trees
有向樹木の構造学習

Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large, continuous, and nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu–Liu–Edmonds’ algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce two methods for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods.

システムの因果構造を知ることは、科学の多くの分野で基本的な関心事であり、システムへの操作の下でうまく機能する予測アルゴリズムの設計に役立ちます。因果構造は、特定の制約の下で観測分布から識別可能になります。データから構造を学習するために、スコアベースの方法は、適合の品質に応じてさまざまなグラフを評価します。ただし、大規模で連続的で非線形なモデルの場合、これらは真の因果構造を回復するという一般的な保証のないヒューリスティックな最適化アプローチに依存しています。この論文では、有向木の構造学習について検討します。Chu-Liu-Edmondsアルゴリズムに基づく高速でスケーラブルな方法を提案し、これを因果加法木(CAT)と呼びます。ガウス誤差の場合、識別可能性ギャップが消失する漸近的レジームでの一貫性を証明します。また、選択後および未識別設定で有効な漸近的なファミリーワイズエラー率制御を使用して、サブ構造仮説をテストするための2つの方法を紹介します。さらに、真の因果モデルが観測分布にどれだけ適合しているかを定量化する識別可能性ギャップを研究し、それが因果モデルのローカル特性によって下限が定められていることを証明します。シミュレーション研究では、競合する構造学習方法と比較して、CATの優れたパフォーマンスが実証されています。

ktrain: A Low-Code Library for Augmented Machine Learning
ktrain: 拡張機械学習のためのローコードライブラリ

We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. As a wrapper to TensorFlow and many other libraries (e.g., transformers, scikit-learn, stellargraph), it is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and apply by both beginners and experienced practitioners. Featuring modules that support text data (e.g., text classification, sequence tagging, open-domain question-answering), vision data (e.g., image classification), graph data (e.g., node classification, link prediction), and tabular data, ktrain presents a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four “commands” or lines of code.

私たちは、機械学習をよりアクセスしやすく、適用しやすくするローコードのPythonライブラリであるktrainを紹介します。TensorFlowや他の多くのライブラリ(transformer、scikit-learn、stellargraphなど)のラッパーとして、初心者から経験豊富な実務者まで、洗練された最先端の機械学習モデルを簡単に構築、トレーニング、検査、適用できるように設計されています。テキストデータ(テキスト分類、シーケンスタグ付け、オープンドメイン質問応答など)、ビジョンデータ(画像分類など)、グラフデータ(ノード分類、リンク予測など)、表形式データをサポートするモジュールを備えたktrainは、シンプルな統一インターフェースを提供し、わずか3〜4行の「コマンド」またはコード行で幅広いタスクを迅速に解決できます。

A universally consistent learning rule with a universally monotone error
普遍的に一貫した学習ルールと普遍的な単調エラー。

We present a universally consistent learning rule whose expected error is monotone non-increasing with the sample size under every data distribution. The question of existence of such rules was brought up in 1996 by Devroye, Györfi and Lugosi (who called them “smart”). Our rule is fully deterministic, a data-dependent partitioning rule constructed in an arbitrary domain (a standard Borel space) using a cyclic order. The central idea is to only partition at each step those cyclic intervals that exhibit a sufficient empirical diversity of labels, thus avoiding a region where the error function is convex.

私たちは、期待誤差がモノトーンで、すべてのデータ分布の下のサンプルサイズとともに非増加する普遍的に一貫性のある学習ルールを提示します。そのようなルールの存在の問題は、1996年にデブロイ、ジェルフィ、ルゴシ(彼らは彼らを「賢い」と呼んだ)によって提起されました。私たちのルールは完全に決定論的であり、巡回次数を使用して任意の領域(標準ボレル空間)に構築されたデータ依存の分割ルールです。中心的な考え方は、ラベルの十分な経験的多様性を示す周期的な間隔のみを各ステップで分割し、誤差関数が凸である領域を回避することです。

Statistical Rates of Convergence for Functional Partially Linear Support Vector Machines for Classification
分類のための機能部分線形サポートベクトルマシンの統計的収束率

In this paper, we consider the learning rate of support vector machines with both a functional predictor and a high-dimensional multivariate vectorial predictor. Similar to the literature on learning in reproducing kernel Hilbert spaces, a source condition and a capacity condition are used to characterize the convergence rate of the estimator. It is highly non-trivial to establish the possibly faster rate of the linear part. Using a key basic inequality comparing losses at two carefully constructed points, we establish the learning rate of the linear part which is the same as if the functional part is known. The proof relies on empirical processes and the Rademacher complexity bound in the semi-nonparametric setting as analytic tools, Young’s inequality for operators, as well as a novel “approximate convexity” assumption.

この論文では、関数予測子と高次元多変量ベクトル予測子の両方を持つサポートベクターマシンの学習率について考察します。カーネルヒルベルト空間の再現における学習に関する文献と同様に、ソース条件と容量条件を使用して、推定器の収束率を特徴付けます。線形部分のより速い速度を確立することは、非常に簡単ではありません。慎重に構築された2つの点での損失を比較する主要な基本不等式を使用して、機能部分が既知である場合と同じ線形部分の学習率を確立します。この証明は、経験的プロセスと、分析ツールとしてのセミノンパラメトリック設定に縛られたラーデマッハーの複雑さ、演算子に対するヤングの不等式、および新しい「近似凸性」の仮定に依存しています。

Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks
過剰パラメータ化された線形モデルにおける主成分バイアスとその深層ニューラルネットワークにおけるその発現

Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model’s parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stopping and its connection to PCA, and why deep networks converge more slowly with random labels.

最近の研究によると、異なるアーキテクチャの畳み込みニューラルネットワークは、同じ順序で画像を分類することを学習します。この現象を理解するために、過剰パラメータ化された深層線形ネットワークモデルを再検討します。分析により、隠れ層が十分に広い場合、このモデルのパラメータの収束率は、対応する特異値によって決まる速度で、データのより大きな主成分の方向に沿って指数関数的に速くなります。この収束パターンを主成分バイアス(PCバイアス)と呼びます。経験的に、PCバイアスが線形ネットワークと非線形ネットワークの両方の学習順序を効率化する方法を示します。これは、学習の初期段階でより顕著になります。次に、結果を単純性バイアスと比較し、両方のバイアスが独立して見られ、学習順序にさまざまな方法で影響することを示します。最後に、PCバイアスが早期停止の利点とPCAとの関連性をどのように説明できるか、およびランダムラベルで深層ネットワークが収束するのがなぜ遅いのかについて説明します。

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach
連続的時間と空間における方策評価と時間差分学習:マーチンゲール法

We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean-square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a “martingale loss function”, whose solution is proved to be the best approximation of the true value function in the mean–square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the “martingale orthogonality conditions” with test functions. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero, and we provide the convergence rate. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.

私たちは、連続時間と空間における強化学習のための方策評価(PE)と関連する時間差分(TD)法を研究するための統一フレームワークを提案します。PEはプロセスのマルチンゲール条件を維持することと同等であることを示す。この観点から、平均二乗TD誤差はマルチンゲールの2次変動を近似するため、PEの適切な目的ではないことがわかる。PEアルゴリズムの設計にマルチンゲール特性を使用する2つの方法を示す。最初の方法は、「マルチンゲール損失関数」を最小化するもので、その解は平均二乗の意味で真の値関数の最良近似であることが証明されています。この方法は、古典的な勾配モンテカルロアルゴリズムを解釈します。2番目の方法は、テスト関数を含む「マルチンゲール直交条件」と呼ばれる方程式のシステムに基づく。これらの方程式をさまざまな方法で解くと、TD($\lambda$)、LSTD、GTDなどのさまざまな古典的なTDアルゴリズムが復元されます。テスト関数のさまざまな選択によって、結果として得られるソリューションが真の値関数にどのような意味で近似するかが決まります。さらに、メッシュサイズが0に近づくにつれて、任意の収束時間離散化アルゴリズムが連続時間アルゴリズムに収束することを証明し、収束率を示します。数値実験とアプリケーションを使用して、理論的な結果と対応するアルゴリズムを示します。

Truncated Emphatic Temporal Difference Methods for Prediction and Control
予測と制御のための切り捨て強調時間差分法

Emphatic Temporal Difference (TD) methods are a class of off-policy Reinforcement Learning (RL) methods involving the use of followon traces. Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems. First, followon traces typically suffer from large variance, making them hard to use in practice. Second, though Yu (2015) confirms the asymptotic convergence of some emphatic TD methods for prediction problems, there is still no finite sample analysis for any emphatic TD method for prediction, much less control. In this paper, we address those two open problems simultaneously via using truncated followon traces in emphatic TD methods. Unlike the original followon traces, which depend on all previous history, truncated followon traces depend on only finite history, reducing variance and enabling the finite sample analysis of our proposed emphatic TD methods for both prediction and control.

強調時間差分(TD)手法は、後続トレースの使用を含むオフポリシーの強化学習(RL)手法のクラスです。非政策RLの悪名高い致命的な三つ組に対処する上で、強調的なTD法が理論的に成功しているにもかかわらず、まだ2つの未解決の問題があります。まず、後続トレースは通常、大きなばらつきに悩まされ、実際に使用するのが難しくなります。第二に、Yu (2015)は、予測問題に対するいくつかの強調TD法の漸近収束を確認していますが、予測のための強調TD法の有限サンプル分析はまだなく、ましてや制御はありません。この論文では、強調TD法で切り捨てられた追従トレースを使用して、これら2つの未解決の問題に同時に対処します。以前のすべての履歴に依存する元のフォローオントレースとは異なり、切り捨てられたフォローオントレースは有限の履歴のみに依存するため、分散が減少し、予測と制御の両方に対して提案された強調TD法の有限サンプル分析が可能になります。

Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning
自動カリキュラム学習による内発的動機付けの目標探索プロセス

Intrinsically motivated spontaneous exploration is a key enabler of autonomous developmental learning in human children. It enables the discovery of skill repertoires through autotelic learning, i.e. the self-generation, self-selection, self-ordering and self-experimentation of learning goals. We present an algorithmic approach called Intrinsically Motivated Goal Exploration Processes (IMGEP) to enable similar properties of autonomous learning in machines. The IMGEP architecture relies on several principles: 1) self-generation of goals, generalized as parameterized fitness functions; 2) selection of goals based on intrinsic rewards; 3) exploration with incremental goal-parameterized policy search and exploitation with a batch learning algorithm; 4) systematic reuse of information acquired when targeting a goal for improving towards other goals. We present a particularly efficient form of IMGEP, called AMB, that uses a population-based policy and an object-centered spatio-temporal modularity. We provide several implementations of this architecture and demonstrate their ability to automatically generate a learning curriculum within several experimental setups. One of these experiments includes a real humanoid robot exploring multiple spaces of goals with several hundred continuous dimensions and with distractors. While no particular target goal is provided to these autotelic agents, this curriculum allows the discovery of diverse skills that act as stepping stones for learning more complex skills, e.g. nested tool use.

内発的動機による自発的な探索は、人間の子供の自律的な発達学習を可能にする重要な要素です。これは、学習目標の自己生成、自己選択、自己順序付け、自己実験などのオートテリック学習を通じて、スキルレパートリーの発見を可能にします。私たちは、機械における自律学習の同様の特性を可能にするために、内発的動機による目標探索プロセス(IMGEP)と呼ばれるアルゴリズムアプローチを提示します。IMGEPアーキテクチャは、いくつかの原則に依存しています。1)パラメーター化された適応度関数として一般化された目標の自己生成、2)内発的報酬に基づく目標の選択、3)バッチ学習アルゴリズムによる増分目標パラメーター化ポリシー検索と活用による探索、4)目標をターゲットにする際に取得した情報を他の目標に向けて改善するために体系的に再利用すること。私たちは、集団ベースのポリシーとオブジェクト中心の時空間モジュール性を使用する、特に効率的な形式のIMGEPであるAMBを提示します。私たちはこのアーキテクチャの実装をいくつか提供し、いくつかの実験設定内で学習カリキュラムを自動的に生成する能力を実証しています。これらの実験の1つには、数百の連続次元と妨害要素を持つ複数の目標空間を探索する実際のヒューマノイドロボットが含まれています。これらのオートテリックエージェントには特定の目標は提供されませんが、このカリキュラムにより、ネストされたツールの使用など、より複雑なスキルを学習するための足がかりとなる多様なスキルを発見できます。

Universal Approximation of Functions on Sets
集合上の関数の普遍近似

Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the model’s latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a naïve constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general.

集合の関数、または同等の順列不変関数のモデル化は、機械学習における長年の課題です。ディープセットは、連続集合関数の万能近似器として知られている一般的な方法です。ディープセットの理論分析を提供し、このユニバーサル近似特性は、モデルの潜在空間が十分に高次元である場合にのみ保証されることを示しています。潜在空間が必要より1次元でも低い場合、最悪の場合の誤差によって判断されるように、Deep Setsが単純な定数ベースラインよりも優れていない区分的アフィン関数が存在します。Deep Setsは、Janossyプーリングパラダイムの最も効率的な化身と見なすことができます。このパラダイムは、現在最も一般的なセット学習方法を包含していると認識しています。この関連性に基づいて、集合学習に対する結果の影響をより広く議論し、一般的なヤノッシープーリングの普遍性に関するいくつかの未解決の問題を特定します。

EV-GAN: Simulation of extreme events with ReLU neural networks
EV-GAN:ReLUニューラルネットワークによる極端事象のシミュレーション

Feedforward neural networks based on Rectified linear units (ReLU) cannot efficiently approximate quantile functions which are not bounded, especially in the case of heavy-tailed distributions. We thus propose a new parametrization for the generator of a Generative adversarial network (GAN) adapted to this framework, basing on extreme-value theory. An analysis of the uniform error between the extreme quantile and its GAN approximation is provided: We establish that the rate of convergence of the error is mainly driven by the second-order parameter of the data distribution. The above results are illustrated on simulated data and real financial data. It appears that our approach outperforms the classical GAN in a wide range of situations including high-dimensional and dependent data.

整流線形ユニット(ReLU)に基づくフィードフォワードニューラルネットワークは、特にヘビーテール分布の場合、有界でない分位点関数を効率的に近似できません。したがって、このフレームワークに適応した敵対的生成ネットワーク(GAN)の生成のための新しいパラメータ化を、極値理論に基づいて提案します。極限分位数とそのGAN近似との間の一様誤差の分析が提供されます:誤差の収束率は主にデータ分布の2次パラメータによって駆動されることを確立します。上記の結果は、シミュレーションデータと実際の財務データに示されています。私たちのアプローチは、高次元データや従属データを含む幅広い状況で従来のGANよりも優れているようです。

Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning
非平滑凸学習における高速ハイパーパラメータ選択のための陰的微分

Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study first-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian. Using implicit differentiation, we show it is possible to leverage the non-smoothness of the inner problem to speed up the computation. Finally, we provide a bound on the error made on the hypergradient when the inner optimization problem is solved approximately. Results on regression and classification problems reveal computational benefits for hyperparameter optimization, especially when multiple hyperparameters are required.

モデルの最適なハイパーパラメーターを見つけることは、バイレベル最適化問題としてキャストでき、通常はゼロ次手法を使用して解かれます。この研究では、内部最適化問題が凸状であるが滑らかでない場合の一次法を研究します。近位勾配降下法と近位座標降下法の前方モード微分が、ヤコビアンのシーケンスが正確なヤコビアンに向かって収束することを示しています。暗黙的な微分を使用して、内部問題の非平滑性を利用して計算を高速化することが可能であることを示します。最後に、内部最適化問題が近似的に解かれた場合のハイパーグラディエントで発生する誤差の範囲を提供します。回帰問題と分類問題の結果から、特に複数のハイパーパラメーターが必要な場合に、ハイパーパラメーターの最適化の計算上の利点が明らかになります。

Online Nonnegative CP-dictionary Learning for Markovian Data
マルコフデータのためのオンライン非否定CP辞書学習

Online Tensor Factorization (OTF) is a fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis for CP-decompositions. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways.

オンラインテンソル分解(OTF)は、ストリーミングマルチモーダルデータから低次元の解釈可能な特徴を学習するための基本的なツールです。最近、OTFのさまざまなアルゴリズム的および理論的側面が調査されていますが、非一貫性やスパース性の仮定なしに目的関数の定常点に収束する一般的な保証は、i.i.d.の場合でもまだ欠けています。この研究では、学習したCP基底の解釈可能性を誘導する非負性制約を含む一般的な制約の下で、テンソル値データの特定のストリームからCANDECOMP/PARAFAC (CP)基底を学習する新しいアルゴリズムを紹介します。データテンソルのシーケンスが基礎となるマルコフ連鎖によって生成されるという仮定の下で、このアルゴリズムが目的関数の定常点のセットにほぼ確実に収束することを証明します。この設定は、古典的なi.i.d.の場合だけでなく、独立サンプリングまたはMCMCサンプリングによって生成されたデータストリームを含む幅広いアプリケーションコンテキストをカバーします。私たちの結果は、CP分解のグローバル収束解析におけるOTFとオンライン行列因数分解の間のギャップを埋めるものです。実験的に、私たちのアルゴリズムは、合成データと実世界のデータの両方で非負テンソル因数分解タスクの標準アルゴリズムよりもはるかに速く収束することを示しています。また、画像、ビデオ、時系列データからのさまざまな例で私たちのアルゴリズムの有用性を実証し、テンソル構造を複数の方法で利用することで、同じテンソルデータから質的に異なるCP辞書を学習する方法を示します。

On the Robustness to Misspecification of α-posteriors and Their Variational Approximations
α事後変数の誤指定に対するロバスト性と変分近似について

$\alpha$-posteriors and their variational approximations distort standard posterior inference by downweighting the likelihood and introducing variational approximation errors. We show that such distortions, if tuned appropriately, reduce the Kullback–Leibler (KL) divergence from the true, but perhaps infeasible, posterior distribution when there is potential parametric model misspecification. To make this point, we derive a Bernstein–von Mises theorem showing convergence in total variation distance of $\alpha$-posteriors and their variational approximations to limiting Gaussian distributions. We use these limiting distributions to evaluate the KL divergence between true and reported posteriors. We show that the KL divergence is minimized by choosing $\alpha$ strictly smaller than one, assuming there is a vanishingly small probability of model misspecification. The optimized value of $\alpha$ becomes smaller as the misspecification becomes more severe. The optimized KL divergence increases logarithmically in the magnitude of misspecification and not linearly as with the usual posterior. Moreover, the optimized variational approximations of $\alpha$-posteriors can induce additional robustness to model misspecification beyond that obtained by optimally downweighting the likelihood.

$\alpha$-事後分布とその変分近似は、尤度の重み付けを下げ、変分近似誤差を導入することで、標準的な事後推論を歪めます。このような歪みは、適切に調整されていれば、潜在的なパラメトリックモデル誤指定がある場合に、真の、しかしおそらくは実現不可能な事後分布からのKullback–Leibler (KL)ダイバージェンスを減少させることを示します。この点を明らかにするために、$\alpha$-事後分布とその変分近似の極限ガウス分布への総変動距離の収束を示すBernstein–von Mises定理を導きます。これらの極限分布を使用して、真の事後分布と報告された事後分布間のKLダイバージェンスを評価します。モデルの誤指定の確率が極めて小さいと仮定すると、$\alpha$を1より厳密に小さく選択することでKLダイバージェンスが最小化されることを示します。誤指定が深刻になるにつれて、$\alpha$の最適化値は小さくなります。最適化されたKLダイバージェンスは、通常の事後分布のように線形ではなく、誤指定の大きさに応じて対数的に増加します。さらに、$\alpha$事後分布の最適化された変分近似により、尤度を最適に重み付けすることで得られるものを超えて、誤指定をモデル化するための追加の堅牢性をもたらすことができます。

Adversarial Robustness Guarantees for Gaussian Processes
ガウス過程に対する敵対的ロバスト性の保証

Gaussian processes (GPs) enable principled computation of model uncertainty, making them attractive for safety-critical applications. Such scenarios demand that GP decisions are not only accurate, but also robust to perturbations. In this paper we present a framework to analyse adversarial robustness of GPs, defined as invariance of the model’s decision to bounded perturbations. Given a compact subset of the input space $T\subseteq \mathbb{R}^d$, a point $x^*$ and a GP, we provide provable guarantees of adversarial robustness of the GP by computing lower and upper bounds on its prediction range in $T$. We develop a branch-and-bound scheme to refine the bounds and show, for any $\epsilon > 0$, that our algorithm is guaranteed to converge to values $\epsilon$-close to the actual values in finitely many iterations. The algorithm is anytime and can handle both regression and classification tasks, with analytical formulation for most kernels used in practice. We evaluate our methods on a collection of synthetic and standard benchmark data sets, including SPAM, MNIST and FashionMNIST. We study the effect of approximate inference techniques on robustness and demonstrate how our method can be used for interpretability. Our empirical results suggest that the adversarial robustness of GPs increases with accurate posterior estimation.

ガウス過程(GP)は、モデルの不確実性の原理的な計算を可能にするため、安全性が重要なアプリケーションにとって魅力的です。このようなシナリオでは、GPの決定が正確であるだけでなく、摂動に対して堅牢であることも求められます。この論文では、モデルの決定の制限された摂動に対する不変性として定義される、GPの敵対的堅牢性を分析するためのフレームワークを紹介します。入力空間のコンパクトなサブセット$T\subseteq \mathbb{R}^d$、点$x^*$、およびGPが与えられた場合、$T$での予測範囲の下限と上限を計算することにより、GPの敵対的堅牢性の証明可能な保証を提供します。境界を精緻化するための分岐限定スキームを開発し、任意の$\epsilon > 0$に対して、有限回の反復でアルゴリズムが実際の値に$\epsilon$近い値に収束することが保証されることを示します。このアルゴリズムはいつでも使用でき、回帰と分類の両方のタスクを処理できます。実際に使用されているほとんどのカーネルの解析的定式化が採用されています。SPAM、MNIST、FashionMNISTなどの合成データセットと標準ベンチマークデータセットのコレクションで、この手法を評価します。近似推論手法が堅牢性に与える影響を調査し、この手法を解釈可能性にどのように使用できるかを示します。実験結果から、GPの敵対的堅牢性は正確な事後推定によって向上することが示唆されています。

A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning
強化学習におけるオフポリシー値推定のための一般化予測ベルマン誤差

Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms—namely temporal difference algorithms—can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective—the mean-squared Bellman error (MSBE)—which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.

多くの強化学習アルゴリズムは値推定に依存していますが、最も広く使用されているアルゴリズム、つまり時間差分アルゴリズムは、オフポリシーサンプリングと非線形関数近似の両方で発散する可能性があります。線形平均二乗投影ベルマン誤差(MSPBE)に基づくオフポリシー値推定用のアルゴリズムが多数開発されており、線形関数近似では健全です。これらの方法を非線形ケースに拡張することは、ほとんど成功していません。最近、異なる目的関数(平均二乗ベルマン誤差(MSBE))を近似するいくつかの方法が導入されました。これにより、非線形近似が自然に容易になります。この研究では、これらの洞察に基づいて、線形MSPBEを非線形設定に拡張する新しい一般化MSPBEを導入します。この一般化目的関数が以前の研究を統合し、一般化目的関数のソリューションの値誤差の新しい境界を取得する方法を示します。一般化目的関数を最小化するための、使いやすく健全なアルゴリズムを導出し、実行全体でより安定し、ハイパーパラメータの影響を受けにくく、ニューラルネットワーク関数近似による4つの制御ドメイン全体で良好なパフォーマンスを発揮することを示します。

A Momentumized, Adaptive, Dual Averaged Gradient Method
モメンチューム化、適応型、二重平均勾配法

We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.

私たちは、AdaGrad適応勾配法ファミリーの新しい最適化手法であるMADGRADを紹介します。MADGRADは、視覚における分類タスクや画像間タスク、自然言語処理における反復モデルや双方向マスクモデルなど、複数の分野からの深層学習最適化問題で優れたパフォーマンスを発揮します。これらの各タスクについて、MADGRADは、適応法が通常パフォーマンスが低い問題でも、テストセットのパフォーマンスでSGDとADAMの両方と同等またはそれ以上のパフォーマンスを発揮します。

Reverse-mode differentiation in arbitrary tensor network format: with application to supervised learning
任意テンソルネットワーク形式での逆モード微分:教師あり学習への応用

This paper describes an efficient reverse-mode differentiation algorithm for contraction operations for arbitrary and unconventional tensor network topologies. The approach leverages the tensor contraction tree of Evenbly and Pfeifer (2014), which provides an instruction set for the contraction sequence of a network. We show that this tree can be efficiently leveraged for differentiation of a full tensor network contraction using a recursive scheme that exploits (1) the bilinear property of contraction and (2) the property that trees have a single path from root to leaves. While differentiation of tensor-tensor contraction is already possible in most automatic differentiation packages, we show that exploiting these two additional properties in the specific context of contraction sequences can improve efficiency. Following a description of the algorithm and computational complexity analysis, we investigate its utility for gradient-based supervised learning for low-rank function recovery and for fitting real-world unstructured datasets. We demonstrate improved performance over alternating least-squares optimization approaches and the capability to handle heterogeneous and arbitrary tensor network formats. When compared to alternating minimization algorithms, we find that the gradient-based approach requires a smaller oversampling ratio (number of samples compared to number model parameters) for recovery. This increased efficiency extends to fitting unstructured data of varying dimensionality and when employing a variety of tensor network formats. Here, we show improved learning using the hierarchical Tucker method over the tensor-train in high-dimensional settings on a number of benchmark problems.

この論文では、任意の非従来型テンソルネットワークトポロジーの縮約操作のための効率的な逆モード微分化アルゴリズムについて説明します。このアプローチでは、ネットワークの縮約シーケンスの命令セットを提供するEvenblyとPfeifer (2014)のテンソル縮約ツリーを活用します。このツリーは、(1)縮約の双線形特性と(2)ツリーがルートからリーフまで単一のパスを持つという特性を利用する再帰スキームを使用して、完全なテンソルネットワーク縮約の微分化に効率的に活用できることを示します。テンソルテンソル縮約の微分化は、ほとんどの自動微分化パッケージですでに可能ですが、縮約シーケンスの特定のコンテキストでこれら2つの追加特性を利用すると、効率が向上することを示します。アルゴリズムの説明と計算の複雑さの分析に続いて、低ランク関数回復および実際の非構造化データセットのフィッティングのための勾配ベースの教師あり学習でのその有用性を調査します。交互最小二乗最適化アプローチよりもパフォーマンスが向上し、異種および任意のテンソルネットワーク形式を処理できることを実証します。交互最小化アルゴリズムと比較すると、勾配ベースのアプローチでは回復に必要なオーバーサンプリング比(サンプル数とモデルパラメータ数の比較)が小さくなることがわかります。この効率性の向上は、さまざまな次元の非構造化データのフィッティングや、さまざまなテンソルネットワーク形式の使用にも適用されます。ここでは、いくつかのベンチマーク問題で高次元設定のテンソルトレインに対して階層型Tucker法を使用して学習が改善されたことを示します。

A Perturbation-Based Kernel Approximation Framework
摂動ベースのカーネル近似フレームワーク

Kernel methods are powerful tools in various data analysis tasks. Yet, in many cases, their time and space complexity render them impractical for large datasets. Various kernel approximation methods were proposed to overcome this issue, with the most prominent method being the Nystr{\”o}m method. In this paper, we derive a perturbation-based kernel approximation framework building upon results from classical perturbation theory. We provide an error analysis for this framework, and prove that in fact, it generalizes the Nystr{\”o}m method and several of its variants. Furthermore, we show that our framework gives rise to new kernel approximation schemes, that can be tuned to take advantage of the structure of the approximated kernel matrix. We support our theoretical results numerically and demonstrate the advantages of our approximation framework on both synthetic and real-world data.

カーネル法は、さまざまなデータ分析タスクで強力なツールです。しかし、多くの場合、その時間と空間の複雑さは、大規模なデータセットには実用的ではありません。この問題を解決するために、さまざまなカーネル近似方法が提案されましたが、最も顕著な方法はNystr{“o}m法でした。この論文では、古典的な摂動理論の結果に基づいて構築された摂動ベースのカーネル近似フレームワークを導き出します。このフレームワークのエラー分析を提供し、実際にはNystr{“o}mメソッドとそのバリアントのいくつかを一般化することを証明します。さらに、私たちのフレームワークが新しいカーネル近似スキームを生み出し、近似されたカーネル行列の構造を利用するように調整できることを示します。私たちは、理論的な結果を数値的にサポートし、合成データと実世界データの両方で近似フレームワークの利点を実証します。

The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks
相関することの重要性:複数のネットワークにわたる共同スペクトル推論における依存性の意味

Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation that can be a consequence of such joint embeddings. In this paper, we present a generalized omnibus embedding methodology and we provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures, and we describe how this omnibus embedding can itself induce correlation. This leads us to distinguish betwee inherent correlation—that is, the correlation that arises naturally in multisample network data—and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and we prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative. By allowing for and deconstructing both forms of correlation, our methodology widens the scope of spectral techniques for network inference, with import in theory and practice.

複数のネットワークにおけるスペクトル推論は、グラフ統計の急速に発展しているサブフィールドです。最近の研究では、複数の独立したネットワークの共同または同時スペクトル埋め込みは、同じネットワークの個別のスペクトル分解よりも正確な推定値を提供できることが実証されています。このような推論手順は、通常、複数のネットワーク実現にわたる独立性の仮定に大きく依存しており、この場合でさえ、このような共同埋め込みの結果として生じる可能性のある誘導ネットワーク相関にはほとんど注意が払われていません。この論文では、一般化されたオムニバス埋め込み方法論を提示し、独立したネットワークと相関ネットワークの両方にわたるこの埋め込みの詳細な分析を提供します。後者は、このような手順の範囲を大幅に拡張します。また、このオムニバス埋め込み自体がどのように相関を誘導できるかについて説明します。これにより、固有の相関、つまりマルチサンプルネットワークデータで自然に発生する相関と、共同埋め込み方法論の策略である誘導相関を区別できます。一般化オムニバス埋め込み手順が柔軟かつ堅牢であることを示し、埋め込まれたポイントの一貫性と中心極限定理の両方を証明します。誘導相関と固有相関がネットワーク時系列データの推論にどのように影響するかを調べ、より一般的に相関するデータの有効サンプルサイズなどの古典的な質問のネットワーク類似物を提供します。さらに、適切に較正された一般化オムニバス埋め込みが、以前の埋め込み手順では識別できなかった実際の生物学的ネットワークの変化を検出できることを示し、固有相関と誘導相関の影響が微妙で変革的である可能性があることを確認します。両方の形式の相関を考慮し、分解することで、私たちの方法論はネットワーク推論のスペクトル手法の範囲を広げ、理論と実践に重要です。

Non-asymptotic and Accurate Learning of Nonlinear Dynamical Systems
非線形力学系の非漸近的かつ高精度な学習

We consider the problem of learning a nonlinear dynamical system governed by a nonlinear state equation $h_{t+1}=\phi(h_t,u_t;\theta)+w_t$. Here $\theta$ is the unknown system dynamics, $h_t$ is the state, $u_t$ is the input and $w_t$ is the additive noise vector. We study gradient based algorithms to learn the system dynamics $\theta$ from samples obtained from a single finite trajectory. If the system is run by a stabilizing input policy, then using a mixing-time argument we show that temporally-dependent samples can be approximated by i.i.d. samples. We then develop new guarantees for the uniform convergence of the gradient of the empirical loss induced by these i.i.d. samples. Unlike existing works, our bounds are noise sensitive which allows for learning the ground-truth dynamics with high accuracy and small sample complexity. When combined, our results facilitate efficient learning of a broader class of nonlinear dynamical systems as compared to the prior works. We specialize our guarantees to entrywise nonlinear activations and verify our theory in various numerical experiments.

私たちは、非線形状態方程式$h_{t+1}=\phi(h_t,u_t;\theta)+w_t$によって支配される非線形動的システムを学習する問題について考えます。ここで、$\theta$は未知のシステムダイナミクス、$h_t$は状態、$u_t$は入力、$w_t$は加法ノイズベクトルです。単一の有限軌道から取得したサンプルからシステムダイナミクス$\theta$を学習するための勾配ベースのアルゴリズムを研究します。システムが安定化入力ポリシーによって実行される場合、混合時間引数を使用して、時間依存サンプルをi.i.d.サンプルで近似できることを示します。次に、これらのi.i.d.サンプルによって誘発される経験的損失の勾配が均一に収束するための新しい保証を開発します。既存の研究とは異なり、私たちの境界はノイズに敏感であるため、高精度で小さなサンプル複雑度で真のダイナミクスを学習できます。これらを組み合わせることで、私たちの研究結果は、以前の研究と比較して、より広範なクラスの非線形動的システムの効率的な学習を可能にします。私たちは、エントリワイズ非線形活性化に対する保証を特化し、さまざまな数値実験で理論を検証します。

No Weighted-Regret Learning in Adversarial Bandits with Delays
遅延のある敵対的バンディットにおける重み付け後悔学習なし

Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have “no weighted-regret” in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.

プレイヤーが$T$ラウンドのうち各ラウンド$t$でアクションを選択し、$d_{t}$ラウンドの遅延後に発生したコストを観察するシナリオを考えます。コスト関数と遅延シーケンスは敵対者によって選択されます。非協力ゲームでは、プレイヤーが上記のシナリオで「加重後悔なし」のアルゴリズムを使用する場合、遅延が大きすぎるために線形後悔がある場合でも、期待される加重エルゴード分布のプレイは、粗い相関均衡の集合に収束することを示します。2人のプレイヤーのゼロ和ゲームでは、加重後悔がなくても、プレイの加重エルゴード平均がナッシュ均衡の集合に収束することがわかります。$n$次元のFKMアルゴリズムは、$D=\sum_{t=1}^{T}d_{t}$および$T$が不明な場合でも、期待される後悔$O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$を達成し、$K$アームのEXP3アルゴリズムは期待される後悔$O\left(\sqrt{\log K\left(KT+D\right)}\right)$を達成することを証明します。これらの境界は、新しい倍増トリックを使用しており、軽い仮定の下で、$D$および$T$が既知の場合の後悔境界を証明可能に保持します。これらの境界を使用して、FKMおよびEXP3には、$d_{t}=O\left(t\log t\right)$の場合でも重み付き後悔がないことを示します。したがって、重み付け後悔のないアルゴリズムは、シミュレーションに大幅な遅延が伴う場合でも、バンディットフィードバックでのみシミュレートできる有限または凸状の未知のゲームのCCEを近似するために使用できます。

Exact simulation of diffusion first exit times: algorithm acceleration
拡散の最初の出口時間の正確なシミュレーション:アルゴリズムの加速

In order to describe or estimate different quantities related to a specific random variable, it is of prime interest to numerically generate such a variate. In specific situations, the exact generation of random variables might be either momentarily unavailable or too expensive in terms of computation time. It therefore needs to be replaced by an approximation procedure. As was previously the case, the ambitious exact simulation of first exit times for diffusion processes was unreachable though it concerns many applications in different fields like mathematical finance, neuroscience or reliability. The usual way to describe first exit times was to use discretization schemes, that are of course approximation procedures. Recently, Herrmann and Zucca proposed a new algorithm, the so-called GDET-algorithm (General Diffusion Exit Time), which permits to simulate exactly the first exit time for one-dimensional diffusions. The only drawback of exact simulation methods using an acceptance-rejection sampling is their time consumption. In this paper the authors highlight an acceleration procedure for the GDET-algorithm based on a multi-armed bandit model. The efficiency of this acceleration is pointed out through numerical examples.

特定のランダム変数に関連するさまざまな量を記述または推定するためには、そのような変量を数値的に生成することが非常に重要です。特定の状況では、ランダム変数の正確な生成が一時的に利用できないか、計算時間の点でコストがかかりすぎる可能性があります。したがって、近似手順に置き換える必要があります。以前と同様に、拡散プロセスの最初の出口時間の野心的な正確なシミュレーションは、数理ファイナンス、神経科学、信頼性などのさまざまな分野の多くのアプリケーションに関係しているにもかかわらず、達成できませんでした。最初の出口時間を記述する通常の方法は、もちろん近似手順である離散化スキームを使用することでした。最近、HerrmannとZuccaは、1次元拡散の最初の出口時間を正確にシミュレートできる、いわゆるGDETアルゴリズム(General Diffusion Exit Time)という新しいアルゴリズムを提案しました。受け入れ拒否サンプリングを使用した正確なシミュレーション方法の唯一の欠点は、時間の消費です。この論文では、著者は、マルチアームバンディットモデルに基づくGDETアルゴリズムの加速手順に焦点を当てています。この加速の効率は数値例を通じて示されます。

On the Efficiency of Entropic Regularized Algorithms for Optimal Transport
最適輸送のためのエントロピー正則化アルゴリズムの効率について

We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as Greenkhorn, from $\tilde{O}(n^2\varepsilon^{-3})$ to $\tilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by Altschuler et al. (2017). Second, we propose a new algorithm, which we refer to as APDAMD and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm (Dvurechensky et al., 2018) with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\tilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\tilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\tilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a deterministic accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\tilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM) (Guminov et al., 2021) in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice.

私たちは、最大$n$個の原子を持つ2つの離散確率測度間の最適輸送(OT)問題を近似的に解くエントロピー正規化アルゴリズムについて、いくつかの新しい計算量結果を示します。まず、Greenkhornとして知られるSinkhornの貪欲な変種の計算量境界を$\tilde{O}(n^2\varepsilon^{-3})$から$\tilde{O}(n^2\varepsilon^{-2})$に改善します。特に、私たちの結果はSinkhornの最も優れた既知の計算量境界と一致し、Altschulerら(2017)が観察した行/列更新の点でGreenkhornが実際にはSinkhornを大幅に上回る理由を明らかにするのに役立ちます。次に、適応型主双対加速勾配降下法(APDAGD)アルゴリズム(Dvurechenskyら、2018年)を、事前指定されたミラーマッピング$\phi$で一般化する、APDAMDと呼ぶ新しいアルゴリズムを提案します。APDAMDが、$\tilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$の複雑性境界を達成することを証明します。ここで、$\delta>0$は$\phi$の正則性を表します。さらに、反例によって、APDAGDに対して以前に証明された複雑性境界$\tilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$は無効であることを示し、洗練された複雑性境界$\tilde{O}(n^{5/2}\varepsilon^{-1})$を示します。さらに、推定シーケンスを利用してSinkhornの決定論的高速化バリアントを開発し、$\tilde{O}(n^{7/3}\varepsilon^{-4/3})$の複雑性境界を証明します。このように、Sinkhornの高速化バリアントは、$1/\varepsilon$に関してはSinkhornやGreenkhornよりも優れており、$n$に関してはAPDAGDや高速交互最小化(AAM) (Guminovら、2021)よりも優れていることがわかります。最後に、合成データと実際のデータで実験を行い、数値結果からGreenkhorn、APDAMD、高速化Sinkhornの実際の効率が示されています。

Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization
非凸オーバーラップ核ノルム正則化による低ランクテンソル学習

Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special “sparse plus low-rank” structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-Lojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art.

非凸正則化は、低ランク行列学習で広く使用されています。しかし、低ランクテンソル学習に拡張すると、依然として計算コストが高くなります。この問題に対処するために、オーバーラップ核ノルム正則化の非凸拡張で使用するための効率的なソルバーを開発します。近似平均アルゴリズムに基づいて、提案されたアルゴリズムは、コストのかかるテンソルの折り畳み/展開操作を回避できます。反復処理中は特別な「スパースプラス低ランク」構造が維持され、個々の近似ステップの高速計算が可能になります。経験的収束は、適応モーメントの使用によってさらに改善されます。滑らかな損失と、Kurdyka-Lojasiewicz条件を満たす目的関数の臨界点に対する収束保証を提供します。最適化問題は非凸で滑らかではありませんが、その臨界点はテンソル補完問題に対して依然として良好な統計的パフォーマンスを発揮することを示します。さまざまな合成データセットと実世界のデータセットでの実験により、提案されたアルゴリズムは時間と空間の両方で効率的であり、既存の最先端技術よりも正確であることが示されました。

Recovery and Generalization in Over-Realized Dictionary Learning
過大実現辞書学習における回復と一般化

In over two decades of research, the field of dictionary learning has gathered a large collection of successful applications, and theoretical guarantees for model recovery are known only whenever optimization is carried out in the same model class as that of the underlying dictionary. This work characterizes the surprising phenomenon that dictionary recovery can be facilitated by searching over the space of larger over-realized models. This observation is general and independent of the specific dictionary learning algorithm used. We thoroughly demonstrate this observation in practice and provide an analysis of this phenomenon by tying recovery measures to generalization bounds. In particular, we show that model recovery can be upper-bounded by the empirical risk, a model-dependent quantity and the generalization gap, reflecting our empirical findings. We further show that an efficient and provably correct distillation approach can be employed to recover the correct atoms from the over-realized model. As a result, our meta-algorithm provides dictionary estimates with consistently better recovery of the ground-truth model.

20年以上にわたる研究で、辞書学習の分野では成功したアプリケーションの大規模なコレクションが集められており、モデル回復の理論的保証は、基礎となる辞書と同じモデルクラスで最適化が実行される場合にのみ知られています。この研究では、より大きな過剰実現モデルの空間を検索することで辞書回復が容易になるという驚くべき現象を特徴付けています。この観察は一般的なものであり、使用される特定の辞書学習アルゴリズムに依存しません。私たちはこの観察を実際に徹底的に実証し、回復尺度を一般化境界に結び付けてこの現象の分析を提供します。特に、モデル回復は、経験的リスク、モデル依存量、および一般化ギャップによって上限が定められる可能性があることを示し、これは私たちの経験的発見を反映しています。さらに、効率的で証明可能な正しい蒸留アプローチを使用して、過剰実現モデルから正しいアトムを回復できることも示しています。その結果、私たちのメタアルゴリズムは、グラウンドトゥルースモデルの回復が一貫して優れている辞書推定を提供します。

Transfer Learning in Information Criteria-based Feature Selection
情報基準に基づく特徴選択における転移学習

This paper investigates the effectiveness of transfer learning based on information criteria. We propose a procedure that combines transfer learning with Mallows’ Cp (TLCp) and prove that it outperforms the conventional Mallows’ Cp criterion in terms of accuracy and stability. Our theoretical results indicate that, for any sample size in the target domain, the proposed TLCp estimator performs better than the Cp estimator by the mean squared error (MSE) metric {in the case of orthogonal predictors}, provided that i) the dissimilarity between the tasks from source domain and target domain is small, and ii) the procedure parameters (complexity penalties) are tuned according to certain explicit rules. Moreover, we show that our transfer learning framework can be extended to other feature selection criteria, such as the Bayesian information criterion. By analyzing the solution of the orthogonalized Cp, we identify an estimator that asymptotically approximates the solution of the Cp criterion in the case of non-orthogonal predictors. Similar results are obtained for the non-orthogonal TLCp. Finally, simulation studies and applications with real data demonstrate the usefulness of the TLCp scheme.

この論文では、情報基準に基づく転移学習の有効性について調査します。転移学習とMallowsのCp (TLCp)を組み合わせた手順を提案し、精度と安定性の点で従来のMallowsのCp基準よりも優れていることを証明します。理論的な結果によると、i)ソースドメインとターゲットドメインのタスク間の相違が小さく、ii)手順パラメータ(複雑性ペナルティ)が特定の明示的なルールに従って調整されている場合、ターゲットドメインの任意のサンプルサイズで、提案されたTLCp推定量は平均二乗誤差(MSE)メトリックでCp推定量よりも優れています(直交予測子の場合)。さらに、転移学習フレームワークは、ベイズ情報基準などの他の特徴選択基準に拡張できることを示しています。直交化されたCpの解を分析することにより、非直交予測子の場合にCp基準の解を漸近的に近似する推定量を特定します。非直交TLCpについても同様の結果が得られます。最後に、シミュレーション研究と実際のデータを使用したアプリケーションにより、TLCpスキームの有用性が実証されます。

Manifold Coordinates with Physical Meaning
物理的意味を持つ多様体座標

Manifold embedding algorithms map high-dimensional data down to coordinates in a much lower-dimensional space. One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. The coordinates returned by the embedding algorithm are abstract, and finding their physical or domain-related meaning is not formalized and often left to domain experts. This paper studies the problem of recovering the meaning of the new low-dimensional representation in an automatic, principled fashion. We propose a method to explain embedding coordinates of a manifold as non-linear compositions of functions from a user-defined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data.

多様体埋め込みアルゴリズムは、高次元データをはるかに低次元空間の座標にマッピングします。次元削減の目的の1つは、データ多様体を記述する固有座標を見つけることです。埋め込みアルゴリズムによって返される座標は抽象的であり、その物理的またはドメインに関連する意味を見つけることは形式化されておらず、多くの場合、ドメインの専門家に任されています。この論文では、新しい低次元表現の意味を自動的かつ原則的な方法で回復する問題を研究します。多様体の埋め込み座標を、ユーザー定義辞書からの関数の非線形合成として説明する方法を提案します。この問題をスパース線形のGroup Lasso回復問題として設定し、十分な回復条件を見つけ、データでその有効性を実証できることを示します。

An Optimization-centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference
ベイズの法則の最適化中心の見方:変分推論のレビューと一般化

We advocate an optimization-centric view of Bayesian inference. Our inspiration is the representation of Bayes’ rule as infinite-dimensional optimization (Csiszar, 1975; Donsker and Varadhan, 1975; Zellner, 1988). Equipped with this perspective, we study Bayesian inference when one does not have access to (1) well-specified priors, (2) well-specified likelihoods, (3) infinite computing power. While these three assumptions underlie the standard Bayesian paradigm, they are typically inappropriate for modern Machine Learning applications. We propose addressing this through an optimization-centric generalization of Bayesian posteriors that we call the Rule of Three (RoT). The RoT can be justified axiomatically and recovers Bayesian, PAC-Bayesian and VI posteriors as special cases. While the RoT is primarily a conceptual and theoretical device, it also encompasses a novel sub-class of tractable posteriors which we call Generalized Variational Inference (GVI) posteriors. Just as the RoT, GVI posteriors are specified by three arguments: a loss, a divergence and a variational family. They also possess a number of desirable properties, including modularity, Frequentist consistency and an interpretation as approximate ELBO. We explore applications of GVI posteriors, and show that they can be used to improve robustness and posterior marginals on Bayesian Neural Networks and Deep Gaussian Processes.

私たちは、ベイズ推論の最適化中心の見方を提唱しています。我々のインスピレーションは、ベイズの定理を無限次元最適化として表現することです(Csiszar、1975年、DonskerとVaradhan、1975年、Zellner、1988年)。この観点を備え、(1)適切に指定された事前分布、(2)適切に指定された尤度、(3)無限の計算能力にアクセスできない場合のベイズ推論を研究します。これらの3つの仮定は標準的なベイズパラダイムの基礎となっていますが、現代の機械学習アプリケーションには通常不適切です。私たちは、ベイズ事後分布の最適化中心の一般化、つまりRule of Three (RoT)を通じてこの問題に対処することを提案します。RoTは公理的に正当化でき、ベイズ、PAC-ベイズ、およびVI事後分布を特殊なケースとして回復します。RoTは主に概念的かつ理論的なデバイスですが、一般化変分推論(GVI)事後分布と呼ばれる扱いやすい事後分布の新しいサブクラスも含んでいます。RoTと同様に、GVI事後分布は、損失、発散、変分族という3つの引数によって指定されます。また、モジュール性、頻度論的一貫性、近似ELBOとしての解釈など、いくつかの望ましい特性も備えています。GVI事後分布のアプリケーションを調査し、ベイジアンニューラルネットワークと深層ガウス過程の堅牢性と事後周辺を改善するために使用できることを示します。

Let’s Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence
ブロック座標降下法の収束を高速化しよう: 貪欲ルールの高速化、メッセージパッシング、アクティブセットの複雑性、超線形収束の高速化

Block coordinate descent (BCD) methods are widely used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper we explore all three of these building blocks and propose variations for each that can significantly improve the progress made by each BCD iteration. We (i) propose new greedy block-selection strategies that guarantee more progress per iteration than the Gauss-Southwell rule; (ii) explore practical issues like how to implement the new rules when using “variable” blocks; (iii) explore the use of message-passing to compute matrix or Newton updates efficiently on huge blocks for problems with sparse dependencies between variables; and (iv) consider optimal active manifold identification, which leads to bounds on the “active-set complexity” of BCD methods and leads to superlinear convergence for certain problems with sparse solutions (and in some cases finite termination at an optimal solution). We support all of our findings with numerical results for the classic machine learning problems of least squares, logistic regression, multi-class logistic regression, label propagation, and L1-regularization.

ブロック座標降下法(BCD)は、反復コストが安く、メモリ要件が低く、並列化が容易で、問題の構造を活用できるため、大規模な数値最適化に広く使用されています。BCD法のパフォーマンスに影響を与える主なアルゴリズムの選択は、ブロック分割戦略、ブロック選択ルール、ブロック更新ルールの3つです。この論文では、これら3つの構成要素すべてを検討し、各BCD反復による進捗を大幅に改善できるバリエーションをそれぞれ提案します。(i)反復ごとにガウスサウスウェルルールよりも多くの進捗を保証する新しい貪欲ブロック選択戦略を提案します。(ii)「変数」ブロックを使用する場合に新しいルールを実装する方法などの実用的な問題を検討します。(iii)変数間の依存関係がまばらな問題に対して、巨大なブロックでマトリックスまたはニュートン更新を効率的に計算するためのメッセージパッシングの使用を検討します。(iv)最適なアクティブ多様体識別を考慮します。これにより、BCD法の「アクティブセット複雑度」の上限が導き出され、スパースソリューション(場合によっては最適ソリューションでの有限終了)を伴う特定の問題に対する超線形収束が実現します。最小二乗法、ロジスティック回帰、多クラスロジスティック回帰、ラベル伝播、L1正則化といった古典的な機械学習の問題に対する数値結果で、すべての発見を裏付けています。

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks
ワイド ReLU ネットワークの区分線形解の平均場解析

Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via noisy-SGD for a univariate regularized regression problem. Our main result is that SGD with vanishingly small noise injected in the gradients is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of “knot” points — i.e., points where the tangent of the ReLU network estimator changes — between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the “knot” points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.

確率的勾配降下法(SGD)でトレーニングされたニューラルネットワークの特性を理解することは、ディープラーニング理論の核心です。この研究では、平均場の観点から、一変量正規化回帰問題に対してノイズSGDでトレーニングされた2層ReLUネットワークを検討します。主な結果は、勾配に挿入されたノイズが極めて小さいSGDは、単純なソリューションに偏っていることです。つまり、収束時に、ReLUネットワークは入力の区分線形マップを実装し、2つの連続するトレーニング入力間の「結び目」ポイント(つまり、ReLUネットワーク推定値の接線が変化するポイント)の数は最大3です。特に、ネットワークのニューロンの数が増えると、SGDダイナミクスは勾配フローのソリューションによって捕捉され、収束時に、重みの分布は、ギブス形式を持つ関連する自由エネルギーの一意の最小化に近づきます。私たちの主な技術的貢献は、この最小化器から得られる推定値の分析にあります。私たちは、その2次導関数が、「結び目」ポイントを表す特定の場所を除いて、どこでも消えることを示しています。また、私たちの理論で予測されているように、データポイントとは異なる場所に結び目が発生する可能性があるという経験的証拠も提供しています。

On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) using Mean Field Control (MFC)
平均場制御(MFC)を用いた協調異種マルチエージェント強化学習(MARL)の近似について

Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{\mathrm{pop}}$ heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem. We consider three scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of $(1)$ joint state and action distributions across all classes, $(2)$ individual distributions of each class, and $(3)$ marginal distributions of the entire population. We show that, in these cases, the $K$-class MARL problem can be approximated by MFC with errors given as $e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})$, $e_2=\mathcal{O}(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\sum_{k}\frac{1}{\sqrt{N_k}})$ and $e_3=\mathcal{O}\left(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)$, respectively, where $A, B$ are some constants and $|\mathcal{X}|,|\mathcal{U}|$ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within $\mathcal{O}(e_j)$ error with a sample complexity of $\mathcal{O}(e_j^{-3})$, $j\in\{1,2,3\}$, respectively.

平均場制御（MFC）は、協力型マルチエージェント強化学習（MARL）問題の次元の呪いを軽減する効果的な方法です。この研究では、$k$番目のクラスに$N_k$個の同質エージェントが含まれるように$K$クラスに分離できる$N_{\mathrm{pop}}$個の異種エージェントのコレクションを検討します。私たちの目的は、この異種システムに対するMARL問題の近似保証を、対応するMFC問題によって証明することです。すべてのエージェントの報酬と遷移ダイナミクスがそれぞれ、すべてのクラスにわたる$(1)$結合状態とアクション分布、$(2)各クラスの個別分布、および$(3)全体の集団の周辺分布の関数であるとする3つのシナリオを検討します。これらの場合、$K$クラスのMARL問題は、次のように与えられる誤差でMFCによって近似できることを示します。$e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})$、$e_2=\mathcal{O}(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\sum_{k}\frac{1}{\sqrt{N_k}})$、およびそれぞれ$e_3=\mathcal{O}\left(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)$であり、$A、B$は定数、$|\mathcal{X}|、|\mathcal{U}|$は各エージェントの状態空間と行動空間のサイズです。最後に、上記の3つのケースで、それぞれサンプル複雑度$\mathcal{O}(e_j^{-3})$、$j\in\{1,2,3\}$で、$\mathcal{O}(e_j)$誤差以内の最適なMARLポリシーに収束できる自然ポリシー勾配(NPG)ベースのアルゴリズムを設計します。

Power Iteration for Tensor PCA
テンソル PCA の累乗反復

In this paper, we study the power iteration algorithm for the asymmetric spiked tensor model, as introduced in Richard and Montanari (2014). We give necessary and sufficient conditions for the convergence of the power iteration algorithm. When the power iteration algorithm converges, for the rank one spiked tensor model, we show the estimators for the spike strength and linear functionals of the signal are asymptotically Gaussian; for the multi-rank spiked tensor model, we show the estimators are asymptotically mixtures of Gaussian. This new phenomenon is different from the spiked matrix model. Using these asymptotic results of our estimators, we construct valid and efficient confidence intervals for spike strengths and linear functionals of the signals.

この論文では、Richard and Montanari (2014)で紹介された非対称スパイクテンソルモデルのパワー反復アルゴリズムについて研究します。電力反復アルゴリズムの収束に必要十分な条件を提供します。パワー反復アルゴリズムが収束すると、ランク1のスパイクテンソルモデルについて、信号のスパイク強度と線形汎関数の推定量が漸近的にガウス分布であることを示します。マルチランクスパイクテンソルモデルでは、推定量がガウスの漸近的な混合物であることを示します。この新しい現象は、スパイクマトリックスモデルとは異なります。推定量のこれらの漸近結果を使用して、信号のスパイク強度と線形汎関数の有効で効率的な信頼区間を構築します。

Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Matérn Correlations
カーネルパケット: マテルン相関を用いたガウス過程回帰の正確でスケーラブルなアルゴリズム

We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Matérn correlations whose smoothness parameter $\nu$ is a half-integer. The proposed algorithm only requires $\mathcal{O}(\nu^3 n)$ operations and $\mathcal{O}(\nu n)$ storage. This leads to a linear-cost solver since $\nu$ is chosen to be fixed and usually very small in most applications. The proposed method can be applied to multi-dimensional problems if a full grid or a sparse grid design is used. The proposed method is based on a novel theory for Matérn correlation functions. We find that a suitable rearrangement of these correlation functions can produce a compactly supported function, called a “kernel packet”. Using a set of kernel packets as basis functions leads to a sparse representation of the covariance matrix that results in the proposed algorithm. Simulation studies show that the proposed algorithm, when applicable, is significantly superior to the existing alternatives in both the computational time and predictive accuracy.

私たちは、滑らかさのパラメータ$\nu$が半整数であるMatérn相関を持つ1次元ガウス過程回帰の正確でスケーラブルなアルゴリズムを開発しました。提案されたアルゴリズムは、$\mathcal{O}(\nu^3 n)$の演算と$\mathcal{O}(\nu n)$のストレージのみを必要とします。$\nu$は固定値として選択され、ほとんどのアプリケーションでは通常非常に小さいため、線形コストのソルバーになります。提案された方法は、フルグリッドまたはスパースグリッド設計が使用されている場合、多次元問題に適用できます。提案された方法は、Matérn相関関数の新しい理論に基づいています。これらの相関関数を適切に再配置すると、「カーネルパケット」と呼ばれるコンパクトにサポートされた関数が生成されることが分かりました。カーネルパケットのセットを基底関数として使用すると、共分散行列のスパース表現が得られ、提案されたアルゴリズムになります。シミュレーション研究は、提案されたアルゴリズムが適用可能な場合、計算時間と予測精度の両方で既存の代替手段よりも大幅に優れていることを示しています。

Neural Estimation of Statistical Divergences
統計的発散の神経推定

Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular $\mathsf{f}$-divergences—Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate.

統計的ダイバージェンス(SD)は、確率分布間の相違を定量化し、統計的推論と機械学習の基本的な構成要素です。これらのダイバージェンスを推定する最新の方法は、ニューラルネットワーク(NN)によって経験的変分形式をパラメーター化し、パラメーター空間を最適化することです。このようなニューラル推定器は実際には広く使用されていますが、対応するパフォーマンスの保証は部分的であり、さらなる調査が必要です。私たちは、浅いNNによって実現されるニューラル推定器の非漸近的な絶対誤差境界を確立し、4つの一般的な$\mathsf{f}$ダイバージェンス(Kullback-Leibler、カイ2乗、2乗Hellinger、および全変動)に焦点を当てます。私たちの分析は、非漸近的な関数近似定理と経験的過程理論のツールに依存して、関数近似と経験的推定という2つの誤差源を制限します。境界は、NNサイズとサンプル数の観点から有効な誤差を特徴付け、一貫性を保証するスケーリングレートを明らかにします。コンパクトにサポートされた分布の場合、適切なNN成長率を持つ上記の最初の3つのダイバージェンスのニューラル推定器は、ミニマックスレート最適であり、パラメトリック収束率を達成することがさらに示されます。

Foolish Crowds Support Benign Overfitting
愚かな群衆は良性の過剰適合を支持しています

We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the “wisdom of the crowd”, except here the harm arising from fitting the noise is ameliorated by spreading it among many directions—the variance reduction arises from a foolish crowd.

私たちは、過度にパラメータ化された領域におけるガウスデータを用いた線形回帰のためのスパース補間手順の過剰リスクの下限を証明します。この結果を適用して、グラウンドトゥルースがまばらな場合でも、その過剰リスクがOLS(最小の$ell_2$ノルム内挿)よりも指数関数的に遅い速度で収束することを意味する基底追求の下限(最小の$ell_1$ノルム内挿)を取得します。私たちの分析は、「群衆の知恵」に類似した効果の利点を明らかにしていますが、ここでは、ノイズをさまざまな方向に広げることによってノイズを調整することから生じる害が改善されます—分散の減少は愚かな群衆から生じます。

Darts: User-Friendly Modern Machine Learning for Time Series
Darts:時系列のためのユーザーフレンドリーな最新の機械学習

We present Darts, a Python machine learning library for time series, with a focus on forecasting. Darts offers a variety of models, from classics such as ARIMA to state-of-the-art deep neural networks. The emphasis of the library is on offering modern machine learning functionalities, such as supporting multidimensional series, fitting models on multiple series, training on large datasets, incorporating external data, ensembling models, and providing a rich support for probabilistic forecasting. At the same time, great care goes into the API design to make it user-friendly and easy to use. For instance, all models can be used using fit()/predict(), similar to scikit-learn.

私たちは、時系列のPython機械学習ライブラリであるDartsを、予測に重点を置いて紹介します。Dartsは、ARIMAなどのクラシックなモデルから最先端のディープニューラルネットワークまで、さまざまなモデルを提供しています。このライブラリは、多次元系列のサポート、複数の系列へのモデルの適合、大規模なデータセットでのトレーニング、外部データの組み込み、モデルのアンサンブル、確率的予測の豊富なサポートの提供など、最新の機械学習機能を提供することに重点を置いています。同時に、API設計には細心の注意が払われており、ユーザーフレンドリーで使いやすいものにしています。例えば、すべてのモデルはscikit-learnと同様にfit()/predict()を使用して使用できます。

Provable Tensor-Train Format Tensor Completion by Riemannian Optimization
リーマン最適化による証明可能なテンソル列形式テンソル補完

The tensor train (TT) format enjoys appealing advantages in handling structural high-order tensors. The recent decade has witnessed the wide applications of TT-format tensors from diverse disciplines, among which tensor completion has drawn considerable attention. Numerous fast algorithms, including the Riemannian gradient descent (RGrad), have been proposed for the TT-format tensor completion. However, the theoretical guarantees of these algorithms are largely missing or sub-optimal, partly due to the complicated and recursive algebraic operations in TT-format decomposition. Moreover, existing results established for the tensors of other formats, for example, Tucker and CP, are inapplicable because the algorithms treating TT-format tensors are substantially different and more involved. In this paper, we provide, to our best knowledge, the first theoretical guarantees of the convergence of RGrad algorithm for TT-format tensor completion, under a nearly optimal sample size condition. The RGrad algorithm converges linearly with a constant contraction rate that is free of tensor condition number without the necessity of re-conditioning. We also propose a novel approach, referred to as the sequential second-order moment method, to attain a warm initialization under a similar sample size requirement. As a byproduct, our result even significantly refines the prior investigation of RGrad algorithm for matrix completion. Lastly, statistically (near) optimal rate is derived for RGrad algorithm if the observed entries consist of random sub-Gaussian noise. Numerical experiments confirm our theoretical discovery and showcase the computational speedup gained by the TT-format decomposition.

テンソル列(TT)形式は、構造的な高次テンソルの処理において魅力的な利点があります。ここ10年間、さまざまな分野でTT形式のテンソルが幅広く応用されてきましたが、その中でもテンソル補完が大きな注目を集めています。TT形式のテンソル補完には、リーマン勾配降下法(RGrad)を含む多数の高速アルゴリズムが提案されています。ただし、これらのアルゴリズムの理論的な保証は、TT形式の分解における複雑で再帰的な代数演算のせいもあって、ほとんど欠落しているか最適ではありません。さらに、他の形式のテンソル(たとえばTuckerやCP)に対して確立された既存の結果は、TT形式のテンソルを処理するアルゴリズムが大幅に異なり、より複雑であるため適用できません。この論文では、ほぼ最適なサンプルサイズ条件下で、TT形式のテンソル補完に対するRGradアルゴリズムの収束に関する、私たちの知る限り初の理論的保証を示します。RGradアルゴリズムは、テンソル条件数に左右されない一定の収縮率で線形に収束し、再調整の必要がありません。また、同様のサンプルサイズ要件の下でウォーム初期化を達成するために、シーケンシャル2次モーメント法と呼ばれる新しいアプローチも提案します。副産物として、私たちの結果は、行列補完のためのRGradアルゴリズムの以前の調査を大幅に改良しています。最後に、観測されたエントリがランダムなサブガウスノイズで構成されている場合、RGradアルゴリズムの統計的に(ほぼ)最適なレートが導出されます。数値実験により、私たちの理論的発見が確認され、TT形式の分解によって得られる計算の高速化が実証されています。

Depth separation beyond radial functions
放射状関数を超えた深さの分離

High-dimensional depth separation results for neural networks show that certain functions can be efficiently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The first contribution of this paper is to extend such results to a more general class of functions, namely functions with piece-wise oscillatory structure, by building on the proof strategy of (Eldan and Shamir, 2016). We complement these results by showing that, if the domain radius and the rate of oscillation of the objective function are constant, then approximation by one-hidden-layer networks holds at a $\mathrm{poly}(d)$ rate for any fixed error threshold. The mentioned results show that one-hidden-layer networks fail to approximate high-energy functions whose Fourier representation is spread in the frequency domain, while they succeed at approximating functions having a sparse Fourier representation. However, the choice of the domain represents a source of gaps between these positive and negative approximation results. We conclude the paper focusing on a compact approximation domain, namely the sphere $\S$ in dimension $d$, where we provide a characterization of both functions which are efficiently approximable by one-hidden-layer networks and of functions which are provably not, in terms of their Fourier expansion.

ニューラルネットワークの高次元深度分離の結果から、高次元では特定の関数は2隠れ層ネットワークで効率的に近似できるが、1隠れ層ネットワークでは近似できないことがわかります。このタイプの既存の結果は主に、実際には通常は遭遇しない、基礎となる放射状または1次元構造を持つ関数に焦点を当てています。この論文の最初の貢献は、(Eldan and Shamir, 2016)の証明戦略に基づいて、このような結果をより一般的なクラスの関数、つまり区分振動構造を持つ関数に拡張することです。これらの結果を補完するために、目的関数の領域半径と振動率が一定であれば、1隠れ層ネットワークによる近似は、任意の固定エラーしきい値に対して$\mathrm{poly}(d)$率で成立することを示します。前述の結果は、1隠れ層ネットワークは、フーリエ表現が周波数領域に広がっている高エネルギー関数を近似できないが、スパースフーリエ表現を持つ関数を近似することに成功していることを示しています。しかし、領域の選択は、これらの正の近似結果と負の近似結果の間にギャップが生じる原因となります。この論文では、コンパクトな近似領域、つまり次元$d$の球$\S$に焦点を当てて結論付け、1つの隠れ層ネットワークによって効率的に近似できる関数と、近似できないことが証明できる関数の両方について、フーリエ展開の観点から特徴付けを行います。

Online Mirror Descent and Dual Averaging: Keeping Pace in the Dynamic Case
オンラインミラーディセントとデュアルアベレメンジング:ダイナミックなケースで歩調を合わせる

Online mirror descent (OMD) and dual averaging (DA)—two fundamental algorithms for online convex optimization—are known to have very similar (and sometimes identical) performance guarantees when used with a fixed learning rate. Under dynamic learning rates, however, OMD is provably inferior to DA and suffers linear regret, even in common settings such as prediction with expert advice. We modify the OMD algorithm through a simple technique that we call stabilization. We give essentially the same abstract regret bound for OMD with stabilization and for DA by modifying the classical OMD convergence analysis in a careful and modular way that allows for straightforward and flexible proofs. Simple corollaries of these bounds show that OMD with stabilization and DA enjoy the same performance guarantees in many applications—even under dynamic learning rates. We also shed light on the similarities between OMD and DA and show simple conditions under which stabilized-OMD and DA generate the same iterates. Finally, we show how to effectively use dual-stabilization with composite cost functions with simple adaptations to both the algorithm and its analysis.

オンラインミラーディセント(OMD)とデュアルアベレージ(DA)は、オンライン凸最適化の2つの基本アルゴリズムであり、固定学習率で使用した場合、非常に類似した(場合によっては同一の)パフォーマンス保証を持つことが知られています。ただし、動的学習率では、OMDはDAより劣ることが証明されており、専門家のアドバイスによる予測などの一般的な設定でも線形リグレットが発生します。私たちは、安定化と呼ぶ単純な手法でOMDアルゴリズムを変更します。古典的なOMD収束解析を、単純で柔軟な証明を可能にする慎重かつモジュール化された方法で変更することにより、安定化ありのOMDとDAに本質的に同じ抽象的なリグレット境界を与えます。これらの境界の単純な帰結から、安定化ありのOMDとDAは、動的学習率でも、多くのアプリケーションで同じパフォーマンス保証を享受できることがわかります。また、OMDとDAの類似点を明らかにし、安定化されたOMDとDAが同じ反復を生成する単純な条件を示します。最後に、アルゴリズムとその分析の両方に簡単な適応を行うことで、複合コスト関数によるデュアル安定化を効果的に使用する方法を示します。

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
スイッチトランスフォーマー: シンプルで効率的なスパース性を備えた 1 兆パラメータモデルへのスケーリング

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model—with an outrageous number of parameters—but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs, and training instability. We address these with the introduction of the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques mitigate the instabilities, and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model.

ディープラーニングでは、モデルは通常、すべての入力に対して同じパラメータを再利用します。Mixture of Experts (MoE)モデルはこれに反し、代わりに各入力例に対して異なるパラメータを選択します。その結果、パラメータの数が途方もないほど多いが計算コストが一定である、スパースにアクティブ化されたモデルが生まれます。しかし、MoEはいくつかの注目すべき成功を収めているにもかかわらず、複雑さ、通信コスト、トレーニングの不安定性により、広範な採用が妨げられてきました。私たちは、Switch Transformerを導入することでこれらの問題に対処します。MoEルーティングアルゴリズムを簡素化し、通信コストと計算コストを削減した直感的に改良されたモデルを設計します。私たちが提案するトレーニング手法は不安定性を軽減し、大規模なスパースモデルを、初めて、より低い精度(bfloat16)形式でトレーニングできることを示しました。私たちは、T5-BaseとT5-Largeに基づいてモデルを設計し、同じ計算リソースで事前トレーニング速度を最大7倍向上させます。これらの改善は多言語設定にも適用され、101言語すべてでmT5-Baseバージョンに対する改善が測定されています。最後に、「Colossal Clean Crawled Corpus」で最大1兆個のパラメータモデルを事前トレーニングすることで、現在の言語モデルの規模を向上し、T5-XXLモデルに比べて4倍の高速化を実現しました。

A spectral-based analysis of the separation between two-layer neural networks and linear methods
2層ニューラルネットワークと線形法の間の分離のスペクトルベース解析

We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights.

私たちは、高次元関数の近似に関して、2層ニューラルネットワークが線形手法とどのように異なるかを分析するためのスペクトルベースのアプローチを提案します。この分離を定量化することは、2層ニューラルネットワークのコルモゴロフ幅の推定に還元でき、後者は関連するカーネルのスペクトルを使用してさらに特徴付けることができることを示します。以前の研究とは異なり、私たちのアプローチでは、上限、下限を取得し、明示的なハード関数を統一的に識別できます。活性化関数の選択が分離、特に入力次元への依存性にどのように影響するかについて、体系的な研究を提供します。具体的には、滑らかでない活性化関数の場合、既知の結果を、よりシャープな境界を持つより多くの活性化関数に拡張します。具体的な例として、任意の単一のニューロンがニューラルネットワークとランダムフィーチャモデル間の分離をインスタンス化できることを証明します。滑らかな活性化関数の場合、内部層の重みのノルムが入力次元に対して多項式的に大きくない限り、分離は無視できるという驚くべき発見があります。対照的に、滑らかでない活性化関数の分離は、内部層の重みのノルムとは無関係です。

Under-bagging Nearest Neighbors for Imbalanced Classification
不均衡な分類のための最近傍のアンダーバギング

In this paper, we propose an ensemble learning algorithm called under-bagging $k$-nearest neighbors (under-bagging $k$-NN) for imbalanced classification problems. On the theoretical side, by developing a new learning theory analysis, we show that with properly chosen parameters, i.e., the number of nearest neighbors $k$, the expected sub-sample size $s$, and the bagging rounds $B$, optimal convergence rates for under-bagging $k$-NN can be achieved under mild assumptions w.r.t. the arithmetic mean (AM) of recalls. Moreover, we show that with a relatively small $B$, the expected sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k$ can be reduced simultaneously, especially when the data are highly imbalanced, which leads to substantially lower time complexity and roughly the same space complexity. On the practical side, we conduct numerical experiments to verify the theoretical results on the benefits of the under-bagging technique by the promising AM performance and efficiency of our proposed algorithm.

この論文では、不均衡な分類問題に対するアンダーバギング$k$-最近傍(アンダーバギング$k$-NN)と呼ばれるアンサンブル学習アルゴリズムを提案します。理論面では、新しい学習理論分析を開発することにより、適切に選択されたパラメータ、すなわち最近傍の数$k$、期待されるサブサンプルサイズ$s$、およびバギングラウンド$B$を使用すると、リコールの算術平均(AM)に関する緩やかな仮定の下で、アンダーバギング$k$-NNの最適な収束率を達成できることを示します。さらに、比較的小さな$B$を使用すると、期待されるサブサンプルサイズ$s$は各バギングラウンドでのトレーニングデータ数$n$よりもはるかに小さくなり、特にデータが非常に不均衡な場合は、最近傍の数$k$を同時に減らすことができ、これにより時間計算量が大幅に低減し、空間計算量はほぼ同じになることを示します。実践面では、提案アルゴリズムの有望なAMパフォーマンスと効率性によって、アンダーバギング手法の利点に関する理論的結果を検証するための数値実験を実施します。

OVERT: An Algorithm for Safety Verification of Neural Network Control Policies for Nonlinear Systems
OVERT:非線形システムのためのニューラルネットワーク制御方策の安全性検証のためのアルゴリズム

Deep learning methods can be used to produce control policies, but certifying their safety is challenging. The resulting networks are nonlinear and often very large. In response to this challenge, we present OVERT: a sound algorithm for safety verification of nonlinear discrete-time closed loop dynamical systems with neural network control policies. The novelty of OVERT lies in combining ideas from the classical formal methods literature with ideas from the newer neural network verification literature. The central concept of OVERT is to abstract nonlinear functions with a set of optimally tight piecewise linear bounds. Such piecewise linear bounds are designed for seamless integration into ReLU neural network verification tools. OVERT can be used to prove bounded-time safety properties by either computing reachable sets or solving feasibility queries directly. We demonstrate various examples of safety verification for several classical benchmark examples. OVERT compares favorably to existing methods both in computation time and in tightness of the reachable set.

ディープラーニング手法を使用して制御ポリシーを作成できますが、その安全性を証明するのは困難です。結果として得られるネットワークは非線形であり、多くの場合非常に大規模です。この課題に対応するために、ニューラルネットワーク制御ポリシーを備えた非線形離散時間閉ループ動的システムの安全性検証のための健全なアルゴリズムであるOVERTを紹介します。OVERTの目新しさは、古典的な形式手法の文献のアイデアと、より新しいニューラルネットワーク検証の文献のアイデアを組み合わせたところにあります。OVERTの中心となる概念は、最適にタイトな区分線形境界のセットを使用して非線形関数を抽象化することです。このような区分線形境界は、ReLUニューラルネットワーク検証ツールにシームレスに統合できるように設計されています。OVERTは、到達可能セットを計算するか、実現可能性クエリを直接解決することにより、制限時間安全性プロパティを証明するために使用できます。いくつかの古典的なベンチマーク例について、安全性検証のさまざまな例を示します。OVERTは、計算時間と到達可能セットのタイトさの両方において、既存の方法に匹敵します。

An Error Analysis of Generative Adversarial Networks for Learning Distributions
学習分布のための敵対的生成ネットワークのエラー解析

This paper studies how well generative adversarial networks (GANs) learn probability distributions from finite samples. Our main results establish the convergence rates of GANs under a collection of integral probability metrics defined through H\”{o}lder classes, including the Wasserstein distance as a special case. We also show that GANs are able to adaptively learn data distributions with low-dimensional structures or have H\”{o}lder densities, when the network architectures are chosen properly. In particular, for distributions concentrated around a low-dimensional set, we show that the learning rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension. Our analysis is based on a new oracle inequality decomposing the estimation error into the generator and discriminator approximation error and the statistical error, which may be of independent interest.

この論文では、敵対的生成ネットワーク(GAN)が有限のサンプルから確率分布をどの程度学習するかを研究します。私たちの主な結果は、H”{o}lderクラスによって定義される積分確率メトリックのコレクションの下でGANの収束率を確立します(特殊なケースとしてのWasserstein距離を含む)。また、ネットワークアーキテクチャが適切に選択されている場合、GANは低次元構造を持つデータ分布やH”{o}lder密度を適応的に学習できることを示しています。特に、低次元セットの周りに集中する分布の場合、GANの学習率は高い周囲次元ではなく、より低い固有次元に依存することを示します。私たちの分析は、推定誤差をジェネレータとディシミネーターの近似誤差と統計的誤差に分解する新しいオラクル不等式に基づいています。

Cauchy–Schwarz Regularized Autoencoder
コーシー・シュワルツ正則化オートエンコーダ

Recent work in unsupervised learning has focused on efficient inference and learning in latent variables models. Training these models by maximizing the evidence (marginal likelihood) is typically intractable. Thus, a common approximation is to maximize the Evidence Lower BOund (ELBO) instead. Variational autoencoders (VAE) are a powerful and widely-used class of generative models that optimize the ELBO efficiently for large datasets. However, the VAE’s default Gaussian choice for the prior imposes a strong constraint on its ability to represent the true posterior, thereby degrading overall performance. A Gaussian mixture model (GMM) would be a richer prior but cannot be handled efficiently within the VAE framework because of the intractability of the Kullback-Leibler divergence for GMMs. We deviate from the common VAE framework in favor of one with an analytical solution for Gaussian mixture prior. To perform efficient inference for GMM priors, we introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. This new objective allows us to incorporate richer, multi-modal priors into the autoencoding framework. We provide empirical studies on a range of datasets and show that our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.

教師なし学習における最近の研究は、潜在変数モデルにおける効率的な推論と学習に焦点を当てています。証拠(周辺尤度)を最大化することでこれらのモデルをトレーニングすることは、通常、扱いにくいものです。したがって、一般的な近似は、代わりに証拠下限(ELBO)を最大化することです。変分オートエンコーダー(VAE)は、大規模データセットに対してELBOを効率的に最適化する、強力で広く使用されている生成モデルのクラスです。ただし、事前分布としてVAEがデフォルトで選択するガウス分布は、真の事後分布を表す能力に強い制約を課し、全体的なパフォーマンスを低下させます。ガウス混合モデル(GMM)はより豊富な事前分布ですが、GMMのKullback-Leiblerダイバージェンスが扱いにくいため、VAEフレームワーク内で効率的に処理することはできません。私たちは、一般的なVAEフレームワークから逸脱し、ガウス混合事前分布の解析ソリューションを備えたフレームワークを採用します。GMM事前分布の効率的な推論を実行するために、GMMに対して解析的に計算できるCauchy-Schwarzダイバージェンスに基づく新しい制約付き目的関数を導入します。この新しい目的関数により、より豊富なマルチモーダル事前分布をオートエンコーディングフレームワークに組み込むことができます。さまざまなデータセットに関する実証的研究を提供し、この目的関数が密度推定、教師なしクラスタリング、半教師あり学習、顔分析における変分オートエンコーディングモデルを改善することを示しています。

ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction
ReduNet:レート削減の最大化の原理に基づくホワイトボックスディープネットワーク

This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation. We argue that for high-dimensional multi-class data, the optimal linear discriminative representation maximizes the coding rate difference between the whole dataset and the average of all the subsets. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep network, named ReduNet, which shares common characteristics of modern deep networks. The deep layered architectures, linear and nonlinear operators, and even parameters of the network are all explicitly constructed layer-by-layer via forward propagation, although they are amenable to fine-tuning via back propagation. All components of so-obtained “white-box” network have precise optimization, statistical, and geometric interpretation. Moreover, all linear operators of the so-derived network naturally become multi-channel convolutions when we enforce classification to be rigorously shift-invariant. The derivation in the invariant setting suggests a trade-off between sparsity and invariance, and also indicates that such a deep convolution network is significantly more efficient to construct and learn in the spectral domain. Our preliminary simulations and experiments clearly verify the effectiveness of both the rate reduction objective and the associated ReduNet. All code and data are available at https://github.com/Ma-Lab-Berkeley.

この研究では、データ圧縮と識別表現の原理から現代の深層（畳み込み）ネットワークを解釈することを目的とした、もっともらしい理論的枠組みを提供することを試みています。高次元の多クラスデータの場合、最適な線形識別表現は、データセット全体とすべてのサブセットの平均との間のコーディングレートの差を最大化すると主張します。レート削減目標を最適化するための基本的な反復勾配上昇スキームは、現代の深層ネットワークの共通の特性を共有するReduNetという多層深層ネットワークに自然につながることを示します。ネットワークの深層アーキテクチャ、線形および非線形演算子、さらにはパラメーターはすべて、順方向伝播によって層ごとに明示的に構築されますが、逆方向伝播による微調整も可能です。このようにして得られた「ホワイトボックス」ネットワークのすべてのコンポーネントには、正確な最適化、統計、および幾何学的解釈があります。さらに、分類が厳密にシフト不変になるように強制すると、このようにして導出されたネットワークのすべての線形演算子は自然にマルチチャネル畳み込みになります。不変設定での導出は、スパース性と不変性の間のトレードオフを示唆しており、また、このような深い畳み込みネットワークは、スペクトル領域で構築および学習する方がはるかに効率的であることを示しています。私たちの予備的なシミュレーションと実験は、レート削減目標と関連するReduNetの両方の有効性を明確に検証しています。すべてのコードとデータは、https://github.com/Ma-Lab-Berkeleyで入手できます。

The Two-Sided Game of Googol
グーゴルの両面ゲーム

The secretary problem or game of Googol are classic models for online selection problems. In this paper we consider a variant of the problem and explore its connections to data-driven online selection. Specifically, we are given $n$ cards with arbitrary non-negative numbers written on both sides. The cards are randomly placed on $n$ consecutive positions on a table, and for each card, the visible side is also selected at random. The player sees the visible side of all cards and wants to select the card with the maximum hidden value. To this end, the player flips the first card, sees its hidden value and decides whether to pick it or drop it and continue with the next card. We study algorithms for two natural objectives: maximizing the probability of selecting the maximum hidden value, and maximizing the expectation of the selected hidden value. For the former objective we obtain a simple $0.45292$-competitive algorithm. For the latter, we obtain a $0.63518$-competitive algorithm. Our main contribution is to set up a model allowing to transform probabilistic optimal stopping problems into purely combinatorial ones. For instance, we can apply our results to obtain lower bounds for the single sample prophet secretary problem.

秘書問題やグーゴルゲームは、オンライン選択問題の古典的なモデルです。この論文では、問題の変種を検討し、データ駆動型オンライン選択との関連を探ります。具体的には、両面に任意の非負の数字が書かれた$n$枚のカードが与えられます。カードはテーブル上の$n$個の連続した位置にランダムに配置され、各カードの見える面もランダムに選択されます。プレーヤーはすべてのカードの見える面を見て、最大の隠し値を持つカードを選択したいと考えます。このために、プレーヤーは最初のカードをめくり、その隠し値を見て、それを選ぶか、捨てて次のカードに進むかを決定します。最大の隠し値を選択する確率を最大化することと、選択された隠し値の期待値を最大化することという2つの自然な目的のためのアルゴリズムを検討します。前者の目的については、単純な$0.45292$競合アルゴリズムが得られます。後者については、$0.63518$競合アルゴリズムが得られます。私たちの主な貢献は、確率的最適停止問題を純粋に組み合わせの問題に変換できるモデルを構築したことです。たとえば、私たちの結果を適用して、単一サンプルの預言者秘書問題の下限値を取得できます。

Sum of Ranked Range Loss for Supervised Learning
教師あり学習のランク付け範囲損失の合計

In forming learning objectives, one oftentimes needs to aggregate a set of individual values to a single output. Such cases occur in the aggregate loss, which combines individual losses of a learning model over each training sample, and in the individual loss for multi-label learning, which combines prediction scores over all class labels. In this work, we introduce the sum of ranked range (SoRR) as a general approach to form learning objectives. A ranked range is a consecutive sequence of sorted values of a set of real numbers. The minimization of SoRR is solved with the difference of convex algorithm (DCA). We explore two applications in machine learning of the minimization of the SoRR framework, namely the AoRR aggregate loss for binary/multi-class classification at the sample level and the TKML individual loss for multi-label/multi-class classification at the label level. A combination loss of AoRR and TKML is proposed as a new learning objective for improving the robustness of multi-label learning in the face of outliers in sample and labels alike. Our empirical results highlight the effectiveness of the proposed optimization frameworks and demonstrate the applicability of proposed losses using synthetic and real data sets.

学習目標を形成する際には、多くの場合、個々の値のセットを単一の出力に集約する必要があります。このようなケースは、各トレーニングサンプルの学習モデルの個々の損失を結合する集約損失と、すべてのクラスラベルの予測スコアを結合するマルチラベル学習の個々の損失で発生します。この研究では、学習目標を形成する一般的なアプローチとして、ランク付けされた範囲の合計(SoRR)を紹介します。ランク付けされた範囲は、一連の実数のソートされた値の連続したシーケンスです。SoRRの最小化は、凸差アルゴリズム(DCA)で解決されます。機械学習におけるSoRRフレームワークの最小化の2つのアプリケーション、つまりサンプルレベルでのバイナリ/マルチクラス分類のAoRR集約損失と、ラベルレベルでのマルチラベル/マルチクラス分類のTKML個別損失を検討します。サンプルとラベルの両方で外れ値がある場合のマルチラベル学習の堅牢性を向上させるための新しい学習目標として、AoRRとTKMLの組み合わせ損失が提案されています。私たちの実験結果は、提案された最適化フレームワークの有効性を強調し、合成データセットと実際のデータセットを使用して提案された損失の適用可能性を実証しています。

Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces
超曲面上の特異点を持つ関数の推定のための深層ニューラルネットワークの利点

We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure.

私たちは、ディープニューラルネットワーク（DNN）が他の標準的な方法よりも優れている理由を説明するために、ミニマックス率分析を開発しました。ノンパラメトリック回帰問題の場合、多くの標準的な方法が滑らかな関数の推定誤差のミニマックス最適率を達成することはよく知られており、そのため、DNNの理論的な利点を特定するのは簡単ではありません。この研究では、超曲面上に特異点を持つ非滑らかな関数のクラスの推定を考慮することで、このギャップを埋めようとします。私たちの発見は次のとおりです。（i）DNN推定量の一般化誤差を導出し、その収束率がほぼ最適であることを証明します。（ii）推定問題の位相図を明らかにし、カーネル法、ガウス過程法などを含む一般的なクラスの推定量よりもDNNが優れている状況を説明します。さらに、DNNが調和解析ベースの推定量よりも優れていることを示します。DNNのこの利点は、特異点の形状を多層構造によってうまく処理できるという事実から生まれます。

EiGLasso for Scalable Sparse Kronecker-Sum Inverse Covariance Estimation
スケーラブルなスパースクロネッカー和逆共分散推定のための EiGLasso

In many real-world data, complex dependencies are present both among samples and among features. The Kronecker sum or the Cartesian product of two graphs, each modeling dependencies across features and across samples, has been used as an inverse covariance matrix for a matrix-variate Gaussian distribution as an alternative to Kronecker-product inverse covariance matrix due to its more intuitive sparse structure. However, the existing methods for sparse Kronecker-sum inverse covariance estimation are limited in that they do not scale to more than a few hundred features and samples and that unidentifiable parameters pose challenges in estimation. In this paper, we introduce EiGLasso, a highly scalable method for sparse Kronecker-sum inverse covariance estimation, based on Newton’s method combined with eigendecomposition of the sample and feature graphs to exploit the Kronecker-sum structure. EiGLasso further reduces computation time by approximating the Hessian matrix, based on the eigendecomposition of the two graphs. EiGLasso achieves quadratic convergence with the exact Hessian and linear convergence with the approximate Hessian. We describe a simple new approach to estimating the unidentifiable parameters that generalizes the existing methods. On simulated and real-world data, we demonstrate that EiGLasso achieves two to three orders-of-magnitude speed-up, compared to the existing methods.

現実世界のデータの多くでは、サンプル間および特徴間の両方に複雑な依存関係が存在します。特徴間およびサンプル間の依存関係をそれぞれモデル化する2つのグラフのクロネッカー和または直積は、より直感的なスパース構造のため、クロネッカー積逆共分散行列の代わりとして、行列変量ガウス分布の逆共分散行列として使用されてきました。しかし、スパースクロネッカー和逆共分散推定の既存の方法は、数百以上の特徴とサンプルには拡張できず、識別できないパラメーターが推定に課題をもたらすという点で制限があります。この論文では、クロネッカー和構造を活用するためにサンプルと特徴グラフの固有分解と組み合わせたニュートン法に基づく、スパースクロネッカー和逆共分散推定の高度にスケーラブルな方法であるEiGLassoを紹介します。EiGLassoは、2つのグラフの固有値分解に基づいてヘッセ行列を近似することで、計算時間をさらに短縮します。EiGLassoは、正確なヘッセ行列では2次収束を、近似ヘッセ行列では線形収束を実現します。ここでは、既存の方法を一般化する、識別不可能なパラメータを推定する新しいシンプルなアプローチについて説明します。シミュレーションデータと実際のデータで、EiGLassoは既存の方法と比較して2～3桁の高速化を実現することを実証します。

Conditions and Assumptions for Constraint-based Causal Structure Learning
制約に基づく因果構造学習のための条件と仮定

We formalize constraint-based structure learning of the “true” causal graph from observed data when unobserved variables are also existent. We provide conditions for a “natural” family of constraint-based structure-learning algorithms that output graphs that are Markov equivalent to the causal graph. Under the faithfulness assumption, this natural family contains all exact structure-learning algorithms. We also provide a set of assumptions, under which any natural structure-learning algorithm outputs Markov equivalent graphs to the causal graph. These assumptions can be thought of as a relaxation of faithfulness, and most of them can be directly tested from (the underlying distribution) of the data, particularly when one focuses on structural causal models. We specialize the definitions and results for structural causal models.

私たちは、観測されていない変数も存在する場合に、観測されたデータから「真の」因果グラフの制約ベースの構造学習を形式化します。因果グラフと同等のマルコフグラフを出力する制約ベースの構造学習アルゴリズムの”自然な”ファミリの条件を提供します。忠実性の仮定の下では、この自然な家族にはすべての正確な構造学習アルゴリズムが含まれています。また、自然な構造学習アルゴリズムがマルコフ等価グラフを因果グラフに出力する一連の仮定も提供します。これらの仮定は、忠実さの緩和と考えることができ、それらのほとんどは、特に構造的な因果モデルに焦点を当てる場合に、データの(基礎となる分布)から直接テストできます。構造的因果モデルの定義と結果を専門化します。

Bayesian subset selection and variable importance for interpretable prediction and classification
解釈可能な予測と分類のためのベイズのサブセット選択と変数の重要度

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model M, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single “best” subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via M. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.

サブセット選択は、解釈可能な学習、科学的発見、およびデータ圧縮のための貴重なツールです。ただし、選択の不安定性、正則化の欠如、および選択後の推論の難しさのため、従来のサブセット選択は回避されることがよくあります。私たちは、ベイズの観点からこれらの課題に取り組みます。任意のベイズ予測モデルMが与えられた場合、線形予測または分類のための変数のほぼ最適なサブセットのファミリーを抽出します。この戦略は、単一の「最良の」サブセットの役割を軽視し、代わりに、多くのサブセットが非常に競争的であることが多いというより広い視点を推進します。許容可能なサブセットのファミリーは、モデル解釈の新しい経路を提供し、最小の許容可能なサブセットなどの主要メンバーと、変数がすべての許容可能なサブセット、一部の許容可能なサブセット、または許容可能なサブセットに出現するかどうかに基づく新しい(共)変数重要度メトリックによって簡潔にまとめられます。より広くは、ベイズ決定分析を適用して、任意の変数のサブセットの最適な線形係数を導出します。これらの係数は、Mを介して正則化と予測不確実性の定量化の両方を継承します。シミュレートされたデータと実際のデータの両方で、提案されたアプローチは、競合するベイズおよび頻度論的選択方法よりも優れた予測、区間推定、および変数選択を示します。これらのツールは、相関性の高い共変量を持つ大規模な教育データセットに適用されます。私たちの分析は、教育の成果を予測する環境、社会経済、および人口統計学的要因の組み合わせに関する独自の洞察を提供し、ほぼ最適なサンプル外予測精度を提供する200を超える変数の異なるサブセットを特定します。

IALE: Imitating Active Learner Ensembles
IALE:アクティブラーナーアンサンブルの模倣

Active learning prioritizes the labeling of the most informative data samples. However, the performance of active learning heuristics depends on both the structure of the underlying model architecture and the data. We propose IALE, an imitation learning scheme that imitates the selection of the best-performing expert heuristic at each stage of the learning cycle in a batch-mode pool-based setting. We use Dagger to train a transferable policy on a dataset and later apply it to different datasets and deep classifier architectures. The policy reflects on the best choices from multiple expert heuristics given the current state of the active learning process, and learns to select samples in a complementary way that unifies the expert strategies. Our experiments on well-known image datasets show that we outperform state of the art imitation learners and heuristics.

アクティブラーニングでは、最も情報量の多いデータサンプルのラベリングを優先します。ただし、アクティブラーニングヒューリスティックのパフォーマンスは、基になるモデルアーキテクチャの構造とデータの両方に依存します。私たちは、バッチモードのプールベースの設定で、学習サイクルの各段階で最もパフォーマンスの高い専門家ヒューリスティックの選択を模倣する模倣学習スキームであるIALEを提案します。Daggerを使用して、データセットの転送可能なポリシーをトレーニングし、後でそれをさまざまなデータセットやディープクラシファイアアーキテクチャに適用します。このポリシーは、アクティブラーニングプロセスの現状を考慮して、複数の専門家ヒューリスティックからの最良の選択を反映し、専門家の戦略を統一する補完的な方法でサンプルを選択することを学習します。よく知られた画像データセットでの実験では、最先端の模倣学習器やヒューリスティックスよりも優れたパフォーマンスを発揮していることが示されています。

Riemannian Stochastic Proximal Gradient Methods for Nonsmooth Optimization over the Stiefel Manifold
シュティーフェル多様体上の非平滑最適化のためのリーマン確率近似勾配法

Riemannian optimization has drawn a lot of attention due to its wide applications in practice. Riemannian stochastic first-order algorithms have been studied in the literature to solve large-scale machine learning problems over Riemannian manifolds. However, most of the existing Riemannian stochastic algorithms require the objective function to be differentiable, and they do not apply to the case where the objective function is nonsmooth. In this paper, we present two Riemannian stochastic proximal gradient methods for minimizing nonsmooth function over the Stiefel manifold. The two methods, named R-ProxSGD and R-ProxSPB, are generalizations of proximal SGD and proximal SpiderBoost in Euclidean setting to the Riemannian setting. Analysis on the incremental first-order oracle (IFO) complexity of the proposed algorithms is provided. Specifically, the R-ProxSPB algorithm finds an $\epsilon$-stationary point with $O(\epsilon^{-3})$ IFOs in the online case, and $O(n+\sqrt{n}\epsilon^{-2})$ IFOs in the finite-sum case with $n$ being the number of summands in the objective. Experimental results on online sparse PCA and robust low-rank matrix completion show that our proposed methods significantly outperform the existing methods that use Riemannian subgradient information.

リーマン最適化は、実践での幅広い応用により、多くの注目を集めています。リーマン確率的一次アルゴリズムは、リーマン多様体上の大規模機械学習問題を解決するために文献で研究されてきました。しかし、既存のリーマン確率的アルゴリズムのほとんどは、目的関数が微分可能であることを必要とし、目的関数が滑らかでない場合には適用されません。この論文では、シュティーフェル多様体上の滑らかでない関数を最小化する2つのリーマン確率的近似勾配法を紹介します。R-ProxSGDとR-ProxSPBと呼ばれる2つの方法は、ユークリッド設定の近似SGDと近似SpiderBoostをリーマン設定に一般化したものです。提案されたアルゴリズムの増分一次オラクル（IFO）複雑度の分析が提供されています。具体的には、R-ProxSPBアルゴリズムは、オンラインの場合は$O(\epsilon^{-3})$個のIFOで$\epsilon$定常点を見つけ、有限和の場合は$O(n+\sqrt{n}\epsilon^{-2})$個のIFOで$n$を目的関数の加数として見つけます。オンラインスパースPCAと堅牢な低ランク行列補完に関する実験結果から、提案手法がリーマン部分勾配情報を使用する既存の手法よりも大幅に優れていることがわかります。

Globally Injective ReLU Networks
グローバルインジェクションReLUネットワーク

Injectivity plays an important role in generative models where it enables inference; in inverse problems and compressed sensing with generative priors it is a precursor to well posedness. We establish sharp characterizations of injectivity of fully-connected and convolutional ReLU layers and networks. First, through a layerwise analysis, we show that an expansivity factor of two is necessary and sufficient for injectivity by constructing appropriate weight matrices. We show that global injectivity with iid Gaussian matrices, a commonly used tractable model, requires larger expansivity between 3.4 and 10.5. We also characterize the stability of inverting an injective network via worst-case Lipschitz constants of the inverse. We then use arguments from differential topology to study injectivity of deep networks and prove that any Lipschitz map can be approximated by an injective ReLU network. Finally, using an argument based on random projections, we show that an end-to-end—rather than layerwise—doubling of the dimension suffices for injectivity. Our results establish a theoretical basis for the study of nonlinear inverse and inference problems using neural networks.

単射性は、推論を可能にする生成モデルで重要な役割を果たします。また、逆問題や生成事前確率による圧縮センシングでは、適切性の前提条件となります。完全結合および畳み込みReLU層とネットワークの単射性の明確な特性を確立します。まず、層ごとの分析により、適切な重み行列を構築することで、単射性には拡張係数2が必要かつ十分であることを示します。一般的に使用される扱いやすいモデルであるiidガウス行列によるグローバル単射性には、3.4から10.5の間のより大きな拡張性が必要であることを示します。また、逆の最悪ケースのLipschitz定数を介して、単射ネットワークの反転の安定性を特性評価します。次に、微分位相幾何学からの議論を使用してディープネットワークの単射性を調べ、任意のLipschitzマップを単射ReLUネットワークで近似できることを証明します。最後に、ランダム投影に基づく議論を使用して、次元を層ごとではなくエンドツーエンドで2倍にすれば、単射性を実現できることを示します。私たちの結果は、ニューラルネットワークを使用した非線形逆問題と推論問題の研究の理論的基礎を確立します。

Efficient Least Squares for Estimating Total Effects under Linearity and Causal Sufficiency
線形性と因果的充足性の下での全効果を推定するための効率的な最小二乗法

Recursive linear structural equation models are widely used to postulate causal mechanisms underlying observational data. In these models, each variable equals a linear combination of a subset of the remaining variables plus an error term. When there is no unobserved confounding or selection bias, the error terms are assumed to be independent. We consider estimating a total causal effect in this setting. The causal structure is assumed to be known only up to a maximally oriented partially directed acyclic graph (MPDAG), a general class of graphs that can represent a Markov equivalence class of directed acyclic graphs (DAGs) with added background knowledge. We propose a simple estimator based on recursive least squares, which can consistently estimate any identified total causal effect, under point or joint intervention. We show that this estimator is the most efficient among all regular estimators that are based on the sample covariance, which includes covariate adjustment and the estimators employed by the joint-IDA algorithm. Notably, our result holds without assuming Gaussian errors.

再帰的線形構造方程式モデルは、観察データの根底にある因果メカニズムを仮定するために広く使用されています。これらのモデルでは、各変数は、残りの変数のサブセットと誤差項の線形結合に等しくなります。観測されていない交絡や選択バイアスがない場合、誤差項は独立していると想定されます。この設定で全体の因果効果を推定することを検討します。因果構造は、最大指向部分有向非巡回グラフ(MPDAG)までしかわかっていないと想定されます。MPDAGは、背景知識を追加した有向非巡回グラフ(DAG)のマルコフ同値クラスを表すことができるグラフの一般的なクラスです。私たちは、再帰最小二乗法に基づく単純な推定量を提案します。この推定量は、点介入または結合介入の下で、特定されたすべての全体の因果効果を一貫して推定できます。この推定量は、共変量調整と結合IDAアルゴリズムで使用される推定量を含むサンプル共分散に基づくすべての通常の推定量の中で最も効率的であることを示します。注目すべきことに、我々の結果はガウス誤差を仮定しなくても成り立ちます。

The EM Algorithm is Adaptively-Optimal for Unbalanced Symmetric Gaussian Mixtures
EMアルゴリズムは、不均衡な対称ガウス混合に対して適応最適です

This paper studies the problem of estimating the means $\pm\theta_{*}\in\mathbb{R}^{d}$ of a symmetric two-component Gaussian mixture $\delta_{*}\cdot N(\theta_{*},I)+(1-\delta_{*})\cdot N(-\theta_{*},I)$, where the weights $\delta_{*}$ and $1-\delta_{*}$ are unequal. Assuming that $\delta_{*}$ is known, we show that the population version of the EM algorithm globally converges if the initial estimate has non-negative inner product with the mean of the larger weight component. This can be achieved by the trivial initialization $\theta_{0}=0$. For the empirical iteration based on $n$ samples, we show that when initialized at $\theta_{0}=0$, the EM algorithm adaptively achieves the minimax error rate $\tilde{O}\Big(\min\Big\{\frac{1}{(1-2\delta_{*})}\sqrt{\frac{d}{n}},\frac{1}{\|\theta_{*}\|}\sqrt{\frac{d}{n}},\left(\frac{d}{n}\right)^{1/4}\Big\}\Big)$ in no more than $O\Big(\frac{1}{\|\theta_{*}\|(1-2\delta_{*})}\Big)$ iterations (with high probability). We also consider the EM iteration for estimating the weight $\delta_{*}$, assuming a fixed mean $\theta$ (which is possibly mismatched to $\theta_{*}$). For the empirical iteration of $n$ samples, we show that the minimax error rate $\tilde{O}\Big(\frac{1}{\|\theta_{*}\|}\sqrt{\frac{d}{n}}\Big)$ is achieved in no more than $O\Big(\frac{1}{\|\theta_{*}\|^{2}}\Big)$ iterations. These results robustify and complement recent results of Wu and Zhou (2019) obtained for the equal weights case $\delta_{*}=1/2$.

この論文では、対称的な2成分ガウス混合分布$\delta_{*}\cdot N(\theta_{*},I)+(1-\delta_{*})\cdot N(-\theta_{*},I)$の平均$\pm\theta_{*}\in\mathbb{R}^{d}$を推定する問題について検討します。ここで、重み$\delta_{*}$と$1-\delta_{*}$は等しくありません。$\delta_{*}$が既知であると仮定すると、初期推定値がより大きな重み成分の平均との内積が負でない場合、EMアルゴリズムの母集団バージョンが全体的に収束することを示します。これは、単純な初期化$\theta_{0}=0$で実現できます。$n$サンプルに基づく経験的反復では、$\theta_{0}=0$で初期化すると、EMアルゴリズムは、最大最小エラー率$\tilde{O}\Big(\min\Big\{\frac{1}{(1-2\delta_{*})}\sqrt{\frac{d}{n}},\frac{1}{\|\theta_{*}\|}\sqrt{\frac{d}{n}},\left(\frac{d}{n}\right)^{1/4}\Big\}\Big)$を、最大$O\Big(\frac{1}{\|\theta_{*}\|(1-2\delta_{*})}\Big)$回の反復で（高い確率で）適応的に達成することを示します。また、平均$\theta$を固定値(おそらく$\theta_{*}$と一致しない)と仮定して、重み$\delta_{*}$を推定するためのEM反復も検討します。$n$個のサンプルの実験的反復では、最小最大誤差率$\tilde{O}\Big(\frac{1}{\|\theta_{*}\|}\sqrt{\frac{d}{n}}\Big)$が$O\Big(\frac{1}{\|\theta_{*}\|^{2}}\Big)$回の反復で達成されることを示します。これらの結果は、等しい重みの場合$\delta_{*}=1/2$で得られたWuとZhou (2019)の最近の結果を堅牢にし、補完します。

Sufficient reductions in regression with mixed predictors
混合予測子による回帰の十分な削減

Most data sets comprise of measurements on continuous and categorical variables. Yet, modeling high-dimensional mixed predictors has received limited attention in the regression and classification statistical literature. We study the general regression problem of inferring on a variable of interest based on high dimensional mixed continuous and binary predictors. The aim is to find a lower dimensional function of the mixed predictor vector that contains all the modeling information in the mixed predictors for the response, which can be either continuous or categorical. The approach we propose identifies sufficient reductions by reversing the regression and modeling the mixed predictors conditional on the response. We derive the maximum likelihood estimator of the sufficient reductions, asymptotic tests for dimension, and a regularized estimator, which simultaneously achieves variable (feature) selection and dimension reduction (feature extraction). We study the performance of the proposed method and compare it with other approaches through simulations and real data examples.

ほとんどのデータセットは、連続変数とカテゴリ変数の測定値で構成されています。しかし、高次元の混合予測子のモデリングは、回帰および分類統計の文献ではあまり注目されていません。私たちは、高次元の混合連続予測子と混合バイナリ予測子に基づいて、関心のある変数を推測する一般的な回帰問題を研究します。目的は、連続またはカテゴリのいずれかである応答の混合予測子のすべてのモデリング情報を含む混合予測子ベクトルの低次元関数を見つけることです。私たちが提案するアプローチは、回帰を逆転させ、応答を条件として混合予測子をモデリングすることにより、十分な削減を特定します。十分な削減の最大尤度推定量、次元の漸近検定、および変数（特徴）選択と次元削減（特徴抽出）を同時に達成する正規化推定量を導出します。私たちは、シミュレーションと実際のデータ例を通じて、提案された方法のパフォーマンスを研究し、他のアプローチと比較します。

Towards An Efficient Approach for the Nonconvex lp Ball Projection: Algorithm and Analysis
非凸型lp球射影の効率的なアプローチに向けて:アルゴリズムと解析

This paper primarily focuses on computing the Euclidean projection of a vector onto the lp ball in which p ∈ (0,1). Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote the sparsity of the desired solution. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem. Based on this characterization, we develop a novel numerical approach for computing the stationary point by solving a sequence of projections onto the reweighted l1-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case O(1/\sqrt{k}) convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm.

この論文は主に、p∈(0,1)であるlp球へのベクトルのユークリッド射影の計算に焦点を当てています。このような問題は、望ましい解のスパース性を促進する能力があるため、統計的機械学習および信号処理タスクのコア構成要素として浮上しています。しかし、射影を見つけるための効率的な数値アルゴリズムは、特に大規模な最適化では、まだ利用できません。この課題に対処するために、まずこの問題の1次の必要最適条件を導出します。この特性に基づいて、再重み付けされたl1球への一連の射影を解くことにより、定常点を計算するための新しい数値アプローチを開発します。この方法は、実装が実質的に簡単で、計算効率に優れています。さらに、提案されたアルゴリズムは、穏やかな条件下で一意に収束することが示されており、最悪の場合の収束率はO(1/\sqrt{k})です。数値実験により、提案されたアルゴリズムの効率が実証されています。

Total Stability of SVMs and Localized SVMs
SVM とローカライズされた SVM の総合的な安定性

Regularized kernel-based methods such as support vector machines (SVMs) typically depend on the underlying probability measure $\mathrm{P}$ (respectively an empirical measure $\mathrm{D}_n$ in applications) as well as on the regularization parameter $\lambda$ and the kernel $k$. Whereas classical statistical robustness only considers the effect of small perturbations in $\mathrm{P}$, the present paper investigates the influence of simultaneous slight variations in the whole triple $(\mathrm{P},\lambda,k)$, respectively $(\mathrm{D}_n,\lambda_n,k)$, on the resulting predictor. Existing results from the literature are considerably generalized and improved. In order to also make them applicable to big data, where regular SVMs suffer from their super-linear computational requirements, we show how our results can be transferred to the context of localized learning. Here, the effect of slight variations in the applied regionalization, which might for example stem from changes in $\mathrm{P}$ respectively $\mathrm{D}_n$, is considered as well.

サポートベクターマシン(SVM)などの正規化カーネルベースの方法は、通常、基礎となる確率測度$\mathrm{P}$ (アプリケーションではそれぞれ経験的測度$\mathrm{D}_n$)とカーネル$k$に依存します。従来の統計的堅牢性は$\mathrm{P}$の小さな摂動の影響のみを考慮しますが、この論文では、3つの要素全体$(\mathrm{P},\lambda,k)$、それぞれ$(\mathrm{D}_n,\lambda_n,k)$の同時のわずかな変化が、結果として得られる予測子に与える影響を調査します。文献の既存の結果は、かなり一般化され、改善されています。通常のSVMが超線形の計算要件に悩まされるビッグデータにも適用できるようにするために、結果をローカライズされた学習のコンテキストに転送する方法を示します。ここでは、適用された地域化のわずかな変化（たとえば、$\mathrm{P}$や$\mathrm{D}_n$の変化に起因するもの）の影響も考慮されます。

Distributed Learning of Finite Gaussian Mixtures
有限ガウス混合の分散学習

Advances in information technology have led to extremely large datasets that are often kept in different storage centers. Existing statistical methods must be adapted to overcome the resulting computational obstacles while retaining statistical validity and efficiency. In this situation, the split-and-conquer strategy is among the most effective solutions to many statistical problems, including quantile processes, regression analysis, principal eigenspaces, and exponential families. This paper applies this strategy to develop a distributed learning procedure of finite Gaussian mixtures. We recommend a reduction strategy and invent an effective majorization-minimization algorithm. The new estimator is consistent and retains root-n consistency under some general conditions. Experiments based on simulated and real-world datasets show that the proposed estimator has comparable statistical performance with the global estimator based on the full dataset, if the latter is feasible. It can even outperform the global estimator for the purpose of clustering if the model assumption does not fully match the real-world data. It also has better statistical and computational performance than some existing split-and-conquer approaches.

情報技術の進歩により、非常に大規模なデータセットが生まれ、多くの場合、異なるストレージセンターに保存されています。既存の統計手法は、統計的妥当性と効率性を維持しながら、結果として生じる計算上の障害を克服するために適応させる必要があります。このような状況では、分割統治戦略は、分位プロセス、回帰分析、主固有空間、指数族など、多くの統計的問題に対する最も効果的なソリューションの1つです。この論文では、この戦略を適用して、有限ガウス混合の分散学習手順を開発します。削減戦略を推奨し、効果的な主要化最小化アルゴリズムを発明します。新しい推定量は一貫性があり、いくつかの一般的な条件下でルートn一貫性を維持します。シミュレートされたデータセットと実際のデータセットに基づく実験では、提案された推定量は、完全なデータセットに基づくグローバル推定量(後者が実行可能であれば)と同等の統計パフォーマンスを持つことが示されています。モデルの仮定が実際のデータと完全に一致しない場合は、クラスタリングの目的でグローバル推定量よりも優れたパフォーマンスを発揮することさえあります。また、既存の分割統治法よりも統計的および計算的なパフォーマンスが優れています。

PECOS: Prediction for Enormous and Correlated Output Spaces
PECOS:巨大で相関のある出力空間の予測

Many large-scale applications amount to finding relevant results from an enormous output space of potential candidates. For example, finding the best matching product from a large catalog or suggesting related search phrases on a search engine. The size of the output space for these problems can range from millions to billions, and can even be infinite in some applications. Moreover, training data is often limited for the “long-tail” items in the output space. Fortunately, items in the output space are often correlated thereby presenting an opportunity to alleviate the data sparsity issue. In this paper, we propose the Prediction for Enormous and Correlated Output Spaces (PECOS) framework, a versatile and modular machine learning framework for solving prediction problems for very large output spaces, and apply it to the eXtreme Multilabel Ranking (XMR) problem: given an input instance, find and rank the most relevant items from an enormous but fixed and finite output space. We propose a three phase framework for PECOS: (i) in the first phase, PECOS organizes the output space using a semantic indexing scheme, (ii) in the second phase, PECOS uses the indexing to narrow down the output space by orders of magnitude using a machine learned matching scheme, and (iii) in the third phase, PECOS ranks the matched items using a final ranking scheme. The versatility and modularity of PECOS allows for easy plug-and-play of various choices for the indexing, matching, and ranking phases. The indexing and matching phases alleviate the data sparsity issue by leveraging correlations across different items in the output space. For the critical matching phase, we develop a recursive machine learned matching strategy with both linear and neural matchers. When applied to eXtreme Multilabel Ranking where the input instances are in textual form, we find that the recursive Transformer matcher gives state-of-the-art accuracy results, at the cost of two orders of magnitude increased training time compared to the recursive linear matcher. For example, on a dataset where the output space is of size 2.8 million, the recursive Transformer matcher results in a 6% increase in precision@1 (from 48.6% to 54.2%) over the recursive linear matcher but takes 100x more time to train. Thus it is up to the practitioner to evaluate the trade-offs and decide whether the increased training time and infrastructure cost is warranted for their application; indeed, the flexibility of the PECOS framework seamlessly allows different strategies to be used. We also develop very fast inference procedures which allow us to perform XMR predictions in real time; for example, inference takes less than 1 millisecond per input on the dataset with 2.8 million labels. The PECOS software is available at https://libpecos.org.

多くの大規模アプリケーションは、潜在的な候補の膨大な出力空間から関連する結果を見つけることに相当します。たとえば、大規模なカタログから最も一致する製品を見つけたり、検索エンジンで関連する検索フレーズを提案したりします。これらの問題の出力空間のサイズは、数百万から数十億に及ぶことがあり、一部のアプリケーションでは無限になることもあります。さらに、出力空間の「ロングテール」アイテムのトレーニングデータは制限されることがよくあります。幸いなことに、出力空間のアイテムは相関していることが多いため、データのスパース性の問題を軽減する機会が提供されます。この論文では、非常に大規模な出力空間の予測問題を解決するための多用途でモジュール化された機械学習フレームワークであるPrediction for Enormous and Correlated Output Spaces (PECOS)フレームワークを提案し、それをeXtreme Multilabel Ranking (XMR)問題に適用します。入力インスタンスが与えられた場合、巨大だが固定された有限の出力空間から最も関連性の高いアイテムを見つけてランク付けします。PECOSの3フェーズフレームワークを提案します。(i)最初のフェーズでは、PECOSはセマンティックインデックススキームを使用して出力空間を整理します。(ii) 2番目のフェーズでは、PECOSはインデックスを使用して、機械学習によるマッチングスキームを使用して出力空間を桁違いに絞り込みます。(iii) 3番目のフェーズでは、PECOSは最終ランキングスキームを使用して一致したアイテムをランク付けします。PECOSの汎用性とモジュール性により、インデックス、マッチング、ランキングフェーズのさまざまな選択肢を簡単にプラグアンドプレイできます。インデックスとマッチングフェーズでは、出力空間内のさまざまなアイテム間の相関関係を活用することで、データのスパース性の問題を軽減します。重要なマッチングフェーズでは、線形マッチャーとニューラルマッチャーの両方を備えた再帰的な機械学習によるマッチング戦略を開発します。入力インスタンスがテキスト形式であるeXtreme Multilabel Rankingに適用した場合、再帰Transformerマッチャーは最先端の精度結果をもたらしますが、再帰線形マッチャーと比較してトレーニング時間が2桁長くなります。たとえば、出力スペースのサイズが280万のデータセットでは、再帰Transformerマッチャーは再帰線形マッチャーよりもprecision@1が6% (48.6%から54.2%)増加しますが、トレーニングには100倍の時間がかかります。したがって、トレードオフを評価し、トレーニング時間とインフラストラクチャコストの増加がアプリケーションに正当化されるかどうかを判断するのは実践者次第です。実際、PECOSフレームワークの柔軟性により、さまざまな戦略をシームレスに使用できます。また、XMR予測をリアルタイムで実行できる非常に高速な推論手順も開発しています。たとえば、280万ラベルのデータセットでは、推論には入力ごとに1ミリ秒未満しかかかりません。PECOSソフトウェアはhttps://libpecos.orgから入手できます。

Unlabeled Data Help in Graph-Based Semi-Supervised Learning: A Bayesian Nonparametrics Perspective
グラフベースの半教師あり学習におけるラベルなしデータのヘルプ: ベイズノンパラメトリックの視点

In this paper we analyze the graph-based approach to semi-supervised learning under a manifold assumption. We adopt a Bayesian perspective and demonstrate that, for a suitable choice of prior constructed with sufficiently many unlabeled data, the posterior contracts around the truth at a rate that is minimax optimal up to a logarithmic factor. Our theory covers both regression and classification.

この論文では、多様体を仮定した半教師あり学習へのグラフベースのアプローチを分析します。ベイズの視点を採用し、十分な数のラベル付けされていないデータを使用して構築された事前分布の適切な選択に対して、事後分布は対数因子までのミニマックス最適速度で真実の周りを収縮することを示します。私たちの理論は、回帰と分類の両方をカバーしています。

Rethinking Nonlinear Instrumental Variable Models through Prediction Validity
予測妥当性による非線形操作変数モデルの再考

Instrumental variables (IV) are widely used in the social and health sciences in situations where a researcher would like to measure a causal effect but cannot perform an experiment. For valid causal inference in an IV model, there must be external (exogenous) variation that (i) has a sufficiently large impact on the variable of interest (called the relevance assumption) and where (ii) the only pathway through which the external variation impacts the outcome is via the variable of interest (called the exclusion restriction). For statistical inference, researchers must also make assumptions about the functional form of the relationship between the three variables. Current practice assumes (i) and (ii) are met, then postulates a functional form with limited input from the data. In this paper, we describe a framework that leverages machine learning to validate these typically unchecked but consequential assumptions in the IV framework, providing the researcher empirical evidence about the quality of the instrument given the data at hand. Central to the proposed approach is the idea of prediction validity. Prediction validity checks that error terms — which should be independent from the instrument — cannot be modeled with machine learning any better than a model that is identically zero. We use prediction validity to develop both one-stage and two-stage approaches for IV, and demonstrate their performance on an example relevant to climate change policy.

道具変数(IV)は、研究者が因果効果を測定したいが実験を実行できない状況で、社会科学や健康科学で広く使用されています。IVモデルで有効な因果推論を行うには、(i)対象変数に十分大きな影響を与える(関連性仮定と呼ばれる)外部(外生)変動が存在し、(ii)外部変動が結果に影響を与える唯一の経路が対象変数を経由する(除外制約と呼ばれる)必要があります。統計的推論では、研究者は3つの変数間の関係の機能形式についても仮定する必要があります。現在の慣行では、(i)と(ii)が満たされていると仮定し、データからの限られた入力で機能形式を仮定します。この論文では、機械学習を活用してIVフレームワークで通常チェックされないが重要なこれらの仮定を検証し、手元にあるデータに基づいて道具変数の品質に関する経験的証拠を研究者に提供するフレームワークについて説明します。提案されたアプローチの中心となるのは、予測妥当性という考え方です。予測妥当性は、誤差項（機器から独立しているはず）が、機械学習では完全にゼロのモデルよりも優れたモデル化ができないことを確認します。予測妥当性を使用して、IVの1段階アプローチと2段階アプローチの両方を開発し、気候変動政策に関連する例でそのパフォーマンスを実証します。

Attraction-Repulsion Spectrum in Neighbor Embeddings
隣接埋め込みにおける引力-反発スペクトル

Neighbor embeddings are a family of methods for visualizing complex high-dimensional data sets using kNN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between the attractive and the repulsive forces in t-SNE using the exaggeration parameter yields a spectrum of embeddings, which is characterized by a simple trade-off: stronger attraction can better represent continuous manifold structures, while stronger repulsion can better represent discrete cluster structures and yields higher kNN recall. We find that UMAP embeddings correspond to t-SNE with increased attraction; mathematical analysis shows that this is because the negative sampling optimization strategy employed by UMAP strongly lowers the effective repulsion. Likewise, ForceAtlas2, commonly used for visualizing developmental single-cell transcriptomic data, yields embeddings corresponding to t-SNE with the attraction increased even more. At the extreme of this spectrum lie Laplacian eigenmaps. Our results demonstrate that many prominent neighbor embedding algorithms can be placed onto the attraction-repulsion spectrum, and highlight the inherent trade-offs between them.

近傍埋め込みは、kNNグラフを使用して複雑な高次元データセットを視覚化する一連の手法です。低次元埋め込みを見つけるために、これらのアルゴリズムは、隣接するポイントのペア間の引力とすべてのポイント間の斥力とを組み合わせます。このようなアルゴリズムの最も一般的な例の1つがt-SNEです。ここでは、誇張パラメーターを使用してt-SNEの引力と斥力のバランスを変更すると、単純なトレードオフを特徴とする埋め込みのスペクトルが得られることを経験的に示します。つまり、引力が強いほど連続的な多様体構造をより適切に表現でき、斥力が強いほど離散的なクラスター構造をより適切に表現でき、kNNリコールが高くなります。UMAP埋め込みは、引力が増したt-SNEに対応することがわかりました。数学的分析により、これはUMAPが採用しているネガティブサンプリング最適化戦略によって有効斥力が大幅に低下するためであることが示されています。同様に、発達中の単一細胞トランスクリプトームデータを視覚化するために一般的に使用されるForceAtlas2は、t-SNEに対応する埋め込みを生成し、その引力はさらに増大します。このスペクトルの極限には、ラプラシアン固有マップがあります。私たちの結果は、多くの著名な近傍埋め込みアルゴリズムが引力-反発スペクトル上に配置できることを示し、それらの間の固有のトレードオフを強調しています。

Multiple Testing in Nonparametric Hidden Markov Models: An Empirical Bayes Approach
ノンパラメトリック隠れマルコフモデルにおける多重検定:経験的ベイズアプローチ

Given a nonparametric Hidden Markov Model (HMM) with two states, the question of constructing efficient multiple testing procedures is considered, treating the states as unknown null and alternative hypotheses. A procedure is introduced, based on nonparametric empirical Bayes ideas, that controls the False Discovery Rate (FDR) at a user-specified level. Guarantees on power are also provided, in the form of a control of the true positive rate. One of the key steps in the construction requires supremum-norm convergence of preliminary estimators of the emission densities of the HMM. We provide the existence of such estimators, with convergence at the optimal minimax rate, for the case of a HMM with $J\ge 2$ states, which is of independent interest.

2つの状態を持つノンパラメトリック隠れマルコフモデル(HMM)が与えられた場合、効率的な複数のテスト手順を構築する問題が考慮され、状態を未知の帰無仮説と対立仮説として扱います。ノンパラメトリックな経験的ベイズの考え方に基づいて、ユーザー指定のレベルで偽発見率(FDR)を制御する手順が導入されます。また、真陽性率の制御という形で、電力の保証も提供されます。この構築における重要なステップの1つには、HMMの排出密度の予備的な推定量の超高水準収束が必要です。このような推定量の存在を、最適な最小最大レートでの収束とともに、$Jge 2$状態を持つHMMの場合に提供します。これは独立した関心事です。

Regularized K-means Through Hard-Thresholding
ハードスレッショニングによる正則化されたK平均法

We study a framework for performing regularized K-means, based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared in a theoretical analysis and an extensive Monte Carlo simulation study. Based on the results, we propose a new method called hard-threshold K-means (HTK-means), which uses an ℓ0 penalty to induce sparsity. HTK-means is a fast and competitive sparse clustering method which is easily interpretable, as is illustrated on several real data examples. In this context, new graphical displays are presented and used to gain further insight into the data sets.

私たちは、クラスター中心のサイズの直接的なペナルティに基づいて、正則化されたK平均法を実行するためのフレームワークを研究します。さまざまなペナルティ戦略が検討され、理論分析と広範なモンテカルロシミュレーション研究で比較されます。この結果に基づいて、l0ペナルティを使用してスパース性を誘導するハードスレッショルドK平均法(HTK-means)と呼ばれる新しい方法を提案します。HTK平均法は、いくつかの実際のデータの例に示されているように、簡単に解釈できる高速で競争力のあるスパースクラスタリング手法です。このコンテキストでは、新しいグラフィカル表示が表示され、データセットに関するさらなる洞察を得るために使用されます。

Gauss-Legendre Features for Gaussian Process Regression
ガウス過程回帰のためのガウス・ルジャンドル特徴

Gaussian processes provide a powerful probabilistic kernel learning framework, which allows learning high quality nonparametric regression models via methods such as Gaussian process regression. Nevertheless, the learning phase of Gaussian process regression requires massive computations which are not realistic for large datasets. In this paper, we present a Gauss-Legendre quadrature based approach for scaling up Gaussian process regression via a low rank approximation of the kernel matrix. We utilize the structure of the low rank approximation to achieve effective hyperparameter learning, training and prediction. Our method is very much inspired by the well-known random Fourier features approach, which also builds low-rank approximations via numerical integration. However, our method is capable of generating high quality approximation to the kernel using an amount of features which is poly-logarithmic in the number of training points, while similar guarantees will require an amount that is at the very least linear in the number of training points when using random Fourier features. Furthermore, the structure of the low-rank approximation that our method builds is subtly different from the one generated by random Fourier features, and this enables much more efficient hyperparameter learning. The utility of our method for learning with low-dimensional datasets is demonstrated using numerical experiments.

ガウス過程は強力な確率的カーネル学習フレームワークを提供し、ガウス過程回帰などの方法を介して高品質のノンパラメトリック回帰モデルを学習できます。ただし、ガウス過程回帰の学習フェーズでは膨大な計算が必要であり、大規模なデータセットでは現実的ではありません。この論文では、カーネル行列の低ランク近似を介してガウス過程回帰をスケールアップするためのガウス-ルジャンドル求積法ベースのアプローチを紹介します。低ランク近似の構造を利用して、効果的なハイパーパラメータ学習、トレーニング、および予測を実現します。私たちの方法は、数値積分を介して低ランク近似を構築する、よく知られているランダムフーリエ特徴アプローチに大きく影響を受けています。ただし、私たちの方法は、トレーニングポイントの数に対して多重対数である特徴量を使用してカーネルに高品質の近似を生成することができますが、ランダムフーリエ特徴を使用する場合、同様の保証には、トレーニングポイントの数に対して少なくとも線形である量が必要です。さらに、私たちの方法が構築する低ランク近似の構造は、ランダムフーリエ特徴によって生成されるものとは微妙に異なり、これにより、はるかに効率的なハイパーパラメータ学習が可能になります。低次元データセットでの学習における私たちの方法の有用性は、数値実験を使用して実証されています。

When Hardness of Approximation Meets Hardness of Learning
近似の硬さが学習の難しさと出会うとき

A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples. The hypothesis of the learner is taken from some fixed class of functions (e.g., linear classifiers, neural networks etc.). A failure of the learning algorithm can occur due to two possible reasons: wrong choice of hypothesis class (hardness of approximation), or failure to find the best function within the hypothesis class (hardness of learning). Although both approximation and learnability are important for the success of the algorithm, they are typically studied separately. In this work, we show a single hardness property that implies both hardness of approximation using linear classes and shallow networks, and hardness of learning using correlation queries and gradient-descent. This allows us to obtain new results on hardness of approximation and learnability of parity functions, DNF formulas and $AC^0$ circuits.

教師あり学習アルゴリズムは、ラベル付けされた例の分布にアクセスでき、例に正しくラベルを付ける関数(仮説)を返す必要があります。学習器の仮説は、固定されたクラスの関数(線形分類器、ニューラルネットワークなど)から取得されます。学習アルゴリズムの失敗は、仮説クラスの選択が間違っている(近似の難易度)か、仮説クラス内で最適な関数を見つけられない(学習の難易度)という2つの理由が考えられます。近似と学習可能性はどちらもアルゴリズムの成功にとって重要ですが、通常は別々に研究されます。この作業では、線形クラスと浅いネットワークを使用した近似の硬度と、相関クエリと勾配降下法を使用した学習の硬度の両方を意味する単一の硬度特性を示します。これにより、パリティ関数、DNF式、$AC^0$回路の近似の硬さと学習可能性に関する新しい結果を得ることができます。

Accelerating Adaptive Cubic Regularization of Newton’s Method via Random Sampling
ランダムサンプリングによるニュートン法の適応3次正則化の加速

In this paper, we consider an unconstrained optimization model where the objective is a sum of a large number of possibly nonconvex functions, though overall the objective is assumed to be smooth and convex. Our bid to solving such model uses the framework of cubic regularization of Newton’s method. As well known, the crux in cubic regularization is its utilization of the Hessian information, which may be computationally expensive for large-scale problems. To tackle this, we resort to approximating the Hessian matrix via sub-sampling. In particular, we propose to compute an approximated Hessian matrix by either uniformly or non-uniformly sub-sampling the components of the objective. Based upon such sampling strategy, we develop accelerated adaptive cubic regularization approaches and provide theoretical guarantees on global iteration complexity of $\O(\epsilon^{-1/3})$ with high probability, which matches that of the original accelerated cubic regularization methods Jiang et al. (2020) using the full Hessian information. Interestingly, we also show that in the worst case scenario our algorithm still achieves an $O(\epsilon^{-5/6}\log(\epsilon^{-1}))$ iteration complexity bound. The proof techniques are new to our knowledge and can be of independent interets. Experimental results on the regularized logistic regression problems demonstrate a clear effect of acceleration on several real data sets.

この論文では、全体としては滑らかで凸であると仮定しながらも、目的関数が多数の非凸関数の和である制約のない最適化モデルを検討します。このようなモデルを解くために、ニュートン法の3次正則化のフレームワークを使用します。よく知られているように、3次正則化の要点はヘッセ情報の利用であり、大規模な問題では計算コストが高くなる可能性があります。これに対処するために、サブサンプリングによるヘッセ行列の近似に頼る。特に、目的関数の成分を均一または不均一にサブサンプリングすることにより、近似ヘッセ行列を計算することを提案します。このようなサンプリング戦略に基づいて、高速適応型3次正則化アプローチを開発し、高い確率で$\O(\epsilon^{-1/3})$のグローバル反復計算量を理論的に保証します。これは、Jiangらによるオリジナルの高速3次正則化法の計算量と一致します。(2020)完全なヘッセ情報を使用しています。興味深いことに、最悪のシナリオでも、アルゴリズムは$O(\epsilon^{-5/6}\log(\epsilon^{-1}))$の反復計算量境界を達成することも示しています。証明手法は私たちの知る限り新しいものであり、独立した関心事である可能性があります。正規化ロジスティック回帰問題に関する実験結果は、いくつかの実際のデータセットに対する加速の明確な効果を示しています。

Machine Learning on Graphs: A Model and Comprehensive Taxonomy
グラフでの機械学習:モデルと包括的な分類法

There has been a surge of recent interest in graph representation learning (GRL). GRL methods have generally fallen into three main categories, based on the availability of labeled data. The first, network embedding, focuses on learning unsupervised representations of relational structure. The second, graph regularized neural networks, leverages graphs to augment neural network losses with a regularization objective for semi-supervised learning. The third, graph neural networks, aims to learn differentiable functions over discrete topologies with arbitrary structure. However, despite the popularity of these areas there has been surprisingly little work on unifying the three paradigms. Here, we aim to bridge the gap between network embedding, graph regularization and graph neural networks. We propose a comprehensive taxonomy of GRL methods, aiming to unify several disparate bodies of work. Specifically, we propose the GraphEDM framework, which generalizes popular algorithms for semi-supervised learning (e.g. GraphSage, GCN, GAT), and unsupervised learning (e.g. DeepWalk, node2vec) of graph representations into a single consistent approach. To illustrate the generality of GraphEDM, we fit over thirty existing methods into this framework. We believe that this unifying view both provides a solid foundation for understanding the intuition behind these methods, and enables future research in the area.

グラフ表現学習(GRL)への関心が最近急増しています。GRL手法は、ラベル付きデータの可用性に基づいて、一般的に3つの主なカテゴリに分類されます。1つ目はネットワーク埋め込みで、関係構造の教師なし表現の学習に焦点を当てています。2つ目はグラフ正規化ニューラルネットワークで、半教師あり学習の正規化目的でグラフを利用してニューラルネットワークの損失を増強します。3つ目はグラフニューラルネットワークで、任意の構造を持つ離散トポロジー上の微分可能関数の学習を目指します。ただし、これらの分野が人気であるにもかかわらず、3つのパラダイムを統合する作業は驚くほど少ないです。ここでは、ネットワーク埋め込み、グラフ正規化、グラフニューラルネットワーク間のギャップを埋めることを目指します。GRL手法の包括的な分類を提案し、いくつかの異なる作業の統合を目指します。具体的には、グラフ表現の半教師あり学習(GraphSage、GCN、GATなど)と教師なし学習(DeepWalk、node2vecなど)の一般的なアルゴリズムを単一の一貫したアプローチに一般化するGraphEDMフレームワークを提案します。GraphEDMの一般性を示すために、30を超える既存のメソッドをこのフレームワークに組み込みました。この統一的な視点は、これらのメソッドの背後にある直感を理解するための強固な基盤を提供し、この分野での将来の研究を可能にすると考えています。

Generalized Ambiguity Decomposition for Ranking Ensemble Learning
ランク付けアンサンブル学習のための一般化曖昧性分解

Error decomposition analysis is a key problem for ensemble learning, which indicates that proper combination of multiple models can achieve better performance than any individual one. Existing theoretical research of ensemble learning focuses on regression or classification tasks. There is limited theoretical research for ranking ensemble. In this paper, we first generalize the ambiguity decomposition theory from regression ensemble to ranking ensemble, which proves the effectiveness of ranking ensemble with consideration of list-wise ranking information. According to the generalized theory, we propose an explicit diversity measure for ranking ensemble, which can be used to enhance the diversity of ensemble and improve the performance of ensemble model. Furthermore, we adopt an adaptive learning scheme to learn query-dependent ensemble weights, which can fit into the generalized theory and help to further improve the performance of ensemble model. Extensive experiments on recommendation and information retrieval tasks demonstrate the effectiveness and theoretical advantages of the proposed method compared with several state-of-the-art methods.

誤差分解分析はアンサンブル学習の重要な問題であり、複数のモデルを適切に組み合わせることで、個々のモデルよりも優れたパフォーマンスを実現できることを示しています。既存のアンサンブル学習の理論的研究は、回帰または分類タスクに焦点を当てています。ランキングアンサンブルの理論的研究は限られています。この論文では、まず、回帰アンサンブルからランキングアンサンブルに曖昧さ分解理論を一般化し、リストごとのランキング情報を考慮したランキングアンサンブルの有効性を証明します。一般化された理論によれば、ランキングアンサンブルの明示的な多様性尺度を提案します。これは、アンサンブルの多様性を高め、アンサンブルモデルのパフォーマンスを向上させるために使用できます。さらに、適応学習スキームを採用してクエリ依存のアンサンブル重みを学習します。これは一般化理論に適合し、アンサンブルモデルのパフォーマンスをさらに向上させるのに役立ちます。推奨および情報検索タスクに関する広範な実験により、いくつかの最先端の方法と比較して、提案された方法の有効性と理論的利点が実証されています。

CD-split and HPD-split: Efficient Conformal Regions in High Dimensions
CD分割とHPD分割:高次元における効率的な共形領域

Conformal methods create prediction bands that control average coverage assuming solely i.i.d. data. Although the literature has mostly focused on prediction intervals, more general regions can often better represent uncertainty. For instance, a bimodal target is better represented by the union of two intervals. Such prediction regions are obtained by CD-split, which combines the split method and a data-driven partition of the feature space which scales to high dimensions. CD-split however contains many tuning parameters, and their role is not clear. In this paper, we provide new insights on CD-split by exploring its theoretical properties. In particular, we show that CD-split converges asymptotically to the oracle highest predictive density set and satisfies local and asymptotic conditional validity. We also present simulations that show how to tune CD-split. Finally, we introduce HPD-split, a variation of CD-split that requires less tuning, and show that it shares the same theoretical guarantees as CD-split. In a wide variety of our simulations, CD-split and HPD-split have better conditional coverage and yield smaller prediction regions than other methods.

コンフォーマル法は、i.i.d.データのみを想定して平均カバレッジを制御する予測バンドを作成します。文献では主に予測区間に焦点を当てていますが、より一般的な領域の方が不確実性をより適切に表すことができる場合がよくあります。たとえば、バイモーダルターゲットは、2つの区間の結合によってより適切に表されます。このような予測領域は、分割法と、高次元にスケーリングされる特徴空間のデータ駆動型分割を組み合わせたCD分割によって取得されます。ただし、CD分割には多くの調整パラメーターが含まれており、その役割は明確ではありません。この論文では、CD分割の理論的特性を検討することで、CD分割に関する新しい洞察を提供します。特に、CD分割がオラクル最高予測密度セットに漸近的に収束し、ローカルおよび漸近条件付き妥当性を満たすことを示します。また、CD分割を調整する方法を示すシミュレーションも示します。最後に、CD分割のバリエーションで調整が少なくて済むHPD分割を紹介し、CD分割と同じ理論的保証を共有することを示します。私たちのさまざまなシミュレーションでは、CD分割とHPD分割は他の方法よりも条件付きカバレッジが優れており、予測領域が小さくなります。

Robust and scalable manifold learning via landmark diffusion for long-term medical signal processing
長期の医療信号処理のためのランドマーク拡散による堅牢でスケーラブルな多様体学習

Motivated by analyzing long-term physiological time series, we design a robust and scalable spectral embedding algorithm that we refer to as RObust and Scalable Embedding via LANdmark Diffusion (Roseland). The key is designing a diffusion process on the dataset where the diffusion is done via a small subset called the landmark set. Roseland is theoretically justified under the manifold model, and its computational complexity is comparable with commonly applied subsampling scheme such as the Nystr\”om extension. Specifically, when there are $n$ data points in $\mathbb{R}^q$ and $n^\beta$ points in the landmark set, where $\beta\in (0,1)$, the computational complexity of Roseland is $O(n^{1+2\beta}+qn^{1+\beta})$, while that of Nystrom is $O(n^{2.81\beta}+qn^{1+2\beta})$. To demonstrate the potential of Roseland, we apply it to { three} datasets and compare it with several other existing algorithms. First, we apply Roseland to the task of spectral clustering using the MNIST dataset (70,000 images), achieving 85\% accuracy when the dataset is clean and 78\% accuracy when the dataset is noisy. Compared with other subsampling schemes, overall Roseland achieves a better performance. Second, we apply Roseland to the task of image segmentation using images from COCO. Finally, we demonstrate how to apply Roseland to explore long-term arterial blood pressure waveform dynamics during a liver transplant operation lasting for 12 hours. In conclusion, Roseland is scalable and robust, and it has a potential for analyzing large datasets.

長期にわたる生理学的時系列を分析することに着目し、私たちは堅牢でスケーラブルなスペクトル埋め込みアルゴリズムを設計しました。このアルゴリズムは、ランドマーク拡散によるロバストかつスケーラブルな埋め込み(Roseland)と呼ばれています。重要なのは、ランドマークセットと呼ばれる小さなサブセットを介して拡散が行われるデータセット上の拡散プロセスを設計することです。Roselandは多様体モデルの下で理論的に正当化されており、その計算量はNystrom拡張などの一般的に適用されるサブサンプリング方式と同等です。具体的には、$\mathbb{R}^q$に$n$個のデータポイントがあり、ランドマークセットに$n^\beta$個のポイントがある場合($\beta\in (0,1)$)、Roselandの計算量は$O(n^{1+2\beta}+qn^{1+\beta})$ですが、Nystromの計算量は$O(n^{2.81\beta}+qn^{1+2\beta})$です。Roselandの可能性を示すために、{3}個のデータセットに適用し、他のいくつかの既存のアルゴリズムと比較します。まず、MNISTデータセット(70,000枚の画像)を使用してRoselandをスペクトルクラスタリングのタスクに適用し、データセットがクリーンな場合は85\%の精度、データセットがノイズの多い場合は78\%の精度を達成しました。他のサブサンプリング方式と比較して、Roselandは全体的に優れたパフォーマンスを発揮します。次に、COCOの画像を使用してRoselandを画像セグメンテーションのタスクに適用します。最後に、12時間続く肝臓移植手術中の長期的な動脈血圧波形のダイナミクスを調査するためにRoselandを適用する方法を示します。結論として、Roselandはスケーラブルで堅牢であり、大規模なデータセットを分析する可能性を秘めています。

A Distribution Free Conditional Independence Test with Applications to Causal Discovery
因果関係発見への応用による配布自由条件付き独立性検定

This paper is concerned with test of the conditional independence. We first establish an equivalence between the conditional independence and the mutual independence. Based on the equivalence, we propose an index to measure the conditional dependence by quantifying the mutual dependence among the transformed variables. The proposed index has several appealing properties. (a) It is distribution free since the limiting null distribution of the proposed index does not depend on the population distributions of the data. Hence the critical values can be tabulated by simulations. (b) The proposed index ranges from zero to one, and equals zero if and only if the conditional independence holds. Thus, it has nontrivial power under the alternative hypothesis. (c) It is robust to outliers and heavy-tailed data since it is invariant to conditional strictly monotone transformations. (d) It has low computational cost since it incorporates a simple closed-form expression and can be implemented in quadratic time. (e) It is insensitive to tuning parameters involved in the calculation of the proposed index. (f) The new index is applicable for multivariate random vectors as well as for discrete data. All these properties enable us to use the new index as statistical inference tools for various data. The effectiveness of the method is illustrated through extensive simulations and a real application on causal discovery.

この論文では、条件付き独立性の検定に関するものです。まず、条件付き独立性と相互独立性の間に同等性を確立します。同等性に基づいて、変換された変数間の相互依存性を定量化することにより、条件付き依存性を測定するための指標を提案します。提案された指標には、いくつかの魅力的な特性があります。(a)提案された指標の限界帰無分布はデータの母集団分布に依存しないため、分布フリーです。したがって、臨界値はシミュレーションによって表にすることができます。(b)提案された指標の範囲は0から1であり、条件付き独立性が成り立つ場合にのみ0になります。したがって、対立仮説の下では重要な力を持つ。(c)条件付きの厳密に単調な変換に対して不変であるため、外れ値や裾の重いデータに対して堅牢です。(d)単純な閉形式の式を組み込んでおり、2次時間で実装できるため、計算コストが低い。(e)提案された指標の計算に含まれるチューニングパラメーターの影響を受けない。(f)新しい指標は、離散データだけでなく多変量ランダムベクトルにも適用できます。これらすべての特性により、新しい指標をさまざまなデータの統計的推論ツールとして使用できます。この方法の有効性は、広範なシミュレーションと因果発見への実際のアプリケーションを通じて実証されています。

Distributed Bayesian Varying Coefficient Modeling Using a Gaussian Process Prior
ガウス過程事前分布を使用した分散ベイズ変動係数モデリング

Varying coefficient models (VCMs) are widely used for estimating nonlinear regression functions for functional data. Their Bayesian variants using Gaussian process priors on the functional coefficients, however, have received limited attention in massive data applications, mainly due to the prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We address this problem using a divide-and-conquer Bayesian approach. We first create a large number of data subsamples with much smaller sizes. Then, we formulate the VCM as a linear mixed-effects model and develop a data augmentation algorithm for obtaining MCMC draws on all the subsets in parallel. Finally, we aggregate the MCMC-based estimates of subset posteriors into a single Aggregated Monte Carlo (AMC) posterior, which is used as a computationally efficient alternative to the true posterior distribution. Theoretically, we derive minimax optimal posterior convergence rates for the AMC posteriors of both the varying coefficients and the mean regression function. We provide quantification on the orders of subset sample sizes and the number of subsets. The empirical results show that the combination schemes that satisfy our theoretical assumptions, including the AMC posterior, have better estimation performance than their main competitors across diverse simulations and in a real data analysis.

変動係数モデル(VCM)は、関数データの非線形回帰関数の推定に広く使用されています。しかし、関数係数にガウス過程事前分布を使用するベイズ変種は、主にマルコフ連鎖モンテカルロ(MCMC)アルゴリズムを使用した事後計算が非常に遅いため、大規模データアプリケーションではあまり注目されていません。私たちは、分割統治ベイズアプローチを使用してこの問題に対処します。まず、はるかに小さいサイズのデータサブサンプルを多数作成します。次に、VCMを線形混合効果モデルとして定式化し、すべてのサブセットでMCMC抽出を並列に取得するためのデータ拡張アルゴリズムを開発します。最後に、サブセット事後分布のMCMCベースの推定値を単一の集約モンテカルロ(AMC)事後分布に集約します。これは、真の事後分布に代わる計算効率の高い方法として使用されます。理論的には、変動係数と平均回帰関数の両方のAMC事後分布のミニマックス最適事後収束率を導出します。サブセットのサンプルサイズとサブセットの数の順序について定量化を提供します。実験結果では、AMC事後分布を含む理論的仮定を満たす組み合わせスキームが、さまざまなシミュレーションと実際のデータ分析で主要な競合相手よりも優れた推定パフォーマンスを発揮することを示しています。

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping
EHR表現型への応用による事前適応的半教師あり学習

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold-standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.

電子健康記録(EHR)データは、生物医学研究の豊富な情報源であり、さまざまな疾患に対する新たな洞察を得るために効果的に使用されてきました。その可能性にもかかわらず、EHRは現在、正確な表現型情報の欠如という大きな制限のため、発見研究に十分に活用されていません。このような困難を克服するために、最近の取り組みでは、カルテのレビューによって抽出されたゴールドスタンダードのラベルを含む比較的小規模なトレーニングデータセットに基づいて表現型を正確に予測する教師ありアルゴリズムの開発に注力してきました。ただし、教師あり方法では、特に候補となる特徴の数が多い場合は、一般化可能なアルゴリズムを生成するために通常、かなりの規模のトレーニングセットが必要です。この論文では、ラベルYと特徴セットXの両方が観察される小規模なラベル付きデータセットと、特徴セットXにすべての患者が利用できる代理ラベルSのみが付随する、はるかに大規模で弱いラベル付きデータセットの両方から情報を借用する半教師あり(SS) EHR表現型解析法を提案します。SはYを介してのみXと関連し、近似的に成立するという作業上の事前仮定の下で、事前知識を組み込む事前適応型半教師あり(PASS)推定量を提案します。この推定量は、事前知識に基づいて導出された方向に向かって推定量を縮小します。提案された推定量の漸近理論を導出し、質の低い事前情報に対するその効率性と堅牢性を正当化します。また、シミュレーション研究および大規模な三次医療機関での3つの実際のEHR表現型研究を通じて、さまざまなシナリオで既存の推定量よりも優れていることを実証します。

FuDGE: A Method to Estimate a Functional Differential Graph in a High-Dimensional Setting
FuDGE:高次元設定における関数微分グラフの推定手法

We consider the problem of estimating the difference between two undirected functional graphical models with shared structures. In many applications, data are naturally regarded as a vector of random functions rather than as a vector of scalars. For example, electroencephalography (EEG) data are treated more appropriately as functions of time. In such a problem, not only can the number of functions measured per sample be large, but each function is itself an infinite dimensional object, making estimation of model parameters challenging. This is further complicated by the fact that curves are usually observed only at discrete time points. We first define a functional differential graph that captures the differences between two functional graphical models and formally characterize when the functional differential graph is well defined. We then propose a method, FuDGE, that directly estimates the functional differential graph without first estimating each individual graph. This is particularly beneficial in settings where the individual graphs are dense but the differential graph is sparse. We show that FuDGE consistently estimates the functional differential graph even in a high-dimensional setting for both fully observed and discretely observed function paths. We illustrate the finite sample properties of our method through simulation studies. We also propose a competing method, the Joint Functional Graphical Lasso, which generalizes the Joint Graphical Lasso to the functional setting. Finally, we apply our method to EEG data to uncover differences in functional brain connectivity between a group of individuals with alcohol use disorder and a control group.

私たちは、共通の構造を持つ2つの無向機能グラフィカルモデル間の差を推定する問題について検討します。多くのアプリケーションでは、データはスカラーのベクトルではなく、ランダム関数のベクトルとして自然にみなされます。たとえば、脳波(EEG)データは、時間の関数としてより適切に扱われます。このような問題では、サンプルごとに測定される関数の数が多くなるだけでなく、各関数自体が無限次元オブジェクトであるため、モデルパラメーターの推定が困難になります。曲線は通常、離散的な時点でのみ観察されるという事実によって、この問題はさらに複雑になります。まず、2つの機能グラフィカルモデル間の差を捉える機能微分グラフを定義し、機能微分グラフが適切に定義されている場合を正式に特徴付けます。次に、最初に個々のグラフを推定せずに機能微分グラフを直接推定する手法FuDGEを提案します。これは、個々のグラフは密であるが微分グラフは疎である設定で特に有益です。FuDGEは、高次元設定でも、完全に観測された機能パスと離散的に観測された機能パスの両方において、一貫して機能微分グラフを推定することを示します。シミュレーション研究を通じて、この方法の有限サンプル特性を示します。また、Joint Graphical Lassoを機能設定に一般化した、競合方法であるJoint Functional Graphical Lassoも提案します。最後に、この方法をEEGデータに適用して、アルコール使用障害を持つグループと対照群の間の機能的脳接続の違いを明らかにします。

Dependent randomized rounding for clustering and partition systems with knapsack constraints
ナップザック制約を持つクラスタリングシステムとパーティションシステムのための従属ランダム化丸め

Clustering problems are fundamental to unsupervised learning. There is an increased emphasis on fairness in machine learning and AI; one representative notion of fairness is that no single group should be over-represented among the cluster-centers. This, and much more general clustering problems, can be formulated with “knapsack” and “partition” constraints. We develop new randomized algorithms targeting such problems, and study two in particular: multi-knapsack median and multi-knapsack center. Our rounding algorithms give new approximation and pseudo-approximation algorithms for these problems. One key technical tool, which may be of independent interest, is a new tail bound analogous to Feige (2006) for sums of random variables with unbounded variances. Such bounds can be useful in inferring properties of large networks using few samples.

クラスタリングの問題は、教師なし学習の基本です。機械学習とAIでは公平性がますます重視されています。公平性の代表的な概念の一つは、クラスターセンターの中で単一のグループが過度に代表されるべきではないというものです。これと、より一般的なクラスタリングの問題は、「ナップザック」制約と「パーティション」制約を使用して定式化できます。このような問題を対象とした新しいランダム化アルゴリズムを開発し、特にマルチナップザック中央値とマルチナップザックセンターの2つを研究しています。私たちの丸めアルゴリズムは、これらの問題に対して新しい近似アルゴリズムと擬似近似アルゴリズムを提供します。独立した関心事となる可能性のある重要な技術ツールの1つは、Feige (2006)に類似した、無制限の分散を持つ確率変数の合計に対する新しいテールバウンドです。このような境界は、少数のサンプルを使用して大規模なネットワークの特性を推論するのに役立ちます。

Posterior Asymptotics for Boosted Hierarchical Dirichlet Process Mixtures
ブースト階層的ディリクレ過程混合のための事後漸近

Bayesian hierarchical models are powerful tools for learning common latent features across multiple data sources. The Hierarchical Dirichlet Process (HDP) is invoked when the number of latent components is a priori unknown. While there is a rich literature on finite sample properties and performance of hierarchical processes, the analysis of their frequentist posterior asymptotic properties is still at an early stage. Here we establish theoretical guarantees for recovering the true data generating process when the data are modeled as mixtures over the HDP or a generalization of the HDP, which we term boosted because of the faster growth in the number of discovered latent features. By extending Schwartz’s theory to partially exchangeable sequences we show that posterior contraction rates are crucially affected by the relationship between the sample sizes corresponding to the different groups. The effect varies according to the smoothness level of the true data distributions. In the supersmooth case, when the generating densities are Gaussian mixtures, we recover the parametric rate up to a logarithmic factor, provided that the sample sizes are related in a polynomial fashion. Under ordinary smoothness assumptions more caution is needed as a polynomial deviation in the sample sizes could drastically deteriorate the convergence to the truth.

ベイズ階層モデルは、複数のデータソースに共通する潜在的特徴を学習するための強力なツールです。潜在的コンポーネントの数が事前に不明な場合は、階層ディリクレ過程(HDP)が呼び出されます。階層的プロセスの有限サンプル特性とパフォーマンスに関する文献は豊富にありますが、頻度主義事後漸近特性の分析はまだ初期段階にあります。ここでは、データがHDP上の混合として、またはHDPの一般化としてモデル化される場合に、真のデータ生成プロセスを回復するための理論的保証を確立します。これは、発見された潜在的特徴の数の増加が速いため、ブーストと呼ばれます。シュワルツの理論を部分的に交換可能なシーケンスに拡張することにより、事後収縮率が、異なるグループに対応するサンプルサイズ間の関係によって決定的に影響を受けることを示します。効果は、真のデータ分布の滑らかさのレベルによって異なります。超滑らかなケースでは、生成密度がガウス混合の場合、サンプルサイズが多項式で関連している限り、対数係数までのパラメトリックレートを回復します。通常の滑らかさの仮定では、サンプルサイズの多項式偏差によって真実への収束が大幅に悪化する可能性があるため、より注意が必要です。

Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors
非混合ベイズ計算のためのスタッキング:マルチモーダル事後関数の呪いと祝福

When working with multimodal Bayesian posterior distributions, Markov chain Monte Carlo (MCMC) algorithms have difficulty moving between modes, and default variational or mode-based approximate inferences will understate posterior uncertainty. And, even if the most important modes can be found, it is difficult to evaluate their relative weights in the posterior. Here we propose an approach using parallel runs of MCMC, variational, or mode-based inference to hit as many modes or separated regions as possible and then combine these using Bayesian stacking, a scalable method for constructing a weighted average of distributions. The result from stacking efficiently samples from multimodal posterior distribution, minimizes cross validation prediction error, and represents the posterior uncertainty better than variational inference, but it is not necessarily equivalent, even asymptotically, to fully Bayesian inference. We present theoretical consistency with an example where the stacked inference approximates the true data generating process from the misspecified model and a non-mixing sampler, from which the predictive performance is better than full Bayesian inference, hence the multimodality can be considered a blessing rather than a curse under model misspecification. We demonstrate practical implementation in several model families: latent Dirichlet allocation, Gaussian process regression, hierarchical regression, horseshoe variable selection, and neural networks.

マルチモーダルベイジアン事後分布を扱う場合、マルコフ連鎖モンテカルロ(MCMC)アルゴリズムではモード間の移動が困難で、デフォルトの変分またはモードベースの近似推論では事後不確実性が過小評価されます。また、最も重要なモードが見つかったとしても、事後分布におけるそれらの相対的な重みを評価することは困難です。ここでは、MCMC、変分、またはモードベースの推論を並列実行して、できるだけ多くのモードまたは分離された領域に到達し、次に、分布の加重平均を構築するためのスケーラブルな方法であるベイジアンスタッキングを使用してこれらを組み合わせるアプローチを提案します。スタッキングの結果は、マルチモーダル事後分布から効率的にサンプルを抽出し、クロスバリデーション予測誤差を最小化し、変分推論よりも事後不確実性を適切に表しますが、漸近的にも完全なベイジアン推論と必ずしも同等であるとは限りません。スタックされた推論が、誤って指定されたモデルと非混合サンプラーからの真のデータ生成プロセスを近似する例で理論的な一貫性を示します。この例では、予測パフォーマンスが完全なベイズ推論よりも優れているため、モデルの誤った指定の下では、マルチモーダル性は呪いではなく祝福であると考えられます。潜在ディリクレ配分、ガウス過程回帰、階層的回帰、馬蹄形変数選択、ニューラルネットワークなど、いくつかのモデルファミリでの実用的な実装を示します。

Multi-Agent Online Optimization with Delays: Asynchronicity, Adaptivity, and Optimism
遅延を伴うマルチエージェントオンライン最適化:非同期性、適応性、および楽観主義

In this paper, we provide a general framework for studying multi-agent online learning problems in the presence of delays and asynchronicities. Specifically, we propose and analyze a class of adaptive dual averaging schemes in which agents only need to accumulate gradient feedback received from the whole system, without requiring any between-agent coordination. In the single-agent case, the adaptivity of the proposed method allows us to extend a range of existing results to problems with potentially unbounded delays between playing an action and receiving the corresponding feedback. In the multi-agent case, the situation is significantly more complicated because agents may not have access to a global clock to use as a reference point; to overcome this, we focus on the information that is available for producing each prediction rather than the actual delay associated with each feedback. This allows us to derive adaptive learning strategies with optimal regret bounds, even in a fully decentralized, asynchronous environment. Finally, we also analyze an “optimistic” variant of the proposed algorithm which is capable of exploiting the predictability of problems with a slower variation and leads to improved regret bounds.

この論文では、遅延と非同期性が存在するマルチエージェントオンライン学習問題を研究するための一般的なフレームワークを提供します。具体的には、エージェントがシステム全体から受信した勾配フィードバックを蓄積するだけでよく、エージェント間の調整を必要としない、適応型デュアル平均化スキームのクラスを提案し、分析します。単一エージェントの場合、提案された方法の適応性により、アクションの実行と対応するフィードバックの受信の間に潜在的に無制限の遅延がある問題に、既存の結果の範囲を拡張できます。マルチエージェントの場合、エージェントが参照ポイントとして使用するグローバルクロックにアクセスできない可能性があるため、状況は大幅に複雑になります。これを克服するために、各フィードバックに関連する実際の遅延ではなく、各予測を生成するために利用できる情報に焦点を当てます。これにより、完全に分散化された非同期環境でも、最適な後悔境界を持つ適応型学習戦略を導き出すことができます。最後に、提案されたアルゴリズムの「楽観的」バリアントも分析します。これは、より遅い変動の問題の予測可能性を活用し、改善された後悔境界につながります。

Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits
区分的静止バンディットに対処するための効率的な変化点検出

We introduce GLRklUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, klUCB, with an efficient, parameter-free, change-point detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous non-stationary bandit algorithms using a change-point detector, GLRklUCB does not need to be calibrated based on prior knowledge on the arms’ means. We prove that this algorithm can attain a $O(\sqrt{TA\Upsilon_T\log(T)})$ regret in $T$ rounds on some “easy” instances in which there is sufficient delay between two change-points, where $A$ is the number of arms and $\Upsilon_T$ the number of change-points, without prior knowledge of $\Upsilon_T$. In contrast with recently proposed algorithms that are agnostic to $\Upsilon_T$, we perform a numerical study showing that GLRklUCB is also very efficient in practice, beyond easy instances.

私たちは、報酬が制限された区分的iid非定常バンディット問題に対する新しいアルゴリズムGLRklUCBを紹介します。このアルゴリズムは、効率的なバンディットアルゴリズムklUCBと、効率的でパラメーター不要の変化点検出器であるベルヌーイ一般化尤度比検定を組み合わせたもので、独立関心の新しい理論的保証を提供します。変化点検出器を使用する以前の非定常バンディットアルゴリズムとは異なり、GLRklUCBは、アームの平均に関する事前知識に基づいて調整する必要がありません。このアルゴリズムは、2つの変化点の間に十分な遅延があるいくつかの「簡単な」インスタンスで、$T$ラウンドで$O(\sqrt{TA\Upsilon_T\log(T)})$の後悔を達成できることを証明します。ここで、$A$はアームの数、$\Upsilon_T$は変化点の数で、$\Upsilon_T$の事前知識は必要ありません。$\Upsilon_T$に依存しない最近提案されたアルゴリズムとは対照的に、GLRklUCBも簡単な例を超えて実際には非常に効率的であることを示す数値研究を実行します。

Joint Inference of Multiple Graphs from Matrix Polynomials
行列多項式からの複数グラフの同時推論

Inferring graph structure from observations on the nodes is an important and popular network science task. Departing from the more common inference of a single graph, we study the problem of jointly inferring multiple graphs from the observation of signals at their nodes (graph signals), which are assumed to be stationary in the sought graphs. Graph stationarity implies that the mapping between the covariance of the signals and the sparse matrix representing the underlying graph is given by a matrix polynomial. A prominent example is that of Markov random fields, where the inverse of the covariance yields the sparse matrix of interest. From a modeling perspective, stationary graph signals can be used to model linear network processes evolving on a set of (not necessarily known) networks. Leveraging that matrix polynomials commute, a convex optimization method along with sufficient conditions that guarantee the recovery of the true graphs are provided when perfect covariance information is available. Particularly important from an empirical viewpoint, we provide high-probability bounds on the recovery error as a function of the number of signals observed and other key problem parameters. Numerical experiments demonstrate the effectiveness of the proposed method with perfect covariance information as well as its robustness in the noisy regime.

ノードの観測からグラフ構造を推論することは、ネットワークサイエンスの重要かつ一般的なタスクです。より一般的な単一のグラフの推論から離れて、ノードでの信号(グラフ信号)の観測から複数のグラフを共同で推論する問題を研究します。信号は、探しているグラフで定常であると想定されます。グラフの定常性は、信号の共分散と基礎となるグラフを表す疎行列との間のマッピングが行列多項式によって与えられることを意味します。顕著な例はマルコフランダムフィールドで、共分散の逆関数が対象の疎行列を生成します。モデリングの観点からは、定常グラフ信号を使用して、一連の(必ずしも既知ではない)ネットワーク上で展開する線形ネットワークプロセスをモデル化できます。行列多項式の可換性を活用して、完全な共分散情報が利用可能な場合に、真のグラフの回復を保証する十分な条件とともに凸最適化手法が提供されます。経験的観点から特に重要なのは、観測された信号の数やその他の重要な問題パラメータの関数として、回復エラーの高確率境界を提供することです。数値実験により、完全な共分散情報を持つ提案手法の有効性と、ノイズの多い状況での堅牢性が実証されています。

Mutual Information Constraints for Monte-Carlo Objectives to Prevent Posterior Collapse Especially in Language Modelling
特に言語モデリングにおける後方崩壊を防ぐためのモンテカルロ目標の相互情報量制約

Posterior collapse is a common failure mode of density models trained as variational autoencoders, wherein they model the data without relying on their latent variables, rendering these variables useless. We focus on two factors contributing to posterior collapse, that have been studied separately in the literature. First, the underspecification of the model, which in an extreme but common case allows posterior collapse to be the theoretical optimium. Second, the looseness of the variational lower bound and the related underestimation of the utility of the latents. We weave these two strands of research together, specifically the tighter bounds of multi-sample Monte-Carlo objectives and constraints on the mutual information between the observable and the latent variables. The main obstacle is that the usual method of estimating the mutual information as the average Kullback-Leibler divergence between the easily available variational posterior q(z|x) and the prior does not work with Monte-Carlo objectives because their q(z|x) is not a direct approximation to the model’s true posterior p(z|x). Hence, we construct estimators of the Kullback-Leibler divergence of the true posterior from the prior by recycling samples used in the objective, with which we train models of continuous and discrete latents at much improved rate-distortion and no posterior collapse. While alleviated, the tradeoff between modelling the data and using the latents still remains, and we urge for evaluating inference methods across a range of mutual information values.

事後崩壊は、変分オートエンコーダとしてトレーニングされた密度モデルの一般的な失敗モードです。このモードでは、潜在変数に依存せずにデータをモデル化し、これらの変数を役に立たなくします。私たちは、文献で個別に研究されてきた事後崩壊に寄与する2つの要因に焦点を当てます。1つ目は、モデルの指定不足です。これは、極端ですが一般的なケースでは、事後崩壊が理論的最適値になることがあります。2つ目は、変分下限の緩さおよび関連する潜在変数の有用性の過小評価です。私たちは、マルチサンプルモンテカルロ目的関数のより厳しい境界と、観測可能な変数と潜在変数間の相互情報量に対する制約という、この2つの研究の流れを組み合わせます。主な障害は、簡単に入手できる変分事後分布q(z|x)と事前分布の間の平均Kullback-Leiblerダイバージェンスとして相互情報量を推定する通常の方法が、モンテカルロ目的関数では機能しないことです。これは、q(z|x)がモデルの真の事後分布p(z|x)の直接近似ではないためです。したがって、目的関数で使用したサンプルをリサイクルして、事前分布から真の事後分布のKullback-Leiblerダイバージェンスの推定量を構築し、これにより、レート歪みが大幅に改善され、事後分布の崩壊がない状態で、連続潜在変数と離散潜在変数のモデルをトレーニングします。軽減されたとはいえ、データのモデル化と潜在変数の使用のトレードオフは依然として残っているため、相互情報値の範囲にわたって推論方法を評価することを強くお勧めします。

All You Need is a Good Functional Prior for Bayesian Deep Learning
必要なのは、ベイジアン深層学習の優れた関数型事前分布だけです

The Bayesian treatment of neural networks dictates that a prior distribution is specified over their weight and bias parameters. This poses a challenge because modern neural networks are characterized by a large number of parameters, and the choice of these priors has an uncontrolled effect on the induced functional prior, which is the distribution of the functions obtained by sampling the parameters from their prior distribution. We argue that this is a hugely limiting aspect of Bayesian deep learning, and this work tackles this limitation in a practical and effective way. Our proposal is to reason in terms of functional priors, which are easier to elicit, and to “tune” the priors of neural network parameters in a way that they reflect such functional priors. Gaussian processes offer a rigorous framework to define prior distributions over functions, and we propose a novel and robust framework to match their prior with the functional prior of neural networks based on the minimization of their Wasserstein distance. We provide vast experimental evidence that coupling these priors with scalable Markov chain Monte Carlo sampling offers systematically large performance improvements over alternative choices of priors and state-of-the-art approximate Bayesian deep learning approaches. We consider this work a considerable step in the direction of making the long-standing challenge of carrying out a fully Bayesian treatment of neural networks, including convolutional neural networks, a concrete possibility.

ニューラルネットワークのベイズ的処理では、重みとバイアスパラメータに対して事前分布が指定される必要があります。これは、現代のニューラルネットワークが多数のパラメータによって特徴付けられ、これらの事前分布の選択が、事前分布からパラメータをサンプリングすることによって取得される関数の分布である誘導機能事前分布に制御不能な影響を及ぼすため、課題となります。私たちは、これがベイズ深層学習の非常に制限的な側面であると主張し、この研究では、この制限に実用的かつ効果的な方法で取り組みます。私たちの提案は、より簡単に引き出すことができる機能事前分布の観点から推論し、そのような機能事前分布を反映するようにニューラルネットワークパラメータの事前分布を「調整」することです。ガウス過程は、関数に対する事前分布を定義するための厳密なフレームワークを提供します。私たちは、ワッサーシュタイン距離の最小化に基づいて、その事前分布をニューラルネットワークの機能事前分布と一致させる新しい堅牢なフレームワークを提案します。これらの事前分布をスケーラブルなマルコフ連鎖モンテカルロサンプリングと組み合わせると、事前分布の代替選択や最先端の近似ベイジアン深層学習アプローチに比べて、体系的に大きなパフォーマンス向上が得られるという膨大な実験的証拠を提供します。この研究では、畳み込みニューラルネットワークを含むニューラルネットワークの完全なベイジアン処理を実行するという長年の課題を具体的な可能性にするための大きな一歩であると考えています。

A Kernel Two-Sample Test for Functional Data
機能データのカーネル 2 サンプル検定

We propose a nonparametric two-sample test procedure based on Maximum Mean Discrepancy (MMD) for testing the hypothesis that two samples of functions have the same underlying distribution, using kernels defined on function spaces. This construction is motivated by a scaling analysis of the efficiency of MMD-based tests for datasets of increasing dimension. Theoretical properties of kernels on function spaces and their associated MMD are established and employed to ascertain the efficacy of the newly proposed test, as well as to assess the effects of using functional reconstructions based on discretised function samples. The theoretical results are demonstrated over a range of synthetic and real world datasets.

私たちは、関数空間で定義されたカーネルを使用して、関数の2つのサンプルが同じ基本分布を持つという仮説を検定するための、最大平均不一致(MMD)に基づくノンパラメトリック2サンプル検定手順を提案します。この構成は、次元が増加するデータセットに対するMMDベースのテストの効率のスケーリング分析によって動機付けられています。関数空間上のカーネルとそれに関連するMMDの理論的特性を確立し、新たに提案されたテストの有効性を確認するため、および離散化された関数サンプルに基づく機能再構成を使用した影響を評価するために採用されています。理論的な結果は、さまざまな合成データセットと実世界のデータセットで実証されています。

Batch Normalization Preconditioning for Neural Network Training
ニューラルネットワーク学習のためのバッチ正規化前処理

Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this paper, we propose a new method called Batch Normalization Preconditioning (BNP). Instead of applying normalization explicitly through a batch normalization layer as is done in BN, BNP applies normalization by conditioning the parameter gradients directly during training. This is designed to improve the Hessian matrix of the loss function and hence convergence during training. One benefit is that BNP is not constrained on the mini-batch size and works in the online learning setting. Furthermore, its connection to BN provides theoretical insights on how BN improves training and how BN is applied to special architectures such as convolutional neural networks. For a theoretical foundation, we also present a novel Hessian condition number based convergence theory for a locally convex but not strong-convex loss, which is applicable to networks with a scale-invariant property.

バッチ正規化(BN)は、ディープラーニングで人気があり、広く普及している手法であり、ニューラルネットワークのトレーニング時間を短縮し、一般化パフォーマンスを向上させることが示されています。その成功にもかかわらず、BNは理論的に十分に理解されていません。非常に小さなミニバッチサイズやオンライン学習での使用には適していません。この論文では、バッチ正規化前処理(BNP)と呼ばれる新しい手法を提案します。BNで行われるようにバッチ正規化レイヤーを介して明示的に正規化を適用する代わりに、BNPはトレーニング中に直接パラメーター勾配を調整することによって正規化を適用します。これは、損失関数のヘッセ行列を改善し、トレーニング中の収束を改善するように設計されています。1つの利点は、BNPがミニバッチサイズに制約されず、オンライン学習設定で機能することです。さらに、BNとの関連により、BNがどのようにトレーニングを改善するか、およびBNが畳み込みニューラルネットワークなどの特殊なアーキテクチャにどのように適用されるかについての理論的な洞察が得られます。理論的基礎として、スケール不変特性を持つネットワークに適用可能な、局所凸だが強凸ではない損失に対する新しいヘッセ条件数ベースの収束理論も提示します。

Multiple-Splitting Projection Test for High-Dimensional Mean Vectors
高次元平均ベクトルの多重分割射影試験

We propose a multiple-splitting projection test (MPT) for one-sample mean vectors in high-dimensional settings. The idea of projection test is to project high-dimensional samples to a 1-dimensional space using an optimal projection direction such that traditional tests can be carried out with projected samples. However, estimation of the optimal projection direction has not been systematically studied in the literature. In this work, we bridge the gap by proposing a consistent estimation via regularized quadratic optimization. To retain type I error rate, we adopt a data-splitting strategy when constructing test statistics. To mitigate the power loss due to data-splitting, we further propose a test via multiple splits to enhance the testing power. We show that the $p$-values resulted from multiple splits are exchangeable. Unlike existing methods which tend to conservatively combine dependent $p$-values, we develop an exact level $\alpha$ test that explicitly utilizes the exchangeability structure to achieve better power. Numerical studies show that the proposed test well retains the type I error rate and is more powerful than state-of-the-art tests.

私たちは、高次元設定における1サンプル平均ベクトルの多重分割射影検定(MPT)を提案します。射影検定の考え方は、最適な射影方向を使用して高次元サンプルを1次元空間に射影し、従来の検定を射影サンプルで実行できるようにすることです。しかし、最適な射影方向の推定は文献で体系的に研究されていない。この研究では、正規化された二次最適化による一貫した推定を提案することで、このギャップを埋める。タイプIの誤り率を維持するために、検定統計量を構築するときにデータ分割戦略を採用します。データ分割による検出力の低下を軽減するために、検定力を高めるために多重分割による検定をさらに提案します。多重分割から得られる$p$値は交換可能であることを示す。従属$p$値を保守的に結合する傾向がある既存の方法とは異なり、交換可能性構造を明示的に利用してより良い検出力を実現する正確なレベル$\alpha$検定を開発します。数値的研究によれば、提案されたテストはタイプIのエラー率を十分に維持し、最先端のテストよりも強力であることが示されています。

Generalized Sparse Additive Models
一般化スパース加法モデル

We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework.

私たちは、高次元の一般化加法モデルの推定と分析のための統一されたフレームワークを提示します。このフレームワークは、多くの既存の方法を包含する、ペナルティ付き回帰推定量の大きなクラスを定義します。このクラスの効率的な計算アルゴリズムは、数千の観測値と特徴に簡単にスケーリングできます。このクラスのミニマックス最適収束限界を弱い互換性条件下で証明します。さらに、この互換性条件が満たされない場合の収束率を特徴付けます。最後に、フレームワークの構造ペナルティとスパース性ペナルティの最適なペナルティパラメータがリンクされていることも示し、1つのチューニングパラメータのみでクロスバリデーションを実行できることも示しています。私たちは、このフレームワーク内のいくつかの既存の方法を比較する実証的研究によって、理論的結果を補完します。

Asymptotic Network Independence and Step-Size for a Distributed Subgradient Method
分布サブグラディエント法のための漸近ネットワーク独立性とステップサイズ

We consider whether distributed subgradient methods can achieve a linear speedup over a centralized subgradient method. While it might be hoped that distributed network of $n$ nodes that can compute $n$ times more subgradients in parallel compared to a single node might, as a result, be $n$ times faster, existing bounds for distributed optimization methods are often consistent with a slowdown rather than speedup compared to a single node. We show that a distributed subgradient method has this “linear speedup” property when using a class of square-summable-but-not-summable step-sizes which include $1/t^{\beta}$ when $\beta \in (1/2,1)$; for such step-sizes, we show that after a transient period whose size depends on the spectral gap of the network, the method achieves a performance guarantee that does not depend on the network or the number of nodes. We also show that the same method can fail to have this “asymptotic network independence” property under the optimally decaying step-size $1/\sqrt{t}$ and, as a consequence, can fail to provide a linear speedup compared to a single node with $1/\sqrt{t}$ step-size.

私たちは、分散サブグラディエント法が集中サブグラディエント法よりも線形の高速化を達成できるかどうかを考察します。単一ノードと比較して並列に$n$倍のサブグラディエントを計算できる$n$ノードの分散ネットワークは、結果として$n$倍高速になることを期待できますが、分散最適化法の既存の境界は、単一ノードと比較して高速化ではなく速度低下と一致することがよくあります。分散サブグラディエント法は、$\beta \in (1/2,1)$のときに$1/t^{\beta}$を含む、平方加算可能だが加算不可能なステップサイズのクラスを使用する場合に、この「線形の高速化」プロパティを持つことを示します。このようなステップサイズでは、ネットワークのスペクトルギャップに依存するサイズの過渡期間の後、この方法はネットワークやノードの数に依存しないパフォーマンス保証を実現します。また、同じ方法では、最適に減衰するステップサイズ$1/\sqrt{t}$の下でこの「漸近的ネットワーク独立性」特性を実現できない場合があり、結果として、ステップサイズ$1/\sqrt{t}$の単一ノードと比較して線形の高速化を実現できない可能性があることも示します。

Scaling-Translation-Equivariant Networks with Decomposed Convolutional Filters
分解畳み込みフィルタによるスケーリング-平行移動-等変量ネットワーク

Encoding the scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many computer vision tasks especially when dealing with multiscale inputs. We study, in this paper, a scaling-translation-equivariant ($\mathcal{ST}$-equivariant) CNN with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve equivariance for the regular representation of the scaling-translation group $\mathcal{ST}$. To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation, a property which is theoretically analyzed and empirically verified. Numerical experiments demonstrate that the proposed scaling-translation-equivariant network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.

畳み込みニューラルネットワーク(CNN)によって学習された表現にスケール情報を明示的にエンコードすることは、特にマルチスケール入力を処理する場合の多くのコンピュータービジョンタスクにとって有益です。この論文では、空間とスケーリンググループにわたるジョイント畳み込みを備えたスケーリング-変換-等変($\mathcal{ST}$-等変) CNNについて検討します。これは、スケーリング-変換グループ$\mathcal{ST}$の通常の表現の等変を実現するために十分かつ必要であることが示されています。モデルの複雑さと計算負荷を軽減するために、畳み込みフィルターを2つの事前固定された分離可能な基底で分解し、低周波成分への展開を切り捨てます。切り捨てられたフィルター展開のさらなる利点は、等変表現の変形堅牢性が向上することです。この特性は理論的に分析され、経験的に検証されています。数値実験により、提案された分解畳み込みフィルタを備えたスケーリング-変換-等価ネットワーク(ScDCFNet)は、モデルサイズを縮小しながら、通常のCNNよりもマルチスケール画像分類のパフォーマンスが大幅に向上し、解釈可能性が向上することが実証されています。

Are All Layers Created Equal?
すべてのレイヤーは同じように作成されていますか?

Understanding deep neural networks is a major research objective with notable experimental and theoretical attention in recent years. The practical success of excessively large networks underscores the need for better theoretical analyses and justifications. In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers’ robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either “robust” or “critical”. Resetting the robust layers to their initial values does not result in adverse decline in performance. In many cases, robust layers hardly change throughout training. In contrast, re-initializing critical layers vastly degrades the performance of the network with test error essentially dropping to random guesses. Our study provides further evidence that mere parameter counting or norm calculations are too coarse in studying generalization of deep models, and “flatness” and robustness analysis of trained models need to be examined while taking into account the respective network architectures.

ディープニューラルネットワークを理解することは、近年、実験的および理論的に注目されている主要な研究目標です。過度に大規模なネットワークの実際の成功は、より優れた理論的分析と正当化の必要性を強調しています。この論文では、過剰パラメータ化されたディープモデルの層ごとの機能構造と動作に焦点を当てます。そのために、トレーニング後の再初期化とパラメータの再ランダム化に対する層の堅牢性を経験的に調査します。層の不均一性の証拠となる実験結果を提供します。理論的には、大規模なディープニューラルネットワークの層は、「堅牢」または「重要」に分類できます。堅牢な層を初期値にリセットしても、パフォーマンスが悪化することはありません。多くの場合、堅牢な層はトレーニング中ほとんど変化しません。対照的に、重要な層を再初期化すると、ネットワークのパフォーマンスが大幅に低下し、テストエラーが基本的にランダムな推測に低下します。私たちの研究は、深層モデルの一般化を研究するには単なるパラメータのカウントやノルムの計算は粗すぎるというさらなる証拠を提供し、それぞれのネットワークアーキテクチャを考慮しながら、トレーニングされたモデルの「平坦性」と堅牢性の分析を検討する必要があることを示しています。

New Insights for the Multivariate Square-Root Lasso
多変量平方根ラッソに関する新しい洞察

We study the multivariate square-root lasso, a method for fitting the multivariate response linear regression model with dependent errors. This estimator minimizes the nuclear norm of the residual matrix plus a convex penalty. Unlike existing methods that require explicit estimates of the error precision (inverse covariance) matrix, the multivariate square-root lasso implicitly accounts for error dependence and is the solution to a convex optimization problem. We establish error bounds which reveal that like the univariate square-root lasso, the multivariate square-root lasso is pivotal with respect to the unknown error covariance matrix. In addition, we propose a variation of the alternating direction method of multipliers algorithm to compute the estimator and discuss an accelerated first order algorithm that can be applied in certain cases. In both simulation studies and a genomic data application, we show that the multivariate square-root lasso can outperform more computationally intensive methods that require explicit estimation of the error precision matrix.

私たちは、従属誤差を持つ多変量応答線形回帰モデルを当てはめる方法である多変量平方根Lassoを研究します。この推定量は、残差行列の核ノルムと凸ペナルティを最小化します。誤差精度(逆共分散)行列の明示的な推定を必要とする既存の方法とは異なり、多変量平方根Lassoは誤差の依存性を暗黙的に考慮し、凸最適化問題の解となります。私たちは、単変量平方根Lassoと同様に、多変量平方根Lassoが未知の誤差共分散行列に関して極めて重要であることを明らかにする誤差境界を確立します。さらに、推定量を計算するための交互方向乗数法アルゴリズムのバリエーションを提案し、特定のケースに適用できる高速化された一次アルゴリズムについて説明します。シミュレーション研究とゲノムデータアプリケーションの両方において、多変量平方根Lassoは、誤差精度行列の明示的な推定を必要とする、より計算量の多い方法よりも優れていることを示す。

On the Complexity of Approximating Multimarginal Optimal Transport
多周辺最適輸送の近似の複雑さについて

We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{\mathcal{O}}(n^{3m})$. We then propose two simple and deterministic algorithms for approximating the MOT problem. The first algorithm, which we refer to as multimarginal Sinkhorn algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{\mathcal{O}}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first near-linear time complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as accelerated multimarginal Sinkhorn algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{\mathcal{O}}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm (Tupitsa et al., 2020) in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver Gurobi. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms.

私たちは、マルチマージナル最適輸送(MOT)距離の近似の複雑性について研究します。これは、ここでは、それぞれがn個のサポートポイントでサポートされるm個の離散確率分布間の、古典的な最適輸送距離の一般化です。まず、MOT問題の標準線形計画法(LP)表現は、m \geq 3の場合、最小コストフロー問題ではないことを示す。この否定的な結果は、ネットワークシンプレックス法など、いくつかの組み合わせアルゴリズムがMOT問題の近似には適していないことを示しているが、決定論的内点法アルゴリズムの最悪ケースの複雑性境界は、依然として$\tilde{\mathcal{O}}(n^{3m})$の量です。次に、MOT問題を近似するための2つの単純で決定論的なアルゴリズムを提案します。マルチマージナルシンクホーンアルゴリズムと呼ぶ最初のアルゴリズムは、シンクホーンアルゴリズムの証明可能な効率的なマルチマージナル一般化です。私たちは、許容誤差$\varepsilon \in (0, 1)$に対して$\tilde{\mathcal{O}}(m^3n^m\varepsilon^{-2})$の計算量上限を達成することを示しています。これは、MOT問題を近似するための最初の線形に近い時間計算量上限保証を提供し、$m = 2$の場合の古典的なOT設定におけるSinkhornアルゴリズムの最もよく知られている計算量上限と一致します。加速マルチマージナルSinkhornアルゴリズムと呼ぶ2番目のアルゴリズムは、推定シーケンスを組み込むことで加速を実現し、計算量上限は$\tilde{\mathcal{O}}(m^3n^{m+1/3}\varepsilon^{-4/3})$です。この上限は、$1/\varepsilon$の点では最初のアルゴリズムよりも優れており、$n$の点では加速交代最小化アルゴリズム(Tupitsaら、2020年)よりも優れています。最後に、新しいアルゴリズムを市販のLPソルバーGurobiと比較します。合成データと実際の画像に関する予備的な結果により、アルゴリズムの有効性と効率性が実証されています。

Stochastic Zeroth-Order Optimization under Nonstationarity and Nonconvexity
非定常性と非凸性の下での確率的ゼロ次最適化

Stochastic zeroth-order optimization algorithms have been predominantly analyzed under the assumption that the objective function being optimized is time-invariant. Motivated by dynamic matrix sensing and completion problems, and online reinforcement learning problems, in this work, we propose and analyze stochastic zeroth-order optimization algorithms when the objective being optimized changes with time. Considering general nonconvex functions, we propose nonstationary versions of regret measures based on first-order and second-order optimal solutions, and provide the corresponding regret bounds. For the case of first-order optimal solution based regret measures, we provide regret bounds in both the low- and high-dimensional settings. For the case of second-order optimal solution based regret, we propose zeroth-order versions of the stochastic cubic-regularized Newton’s method based on estimating the Hessian matrices in the bandit setting via second-order Gaussian Stein’s identity. Our nonstationary regret bounds in terms of second-order optimal solutions have interesting consequences for avoiding saddle points in the nonstationary setting.

確率的ゼロ次最適化アルゴリズムは、主に、最適化される目的関数が時間不変であるという仮定の下で分析されてきました。動的行列検知および補完問題、およびオンライン強化学習問題に動機付けられて、この研究では、最適化される目的が時間とともに変化するときの確率的ゼロ次最適化アルゴリズムを提案し、分析します。一般的な非凸関数を考慮して、1次および2次の最適解に基づく後悔尺度の非定常バージョンを提案し、対応する後悔境界を提供します。1次最適解に基づく後悔尺度の場合、低次元設定と高次元設定の両方で後悔境界を提供します。2次最適解に基づく後悔の場合、2次ガウスシュタイン恒等式を介してバンディット設定でヘッセ行列を推定することに基づく確率的3次正則化ニュートン法のゼロ次バージョンを提案します。2次最適解に関する非定常後悔境界は、非定常設定での鞍点の回避に関して興味深い結果をもたらします。

Additive Nonlinear Quantile Regression in Ultra-high Dimension
超高次元における加法非線形分位点回帰

We propose a method for simultaneous estimation and variable selection of an additive quantile regression model that can be used with high dimensional data. Quantile regression is an appealing method for analyzing high dimensional data because it can correctly model heteroscedastic relationships, is robust to outliers in the response, sparsity levels can change with quantiles, and it provides a thorough analysis of the conditional distribution of the response. An additive nonlinear model can capture more complex relationships, while avoiding the curse of dimensionality. The additive nonlinear model is fit using B-splines and a nonconvex group penalty is used for simultaneous estimation and variable selection. We derive the asymptotic properties of the estimator, including an oracle property, under general conditions that allow for the number of covariates, $p_n$, and the number of true covariates, $q_n$, to increase with the sample size, $n$. In addition, we propose a coordinate descent algorithm that reduces the computational cost compared to the linear programming approach typically used for solving quantile regression problems. The performance of the method is tested using Monte Carlo simulations, an analysis of fat content of meat conditional on a 100 channel spectrum of absorbances and predicting TRIM32 expression using gene expression data from the eyes of rats.

私たちは、高次元データで使用できる加法分位回帰モデルの同時推定と変数選択の方法を提案します。分位回帰は、不均一な分散関係を正しくモデル化でき、応答の外れ値に対して堅牢で、スパースレベルを分位数に応じて変更でき、応答の条件付き分布の徹底的な分析を提供できるため、高次元データを分析するための魅力的な方法です。加法非線形モデルは、次元の呪いを回避しながら、より複雑な関係を捉えることができます。加法非線形モデルはBスプラインを使用して適合され、同時推定と変数選択には非凸グループペナルティが使用されます。共変量数$p_n$と真の共変量数$q_n$がサンプルサイズ$n$とともに増加することを許容する一般的な条件下で、推定量の漸近特性(オラクル特性を含む)を導出します。さらに、私たちは、分位回帰問題を解くために通常使用される線形計画法に比べて計算コストを削減する座標降下アルゴリズムを提案します。この方法のパフォーマンスは、モンテカルロシミュレーション、100チャネルの吸光度スペクトルを条件とする肉の脂肪含有量の分析、およびラットの眼からの遺伝子発現データを使用したTRIM32発現の予測を使用してテストされています。

The AIM and EM Algorithms for Learning from Coarse Data
粗いデータから学習するためのAIMアルゴリズムとEMアルゴリズム

Statistical learning from incomplete data is typically performed under an assumption of ignorability for the mechanism that causes missing values. Notably, the expectation maximization (EM) algorithm is based on the assumption that values are missing at random. Most approaches that tackle non-ignorable mechanisms are based on specific modeling assumptions for these mechanisms. The adaptive imputation and maximization (AIM) algorithm has been introduced in earlier work as a general paradigm for learning from incomplete data without any assumptions on the process that causes observations to be incomplete. In this paper we give a thorough analysis of the theoretical properties of the AIM algorithm, and its relationship with EM. We identify conditions under which EM and AIM are in fact equivalent, and show that when these conditions are not met, then AIM can produce consistent estimates in non-ignorable incomplete data scenarios where EM becomes inconsistent. Convergence results for AIM are obtained that closely mirror the available convergence guarantees for EM. We develop the general theory of the AIM algorithm for discrete data settings, and then develop a general discretization approach that allows to apply the method also to incomplete continuous data. We demonstrate the practical usability of the AIM algorithm by prototype implementations for parameter learning from continuous Gaussian data, and from discrete Bayesian network data. Extensive experiments show that the theoretical differences between AIM and EM can be observed in practice, and that a combination of the two methods leads to robust performance for both ignorable and non-ignorable mechanisms.

不完全なデータからの統計学習は、通常、欠損値の原因となるメカニズムが無視可能であるという仮定の下で実行されます。特に、期待値最大化(EM)アルゴリズムは、値がランダムに欠損するという仮定に基づいています。無視できないメカニズムに対処するほとんどのアプローチは、これらのメカニズムの特定のモデリング仮定に基づいています。適応型補完および最大化(AIM)アルゴリズムは、観測が不完全になるプロセスに関する仮定なしに、不完全なデータから学習するための一般的なパラダイムとして、以前の研究で導入されました。この論文では、AIMアルゴリズムの理論的特性とEMとの関係を徹底的に分析します。EMとAIMが実際に同等となる条件を特定し、これらの条件が満たされない場合、EMが矛盾する無視できない不完全なデータシナリオでAIMが一貫した推定値を生成できることを示します。EMで利用可能な収束保証を厳密に反映するAIMの収束結果が得られます。離散データ設定に対するAIMアルゴリズムの一般理論を開発し、次に不完全な連続データにもこの方法を適用できる一般的な離散化アプローチを開発します。連続ガウスデータと離散ベイジアンネットワークデータからのパラメータ学習のプロトタイプ実装によって、AIMアルゴリズムの実用的な有用性を実証します。広範な実験により、AIMとEMの理論的な違いは実際に観察できること、および2つの方法を組み合わせると、無視できるメカニズムと無視できないメカニズムの両方で堅牢なパフォーマンスが得られることが示されました。

Sparse Additive Gaussian Process Regression
スパース加法ガウス過程回帰

In this paper we introduce a novel model for Gaussian process (GP) regression in the fully Bayesian setting. Motivated by the ideas of sparsification, localization and Bayesian additive modeling, our model is built around a recursive partitioning (RP) scheme. Within each RP partition, a sparse GP (SGP) regression model is fitted. A Bayesian additive framework then combines multiple layers of partitioned SGPs, capturing both global trends and local refinements with efficient computations. The model addresses both the problem of efficiency in fitting a full Gaussian process regression model and the problem of prediction performance associated with a single SGP. Our approach mitigates the issue of pseudo-input selection and avoids the need for complex inter-block correlations in existing methods. The crucial trade-off becomes choosing between many simpler local model components or fewer complex global model components, which the practitioner can sensibly tune. Implementation is via a Metropolis-Hasting Markov chain Monte-Carlo algorithm with Bayesian back-fitting. We compare our model against popular alternatives on simulated and real datasets, and find the performance is competitive, while the fully Bayesian procedure enables the quantification of model uncertainties.

この論文では、完全なベイジアン設定におけるガウス過程(GP)回帰の新しいモデルを紹介します。スパース化、ローカリゼーション、ベイジアン加法モデリングのアイデアに着想を得たこのモデルは、再帰分割(RP)スキームに基づいて構築されています。各RPパーティション内で、スパースGP (SGP)回帰モデルが適合されます。次に、ベイジアン加法フレームワークが、パーティション化されたSGPの複数のレイヤーを結合し、効率的な計算でグローバルトレンドとローカルリファインメントの両方を捕捉します。このモデルは、完全なガウス過程回帰モデルを適合する際の効率の問題と、単一のSGPに関連する予測パフォーマンスの問題の両方に対処します。このアプローチは、疑似入力選択の問題を軽減し、既存の方法における複雑なブロック間相関の必要性を回避します。重要なトレードオフは、多数のより単純なローカルモデルコンポーネントと、より少ない複雑なグローバルモデルコンポーネントのどちらかを選択することになり、これは実践者が適切に調整できます。実装は、ベイジアンバックフィッティングを使用したメトロポリス-ヘイスティングマルコフ連鎖モンテカルロアルゴリズムによって行われます。私たちは、シミュレーションされたデータセットと実際のデータセット上で私たちのモデルを一般的な代替モデルと比較し、そのパフォーマンスが競争力があることを発見しました。また、完全なベイズ手順により、モデルの不確実性の定量化が可能になります。

A Unifying Framework for Variance-Reduced Algorithms for Findings Zeroes of Monotone operators
単調演算子の発見零点に対する分散縮小アルゴリズムのための統一フレームワーク

It is common to encounter large-scale monotone inclusion problems where the objective has a finite sum structure. We develop a general framework for variance-reduced forward-backward splitting algorithms for this problem. This framework includes a number of existing deterministic and variance-reduced algorithms for function minimization as special cases, and it is also applicable to more general problems such as saddle-point problems and variational inequalities. With a carefully constructed Lyapunov function, we show that the algorithms covered by our framework enjoy a linear convergence rate in expectation under mild assumptions. We further consider Catalyst acceleration and asynchronous implementation to reduce the algorithmic complexity and computation time. We apply our proposed framework to a policy evaluation problem and a strongly monotone two-player game, both of which fall outside the realm of function minimization.

目的物が有限和構造を持つ大規模な単調介在物問題に遭遇することはよくあります。この問題に対する分散削減前方後方分割アルゴリズムの一般的なフレームワークを開発します。このフレームワークには、特殊なケースとして関数最小化のための既存の決定論的および分散縮小アルゴリズムが多数含まれており、サドルポイント問題や変分不等式などのより一般的な問題にも適用できます。慎重に構築されたリアプノフ関数を使用して、フレームワークでカバーされるアルゴリズムが、穏やかな仮定の下で期待値の線形収束率を享受することを示します。さらに、Catalystアクセラレーションと非同期実装を検討して、アルゴリズムの複雑さと計算時間を短縮します。私たちは、提案されたフレームワークを、機能最小化の領域外にある、ポリシー評価問題と非常に単調な2人用ゲームに適用します。

Causal Classification: Treatment Effect Estimation vs. Outcome Prediction
因果分類:治療効果の推定と結果の予測

The goal of causal classification is to identify individuals whose outcome would be positively changed by a treatment. Examples include targeting advertisements and targeting retention incentives to reduce churn. Causal classification is challenging because we observe individuals under only one condition (treated or untreated), so we do not know who was influenced by the treatment, but we may estimate the potential outcomes under each condition to decide whom to treat by estimating treatment effects. Curiously, we often see practitioners using simple outcome prediction instead, for example, predicting if someone will purchase if shown the ad. Rather than disregarding this as naive behavior, we present a theoretical analysis comparing treatment effect estimation and outcome prediction when addressing causal classification. We focus on the key question: “When (if ever) is simple outcome prediction preferable to treatment effect estimation for causal classification?” The analysis reveals a causal bias–variance tradeoff. First, when the treatment effect estimation depends on two outcome predictions, larger sampling variance may lead to more errors than the (biased) outcome prediction approach. Second, a stronger signal-to-noise ratio in outcome prediction implies that the bias can help with intervention decisions when outcomes are informative of effects. The theoretical results, as well as simulations, illustrate settings where outcome prediction should actually be better, including cases where (1) the bias may be partially corrected by choosing a different threshold, (2) outcomes and treatment effects are correlated, and (3) data to estimate counterfactuals are limited. A major practical implication is that, for some applications, it might be feasible to make good intervention decisions without any data on how individuals actually behave when intervened. Finally, we show that for a real online advertising application, outcome prediction models indeed excel at causal classification.

因果分類の目標は、治療によって結果がプラスに変化する個人を特定することです。例としては、広告のターゲティングや、解約を減らすためのリテンションインセンティブのターゲティングなどがあります。因果分類は、個人を1つの条件(治療済みまたは未治療)でのみ観察するため、誰が治療の影響を受けたかはわかりませんが、各条件での潜在的な結果を推定し、治療効果を推定することで誰を治療するかを決定することができるため、困難です。興味深いことに、代わりに、たとえば広告が表示された場合に誰かが購入するかどうかを予測するなど、単純な結果予測を使用する専門家をよく見かけます。これを素朴な行動として無視するのではなく、因果分類に対処する際に治療効果の推定と結果予測を比較する理論的分析を示します。「因果分類では、単純な結果予測が治療効果の推定よりも望ましいのはいつですか(ある場合)?」という重要な質問に焦点を当てます。分析により、因果バイアスと分散のトレードオフが明らかになります。まず、治療効果の推定が2つの結果予測に依存する場合、サンプリング分散が大きいと、(バイアスのある)結果予測アプローチよりもエラーが多くなる可能性があります。第二に、結果予測における信号対雑音比が強いということは、結果が効果に関する情報となる場合、バイアスが介入決定に役立つ可能性があることを意味します。理論的な結果とシミュレーションは、結果予測が実際にはより良くなるはずの設定を示しています。これには、(1)異なるしきい値を選択することでバイアスが部分的に修正される可能性がある場合、(2)結果と治療効果が相関している場合、(3)反事実を推定するデータが限られている場合が含まれます。主な実用的な意味合いは、一部のアプリケーションでは、介入時に個人が実際にどのように行動するかに関するデータがなくても、適切な介入決定を下すことが可能である可能性があるということです。最後に、実際のオンライン広告アプリケーションでは、結果予測モデルが因果分類に優れていることを示します。

A Statistical Approach for Optimal Topic Model Identification
最適なトピックモデル同定のための統計的アプローチ

Latent Dirichlet Allocation is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.

Latent Dirichlet Allocationは、ドキュメントのコーパス内の潜在的な構造を特定する一般的な機械学習手法です。この論文では、最適なLDA構成を決定するための正式な手順が存在しないという継続的な懸念に対処するために、元のLDAモデルの基礎となる仮定された多項分布仕様に依存する一連のパラメトリックテストを導入します。私たちの方法論は、最適なトピックモデルを特定して評価するための一連の厳密な統計手順を定義します。米国大統領就任演説コーパスは、数値結果を示すためのケーススタディとして使用されます。92のトピックがコーパスを最もよく表していることがわかりました。さらに、シミュレーション研究を通じてこの手法を検証し、パープレキシティインデックスなどの他の標準的なヒューリスティックメトリクスと比較したアプローチの優位性を確認しています。

Inherent Tradeoffs in Learning Fair Representations
公正な表現を学ぶための固有のトレードオフ

Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy is not entirely clear, even for the basic paradigm of classification problems. In this paper, we characterize an inherent tradeoff between statistical parity and accuracy in the classification setting by providing a lower bound on the sum of group-wise errors of any fair classifiers. Our impossibility theorem could be interpreted as a certain uncertainty principle in fairness: if the base rates differ among groups, then any fair classifier satisfying statistical parity has to incur a large error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair classifiers, from the perspective of learning fair representations. To show that our lower bound is tight, assuming oracle access to Bayes (potentially unfair) classifiers, we also construct an algorithm that returns a randomized classifier which is both optimal (in terms of accuracy) and fair. Interestingly, when the protected attribute can take more than two values, an extension of this lower bound does not admit an analytic solution. Nevertheless, in this case, we show that the lower bound can be efficiently computed by solving a linear program, which we term as the TV-Barycenter problem, a barycenter problem under the TV-distance. On the upside, we prove that if the group-wise Bayes optimal classifiers are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, we also conduct experiments on real-world datasets to confirm our theoretical findings.

機械学習ツールをハイステークス領域に実際に適用する場合、予測されるターゲットが保護された属性に関して何らかの定量的なパリティの概念を満たす必要があるという意味で、公平であるように規制されることがよくあります。ただし、分類問題の基本パラダイムであっても、公平性と精度の正確なトレードオフは完全には明らかではありません。この論文では、分類設定における統計的パリティと精度の固有のトレードオフを、公平な分類器のグループごとの誤差の合計の下限値を与えることで特徴付けます。私たちの不可能性定理は、公平性における特定の不確定性原理として解釈できます。つまり、ベースレートがグループ間で異なる場合、統計的パリティを満たす公平な分類器は、少なくとも1つのグループで大きな誤差を被る必要があります。私たちはこの結果をさらに拡張して、公平な表現を学習するという観点から、(近似的に)公平な分類器の結合誤差の下限値を与えます。下限が厳密であることを示すために、オラクルがベイズ（潜在的に不公平）分類器にアクセスできると仮定して、最適（精度の点で）かつ公平なランダム分類器を返すアルゴリズムも構築します。興味深いことに、保護された属性が3つ以上の値をとることができる場合、この下限の拡張では解析解が得られません。それでも、この場合は、TV距離の下での重心問題であるTV重心問題と呼ぶ線形計画を解くことで、下限を効率的に計算できることを示します。良い点としては、グループごとのベイズ最適分類器が近い場合、公平な表現を学習すると、グループ間のエラー率が近いことを示す精度パリティと呼ばれる別の公平性の概念につながることを証明します。最後に、実際のデータセットで実験を行い、理論的発見を確認します。

solo-learn: A Library of Self-supervised Methods for Visual Representation Learning
solo-learn: 視覚表現学習のための自己教師あり手法のライブラリ

This paper presents solo-learn, a library of self-supervised methods for visual representation learning. Implemented in Python, using Pytorch and Pytorch lightning, the library fits both research and industry needs by featuring distributed training pipelines with mixed-precision, faster data loading via Nvidia DALI, online linear evaluation for better prototyping, and many additional training tricks. Our goal is to provide an easy-to-use library comprising a large amount of Self-supervised Learning (SSL) methods, that can be easily extended and fine-tuned by the community. solo-learn opens up avenues for exploiting large-budget SSL solutions on inexpensive smaller infrastructures and seeks to democratize SSL by making it accessible to all. The source code is available at https://github.com/vturrisi/solo-learn.

この論文では、視覚表現学習のための自己教師ありメソッドのライブラリであるsolo-learnを紹介します。PytorchとPytorch lightningを使用してPythonで実装されたこのライブラリは、混合精度の分散トレーニングパイプライン、Nvidia DALIによる高速データ読み込み、プロトタイピングを改善するためのオンライン線形評価、および多くの追加のトレーニングトリックを備えているため、研究と業界の両方のニーズに適合します。私たちの目標は、コミュニティが簡単に拡張および微調整できる、大量の自己教師あり学習(SSL)メソッドで構成される使いやすいライブラリを提供することです。solo-learnは、安価で小規模なインフラストラクチャで大規模な予算のSSLソリューションを活用する道を開き、SSLをすべての人がアクセスできるようにすることで民主化を目指しています。ソースコードはhttps://github.com/vturrisi/solo-learnで入手できます。

Bayesian Pseudo Posterior Mechanism under Asymptotic Differential Privacy
漸近差分プライバシーにおけるベイズ擬似事後メカニズム

We propose a Bayesian pseudo posterior mechanism to generate record-level synthetic databases equipped with an $(\epsilon,\pi)-$ probabilistic differential privacy (pDP) guarantee, where $\pi$ denotes the probability that any observed database exceeds $\epsilon$. The pseudo posterior mechanism employs a data record-indexed, risk-based weight vector with weight values $\in [0, 1]$ that surgically downweight the likelihood contributions for high-risk records for model estimation and the generation of record-level synthetic data for public release. The pseudo posterior synthesizer constructs a weight for each datum record by using the Lipschitz bound for that record under a log-pseudo likelihood utility function that generalizes the exponential mechanism (EM) used to construct a formally private data generating mechanism. By selecting weights to remove likelihood contributions with non-finite log-likelihood values, we guarantee a finite local privacy guarantee for our pseudo posterior mechanism at every sample size. Our results may be applied to any synthesizing model envisioned by the data disseminator in a computationally tractable way that only involves estimation of a pseudo posterior distribution for parameters, $\theta$, unlike recent approaches that use naturally-bounded utility functions implemented through the EM. We specify conditions that guarantee the asymptotic contraction of $\pi$ to $0$ over the space of databases, such that the form of the guarantee provided by our method is asymptotic. We illustrate our pseudo posterior mechanism on the sensitive family income variable from the Consumer Expenditure Surveys database published by the U.S. Bureau of Labor Statistics. We show that utility is better preserved in the synthetic data for our pseudo posterior mechanism as compared to the EM, both estimated using the same non-private synthesizer, due to our use of targeted downweighting.

私たちは、$(\epsilon,\pi)-$確率的差分プライバシー(pDP)保証を備えたレコードレベルの合成データベースを生成するベイズ擬似事後メカニズムを提案します。ここで、$\pi$は、観測されたデータベースが$\epsilon$を超える確率を表す。擬似事後メカニズムは、データレコードインデックス付き、リスクベースの重みベクトルを使用します。重みベクトルは、重み値$\in [0, 1]$を持ち、モデル推定と公開用のレコードレベルの合成データの生成のために、高リスクレコードの尤度寄与を外科的に軽減します。擬似事後合成器は、形式的にプライベートなデータ生成メカニズムの構築に使用される指数メカニズム(EM)を一般化する対数擬似尤度ユーティリティ関数の下で、そのレコードのLipschitz境界を使用して、各データレコードの重みを構築します。非有限の対数尤度値を持つ尤度寄与を除去する重みを選択することで、あらゆるサンプルサイズで擬似事後メカニズムの有限のローカルプライバシー保証を保証します。私たちの結果は、EMを通じて実装された自然に有界な効用関数を使用する最近のアプローチとは異なり、パラメータ$\theta$の擬似事後分布の推定のみを伴う計算的に扱いやすい方法で、データ配信者が想定する任意の合成モデルに適用できます。データベースの空間で$\pi$から$0$への漸近的収縮を保証する条件を指定し、私たちの方法によって提供される保証の形式が漸近的になるようにします。米国労働統計局が発行する消費者支出調査データベースの敏感な家族収入変数で擬似事後メカニズムを示します。私たちは、ターゲットを絞ったダウンウェイトの使用により、同じ非プライベート合成装置を使用して推定されたEMと比較して、擬似事後メカニズムの合成データで効用がよりよく保持されることを示します。

SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
SMAC3: ハイパーパラメータ最適化のための汎用性の高いベイズ最適化パッケージ

Algorithm parameters, in particular hyperparameters of machine learning algorithms, can substantially impact their performance. To support users in determining well-performing hyperparameter configurations for their algorithms, datasets and applications at hand, SMAC3 offers a robust and flexible framework for Bayesian Optimization, which can improve performance within a few evaluations. It offers several facades and pre-sets for typical use cases, such as optimizing hyperparameters, solving low dimensional continuous (artificial) global optimization problems and configuring algorithms to perform well across multiple problem instances. The SMAC3 package is available under a permissive BSD-license at https://github.com/automl/SMAC3.

アルゴリズムパラメーター、特に機械学習アルゴリズムのハイパーパラメーターは、そのパフォーマンスに大きな影響を与える可能性があります。SMAC3は、アルゴリズム、データセット、アプリケーションの高性能なハイパーパラメータ設定を決定する際にユーザーをサポートするために、ベイズ最適化のための堅牢で柔軟なフレームワークを提供し、数回の評価でパフォーマンスを向上させることができます。ハイパーパラメータの最適化、低次元の連続(人工)グローバル最適化問題の解法、複数の問題インスタンス間で適切に動作するためのアルゴリズムの構成など、一般的なユースケースに対応するいくつかのファサードとプリセットを提供します。SMAC3パッケージは、https://github.com/automl/SMAC3で寛容なBSDライセンスの下で入手可能です。

DoubleML – An Object-Oriented Implementation of Double Machine Learning in Python
DoubleML – Pythonでのダブル機械学習のオブジェクト指向実装

DoubleML is an open-source Python library implementing the double machine learning framework of Chernozhukov et al. (2018) for a variety of causal models. It contains functionalities for valid statistical inference on causal parameters when the estimation of nuisance parameters is based on machine learning methods. The object-oriented implementation of DoubleML provides a high flexibility in terms of model specifications and makes it easily extendable. The package is distributed under the MIT license and relies on core libraries from the scientific Python ecosystem: scikit-learn, numpy, pandas, scipy, statsmodels and joblib. Source code, documentation and an extensive user guide can be found at https://github.com/DoubleML/doubleml-for-py and https://docs.doubleml.org.

DoubleMLは、Chernozhukovら(2018)のdouble機械学習フレームワークをさまざまな因果モデルに実装したオープンソースのPythonライブラリです。これには、迷惑パラメータの推定が機械学習手法に基づいている場合に、因果関係パラメータに関する有効な統計的推論のための機能が含まれています。DoubleMLのオブジェクト指向実装は、モデル仕様の柔軟性が高く、拡張が容易です。このパッケージはMITライセンスの下で配布され、科学的なPythonエコシステムのコアライブラリ(scikit-learn、numpy、pandas、scipy、statsmodels、joblib)に依存しています。ソースコード、ドキュメント、および広範なユーザーガイドは、https://github.com/DoubleML/doubleml-for-pyとhttps://docs.doubleml.orgにあります。

LinCDE: Conditional Density Estimation via Lindsey’s Method
LinCDE: Lindsey 法による条件付き密度推定

Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey’s method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE’s efficacy through extensive simulations and three real data examples.

条件付き密度推定は、生物学、経済学、金融学、環境研究など、科学的かつ実用的な応用が可能な統計学の基本的な問題です。この論文では、勾配ブースティングとLindseyの方法(LinCDE)に基づく条件付き密度推定量を提案します。LinCDEは、密度ファミリーの柔軟なモデリングを認めており、モダリティや形状などの分布特性をキャプチャできます。特に、適切にパラメータ化されると、LinCDEは滑らかで負でない密度推定値を生成します。さらに、ブースト回帰木と同様に、LinCDEは自動機能選択を行います。LinCDEの有効性を、広範なシミュレーションと3つの実データ例を通じて実証します。

Toolbox for Multimodal Learn (scikit-multimodallearn)
マルチモーダル学習用ツールボックス (scikit-multimodallearn)

scikit-multimodallearn is a Python library for multimodal supervised learning, licensed under Free BSD, and compatible with the well-known scikit-learn toolbox (Fabian Pedregosa, 2011). This paper details the content of the library, including a specific multimodal data formatting and classification and regression algorithms. Use cases and examples are also provided.

scikit-multimodallearnは、マルチモーダル教師あり学習用のPythonライブラリで、Free BSDの下でライセンスされており、有名なscikit-learnツールボックスと互換性があります(Fabian Pedregosa、2011年)。この論文では、特定のマルチモーダルデータフォーマット、分類および回帰アルゴリズムなど、ライブラリの内容について詳しく説明します。また、ユースケースと例も紹介します。

Analytically Tractable Hidden-States Inference in Bayesian Neural Networks
ベイジアンニューラルネットワークにおける解析的に扱いやすい隠れ状態推論

With few exceptions, neural networks have been relying on backpropagation and gradient descent as the inference engine in order to learn the model parameters, because closed-form Bayesian inference for neural networks has been considered to be intractable. In this paper, we show how we can leverage the tractable approximate Gaussian inference’s (TAGI) capabilities to infer hidden states, rather than only using it for inferring the network’s parameters. One novel aspect is that it allows inferring hidden states through the imposition of constraints designed to achieve specific objectives, as illustrated through three examples: (1) the generation of adversarial-attack examples, (2) the usage of a neural network as a black-box optimization method, and (3) the application of inference on continuous-action reinforcement learning. In these three examples, the constrains are in (1), a target label chosen to fool a neural network, and in (2 and 3) the derivative of the network with respect to its input that is set to zero in order to infer the optimal input values that are either maximizing or minimizing it. These applications showcase how tasks that were previously reserved to gradient-based optimization approaches can now be approached with analytically tractable inference.

いくつかの例外を除き、ニューラルネットワークは、モデルパラメータを学習するための推論エンジンとしてバックプロパゲーションと勾配降下法に依存してきました。これは、ニューラルネットワークの閉形式ベイズ推論が扱いにくいと考えられてきたためです。この論文では、扱いやすい近似ガウス推論(TAGI)の機能を、ネットワークのパラメータを推論するためだけに使うのではなく、隠れた状態を推論するためにどのように活用できるかを示します。新しい点の1つは、特定の目的を達成するために設計された制約を課すことで隠れた状態を推論できることです。これは、(1)敵対的攻撃例の生成、(2)ブラックボックス最適化手法としてのニューラルネットワークの使用、(3)連続アクション強化学習への推論の適用という3つの例で示されます。これら3つの例では、制約は(1)ではニューラルネットワークを騙すために選択されたターゲットラベル、(2と3)ではネットワークの入力に対する導関数で、ネットワークを最大化または最小化する最適な入力値を推論するためにゼロに設定されています。これらのアプリケーションは、以前は勾配ベースの最適化アプローチに限定されていたタスクに、解析的に扱いやすい推論でアプローチできることを示しています。

Innovations Autoencoder and its Application in One-class Anomalous Sequence Detection
イノベーションオートエンコーダと1クラス異常シーケンス検出への応用

An innovations sequence of a time series is a sequence of independent and identically distributed random variables with which the original time series has a causal representation. The innovation at a time is statistically independent of the history of the time series. As such, it represents the new information contained at present but not in the past. Because of its simple probability structure, the innovations sequence is the most efficient signature of the original. Unlike the principle or independent component representations, an innovations sequence preserves not only the complete statistical properties but also the temporal order of the original time series. An long-standing open problem is to find a computationally tractable way to extract an innovations sequence of non-Gaussian processes. This paper presents a deep learning approach, referred to as Innovations Autoencoder (IAE), that extracts innovations sequences using a causal convolutional neural network. An application of IAE to the one-class anomalous sequence detection problem with unknown anomaly and anomaly-free models is also presented.

時系列のイノベーションシーケンスは、元の時系列が因果表現を持つ、独立かつ同一に分布するランダム変数のシーケンスです。ある時点のイノベーションは、時系列の履歴から統計的に独立しています。そのため、イノベーションは現在含まれているが過去に含まれていない新しい情報を表します。イノベーションシーケンスは確率構造が単純なため、元の最も効率的なシグネチャです。主成分表現や独立成分表現とは異なり、イノベーションシーケンスは完全な統計特性だけでなく、元の時系列の時間的順序も保持します。長年の未解決の問題は、非ガウス過程のイノベーションシーケンスを抽出するための計算的に扱いやすい方法を見つけることです。この論文では、因果畳み込みニューラルネットワークを使用してイノベーションシーケンスを抽出する、イノベーションオートエンコーダ(IAE)と呼ばれるディープラーニングアプローチを紹介します。未知の異常および異常のないモデルを使用した1クラス異常シーケンス検出問題へのIAEの適用についても説明します。

Overparameterization of Deep ResNet: Zero Loss and Mean-field Analysis
Deep ResNet のオーバーパラメータ化: ゼロ損失と平均場解析

Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global optimizer with perfect fit (zero-loss) in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of weights in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a gradient flow for a probability distribution that is characterized by a partial differential equation (PDE) in the large-NN limit. Next, we show that under certain assumptions, the solution to the PDE converges in the training time to a zero-loss solution. Together, these results suggest that the training of the ResNet gives a near-zero loss if the ResNet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.

トレーニングデータに適合するディープニューラルネットワーク(NN)のパラメーターを見つけることは非凸最適化問題ですが、基本的な1次最適化手法(勾配降下法)では、多くの実用的な状況で完全に適合する(損失ゼロ)グローバルオプティマイザーが見つかります。層の数(深さ)と各層の重みの数(幅)の両方が無限大になる極限領域で、滑らかな活性化関数を持つ残差ニューラルネットワーク(ResNet)の場合について、この現象を調べます。まず、平均場極限の議論を使用して、パラメータートレーニングの勾配降下法が、大規模NN極限で偏微分方程式(PDE)によって特徴付けられる確率分布の勾配フローになることを証明します。次に、特定の仮定の下で、PDEの解がトレーニング時間内に損失ゼロの解に収束することを示します。これらの結果を合わせると、ResNetが十分に大きい場合、ResNetのトレーニングで損失がほぼゼロになることが示唆されます。損失を所定のしきい値以下に減らすために必要な深さと幅を高い確率で推定します。

Cascaded Diffusion Models for High Fidelity Image Generation
高忠実度画像生成のためのカスケード拡散モデル

We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64×64, 3.52 at 128×128 and 4.88 at 256×256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256×256, outperforming VQ-VAE-2.

私たちは、カスケード拡散モデルは、サンプル品質を高めるための補助画像分類器の支援なしに、クラス条件付きImageNet生成ベンチマークで忠実度の高い画像を生成できることを示しています。カスケード拡散モデルは、最低解像度の標準拡散モデルから始まり、その後に1つ以上の超解像度拡散モデルが続き、画像をアップサンプリングして高解像度の詳細を追加する、解像度が増加する画像を生成する複数の拡散モデルのパイプラインで構成されています。カスケードパイプラインのサンプル品質は、超解像度モデルへの低解像度コンディショニング入力のデータ拡張として提案されているコンディショニング拡張に大きく依存していることがわかりました。私たちの実験では、コンディショニング拡張により、カスケードモデルでのサンプリング中にエラーが複合的に発生するのを防ぎ、カスケードパイプラインのトレーニングに役立ち、64×64で1.48、128×128で3.52、256×256の解像度で4.88のFIDスコアを達成してBigGAN-deepを上回り、256×256での分類精度スコアは63.02% (トップ1)と84.06% (トップ5)で、VQ-VAE-2を上回ることが示されました。

Beyond Sub-Gaussian Noises: Sharp Concentration Analysis for Stochastic Gradient Descent
サブガウスノイズを超えて:確率的勾配降下法のための急激な濃度解析

In this paper, we study the concentration property of stochastic gradient descent (SGD) solutions. In existing concentration analyses, researchers impose restrictive requirements on the gradient noise, such as boundedness or sub-Gaussianity. We consider a much richer class of noise where only finitely-many moments are required, thus allowing heavy-tailed noises. In particular, we obtain Nagaev type high-probability upper bounds for the estimation errors of averaged stochastic gradient descent (ASGD) in a linear model. Specifically, we prove that, after $T$ steps of SGD, the ASGD estimate achieves an $O(\sqrt{\log(1/\delta)/T} + (\delta T^{q-1})^{-1/q})$ error rate with probability at least $1-\delta$, where $q>2$ controls the tail of the gradient noise. In comparison, one has the $O(\sqrt{\log(1/\delta)/T})$ error rate for sub-Gaussian noises. We also show that the Nagaev type upper bound is almost tight through an example, where the exact asymptotic form of the tail probability can be derived. Our concentration analysis indicates that, in the case of heavy-tailed noises, the polynomial dependence on the failure probability $\delta$ is generally unavoidable for the error rate of SGD.

この論文では、確率的勾配降下法(SGD)ソリューションの集中特性について検討します。既存の集中解析では、研究者は有界性やサブガウス性などの勾配ノイズに制限的な要件を課しています。私たちは、有限個のモーメントのみが必要な、より豊富なノイズクラスを検討し、これにより裾の重いノイズを許容します。特に、線形モデルにおける平均確率的勾配降下法(ASGD)の推定誤差に対するNagaevタイプの高確率上限を取得します。具体的には、SGDの$T$ステップ後、ASGD推定が少なくとも$1-\delta$の確率で$O(\sqrt{\log(1/\delta)/T} + (\delta T^{q-1})^{-1/q})$の誤差率を達成することを証明します。ここで、$q>2$は勾配ノイズの裾を制御します。比較すると、サブガウスノイズのエラー率は$O(\sqrt{\log(1/\delta)/T})$です。また、ナガエフ型の上限がほぼ厳密であることも例を通して示し、テール確率の正確な漸近形を導出できます。集中分析により、テールが重いノイズの場合、SGDのエラー率に対する失敗確率$\delta$への多項式依存性は一般に避けられないことがわかります。

Optimal Transport for Stationary Markov Chains via Policy Iteration
方策反復による定常マルコフ連鎖の最適輸送

We study the optimal transport problem for pairs of stationary finite-state Markov chains, with an emphasis on the computation of optimal transition couplings. Transition couplings are a constrained family of transport plans that capture the dynamics of Markov chains. Solutions of the optimal transition coupling (OTC) problem correspond to alignments of the two chains that minimize long-term average cost. We establish a connection between the OTC problem and Markov decision processes, and show that solutions of the OTC problem can be obtained via an adaptation of policy iteration. For settings with large state spaces, we develop a fast approximate algorithm based on an entropy-regularized version of the OTC problem, and provide bounds on its per-iteration complexity. We establish a stability result for both the regularized and unregularized algorithms, from which a statistical consistency result follows as a corollary. We validate our theoretical results empirically through a simulation study, demonstrating that the approximate algorithm exhibits faster overall runtime with low error. Finally, we extend the setting and application of our methods to hidden Markov models, and illustrate the potential use of the proposed algorithms in practice with an application to computer-generated music.

私たちは、定常有限状態マルコフ連鎖のペアの最適輸送問題を研究し、最適遷移カップリングの計算に重点を置いています。遷移カップリングは、マルコフ連鎖のダイナミクスを捉える制約付きの輸送計画ファミリーです。最適遷移カップリング(OTC)問題の解は、長期平均コストを最小化する2つの連鎖の配置に対応します。私たちは、OTC問題とマルコフ決定プロセスの関係を確立し、OTC問題の解がポリシー反復の適応によって得られることを示します。大規模な状態空間の設定では、OTC問題のエントロピー正規化バージョンに基づく高速近似アルゴリズムを開発し、反復ごとの複雑さの境界を提供します。正規化アルゴリズムと非正規化アルゴリズムの両方の安定性の結果を確立し、そこから統計的一貫性の結果が帰結します。私たちは、シミュレーション研究を通じて理論的結果を経験的に検証し、近似アルゴリズムの方が全体的な実行時間が短く、エラーが少ないことを実証します。最後に、私たちの方法の設定と適用を隠れマルコフモデルに拡張し、コンピューター生成音楽への適用によって、提案されたアルゴリズムの実際の使用可能性を示します。

PAC Guarantees and Effective Algorithms for Detecting Novel Categories
PACの保証と新しいカテゴリを検出するための効果的なアルゴリズム

Open category detection is the problem of detecting “alien” test instances that belong to categories or classes that were not present in the training data. In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under general assumptions. Further, while there are algorithms for open category detection, there are few empirical results that directly report alien detection rates. Thus, there are significant theoretical and empirical gaps in our understanding of open category detection. In this paper, we take a step toward addressing this gap by studying a simple, but practically-relevant variant of open category detection. In our setting, we are provided with a “clean” training set that contains only the target categories of interest and an unlabeled “contaminated” training set that contains a fraction $\alpha$ of alien examples. Under the assumption that we know an upper bound on $\alpha$, we develop an algorithm that gives PAC-style guarantees on the alien detection rate, while aiming to minimize false alarms. Given an overall budget on the amount of training data, we also derive the optimal allocation of samples between the mixture and the clean data sets. Experiments on synthetic and standard benchmark datasets evaluate the regimes in which the algorithm can be effective and provide a baseline for further advancements. In addition, for the situation when an upper bound for $\alpha$ is not available, we employ nine different anomaly proportion estimators, and run experiments on both synthetic and standard benchmark data sets to compare their performance.

オープンカテゴリ検出は、トレーニングデータに存在しないカテゴリまたはクラスに属する「エイリアン」テストインスタンスを検出する問題です。多くのアプリケーションでは、このようなエイリアンを確実に検出することが、テストセット予測の安全性と精度を確保する上で重要です。残念ながら、一般的な仮定の下でエイリアンを検出する能力について理論的な保証を提供するアルゴリズムはありません。さらに、オープンカテゴリ検出のアルゴリズムはありますが、エイリアン検出率を直接報告する実証結果はほとんどありません。したがって、オープンカテゴリ検出の理解には、大きな理論的および実証的ギャップがあります。この論文では、シンプルだが実用的に関連するオープンカテゴリ検出のバリアントを研究することで、このギャップに対処するための一歩を踏み出します。私たちの設定では、関心のあるターゲットカテゴリのみを含む「クリーン」なトレーニングセットと、エイリアンの例の小数$\alpha$を含むラベルなしの「汚染された」トレーニングセットが提供されます。$\alpha$の上限がわかっているという仮定の下で、エイリアン検出率についてPACスタイルの保証を提供し、誤報を最小限に抑えることを目指すアルゴリズムを開発します。トレーニングデータ量の全体的な予算が与えられれば、混合データセットとクリーンデータセット間のサンプルの最適な割り当ても導き出されます。合成データセットと標準ベンチマークデータセットでの実験により、アルゴリズムが効果を発揮できる領域が評価され、さらなる進歩のためのベースラインが提供されます。さらに、$\alpha$の上限が利用できない状況では、9つの異なる異常比率推定量を使用し、合成データセットと標準ベンチマークデータセットの両方で実験を実行して、パフォーマンスを比較します。

Sampling Permutations for Shapley Value Estimation
シャープレイ値推定のためのサンプリング順列

Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature. The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations.

シャプレー値に基づくゲーム理論的帰属手法は、ブラックボックス機械学習モデルの解釈に使用されますが、その正確な計算は一般にNP困難であり、非自明なモデルには近似法が必要です。シャプレー値の計算は、一連の順列の合計として表現できるため、一般的なアプローチは、近似のためにこれらの順列のサブセットをサンプリングすることです。残念ながら、標準的なモンテカルロサンプリング法は収束が遅い場合があり、より洗練された準モンテカルロ法はまだ順列の空間に適用されていません。この問題に対処するために、2種類の近似法に基づく新しいアプローチを調査し、それらを経験的に比較します。まず、Mallowsカーネルをカーネルハーディングおよびシーケンシャルベイズ求積法と組み合わせて使用し、順列の関数を含むRKHSでの求積法を示します。RKHSの観点では、順列で定義された扱いやすい不一致尺度を使用して、準モンテカルロ型の誤差境界も得られます。次に、超球面$\mathbb{S}^{d-2}$と順列の関係を利用して、優れた特性を持つ順列サンプルを生成する実用的なアルゴリズムを作成します。実験では、上記の手法により、既存の方法よりもシャプレー値の推定値が大幅に改善され、同じ数のモデル評価でRMSEが小さくなることが示されています。

Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks
線形連続時間再帰型ニューラルネットワークのための近似と最適化理論

We perform a systematic study of the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. On the approximation side, we prove a direct and an inverse approximation theorem of linear functionals using RNNs, which reveal the intricate connections between memory structures in the target and the corresponding approximation efficiency. In particular, we show that temporal relationships can be effectively approximated by RNNs if and only if the former possesses sufficient memory decay. On the optimization front, we perform detailed analysis of the optimization dynamics, including a precise understanding of the difficulty that may arise in learning relationships with long-term memory. The term “curse of memory” is coined to describe the uncovered phenomena, akin to the “curse of dimension” that plagues high-dimensional function approximation. These results form a relatively complete picture of the interaction of memory and recurrent structures in the linear dynamical setting.

私たちは、時系列データの入出力関係の学習に適用した場合の再帰型ニューラルネットワーク(RNN)の近似特性と最適化ダイナミクスの体系的な研究を行う。私たちは、線形関係によって生成されたデータから学習するために連続時間線形RNNを使用するという、単純だが代表的な設定を検討します。近似の面では、RNNを使用した線形関数の直接近似定理と逆近似定理を証明し、ターゲットのメモリ構造とそれに対応する近似効率との間の複雑な関係を明らかにします。特に、時間関係は、前者が十分なメモリ減衰を持つ場合にのみ、RNNによって効果的に近似できることを示す。最適化の面では、長期記憶との関係を学習する際に生じる可能性のある困難さの正確な理解を含む、最適化ダイナミクスの詳細な分析を行う。「記憶の呪い」という用語は、高次元関数近似を悩ませる「次元の呪い」に似た、明らかにされていない現象を説明するために作られた造語です。これらの結果は、線形動的設定におけるメモリと再帰構造の相互作用の比較的完全な図を形成します。

The Correlation-assisted Missing Data Estimator
相関支援欠損データ推定量

We introduce a novel approach to estimation problems in settings with missing data. Our proposal — the Correlation-Assisted Missing data (CAM) estimator — works by exploiting the relationship between the observations with missing features and those without missing features in order to obtain improved prediction accuracy. In particular, our theoretical results elucidate general conditions under which the proposed CAM estimator has lower mean squared error than the widely used complete-case approach in a range of estimation problems. We showcase in detail how the CAM estimator can be applied to $U$-Statistics to obtain an unbiased, asymptotically Gaussian estimator that has lower variance than the complete-case $U$-Statistic. Further, in nonparametric density estimation and regression problems, we construct our CAM estimator using kernel functions, and show it has lower asymptotic mean squared error than the corresponding complete-case kernel estimator. We also include practical demonstrations throughout the paper using simulated data and the Terneuzen birth cohort and Brandsma datasets available from CRAN.

私たちは、欠損データのある設定での推定問題に対する新しいアプローチを紹介します。私たちの提案である相関支援欠損データ(CAM)推定量は、欠損した特徴のある観測値と欠損していない観測値の関係を利用して、予測精度を向上させます。特に、私たちの理論的結果は、提案されたCAM推定量が、さまざまな推定問題で広く使用されている完全ケースアプローチよりも平均二乗誤差が低い一般的な条件を明らかにします。CAM推定量を$U$統計に適用して、完全ケース$U$統計よりも分散が低い、偏りのない漸近ガウス推定量を取得する方法を詳細に示します。さらに、ノンパラメトリック密度推定および回帰問題では、カーネル関数を使用してCAM推定量を構築し、対応する完全ケースカーネル推定量よりも漸近平均二乗誤差が低いことを示します。また、この論文では、シミュレーションデータと、CRANから入手できるTerneuzen出生コホートおよびBrandsmaデータセットを使用した実践的なデモンストレーションも取り上げます。

Structure-adaptive Manifold Estimation
構造適応多様体推定

We consider a problem of manifold estimation from noisy observations. Many manifold learning procedures locally approximate a manifold by a weighted average over a small neighborhood. However, in the presence of large noise, the assigned weights become so corrupted that the averaged estimate shows very poor performance. We suggest a structure-adaptive procedure, which simultaneously reconstructs a smooth manifold and estimates projections of the point cloud onto this manifold. The proposed approach iteratively refines the weights on each step, using the structural information obtained at previous steps. After several iterations, we obtain nearly “oracle” weights, so that the final estimates are nearly efficient even in the presence of relatively large noise. In our theoretical study, we establish tight lower and upper bounds proving asymptotic optimality of the method for manifold estimation under the Hausdorff loss, provided that the noise degrades to zero fast enough.

私たちは、ノイズの多い観測値からの多様体推定の問題を考えます。多くの多様体学習手順は、小さな近傍の加重平均で多様体を局所的に近似します。ただし、ノイズが大きい場合、割り当てられた重みが非常に破損するため、平均化された推定値のパフォーマンスは非常に低下します。私たちは、滑らかな多様体を同時に再構築し、この多様体への点群の投影を推定する構造適応手順を提案します。提案されたアプローチでは、前のステップで得られた構造情報を使用して、各ステップの重みを反復的に改良します。数回の反復の後、ほぼ「オラクル」重みが得られるため、比較的大きなノイズが存在する場合でも、最終的な推定値はほぼ効率的です。私たちの理論的研究では、ハウスドルフ損失の下での多様体推定法の漸近最適性を証明する厳密な下限と上限を確立します。これは、ノイズが十分に速くゼロに劣化することを条件とします。

(f,Gamma)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics
(f,ガンマ)-ダイバージェンス: f-ダイバージェンスと積分確率メトリクス間の補間

We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,\Gamma)$-divergences, provide a notion of `distance’ between probability measures and show that they can be expressed as a two-stage mass-redistribution/mass-transport process. The $(f,\Gamma)$-divergences inherit features from IPMs, such as the ability to compare distributions which are not absolutely continuous, as well as from $f$-divergences, namely the strict concavity of their variational representations and the ability to control heavy-tailed distributions for particular choices of $f$. When combined, these features establish a divergence with improved properties for estimation, statistical learning, and uncertainty quantification applications. Using statistical learning as an example, we demonstrate their advantage in training generative adversarial networks (GANs) for heavy-tailed, not-absolutely continuous sample distributions. We also show improved performance and stability over gradient-penalized Wasserstein GAN in image generation.

私たちは、$f$ダイバージェンスと$1$-ワッサーシュタイン距離などの積分確率メトリクス(IPM)の両方を包含する情報理論的ダイバージェンスを構築するための厳密で一般的なフレームワークを開発しました。私たちは、これらのダイバージェンス(以降$(f,\Gamma)$ダイバージェンスと呼ぶ)がどのような仮定の下で確率尺度間の「距離」の概念を提供するかを証明し、これらが2段階の質量再分配/質量輸送プロセスとして表現できることを示します。$(f,\Gamma)$ダイバージェンスは、絶対的に連続していない分布を比較する機能などのIPMからの特徴と、変分表現の厳密な凹面性と特定の$f$選択に対するヘビーテール分布を制御する機能などの$f$ダイバージェンスからの特徴を継承しています。これらの機能を組み合わせると、推定、統計学習、および不確実性定量化アプリケーション向けの特性が向上したダイバージェンスが確立されます。統計学習を例に、ヘビーテールで絶対的に連続ではないサンプル分布に対して生成的敵対的ネットワーク(GAN)をトレーニングする際の利点を示します。また、画像生成において、勾配ペナルティ付きWasserstein GANよりもパフォーマンスと安定性が向上していることも示します。

Score Matched Neural Exponential Families for Likelihood-Free Inference
尤度フリー推論のための一致した神経指数族のスコアリング

Bayesian Likelihood-Free Inference (LFI) approaches allow to obtain posterior distributions for stochastic models with intractable likelihood, by relying on model simulations. In Approximate Bayesian Computation (ABC), a popular LFI method, summary statistics are used to reduce data dimensionality. ABC algorithms adaptively tailor simulations to the observation in order to sample from an approximate posterior, whose form depends on the chosen statistics. In this work, we introduce a new way to learn ABC statistics: we first generate parameter-simulation pairs from the model independently on the observation; then, we use Score Matching to train a neural conditional exponential family to approximate the likelihood. The exponential family is the largest class of distributions with fixed-size sufficient statistics; thus, we use them in ABC, which is intuitively appealing and has state-of-the-art performance. In parallel, we insert our likelihood approximation in an MCMC for doubly intractable distributions to draw posterior samples. We can repeat that for any number of observations with no additional model simulations, with performance comparable to related approaches. We validate our methods on toy models with known likelihood and a large-dimensional time-series model.

ベイジアン尤度フリー推論(LFI)アプローチでは、モデルシミュレーションに頼ることで、扱いにくい尤度を持つ確率モデルの事後分布を取得できます。一般的なLFI手法である近似ベイジアン計算(ABC)では、要約統計を使用してデータの次元を削減します。ABCアルゴリズムは、選択された統計量に依存する形式を持つ近似事後分布からサンプリングするために、シミュレーションを観測に合わせて適応的に調整します。この研究では、ABC統計量を学習する新しい方法を紹介します。まず、観測量とは無関係にモデルからパラメーターとシミュレーションのペアを生成します。次に、スコアマッチングを使用して、尤度を近似するニューラル条件付き指数族をトレーニングします。指数族は、固定サイズの十分な統計量を持つ分布の最大のクラスであるため、直感的に魅力的で最先端のパフォーマンスを持つABCで使用します。並行して、事後サンプルを抽出できるように、二重に扱いにくい分布のMCMCに尤度近似を挿入します。追加のモデルシミュレーションなしで、任意の数の観測に対してこれを繰り返すことができ、パフォーマンスは関連アプローチに匹敵します。私たちは、尤度が既知のトイモデルと大規模次元の時系列モデルで私たちの方法を検証します。

Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric
Wasserstein計量を用いた実直線上の分布データの予測統計手法

We present a novel class of projected methods to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated, and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed.

私たちは、2-ワッサーシュタイン距離を持つ実数直線上の確率分布のデータセットに対して統計分析を実行するための新しいクラスの投影法を提示します。特に、主成分分析(PCA)と回帰に焦点を当てる。これらのモデルを定義するために、データを適切な線形空間にマッピングし、距離投影演算子を使用して結果をワッサーシュタイン空間に制約することにより、その弱リーマン構造に密接に関連するワッサーシュタイン空間の表現を利用します。接点を慎重に選択することで、制約付きBスプライン近似を利用した高速な経験的手法を導出することができます。我々のアプローチの副産物として、分布のPCAに関する以前の研究のより高速なルーチンを導出することもできます。シミュレーション研究によって、我々は我々のアプローチを以前に提案された方法と比較し、我々の投影PCAが計算コストのほんの一部で同様のパフォーマンスを示し、投影回帰が誤った指定下でも極めて柔軟であることを示す。モデルのいくつかの理論的特性が調査され、漸近的一貫性が証明されています。米国におけるCovid-19による死亡率と風速予測への2つの実際の応用について説明します。

Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization
Mini最適化からミニマックス最適化までの加速ゼロ次および1次運動量法

In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method for black-box mini-optimization where only function values can be obtained. Moreover, we prove that our Acc-ZOM method achieves a lower query complexity of $\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the variable dimension. In particular, our Acc-ZOM does not need large batches required in the existing zeroth-order stochastic algorithms. Meanwhile, we propose an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method for black-box minimax optimization, where only function values can be obtained. Our Acc-ZOMDA obtains a low query complexity of $\tilde{O}((d_1+d_2)^{3/4}\kappa_y^{4.5}\epsilon^{-3})$ without requiring large batches for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote variable dimensions and $\kappa_y$ is condition number. Moreover, we propose an accelerated first-order momentum descent ascent (Acc-MDA) method for minimax optimization, whose explicit gradients are accessible. Our Acc-MDA achieves a low gradient complexity of $\tilde{O}(\kappa_y^{4.5}\epsilon^{-3})$ without requiring large batches for finding an $\epsilon$-stationary point. In particular, our Acc-MDA can obtain a lower gradient complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ with a batch size $O(\kappa_y^4)$, which improves the best known result by a factor of $O(\kappa_y^{1/2})$. Extensive experimental results on black-box adversarial attack to deep neural networks and poisoning attack to logistic regression demonstrate efficiency of our algorithms.

この論文では、非凸ミニ最適化とミニマックス最適化の両方に対して、加速ゼロ次および一次運動量法のクラスを提案します。具体的には、関数値のみを取得できるブラックボックスミニ最適化に対して、新しい加速ゼロ次運動量(Acc-ZOM)法を提案します。さらに、Acc-ZOM法は、$\epsilon$定常点を見つけるためのクエリ複雑度が$\tilde{O}(d^{3/4}\epsilon^{-3})$と低く、既知の最良の結果を$O(d^{1/4})$倍向上させることを証明します(ここで、$d$は変数の次元を表します)。特に、Acc-ZOMは、既存のゼロ次確率アルゴリズムで必要な大きなバッチを必要としません。一方、我々はブラックボックスのミニマックス最適化のために、関数値のみを取得できる加速ゼロ次運動量降下法(Acc-ZOMDA)を提案します。我々のAcc-ZOMDAは、$\epsilon$定常点を見つけるために大きなバッチを必要とせずに、$\tilde{O}((d_1+d_2)^{3/4}\kappa_y^{4.5}\epsilon^{-3})$という低いクエリ複雑度を実現します。ここで、$d_1$と$d_2$は変数の次元、$\kappa_y$は条件数を表す。さらに、明示的な勾配にアクセスできる、ミニマックス最適化のための加速一次運動量降下法(Acc-MDA)を提案します。私たちのAcc-MDAは、$\epsilon$定常点を見つけるために大規模なバッチを必要とせずに、$\tilde{O}(\kappa_y^{4.5}\epsilon^{-3})$という低い勾配複雑度を実現します。特に、私たちのAcc-MDAは、バッチサイズ$O(\kappa_y^4)$で$\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$という低い勾配複雑度を実現でき、これは既知の最良の結果を$O(\kappa_y^{1/2})$倍改善します。ディープニューラルネットワークに対するブラックボックス敵対的攻撃とロジスティック回帰に対するポイズニング攻撃に関する広範な実験結果が、私たちのアルゴリズムの効率性を実証しています。

Optimality and Stability in Non-Convex Smooth Games
非凸スムーズゲームにおける最適性と安定性

Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games.

凸凹関数の鞍点への収束は数十年にわたって研究されてきましたが、近年では、最近の幅広い応用に動機づけられて、非凸（ゼロ和）滑らかなゲームへの関心が高まっています。局所最適点がどのように定義され、どのアルゴリズムがそのような点に収束できるかは、依然として興味深い研究課題です。興味深い概念は、広く知られている勾配降下法アルゴリズムと強く相関する、局所ミニマックス点として知られています。この論文では、他のソリューションの概念との関係や最適条件など、局所ミニマックス点の包括的な分析を提供することを目的とします。局所鞍点は、緩やかな連続性仮定の下で、一様局所ミニマックス点と呼ばれる特別なタイプの局所ミニマックス点と見なすことができることを発見しました。（非凸）二次ゲームでは、局所ミニマックス点は（ある意味で）グローバルミニマックス点と同等であることを示します。最後に、局所ミニマックス点付近での勾配アルゴリズムの安定性を検討します。勾配アルゴリズムは、非退化の場合にはローカル/グローバルミニマックスポイントに収束できますが、一般的な場合には失敗することがよくあります。これは、非凸スムーズゲームにおける鞍点とミニマックスポイントを超える新しいアルゴリズムまたは概念が必要であることを意味します。

SODEN: A Scalable Continuous-Time Survival Model through Ordinary Differential Equation Networks
SODEN:常微分方程式ネットワークによるスケーラブルな連続時間生存モデル

In this paper, we propose a flexible model for survival analysis using neural networks along with scalable optimization algorithms. One key technical challenge for directly applying maximum likelihood estimation (MLE) to censored data is that evaluating the objective function and its gradients with respect to model parameters requires the calculation of integrals. To address this challenge, we recognize from a novel perspective that the MLE for censored data can be viewed as a differential-equation constrained optimization problem. Following this connection, we model the distribution of event time through an ordinary differential equation and utilize efficient ODE solvers and adjoint sensitivity analysis to numerically evaluate the likelihood and the gradients. Using this approach, we are able to 1) provide a broad family of continuous-time survival distributions without strong structural assumptions, 2) obtain powerful feature representations using neural networks, and 3) allow efficient estimation of the model in large-scale applications using stochastic gradient descent. Through both simulation studies and real-world data examples, we demonstrate the effectiveness of the proposed method in comparison to existing state-of-the-art deep learning survival analysis models. The implementation of the proposed SODEN approach has been made publicly available at https://github.com/jiaqima/SODEN.

この論文では、ニューラルネットワークとスケーラブルな最適化アルゴリズムを使用した柔軟な生存分析モデルを提案します。最大尤度推定(MLE)を検閲データに直接適用する際の重要な技術的課題の1つは、モデルパラメーターに対する目的関数とその勾配を評価するには積分の計算が必要になることです。この課題に対処するために、検閲データのMLEは微分方程式制約最適化問題として見ることができるという新しい観点から認識しました。この関係に従って、イベント時間の分布を常微分方程式でモデル化し、効率的なODEソルバーと随伴感度分析を使用して尤度と勾配を数値的に評価します。このアプローチを使用すると、1)強力な構造仮定なしで連続時間生存分布の広範なファミリーを提供、2)ニューラルネットワークを使用して強力な特徴表現を取得、3)確率的勾配降下法を使用して大規模なアプリケーションでモデルを効率的に推定できるようになります。シミュレーション研究と実際のデータ例の両方を通じて、既存の最先端のディープラーニング生存分析モデルと比較して、提案された方法の有効性を実証します。提案されたSODENアプローチの実装は、https://github.com/jiaqima/SODENで公開されています。

Model Averaging Is Asymptotically Better Than Model Selection For Prediction
モデルの平均化は、予測のためのモデル選択よりも漸近的に優れています

We compare the performance of six model average predictors—Mallows’ model averaging, stacking, Bayes model averaging, bagging, random forests, and boosting—to the components used to form them.In all six cases we identify conditions under which the model average predictor is consistent for its intended limit and performs as well or better than any of its components asymptotically. This is well known empirically, especially for complex problems, although theoretical results do not seem to have been formally established. We have focused our attention on the regression context since that is wheremodel averaging techniques differ most often from current practice.

私たちは、6つのモデル平均予測子—Mallowsのモデル平均化、スタッキング、ベイズモデルの平均化、バギング、ランダムフォレスト、ブースティング—のパフォーマンスを、それらを形成するために使用されるコンポーネントと比較します。6つのケースすべてで、モデル平均予測子が意図した極限に対して一貫性があり、漸近的にそのコンポーネントのいずれよりも優れたパフォーマンスを発揮する条件を特定します。これは、特に複雑な問題については、経験的によく知られていますが、理論的な結果は正式に確立されていないようです。回帰のコンテキストに注目したのは、モデル平均化手法が現在の手法と最も頻繁に異なる場所であるためです。

Active Learning for Nonlinear System Identification with Guarantees
保証付き非線形システム同定のためのアクティブラーニング

While the identification of nonlinear dynamical systems is a fundamental building block of model-based reinforcement learning and feedback control, its sample complexity is only understood for systems that either have discrete states and actions or for systems that can be identified from data generated by i.i.d. random inputs. Nonetheless, many interesting dynamical systems have continuous states and actions and can only be identified through a judicious choice of inputs. Motivated by practical settings, we study a class of nonlinear dynamical systems whose state transitions depend linearly on a known feature embedding of state-action pairs. To estimate such systems in finite time identification methods must explore all directions in feature space. We propose an active learning approach that achieves this by repeating three steps: trajectory planning, trajectory tracking, and re-estimation of the system from all available data. We show that our method estimates nonlinear dynamical systems at a parametric rate, similar to the statistical rate of standard linear regression.

非線形動的システムの識別は、モデルベースの強化学習とフィードバック制御の基本的な構成要素ですが、そのサンプルの複雑さは、離散的な状態とアクションを持つシステム、またはi.i.d.ランダム入力によって生成されたデータから識別できるシステムについてのみ理解されています。とはいえ、多くの興味深い動的システムは連続的な状態とアクションを持ち、入力を慎重に選択することによってのみ識別できます。実用的な設定に動機付けられて、状態遷移が状態とアクションのペアの既知の特徴埋め込みに線形に依存する非線形動的システムのクラスを研究します。このようなシステムを有限時間で推定するには、識別方法で特徴空間のすべての方向を探索する必要があります。軌道計画、軌道追跡、および利用可能なすべてのデータからのシステムの再推定という3つのステップを繰り返すことでこれを実現するアクティブラーニングアプローチを提案します。この方法では、標準の線形回帰の統計的速度と同様のパラメトリック速度で非線形動的システムを推定できることを示します。

An Improper Estimator with Optimal Excess Risk in Misspecified Density Estimation and Logistic Regression
誤指定密度推定とロジスティック回帰における最適過剰リスクを持つ不適切な推定量

We introduce a procedure for conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ bounds the norm of features and $B$ that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than $\min({B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a more practical alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly addressing a question raised by Foster et al. (2018).

私たちは、対数損失の下での条件付き密度推定の手順を導入し、これをSMP (Sample Minmax Predictor)と呼ぶ。この推定量は、統計学習のための新しい一般的な過剰リスク境界を最小化します。標準的な例では、この境界は$d/n$に比例し、$d$はモデル次元、$n$はサンプルサイズであり、モデルが誤って指定された場合でも依然として有効です。不適切な(モデル外の)手順であるSMPは、誤って指定された場合に過剰リスクが悪化する最大尤度推定量などのモデル内推定量よりも優れています。逐次問題に還元するアプローチと比較して、我々の境界は最適でない$\log n$因子を除去し、無制限のクラスを処理できます。ガウス線形モデルの場合、SMPの予測値とリスク境界は共変量のてこ比スコアによって決まり、ノイズ分散や線形モデルの近似誤差の条件なしに、適切に指定されたケースで最適なリスクにほぼ一致します。ロジスティック回帰の場合、SMPは仮想サンプルに依存する確率予測の較正に対する非ベイジアン手法を提供し、2つのロジスティック回帰を解くことで計算できます。これは$O((d + B^2R^2)/n)$の非漸近的過剰リスクを実現します。ここで、$R$は特徴のノルムを、$B$は比較パラメータのノルムを境界とします。対照的に、モデル内推定量では一般に$\min({B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$よりも優れたレートを実現することはできません。これは、近似事後サンプリングを必要とするベイジアン手法のより実用的な代替手段を提供し、Fosterら(2018)が提起した疑問に部分的に対処します。

A Class of Conjugate Priors for Multinomial Probit Models which Includes the Multivariate Normal One
多項プロビットモデルのための共役事前確率のクラス (多変量正規プロビットモデルを含む)

Multinomial probit models are routinely-implemented representations for learning how the class probabilities of categorical response data change with $p$ observed predictors. Although several frequentist methods have been developed for estimation, inference and classification within such a class of models, Bayesian inference is still lagging behind. This is due to the apparent absence of a tractable class of conjugate priors, that may facilitate posterior inference on the multinomial probit coefficients. Such an issue has motivated increasing efforts toward the development of effective Markov chain Monte Carlo methods, but state-of-the-art solutions still face severe computational bottlenecks, especially in high dimensions. In this article, we show that the entire class of unified skew-normal (SUN) distributions is conjugate to several multinomial probit models. Leveraging this result and the SUN properties, we improve upon state-of-the-art solutions for posterior inference and classification both in terms of closed-form results for several functionals of interest, and also by developing novel computational methods relying either on independent and identically distributed samples from the exact posterior or on scalable and accurate variational approximations based on blocked partially-factorized representations. As illustrated in simulations and in a gastrointestinal lesions application, the magnitude of the improvements relative to current methods is particularly evident, in practice, when the focus is on high-dimensional studies.

多項プロビットモデルは、カテゴリ応答データのクラス確率が$p$個の観測予測子によってどのように変化するかを学習するための、日常的に実装されている表現です。このようなモデルのクラス内で推定、推論、分類を行うための頻度主義的手法がいくつか開発されていますが、ベイズ推論はまだ遅れています。これは、多項プロビット係数の事後推論を容易にする可能性のある扱いやすい共役事前分布のクラスが明らかに存在しないためです。このような問題により、効果的なマルコフ連鎖モンテカルロ法の開発に向けた取り組みが強化されていますが、最先端のソリューションは、特に高次元で深刻な計算上のボトルネックに直面しています。この記事では、統一歪正規分布(SUN)のクラス全体が、いくつかの多項プロビットモデルと共役であることを示します。この結果とSUNの特性を活用して、いくつかの関心関数の閉形式結果の点でも、また、正確な事後分布からの独立かつ同一に分布するサンプル、またはブロックされた部分因数分解表現に基づくスケーラブルで正確な変分近似のいずれかに依存する新しい計算方法を開発することで、事後推論および分類の最先端のソリューションを改善します。シミュレーションと胃腸病変アプリケーションで示されているように、現在の方法と比較した改善の大きさは、実際には高次元の研究に重点が置かれている場合に特に明らかです。

Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning
マルチステップモデルに依存しないメタ学習の理論的収束

As a popular meta-learning approach, the model-agnostic meta-learning (MAML) algorithm has been widely used due to its simplicity and effectiveness. However, the convergence of the general multi-step MAML still remains unexplored. In this paper, we develop a new theoretical framework to provide such convergence guarantee for two types of objective functions that are of interest in practice: (a) resampling case (e.g., reinforcement learning), where loss functions take the form in expectation and new data are sampled as the algorithm runs; and (b) finite-sum case (e.g., supervised learning), where loss functions take the finite-sum form with given samples. For both cases, we characterize the convergence rate and the computational complexity to attain an $\epsilon$-accurate solution for multi-step MAML in the general nonconvex setting. In particular, our results suggest that an inner-stage stepsize needs to be chosen inversely proportional to the number $N$ of inner-stage steps in order for $N$-step MAML to have guaranteed convergence. From the technical perspective, we develop novel techniques to deal with the nested structure of the meta gradient for multi-step MAML, which can be of independent interest.

人気のメタ学習アプローチとして、モデルに依存しないメタ学習(MAML)アルゴリズムは、そのシンプルさと有効性から広く使用されています。しかし、一般的なマルチステップMAMLの収束性は未だに未解明のままです。この論文では、実際に重要な2種類の目的関数に対してそのような収束保証を提供するための新しい理論的枠組みを開発します。(a)再サンプリングケース(強化学習など)では、損失関数は期待どおりの形を取り、アルゴリズムの実行時に新しいデータがサンプリングされます。(b)有限和ケース(教師あり学習など)では、損失関数は与えられたサンプルで有限和の形を取ります。両方のケースについて、一般的な非凸設定でマルチステップMAMLの$\epsilon$精度のソリューションを実現するための収束率と計算の複雑さを特徴付けます。特に、私たちの結果は、NステップMAMLが確実に収束するためには、内部ステージのステップサイズを内部ステージのステップ数Nに反比例して選択する必要があることを示唆しています。技術的な観点から、私たちは、独立した関心事である、マルチステップMAMLのメタ勾配のネスト構造を扱うための新しい手法を開発しています。

Novel Min-Max Reformulations of Linear Inverse Problems
線形逆問題の新規最小‐最大再定式化

In this article, we dwell into the class of so-called ill-posed Linear Inverse Problems (LIP) which simply refer to the task of recovering the entire signal from its relatively few random linear measurements. Such problems arise in a variety of settings with applications ranging from medical image processing, recommender systems, etc. We propose a slightly generalized version of the error constrained linear inverse problem and obtain a novel and equivalent convex-concave min-max reformulation by providing an exposition to its convex geometry. Saddle points of the min-max problem are completely characterized in terms of a solution to the LIP, and vice versa. Applying simple saddle point seeking ascend-descent type algorithms to solve the min-max problems provides novel and simple algorithms to find a solution to the LIP. Moreover, the reformulation of an LIP as the min-max problem provided in this article is crucial in developing methods to solve the dictionary learning problem with almost sure recovery constraints.

この記事では、いわゆる不良設定線形逆問題(LIP)のクラスについて詳しく説明します。これは、比較的少数のランダムな線形測定から信号全体を復元するタスクを指します。このような問題は、医療画像処理、推奨システムなど、さまざまな設定で発生します。私たちは、誤差制約線形逆問題の若干一般化されたバージョンを提案し、その凸形状の説明を提供することで、新しい同等の凸凹最小最大再定式化を取得します。最小最大問題の鞍点は、LIPの解の観点から完全に特徴付けられ、逆もまた同様です。最小最大問題を解決するために、単純な鞍点探索の上昇下降型アルゴリズムを適用すると、LIPの解を見つけるための新しい単純なアルゴリズムが得られます。さらに、この記事で提供されるLIPの最小最大問題としての再定式化は、ほぼ確実な復元制約を持つ辞書学習問題を解決する方法の開発に不可欠です。

Data-Derived Weak Universal Consistency
データ派生の弱いユニバーサル整合性

Many current applications in data science need rich model classes to adequately represent the statistics that may be driving the observations. Such rich model classes may be too complex to admit uniformly consistent estimators. In such cases, it is conventional to settle for estimators with guarantees on convergence rate where the performance can be bounded in a model-dependent way, i.e. pointwise consistent estimators. But this viewpoint has the practical drawback that estimator performance is a function of the unknown model within the model class that is being estimated. Even if an estimator is consistent, how well it is doing at any given time may not be clear, no matter what the sample size of the observations. In these cases, a line of analysis favors sample dependent guarantees. We explore this framework by studying rich model classes that may only admit pointwise consistency guarantees, yet enough information about the unknown model driving the observations needed to gauge estimator accuracy can be inferred from the sample at hand. In this paper we obtain a novel characterization of lossless compression problems over a countable alphabet in the data-derived framework in terms of what we term deceptive distributions. We also show that the ability to estimate the redundancy of compressing memoryless sources is equivalent to learning the underlying single-letter marginal in a data-derived fashion. We expect that the methodology underlying such characterizations in a data-derived estimation framework will be broadly applicable to a wide range of estimation problems, enabling a more systematic approach to data-derived guarantees.

データサイエンスの現在の多くのアプリケーションでは、観測の原動力となっている統計を適切に表現するために、豊富なモデルクラスが必要です。このような豊富なモデルクラスは、一様に一貫性のある推定量を受け入れるには複雑すぎる場合があります。このような場合、収束率の保証があり、パフォーマンスがモデル依存的に制限される推定量、つまり点ごとに一貫性のある推定量で妥協するのが一般的です。ただし、この観点には、推定量のパフォーマンスが、推定されているモデルクラス内の未知のモデルの関数であるという実際的な欠点があります。推定量が一貫している場合でも、観測のサンプルサイズに関係なく、特定の時点での推定量のパフォーマンスが明確でない場合があります。このような場合、分析の方向性としてサンプル依存の保証が優先されます。このフレームワークを調査するために、点ごとの一貫性の保証のみを受け入れる可能性がある豊富なモデルクラスを調査しますが、推定量の精度を測定するために必要な、観測を駆動している未知のモデルに関する十分な情報は、手元のサンプルから推測できます。この論文では、データ導出フレームワークにおける可算アルファベット上のロスレス圧縮問題の新しい特徴付けを、私たちが欺瞞分布と呼ぶものの観点から得ています。また、圧縮するメモリレスソースの冗長性を推定する能力は、基礎となる単一文字の周辺をデータ導出方式で学習することと同等であることを示しています。データ導出推定フレームワークにおけるこのような特徴付けの基礎となる方法論は、さまざまな推定問題に広く適用でき、データ導出保証に対するより体系的なアプローチが可能になると期待しています。

MurTree: Optimal Decision Trees via Dynamic Programming and Search
MurTree:動的プログラミングと検索による最適な決定木

Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical use of optimal decision trees.

決定木学習は機械学習で広く使用されているアプローチであり、簡潔で解釈可能なモデルを必要とするアプリケーションで好まれています。ヒューリスティックな方法は、従来、適度に高い精度のモデルを迅速に作成するために使用されています。しかし、よく批判される点は、結果として得られるツリーが、精度とサイズの点で必ずしもデータの最適な表現ではない可能性があるということです。近年、これが、一連の局所的に最適な決定を実行するヒューリスティックな方法とは対照的に、決定木を全体的に最適化する最適分類ツリーアルゴリズムの開発の動機となっています。私たちはこの研究の流れに従い、動的プログラミングと検索に基づいて最適な分類ツリーを学習するための新しいアルゴリズムを提供します。私たちのアルゴリズムは、ツリーの深さとノードの数に対する制約をサポートしています。私たちのアプローチの成功は、分類ツリーに固有の特性を利用する一連の特殊な手法によるものです。従来、最適分類ツリーのアルゴリズムは実行時間が長く、スケーラビリティが限られているという問題がありましたが、詳細な実験的研究により、私たちのアプローチでは最先端のアルゴリズムに必要な時間のほんの一部しかかからず、数万のインスタンスを含むデータセットを処理できることが示され、数桁の改善がもたらされ、最適決定ツリーの実用化に大きく貢献しています。

Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting
ADMM型スプリッティングを用いた次元フリー収束率による効率的なMCMCサンプリング

Performing exact Bayesian inference for complex models is computationally intractable. Markov chain Monte Carlo (MCMC) algorithms can provide reliable approximations of the posterior distribution but are expensive for large data sets and high-dimensional models. A standard approach to mitigate this complexity consists in using subsampling techniques or distributing the data across a cluster. However, these approaches are typically unreliable in high-dimensional scenarios. We focus here on a recent alternative class of MCMC schemes exploiting a splitting strategy akin to the one used by the celebrated alternating direction method of multipliers (ADMM) optimization algorithm. These methods appear to provide empirically state-of-the-art performance but their theoretical behavior in high dimension is currently unknown. In this paper, we propose a detailed theoretical study of one of these algorithms known as the split Gibbs sampler. Under regularity conditions, we establish explicit convergence rates for this scheme using Ricci curvature and coupling ideas. We support our theory with numerical illustrations.

複雑なモデルに対して正確なベイズ推論を実行することは、計算上困難です。マルコフ連鎖モンテカルロ(MCMC)アルゴリズムは、事後分布の信頼性の高い近似値を提供できますが、大規模なデータセットや高次元モデルではコストがかかります。この複雑さを軽減する標準的な方法は、サブサンプリング手法を使用するか、データをクラスター全体に分散することです。ただし、これらの方法は、高次元のシナリオでは通常、信頼性がありません。ここでは、有名な交互方向乗数法(ADMM)最適化アルゴリズムで使用されるものと類似した分割戦略を利用する、最近の代替クラスのMCMCスキームに焦点を当てます。これらの方法は、経験的には最先端のパフォーマンスを提供するように見えますが、高次元での理論的な動作は現在不明です。この論文では、分割ギブスサンプラーとして知られるこれらのアルゴリズムの1つについて、詳細な理論的研究を提案します。規則性条件下で、リッチ曲率と結合のアイデアを使用して、このスキームの明確な収束率を確立します。数値例で理論をサポートします。

On Biased Stochastic Gradient Estimation
バイアス確率的勾配推定について

We present a uniform analysis of biased stochastic gradient methods for minimizing convex, strongly convex, and non-convex composite objectives, and identify settings where bias is useful in stochastic gradient estimation. The framework we present allows us to extend proximal support to biased algorithms, including SAG and SARAH, for the first time in the convex setting. We also use our framework to develop a new algorithm, Stochastic Average Recursive GradiEnt (SARGE), that achieves the oracle complexity lower-bound for non-convex, finite-sum objectives and requires strictly fewer calls to a stochastic gradient oracle per iteration than SVRG and SARAH. We support our theoretical results with numerical experiments that demonstrate the benefits of certain biased gradient estimators.

私たちは、凸型、強凸型、および非凸型の複合目的物を最小限に抑えるためのバイアス付き確率的勾配法の統一的な分析を提示し、バイアスが確率的勾配推定に役立つ設定を特定します。ここで紹介するフレームワークにより、凸型設定で初めて、SAGやSARAHなどのバイアス付きアルゴリズムに近位サポートを拡張できます。また、このフレームワークを使用して、非凸有限和目的のオラクル複雑性の下限を達成し、SVRGやSARAHよりも反復ごとにストキャスティクス勾配オラクルへの呼び出しが厳密に少ない新しいアルゴリズム、Stochastic Average Recursive GradiEnt(SARGE)を開発します。私たちは、特定の偏った勾配推定量の利点を実証する数値実験で理論的な結果を裏付けています。

Fast and Robust Rank Aggregation against Model Misspecification
モデルの誤指定に対する高速で堅牢なランク集約

In rank aggregation (RA), a collection of preferences from different users are summarized into a total order under the assumption of homogeneity of users. Model misspecification in RA arises since the homogeneity assumption fails to be satisfied in the complex real-world situation. Existing robust RAs usually resort to an augmentation of the ranking model to account for additional noises, where the collected preferences can be treated as a noisy perturbation of idealized preferences. Since the majority of robust RAs rely on certain perturbation assumptions, they cannot generalize well to agnostic noise-corrupted preferences in the real world. In this paper, we propose CoarsenRank, which possesses robustness against model misspecification. Specifically, the properties of our CoarsenRank are summarized as follows: (1) CoarsenRank is designed for mild model misspecification, which assumes there exist the ideal preferences (consistent with model assumption) that locate in a neighborhood of the actual preferences. (2) CoarsenRank then performs regular RAs over a neighborhood of the preferences instead of the original data set directly. Therefore, CoarsenRank enjoys robustness against model misspecification within a neighborhood. (3) The neighborhood of the data set is defined via their empirical data distributions. Further, we put an exponential prior on the unknown size of the neighborhood and derive a much-simplified posterior formula for CoarsenRank under particular divergence measures. (4) CoarsenRank is further instantiated to Coarsened Thurstone, Coarsened Bradly-Terry, and Coarsened Plackett-Luce with three popular probability ranking models. Meanwhile, tractable optimization strategies are introduced with regards to each instantiation respectively. In the end, we apply CoarsenRank on four real-world data sets. Experiments show that CoarsenRank is fast and robust, achieving consistent improvements over baseline methods.

ランク集約(RA)では、異なるユーザーからの選好のコレクションが、ユーザーの同質性を仮定して全順序にまとめられます。RAにおけるモデルの誤指定は、複雑な現実世界の状況で同質性の仮定が満たされないために発生します。既存の堅牢なRAは通常、追加のノイズを考慮するためにランキングモデルの拡張に頼っており、収集された選好は、理想的な選好のノイズの多い摂動として扱うことができます。堅牢なRAの大部分は特定の摂動仮定に依存しているため、現実世界の不可知論的なノイズで破損した選好にうまく一般化できません。この論文では、モデルの誤指定に対して堅牢なCoarsenRankを提案します。具体的には、CoarsenRankの特性は次のようにまとめられます。(1) CoarsenRankは、実際の選好の近傍に位置する理想的な選好(モデル仮定と一致する)が存在すると仮定する、軽度のモデル誤指定向けに設計されています。(2) CoarsenRankは、元のデータセットを直接処理するのではなく、嗜好の近傍に対して通常のRAを実行します。そのため、CoarsenRankは近傍内のモデル誤指定に対して堅牢です。(3)データセットの近傍は、経験的データ分布によって定義されます。さらに、近傍の未知のサイズに指数事前分布を適用し、特定の発散尺度の下でのCoarsenRankの大幅に簡略化された事後式を導出します。(4) CoarsenRankは、3つの一般的な確率ランキングモデルを使用して、Coarsened Thurstone、Coarsened Bradly-Terry、およびCoarsened Plackett-Luceにさらにインスタンス化されます。一方、各インスタンス化に関して、扱いやすい最適化戦略がそれぞれ導入されます。最後に、CoarsenRankを4つの実際のデータセットに適用します。実験では、CoarsenRankが高速かつ堅牢であり、ベースラインメソッドに対して一貫した改善が達成されることが示されています。

LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis of Big Time Series Data
LSAR:大規模な時系列データの分析のための効率的なレバレッジスコアサンプリングアルゴリズム

We apply methods from randomized numerical linear algebra (RandNLA) to develop improved algorithms for the analysis of large-scale time series data. We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes. We show that the accuracy of approximations lies within $(1+\mathcal{O}({\varepsilon}))$ of the true leverage scores with high probability. These theoretical results are subsequently exploited to develop an efficient algorithm, called LSAR, for fitting an appropriate AR model to big time series data. Our proposed algorithm is guaranteed, with high probability, to find the maximum likelihood estimates of the parameters of the underlying true AR model and has a worst case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale synthetic as well as real data highly support the theoretical results and reveal the efficacy of this new approach.

私たちは、ランダム化数値線形代数(RandNLA)の手法を適用して、大規模な時系列データの解析のための改良型アルゴリズムを開発します。まず、ビッグデータレジームにおける自己回帰(AR)モデルのレバレッジスコアを推定するための新しい高速アルゴリズムを開発します。近似の精度が、高い確率で真のレバレッジスコアの$(1+mathcal{O}({varepsilon}))$内にあることを示します。これらの理論的結果は、その後、大規模な時系列データに適切なARモデルを適合させるためのLSARと呼ばれる効率的なアルゴリズムを開発するために利用されます。私たちが提案するアルゴリズムは、高い確率で、基礎となる真のARモデルのパラメータの最尤推定値を見つけることが保証されており、ビッグデータレジームの最先端の代替手段の実行時間を大幅に改善する最悪の場合の実行時間を持っています。大規模な合成データと実際のデータに関する実証結果は、理論的結果を強く支持し、この新しいアプローチの有効性を明らかにしています。

Evolutionary Variational Optimization of Generative Models
生成モデルの進化的変分最適化

We combine two popular optimization approaches to derive learning algorithms for generative models: variational optimization and evolutionary algorithms. The combination is realized for generative models with discrete latents by using truncated posteriors as the family of variational distributions. The variational parameters of truncated posteriors are sets of latent states. By interpreting these states as genomes of individuals and by using the variational lower bound to define a fitness, we can apply evolutionary algorithms to realize the variational loop. The used variational distributions are very flexible and we show that evolutionary algorithms can effectively and efficiently optimize the variational bound. Furthermore, the variational loop is generally applicable (“black box”) with no analytical derivations required. To show general applicability, we apply the approach to three generative models (we use Noisy-OR Bayes Nets, Binary Sparse Coding, and Spike-and-Slab Sparse Coding). To demonstrate effectiveness and efficiency of the novel variational approach, we use the standard competitive benchmarks of image denoising and inpainting. The benchmarks allow quantitative comparisons to a wide range of methods including probabilistic approaches, deep deterministic and generative networks, and non-local image processing methods. In the category of “zero-shot” learning (when only the corrupted image is used for training), we observed the evolutionary variational algorithm to significantly improve the state-of-the-art in many benchmark settings. For one well-known inpainting benchmark, we also observed state-of-the-art performance across all categories of algorithms although we only train on the corrupted image. In general, our investigations highlight the importance of research on optimization methods for generative models to achieve performance improvements.

私たちは、変分最適化と進化的アルゴリズムという2つの一般的な最適化手法を組み合わせて、生成モデルの学習アルゴリズムを導出します。この組み合わせは、離散潜在変数を持つ生成モデルに対して、変分分布の族として切り捨て事後分布を使用することで実現されます。切り捨て事後分布の変分パラメータは、潜在状態のセットです。これらの状態を個体のゲノムとして解釈し、変分下限を使用して適応度を定義することで、進化的アルゴリズムを適用して変分ループを実現できます。使用される変分分布は非常に柔軟であり、進化的アルゴリズムが変分上限を効果的かつ効率的に最適化できることを示しています。さらに、変分ループは一般に適用可能(「ブラックボックス」)であり、解析的導出は必要ありません。一般的な適用可能性を示すために、このアプローチを3つの生成モデルに適用します(Noisy-OR Bayes Nets、Binary Sparse Coding、Spike-and-Slab Sparse Codingを使用)。新しい変分アプローチの有効性と効率性を実証するために、画像のノイズ除去と修復の標準的な競合ベンチマークを使用します。ベンチマークでは、確率的アプローチ、ディープ決定論的および生成的ネットワーク、非ローカル画像処理方法など、さまざまな方法と定量的に比較できます。「ゼロショット」学習のカテゴリ（破損した画像のみをトレーニングに使用する場合）では、進化的変分アルゴリズムにより、多くのベンチマーク設定で最先端のパフォーマンスが大幅に向上しました。よく知られている1つの修復ベンチマークでは、破損した画像のみでトレーニングしているにもかかわらず、すべてのカテゴリのアルゴリズムで最先端のパフォーマンスを確認しました。一般に、私たちの調査は、パフォーマンスの向上を達成するための生成モデルの最適化方法の研究の重要性を強調しています。

Supervised Dimensionality Reduction and Visualization using Centroid-Encoder
セントロイド・エンコーダを使用した教師あり次元削減と可視化

We propose a new tool for visualizing complex, and potentially large and high-dimensional, data sets called Centroid-Encoder (CE). The architecture of the Centroid-Encoder is similar to the autoencoder neural network but it has a modified target, i.e., the class centroid in the ambient space. As such, CE incorporates label information and performs a supervised data visualization. The training of CE is done in the usual way with a training set whose parameters are tuned using a validation set. The evaluation of the resulting CE visualization is performed on a sequestered test set where the generalization of the model is assessed both visually and quantitatively. We present a detailed comparative analysis of the method using a wide variety of data sets and techniques, both supervised and unsupervised, including NCA, non-linear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. An analysis of variance using PCA demonstrates that a non-linear preprocessing by the CE transformation of the data captures more variance than PCA by dimension.

私たちは、複雑で、潜在的に大規模で高次元のデータセットを視覚化するための新しいツールとして、Centroid-Encoder (CE)を提案します。Centroid-Encoderのアーキテクチャはオートエンコーダニューラルネットワークに似ていますが、ターゲットが変更されています(つまり、周囲空間のクラスセントロイド)。そのため、CEはラベル情報を組み込み、教師ありデータ視覚化を実行します。CEのトレーニングは、検証セットを使用してパラメーターが調整されたトレーニングセットを使用して通常の方法で行われます。結果として得られるCE視覚化の評価は、隔離されたテストセットで実行され、モデルの一般化が視覚的および定量的に評価されます。NCA、非線形NCA、t分布NCA、t分布MCML、教師ありUMAP、教師ありPCA、カラー最大分散展開、教師ありIsomap、パラメトリック埋め込み、教師あり近傍検索ビジュアライザー、多重関係埋め込みなど、教師ありと教師なしの両方を含むさまざまなデータセットと手法を使用して、この方法の詳細な比較分析を示します。PCAを使用した分散分析により、データのCE変換による非線形前処理では、次元によるPCAよりも多くの分散を捕捉できることが実証されています。

Universal Approximation in Dropout Neural Networks
ドロップアウトニューラルネットワークにおけるユニバーサル近似

We prove two universal approximation theorems for a range of dropout neural networks. These are feed-forward neural networks in which each edge is given a random $\{0,1\}$-valued filter, that have two modes of operation: in the first each edge output is multiplied by its random filter, resulting in a random output, while in the second each edge output is multiplied by the expectation of its filter, leading to a deterministic output. It is common to use the random mode during training and the deterministic mode during testing and prediction. Both theorems are of the following form: Given a function to approximate and a threshold $\varepsilon>0$, there exists a dropout network that is $\varepsilon$-close in probability and in $L^q$. The first theorem applies to dropout networks in the random mode. It assumes little on the activation function, applies to a wide class of networks, and can even be applied to approximation schemes other than neural networks. The core is an algebraic property that shows that deterministic networks can be exactly matched in expectation by random networks. The second theorem makes stronger assumptions and gives a stronger result. Given a function to approximate, it provides existence of a network that approximates in both modes simultaneously. Proof components are a recursive replacement of edges by independent copies, and a special first-layer replacement that couples the resulting larger network to the input. The functions to be approximated are assumed to be elements of general normed spaces, and the approximations are measured in the corresponding norms. The networks are constructed explicitly. Because of the different methods of proof, the two results give independent insight into the approximation properties of random dropout networks. With this, we establish that dropout neural networks broadly satisfy a universal-approximation property.

私たちは、さまざまなドロップアウトニューラルネットワークについて、2つの普遍的な近似定理を証明します。これらは、各エッジにランダムな$\{0,1\}$値のフィルターが与えられるフィードフォワードニューラルネットワークで、2つの動作モードがあります。最初のモードでは、各エッジ出力にそのランダムフィルターが掛けられ、ランダム出力が生成されます。2番目のモードでは、各エッジ出力にそのフィルターの期待値が掛けられ、決定論的出力が生成されます。トレーニング中はランダムモードを使用し、テストおよび予測中は決定論的モードを使用するのが一般的です。両方の定理は次の形式です。近似する関数としきい値$\varepsilon>0$が与えられた場合、確率と$L^q$が$\varepsilon$に近いドロップアウトネットワークが存在します。最初の定理は、ランダムモードのドロップアウトネットワークに適用されます。活性化関数についてはほとんど仮定せず、幅広いネットワーククラスに適用され、ニューラルネットワーク以外の近似スキームにも適用できます。核となるのは、決定論的ネットワークがランダムネットワークによって期待値において正確に一致できることを示す代数的特性です。2番目の定理は、より強い仮定を行い、より強い結果をもたらします。近似する関数が与えられると、両方のモードで同時に近似するネットワークの存在が示されます。証明の構成要素は、独立したコピーによるエッジの再帰的な置き換えと、結果として得られるより大きなネットワークを入力に結合する特別な第1層の置き換えです。近似する関数は、一般的なノルム空間の要素であると仮定され、近似は対応するノルムで測定されます。ネットワークは明示的に構築されます。証明方法が異なるため、2つの結果は、ランダムドロップアウトネットワークの近似特性について独立した洞察を提供します。これにより、ドロップアウトニューラルネットワークが普遍近似特性を広く満たすことが証明されます。

Decimated Framelet System on Graphs and Fast G-Framelet Transforms
グラフ上の間引きフレームレットシステムと高速Gフレームレット変換

Graph representation learning has many real-world applications, from self-driving LiDAR, 3D computer vision to drug repurposing, protein classification, social networks analysis. An adequate representation of graph data is vital to the learning performance of a statistical or machine learning model for graph-structured data. This paper proposes a novel multiscale representation system for graph data, called decimated framelets, which form a localized tight frame on the graph. The decimated framelet system allows storage of the graph data representation on a coarse-grained chain and processes the graph data at multi scales where at each scale, the data is stored on a subgraph. Based on this, we establish decimated G-framelet transforms for the decomposition and reconstruction of the graph data at multi resolutions via a constructive data-driven filter bank. The graph framelets are built on a chain-based orthonormal basis that supports fast graph Fourier transforms. From this, we give a fast algorithm for the decimated G-framelet transforms, or FGT, that has linear computational complexity O(N) for a graph of size N. The effectiveness for constructing the decimated framelet system and the FGT is demonstrated by a simulated example of random graphs and real-world applications, including multiresolution analysis for traffic network and representation learning of graph neural networks for graph classification tasks.

グラフ表現学習は、自動運転LiDAR、3Dコンピュータービジョンから薬物の再利用、タンパク質分類、ソーシャルネットワーク分析まで、多くの実際のアプリケーションに使用されています。グラフ構造化データに対する統計モデルまたは機械学習モデルの学習パフォーマンスには、グラフデータの適切な表現が不可欠です。この論文では、グラフ上に局所的なタイトフレームを形成する、デシメートフレームレットと呼ばれるグラフデータの新しいマルチスケール表現システムを提案します。デシメートフレームレットシステムでは、グラフデータ表現を粗粒度のチェーン上に保存し、グラフデータを複数のスケールで処理します。各スケールでは、データはサブグラフ上に格納されます。これに基づいて、構成的データ駆動型フィルタバンクを介して、複数の解像度でグラフデータを分解および再構築するためのデシメートGフレームレット変換を確立します。グラフフレームレットは、高速グラフフーリエ変換をサポートするチェーンベースの正規直交基底に基づいて構築されます。これにより、サイズNのグラフに対して線形計算複雑度がO(N)である、デシメーションGフレームレット変換(FGT)の高速アルゴリズムが提供されます。デシメーションフレームレットシステムとFGTを構築することの有効性は、ランダムグラフのシミュレーション例と、交通ネットワークの多重解像度解析やグラフ分類タスクのグラフニューラルネットワークの表現学習などの実際のアプリケーションによって実証されます。

Spatial Multivariate Trees for Big Data Bayesian Regression
ビッグデータベイズ回帰のための空間多変量木

High resolution geospatial data are challenging because standard geostatistical models based on Gaussian processes are known to not scale to large data sizes. While progress has been made towards methods that can be computed more efficiently, considerably less attention has been devoted to methods for large scale data that allow the description of complex relationships between several outcomes recorded at high resolutions by different sensors. Our Bayesian multivariate regression models based on spatial multivariate trees (SpamTrees) achieve scalability via conditional independence assumptions on latent random effects following a treed directed acyclic graph. Information-theoretic arguments and considerations on computational efficiency guide the construction of the tree and the related efficient sampling algorithms in imbalanced multivariate settings. In addition to simulated data examples, we illustrate SpamTrees using a large climate data set which combines satellite data with land-based station data. Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.

高解像度の地理空間データは、ガウス過程に基づく標準的な地理統計モデルが大規模なデータサイズに拡張できないことが知られているため、困難です。より効率的に計算できる方法に向けて進歩が遂げられている一方で、異なるセンサーによって高解像度で記録された複数の結果間の複雑な関係を記述できる大規模データ向けの方法にはほとんど注目が集まっていません。空間多変量ツリー(SpamTrees)に基づくベイズ多変量回帰モデルは、ツリー化された有向非巡回グラフに従う潜在的なランダム効果に対する条件付き独立性の仮定を介してスケーラビリティを実現します。情報理論的議論と計算効率に関する考慮事項は、不均衡な多変量設定におけるツリーと関連する効率的なサンプリングアルゴリズムの構築を導きます。シミュレートされたデータの例に加えて、衛星データと地上ステーションデータを組み合わせた大規模な気候データセットを使用してSpamTreesを説明します。ソフトウェアとソースコードは、CRAN (https://CRAN.R-project.org/package=spamtree)で入手できます。

TFPnP: Tuning-free Plug-and-Play Proximal Algorithms with Applications to Inverse Imaging Problems
TFPnP:逆イメージング問題への応用によるチューニングフリーのプラグアンドプレイ近位アルゴリズム

Plug-and-Play (PnP) is a non-convex optimization framework that combines proximal algorithms, for example, the alternating direction method of multipliers (ADMM), with advanced denoising priors. Over the past few years, great empirical success has been obtained by PnP algorithms, especially for the ones that integrate deep learning-based denoisers. However, a key problem of PnP approaches is the need for manual parameter tweaking which is essential to obtain high-quality results across the high discrepancy in imaging conditions and varying scene content. In this work, we present a class of tuning-free PnP proximal algorithms that can determine parameters such as denoising strength, termination time, and other optimization-specific parameters automatically. A core part of our approach is a policy network for automated parameter search which can be effectively learned via a mixture of model-free and model-based deep reinforcement learning strategies. We demonstrate, through rigorous numerical and visual experiments, that the learned policy can customize parameters to different settings, and is often more efficient and effective than existing handcrafted criteria. Moreover, we discuss several practical considerations of PnP denoisers, which together with our learned policy yield state-of-the-art results. This advanced performance is prevalent on both linear and nonlinear exemplar inverse imaging problems, and in particular shows promising results on compressed sensing MRI, sparse-view CT, single-photon imaging, and phase retrieval.

プラグアンドプレイ(PnP)は、交互方向乗算法(ADMM)などの近似アルゴリズムと高度なノイズ除去事前分布を組み合わせた非凸最適化フレームワークです。過去数年間、PnPアルゴリズム、特にディープラーニングベースのノイズ除去を統合したアルゴリズムは、大きな実験的成功を収めてきました。しかし、PnPアプローチの主な問題は、画像条件の大きな相違やシーンの内容の変化に渡って高品質の結果を得るために不可欠な、手動のパラメータ調整が必要なことです。この研究では、ノイズ除去の強度、終了時間、その他の最適化固有のパラメータなどのパラメータを自動的に決定できる、チューニングフリーのPnP近似アルゴリズムのクラスを紹介します。私たちのアプローチの中核部分は、モデルフリーとモデルベースの深層強化学習戦略を組み合わせて効果的に学習できる、自動パラメータ検索用のポリシーネットワークです。厳密な数値実験と視覚実験を通じて、学習したポリシーはさまざまな設定にパラメータをカスタマイズでき、既存の手動の基準よりも効率的で効果的であることが多いことを実証しました。さらに、学習したポリシーと組み合わせることで最先端の結果を生み出すPnPノイズ除去のいくつかの実用的な考慮事項について説明します。この高度なパフォーマンスは、線形および非線形の両方のサンプル逆イメージング問題で広く普及しており、特に圧縮センシングMRI、スパースビューCT、単一光子イメージング、および位相回復で有望な結果を示しています。

A Stochastic Bundle Method for Interpolation
内挿のための確率的バンドル法

We propose a novel method for training deep neural networks that are capable of interpolation, that is, driving the empirical loss to zero. At each iteration, our method constructs a stochastic approximation of the learning objective. The approximation, known as a bundle, is a pointwise maximum of linear functions. Our bundle contains a constant function that lower bounds the empirical loss. This enables us to compute an automatic adaptive learning rate, thereby providing an accurate solution. In addition, our bundle includes linear approximations computed at the current iterate and other linear estimates of the DNN parameters. The use of these additional approximations makes our method significantly more robust to its hyperparameters. Based on its desirable empirical properties, we term our method Bundle Optimisation for Robust and Accurate Training (BORAT). In order to operationalise BORAT, we design a novel algorithm for optimising the bundle approximation efficiently at each iteration. We establish the theoretical convergence of BORAT in both convex and non-convex settings. Using standard publicly available data sets, we provide a thorough comparison of BORAT to other single hyperparameter optimisation algorithms. Our experiments demonstrate BORAT matches the state-of-the-art generalisation performance for these methods and is the most robust.

私たちは、補間が可能な、つまり経験的損失をゼロにすることができるディープニューラルネットワークをトレーニングするための新しい方法を提案します。各反復で、我々の方法は学習目標の確率的近似を構築します。バンドルと呼ばれる近似は、線形関数の点ごとの最大値です。我々のバンドルには、経験的損失の下限となる定数関数が含まれています。これにより、自動適応学習率を計算できるため、正確なソリューションを提供できます。さらに、我々のバンドルには、現在の反復で計算された線形近似と、DNNパラメーターのその他の線形推定値が含まれています。これらの追加の近似を使用することで、我々の方法はハイパーパラメーターに対して大幅に堅牢になります。望ましい経験的特性に基づいて、我々はこの方法を堅牢で正確なトレーニングのためのバンドル最適化(BORAT)と名付けました。BORATを運用可能にするために、各反復でバンドル近似を効率的に最適化する新しいアルゴリズムを設計しました。凸設定と非凸設定の両方でBORATの理論的収束を確立しました。公開されている標準データセットを使用して、BORATと他の単一ハイパーパラメータ最適化アルゴリズムを徹底的に比較しました。実験により、BORATはこれらの方法の最先端の一般化パフォーマンスに匹敵し、最も堅牢であることが実証されました。

On Generalizations of Some Distance Based Classifiers for HDLSS Data
HDLSSデータのためのいくつかの距離に基づく分類器の一般化について

In high dimension, low sample size (HDLSS) settings, classifiers based on Euclidean distances like the nearest neighbor classifier and the average distance classifier perform quite poorly if differences between locations of the underlying populations get masked by scale differences. To rectify this problem, several modifications of these classifiers have been proposed in the literature. However, existing methods are confined to location and scale differences only, and they often fail to discriminate among populations differing outside of the first two moments. In this article, we propose some simple transformations of these classifiers resulting in improved performance even when the underlying populations have the same location and scale. We further propose a generalization of these classifiers based on the idea of grouping of variables. High-dimensional behavior of the proposed classifiers is studied theoretically. Numerical experiments with a variety of simulated examples as well as an extensive analysis of benchmark data sets from three different databases exhibit advantages of the proposed methods.

高次元、低サンプルサイズ(HDLSS)設定では、最近傍分類器や平均距離分類器などのユークリッド距離に基づく分類器は、基礎となる集団の位置の違いがスケールの違いによって隠されると、パフォーマンスがかなり低下します。この問題を修正するために、これらの分類器のいくつかの修正が文献で提案されています。ただし、既存の方法は位置とスケールの違いのみに限定されており、最初の2つのモーメント以外で異なる集団を区別できないことがよくあります。この記事では、基礎となる集団の位置とスケールが同じ場合でもパフォーマンスが向上する、これらの分類器のいくつかの簡単な変換を提案します。さらに、変数のグループ化のアイデアに基づいて、これらの分類器の一般化を提案します。提案された分類器の高次元動作は理論的に研究されています。さまざまなシミュレーション例を使用した数値実験と、3つの異なるデータベースからのベンチマークデータセットの広範な分析により、提案された方法の利点が明らかになりました。

Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality
大規模スパース PCA の解法による認証可能な (近似) 最適性

Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than $p=100s$ of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting $k=5$ covariates from $p=300$ variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of $1-2\%$ in practice within minutes for $p=100$s or hours for $p=1,000$s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical data sets, we illustrate our approach’s ability to derive interpretable principal components tractably at scale.

スパース主成分分析(PCA)は、元の特徴の小さなサブセットの線形結合である主成分を取得するための一般的な次元削減手法です。既存のアプローチでは、$p=100s$を超える変数で証明可能な最適な主成分を供給することはできません。スパースPCAを凸混合整数半定値最適化問題として再定式化することにより、$p=300$変数から$k=5$共変量を選択するスケールで問題を認定可能な最適性に解く切断面法を設計し、より大きなスケールで小さな境界ギャップを提供します。また、実際には$p=100$sの場合は数分以内、$p=1,000$sの場合は数時間で$1-2%$の束縛ギャップを提供する凸緩和と貪欲な丸めスキームを提案します。したがって、大規模な正確な方法の実行可能な代替手段です。現実世界の金融および医療データセットを使用して、解釈可能な主成分を大規模かつ扱いやすく導き出すアプローチの能力を示します。

Approximate Information State for Approximate Planning and Reinforcement Learning in Partially Observed Systems
部分的に観測されたシステムにおける近似計画と強化学習のための近似情報状態

We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two definitions of information state—i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called AIS) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms and detailed numerical experiments with low, moderate and high dimensional environments.

私たちは、部分的に観測されたシステムにおける近似計画と学習のための理論的枠組みを提案します。我々の枠組みは、情報状態という基本的な概念に基づいています。私たちは、情報状態の2つの定義を提供します。i)期待される報酬を計算し、その次の値を予測するのに十分な履歴の関数、ii)再帰的に更新でき、期待される報酬を計算し、次の観測を予測するのに十分な履歴の関数です。情報状態は常に動的計画法分解につながる。我々の主要な結果は、履歴の関数(AISと呼ばれる)が情報状態の特性を近似的に満たす場合、対応する近似動的プログラムが存在することを示すことです。私たちは、これを使用して計算されたポリシーが、最適性の限界損失を伴う近似最適であることを示す。私たちは、文献における状態、観測、およびアクション空間のいくつかの近似がAISのインスタンスとみなせることを示す。これらのケースのいくつかでは、より厳しい限界が得られます。AISの顕著な特徴は、データから学習できることです。AISベースのマルチタイムスケールポリシー勾配アルゴリズムと、低次元、中次元、高次元環境での詳細な数値実験を紹介します。

Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes
部分的に観測されたマルコフ決定過程における有限記憶フィードバック方策の近最適性

In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.

部分観測マルコフ決定過程(POMDP)の理論では、最適ポリシーの存在は、一般的に、元の部分観測確率制御問題を信念空間上の完全観測問題に変換して信念MDPに導くことによって確立されています。しかし、この完全観測モデル、つまり元のPOMDPの最適ポリシーを古典的な動的計画法または線形計画法を使用して計算することは、たとえ元のシステムが有限の状態空間と行動空間を持っていたとしても困難です。これは、完全観測信念MDPモデルの状態空間が常に無数であるためです。さらに、必要な正則性条件は、フェラー連続性などの性質につながる確率測度の空間を含む面倒な研究を必要とすることが多いため、厳密な価値関数近似と最適ポリシー近似の結果はほとんど存在しません。この論文では、システムダイナミクスと測定チャネルモデルが既知であると仮定したPOMDPの計画問題を研究します。有限のウィンドウ情報変数のみを使用して信念空間を離散化することにより、近似的な信念モデルを構築します。次に、近似モデルに最適なポリシーを見つけ、軽度の非線形フィルタ安定性条件と、測定セットとアクションセットが有限である(状態空間は実数ベクトル値である)という仮定の下で、POMDPで構築された有限ウィンドウ制御ポリシーのほぼ最適性を厳密に確立します。また、有限ウィンドウメモリサイズと近似エラー境界を関連付ける収束率の結果を確立します。収束率は、明示的でテスト可能な指数フィルタ安定性条件下では指数です。実験結果は多数存在しますが、厳密な漸近収束結果はほとんどありませんが、明示的な収束率の結果は、私たちの知る限り、文献では新しいものです。

Interpolating Predictors in High-Dimensional Factor Regression
高次元因子回帰における予測変数の内挿

This work studies finite-sample properties of the risk of the minimum-norm interpolating predictor in high-dimensional regression models. If the effective rank of the covariance matrix $\Sigma$ of the $p$ regression features is much larger than the sample size $n$, we show that the min-norm interpolating predictor is not desirable, as its risk approaches the risk of trivially predicting the response by 0. However, our detailed finite-sample analysis reveals, surprisingly, that this behavior is not present when the regression response and the features are jointly low-dimensional, following a widely used factor regression model. Within this popular model class, and when the effective rank of $\Sigma$ is smaller than $n$, while still allowing for $p \gg n$, both the bias and the variance terms of the excess risk can be controlled, and the risk of the minimum-norm interpolating predictor approaches optimal benchmarks. Moreover, through a detailed analysis of the bias term, we exhibit model classes under which our upper bound on the excess risk approaches zero, while the corresponding upper bound in the recent work arXiv:1906.11300 diverges. Furthermore, we show that the minimum-norm interpolating predictor analyzed under the factor regression model, despite being model-agnostic and devoid of tuning parameters, can have similar risk to predictors based on principal components regression and ridge regression, and can improve over LASSO based predictors, in the high-dimensional regime.

この研究では、高次元回帰モデルにおける最小ノルム補間予測子のリスクの有限サンプル特性を研究します。$p$回帰特徴の共分散行列$\Sigma$の有効ランクがサンプルサイズ$n$よりはるかに大きい場合、最小ノルム補間予測子のリスクが応答を0で自明に予測するリスクに近づくため、最小ノルム補間予測子は望ましくないことを示す。しかし、詳細な有限サンプル分析により、驚くべきことに、広く使用されている因子回帰モデルに従い、回帰応答と特徴が共に低次元である場合、この動作は存在しないことが明らかになった。この一般的なモデルクラス内で、$\Sigma$の有効ランクが$n$より小さい場合、$p \gg n$を許容しながら、過剰リスクのバイアスと分散の両方の項を制御でき、最小ノルム補間予測子のリスクは最適なベンチマークに近づく。さらに、バイアス項の詳細な分析を通じて、過剰リスクの上限がゼロに近づくモデルクラスを示しますが、最近の研究arXiv:1906.11300の対応する上限は発散します。さらに、因子回帰モデルで分析された最小ノルム補間予測子は、モデルに依存せず、チューニングパラメータがないにもかかわらず、主成分回帰およびリッジ回帰に基づく予測子と同様のリスクを持つ可能性があり、高次元領域ではLASSOベースの予測子よりも優れていることを示しています。

Scaling Laws from the Data Manifold Dimension
データ多様体次元からのスケーリング則

When data is plentiful, the test loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of $d$ and $\alpha$ by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.

データが豊富な場合、十分に訓練されたニューラルネットワークによって達成されるテスト損失は、ネットワークパラメータの数$N$のpropto N^{-alpha}$のべき乗則$Lとしてスケーリングされます。この経験的スケーリング法則は、さまざまなデータモダリティに当てはまり、何桁にもわたって持続する可能性があります。スケーリングの法則は、ニューラルモデルが固有次元$d$のデータ多様体に対して効果的に回帰を実行しているだけであれば説明できます。この単純な理論は、スケーリング指数がクロスエントロピーと平均二乗誤差損失に対して$alpha approx 4/d$になると予測します。この理論は、教師/生徒のフレームワークで固有の次元とスケーリング指数を独立して測定することで確認し、ランダムな教師ネットワークの特性をダイヤルすることでさまざまな$d$と$alpha$を研究できます。また、いくつかのデータセットのCNN画像分類器とGPTタイプの言語モデルを使用して、理論をテストします。

Deep Learning in Target Space
ターゲット空間での深層学習

Deep learning uses neural networks which are parameterised by their weights. The neural networks are usually trained by tuning the weights to directly minimise a given loss function. In this paper we propose to re-parameterise the weights into targets for the firing strengths of the individual nodes in the network. Given a set of targets, it is possible to calculate the weights which make the firing strengths best meet those targets. It is argued that using targets for training addresses the problem of exploding gradients, by a process which we call cascade untangling, and makes the loss-function surface smoother to traverse, and so leads to easier, faster training, and also potentially better generalisation, of the neural network. It also allows for easier learning of deeper and recurrent network structures. The necessary conversion of targets to weights comes at an extra computational expense, which is in many cases manageable. Learning in target space can be combined with existing neural-network optimisers, for extra gain. Experimental results show the speed of using target space, and examples of improved generalisation, for fully-connected networks and convolutional networks, and the ability to recall and process long time sequences and perform natural-language processing with recurrent networks.

ディープラーニングでは、重みによってパラメータ化されたニューラルネットワークを使用します。ニューラルネットワークは通常、特定の損失関数を直接最小化するように重みを調整することでトレーニングされます。この論文では、重みをネットワーク内の個々のノードの発火強度のターゲットに再パラメータ化することを提案します。ターゲットのセットが与えられれば、発火強度がそれらのターゲットに最もよく適合する重みを計算することができます。トレーニングにターゲットを使用すると、カスケードアンタングルと呼ばれるプロセスによって勾配爆発の問題に対処し、損失関数の表面をスムーズに横断できるため、ニューラルネットワークのトレーニングが簡単かつ高速になり、一般化も改善される可能性があると主張されています。また、より深く再帰的なネットワーク構造の学習も容易になります。ターゲットを重みに変換するには追加の計算コストがかかりますが、多くの場合、管理可能です。ターゲット空間での学習は、既存のニューラルネットワーク最適化装置と組み合わせて、さらなる利益を得ることができます。実験結果は、完全接続ネットワークと畳み込みネットワークのターゲット空間の使用速度と一般化の改善例、および長時間シーケンスの呼び出しと処理能力と再帰型ネットワークによる自然言語処理の実行能力を示しています。

Bayesian Multinomial Logistic Normal Models through Marginally Latent Matrix-T Processes
限界潜在行列T過程によるベイズ多項ロジスティック正規モデル

Bayesian multinomial logistic-normal (MLN) models are popular for the analysis of sequence count data (e.g., microbiome or gene expression data) due to their ability to model multivariate count data with complex covariance structure. However, existing implementations of MLN models are limited to small datasets due to the non-conjugacy of the multinomial and logistic-normal distributions. Motivated by the need to develop efficient inference for Bayesian MLN models, we develop two key ideas. First, we develop the class of Marginally Latent Matrix-T Process (Marginally LTP) models. We demonstrate that many popular MLN models, including those with latent linear, non-linear, and dynamic linear structure are special cases of this class. Second, we develop an efficient inference scheme for Marginally LTP models with specific accelerations for the MLN subclass. Through application to MLN models, we demonstrate that our inference scheme are both highly accurate and often 4-5 orders of magnitude faster than MCMC.

ベイジアン多項ロジスティック正規(MLN)モデルは、複雑な共分散構造を持つ多変量カウントデータをモデル化できるため、シーケンスカウントデータ(マイクロバイオームや遺伝子発現データなど)の分析によく使用されます。ただし、既存のMLNモデルの実装は、多項分布とロジスティック正規分布の非共役性のため、小規模なデータセットに限定されています。ベイジアンMLNモデルの効率的な推論を開発する必要性から、2つの重要なアイデアを開発しました。まず、限界潜在マトリックスTプロセス(限界LTP)モデルのクラスを開発します。潜在線形、非線形、動的線形構造を持つモデルを含む多くの一般的なMLNモデルがこのクラスの特殊なケースであることを示します。次に、MLNサブクラスに特定の加速を備えた限界LTPモデルの効率的な推論スキームを開発します。MLNモデルへの適用を通じて、推論スキームが非常に正確であり、MCMCよりも4～5桁高速であることが多いことを実証します。

XAI Beyond Classification: Interpretable Neural Clustering
分類を超えたXAI: 解釈可能なニューラルクラスタリング

In this paper, we study two challenging problems in explainable AI (XAI) and data clustering. The first is how to directly design a neural network with inherent interpretability, rather than giving post-hoc explanations of a black-box model. The second is implementing discrete $k$-means with a differentiable neural network that embraces the advantages of parallel computing, online clustering, and clustering-favorable representation learning. To address these two challenges, we design a novel neural network, which is a differentiable reformulation of the vanilla $k$-means, called inTerpretable nEuraL cLustering (TELL). Our contributions are threefold. First, to the best of our knowledge, most existing XAI works focus on supervised learning paradigms. This work is one of the few XAI studies on unsupervised learning, in particular, data clustering. Second, TELL is an interpretable, or the so-called intrinsically explainable and transparent model. In contrast, most existing XAI studies resort to various means for understanding a black-box model with post-hoc explanations. Third, from the view of data clustering, TELL possesses many properties highly desired by $k$-means, including but not limited to online clustering, plug-and-play module, parallel computing, and provable convergence. Extensive experiments show that our method achieves superior performance comparing with 14 clustering approaches on three challenging data sets. The source code could be accessed at www.pengxi.me.

この論文では、説明可能なAI (XAI)とデータクラスタリングにおける2つの困難な問題について検討します。1つ目は、ブラックボックスモデルの事後的な説明を行うのではなく、本質的に解釈可能なニューラルネットワークを直接設計する方法です。2つ目は、並列コンピューティング、オンラインクラスタリング、クラスタリングに適した表現学習の利点を取り入れた微分可能なニューラルネットワークを使用して、離散$k$-meansを実装することです。この2つの課題に対処するために、inTerpretable nEuraL clustering (TELL)と呼ばれる、バニラ$k$-meansを微分可能に再定式化した新しいニューラルネットワークを設計します。私たちの貢献は3つあります。まず、私たちの知る限り、既存のXAI研究のほとんどは、教師あり学習パラダイムに焦点を当てています。本研究は、教師なし学習、特にデータクラスタリングに関する数少ないXAI研究の1つです。2つ目は、TELLは解釈可能な、つまり、いわゆる本質的に説明可能で透明なモデルであるということです。対照的に、既存のXAI研究のほとんどは、事後的な説明を伴うブラックボックスモデルを理解するためにさまざまな手段に頼っています。3番目に、データクラスタリングの観点から、TELLは、オンラインクラスタリング、プラグアンドプレイモジュール、並列コンピューティング、証明可能な収束など、k平均法で非常に望まれる多くの特性を備えています。広範な実験により、3つの困難なデータセットで14のクラスタリング手法と比較して、当社の方法が優れたパフォーマンスを達成することが示されています。ソースコードはwww.pengxi.meでアクセスできます。

Empirical Risk Minimization under Random Censorship
ランダム検閲下における経験的リスク最小化

We consider the classic supervised learning problem where a continuous non-negative random label $Y$ (e.g. a random duration) is to be predicted based upon observing a random vector $X$ valued in $\mathbb{R}^d$ with $d\geq 1$ by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis for instance, training observations can be right censored, meaning that, rather than on independent copies of $(X,Y)$, statistical learning relies on a collection of $n\geq 1$ independent realizations of the triplet $(X, \; \min\{Y,\; C\},\; \delta)$, where $C$ is a nonnegative random variable with unknown distribution, modelling censoring and $\delta=\mathbb{I}\{Y\leq C\}$ indicates whether the duration is right censored or not. As ignoring censoring in the risk computation may clearly lead to a severe underestimation of the target duration and jeopardize prediction, we consider a plug-in estimate of the true risk based on a Kaplan-Meier estimator of the conditional survival function of the censoring $C$ given $X$, referred to as Beran risk, in order to perform empirical risk minimization. It is established, under mild conditions, that the learning rate of minimizers of this biased/weighted empirical risk functional is of order $O_{\mathbb{P}}(\sqrt{\log(n)/n})$ when ignoring model bias issues inherent to plug-in estimation, as can be attained in absence of censoring. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed.

私たちは、最小二乗誤差を伴う回帰規則を用いて、$\mathbb{R}^d$で$d\geq 1$の値を持つランダムベクトル$X$の観測に基づいて、連続した非負のランダムラベル$Y$ (ランダムな期間など)を予測するという古典的な教師あり学習の問題を考察します。例えば、産業品質管理から公衆衛生、信用リスク分析に至るまで、さまざまなアプリケーションで、トレーニング観測は右打ち切りになることがあります。つまり、$(X,Y)$の独立したコピーではなく、統計学習は3つの要素$(X, \; \min\{Y,\; C\},\; \delta)$の$n\geq 1$個の独立した実現のコレクションに依存します。ここで、$C$は分布が不明な非負のランダム変数で、打ち切りをモデル化し、$\delta=\mathbb{I}\{Y\leq C\}$は期間が右打ち切りかどうかを示します。リスク計算で打ち切りを無視すると、明らかに目標期間が大幅に過小評価され、予測が危うくなる可能性があるため、経験的リスク最小化を実行するために、ベランリスクと呼ばれる、$X$が与えられたときの打ち切り$C$の条件付き生存関数のKaplan-Meier推定量に基づく真のリスクのプラグイン推定を検討します。プラグイン推定に固有のモデルバイアスの問題を無視すると、穏やかな条件下では、このバイアス/重み付けされた経験的リスク関数の最小化者の学習率は、打ち切りがない場合に達成できるのと同様に、$O_{\mathbb{P}}(\sqrt{\log(n)/n})$のオーダーであることが確立されています。理論的な結果以外に、開発されたアプローチの妥当性を示すために数値実験が提示されています。

Exploiting locality in high-dimensional Factorial hidden Markov models
高次元階乗隠れマルコフ模型における局所性の活用

We propose algorithms for approximate filtering and smoothing in high-dimensional Factorial hidden Markov models. The approximation involves discarding, in a principled way, likelihood factors according to a notion of locality in a factor graph associated with the emission distribution. This allows the exponential-in-dimension cost of exact filtering and smoothing to be avoided. We prove that the approximation accuracy, measured in a local total variation norm, is “dimension-free” in the sense that as the overall dimension of the model increases the error bounds we derive do not necessarily degrade. A key step in the analysis is to quantify the error introduced by localizing the likelihood function in a Bayes’ rule update. The factorial structure of the likelihood function which we exploit arises naturally when data have known spatial or network structure. We demonstrate the new algorithms on synthetic examples and a London Underground passenger flow problem, where the factor graph is effectively given by the train network.

私たちは、高次元因子隠れマルコフモデルにおける近似フィルタリングおよびスムージングのアルゴリズムを提案します。近似では、排出分布に関連付けられた因子グラフの局所性の概念に従って、原則的に尤度因子を破棄します。これにより、正確なフィルタリングおよびスムージングの次元指数コストを回避できます。局所的な全変動ノルムで測定された近似精度は、モデルの全体的な次元が増加しても、導出される誤差範囲が必ずしも低下しないという意味で「次元フリー」であることを証明します。分析の重要なステップは、ベイズ規則の更新で尤度関数を局所化することによって生じる誤差を定量化することです。私たちが利用する尤度関数の因子構造は、データが既知の空間構造またはネットワーク構造を持つ場合に自然に発生します。私たちは、合成例とロンドン地下鉄の乗客フロー問題で新しいアルゴリズムを示します。この場合、因子グラフは実質的に列車ネットワークによって与えられます。

Recovering shared structure from multiple networks with unknown edge distributions
エッジ分布が不明な複数のネットワークからの共有構造の回復

In increasingly many settings, data sets consist of multiple samples from a population of networks, with vertices aligned across networks; for example, brain connectivity networks in neuroscience. We consider the setting where the observed networks have a shared expectation, but may differ in the noise structure on their edges. Our approach exploits the shared mean structure to denoise edge-level measurements of the observed networks and estimate the underlying population-level parameters. We also explore the extent to which edge-level errors influence estimation and downstream inference. In the process, we establish a finite-sample concentration inequality for the low-rank eigenvalue truncation of a random weighted adjacency matrix, which may be of independent interest. The proposed approach is illustrated on synthetic networks and on data from an fMRI study of schizophrenia.

ますます多くの設定では、データセットはネットワークの母集団からの複数のサンプルで構成され、頂点はネットワーク全体で整列されています。たとえば、神経科学における脳の接続ネットワーク。観測されたネットワークが共通の期待値を持つが、エッジのノイズ構造が異なる可能性がある設定を検討します。私たちのアプローチは、共有平均構造を利用して、観測されたネットワークのエッジレベルの測定値のノイズを除去し、基礎となる母集団レベルのパラメーターを推定します。また、エッジレベルの誤差が推定と下流の推論にどの程度影響するかについても調査します。このプロセスでは、ランダム加重隣接行列の低ランク固有値切り捨てについて有限サンプル濃度の不等式を確立します。これは独立した関心事である可能性があります。提案されたアプローチは、合成ネットワークと統合失調症のfMRI研究からのデータに示されています。

Debiased Distributed Learning for Sparse Partial Linear Models in High Dimensions
高次元のスパース偏線形モデルに対する偏分散学習の偏り除去

Although various distributed machine learning schemes have been proposed recently for purely linear models and fully nonparametric models, little attention has been paid to distributed optimization for semi-parametric models with multiple structures (e.g. sparsity, linearity and nonlinearity). To address these issues, the current paper proposes a new communication-efficient distributed learning algorithm for sparse partially linear models with an increasing number of features. The proposed method is based on the classical divide and conquer strategy for handling big data and the computation on each subsample consists of a debiased estimation of the doubly regularized least squares approach. With the proposed method, we theoretically prove that our global parametric estimator can achieve the optimal parametric rate in our semi-parametric model given an appropriate partition on the total data. Specifically, the choice of data partition relies on the underlying smoothness of the nonparametric component, and it is adaptive to the sparsity parameter. Finally, some simulated experiments are carried out to illustrate the empirical performances of our debiased technique under the distributed setting.

最近、純粋に線形なモデルや完全にノンパラメトリックなモデルに対してさまざまな分散機械学習方式が提案されていますが、複数の構造(スパース性、線形性、非線形性など)を持つセミパラメトリックモデルの分散最適化にはほとんど注目されていません。これらの問題に対処するために、この論文では、機能の数が増えているスパースな部分線形モデルに対して、通信効率の高い新しい分散学習アルゴリズムを提案します。提案された方法は、ビッグデータを処理するための古典的な分割統治戦略に基づいており、各サブサンプルの計算は、二重正則化最小二乗法の偏りのない推定で構成されます。提案された方法により、全体のデータを適切に分割すれば、グローバルパラメトリック推定器がセミパラメトリックモデルで最適なパラメトリックレートを達成できることを理論的に証明します。具体的には、データ分割の選択は、ノンパラメトリックコンポーネントの基礎となる滑らかさに依存し、スパース性パラメーターに適応します。最後に、分散設定での偏りのない手法の実証的なパフォーマンスを示すために、いくつかのシミュレーション実験が行われます。

Joint Estimation and Inference for Data Integration Problems based on Multiple Multi-layered Gaussian Graphical Models
複数の多層ガウスグラフモデルに基づくデータ積分問題のための共同推定と推論

The rapid development of high-throughput technologies has enabled the generation of data from biological or disease processes that span multiple layers, like genomic, proteomic or metabolomic data, and further pertain to multiple sources, like disease subtypes or experimental conditions. In this work, we propose a general statistical framework based on Gaussian graphical models for horizontal (i.e. across conditions or subtypes) and vertical (i.e. across different layers containing data on molecular compartments) integration of information in such datasets. We start with decomposing the multi-layer problem into a series of two-layer problems. For each two-layer problem, we model the outcomes at a node in the lower layer as dependent on those of other nodes in that layer, as well as all nodes in the upper layer. We use a combination of neighborhood selection and group-penalized regression to obtain sparse estimates of all model parameters. Following this, we develop a debiasing technique and asymptotic distributions of inter-layer directed edge weights that utilize already computed neighborhood selection coefficients for nodes in the upper layer. Subsequently, we establish global and simultaneous testing procedures for these edge weights. Performance of the proposed methodology is evaluated on synthetic and real data.

ハイスループット技術の急速な発展により、ゲノム、プロテオーム、メタボロミクスデータなどの複数のレイヤーにまたがり、さらに疾患のサブタイプや実験条件などの複数のソースに関連する生物学的プロセスまたは疾患プロセスからのデータの生成が可能になりました。この研究では、そのようなデータセット内の情報の水平方向(つまり、条件またはサブタイプ間)および垂直方向(つまり、分子コンパートメントに関するデータを含む異なるレイヤー間)の統合のためのガウスグラフィカルモデルに基づく一般的な統計フレームワークを提案します。まず、多層問題を一連の2層問題に分解します。各2層問題では、下位層のノードでの結果が、その層の他のノードの結果と上位層のすべてのノードに依存するようにモデル化します。近傍選択とグループペナルティ回帰を組み合わせて使用し、すべてのモデルパラメータのスパース推定値を取得します。これに続いて、上位層のノードの計算済みの近傍選択係数を利用する層間有向エッジ重みのデバイアス手法と漸近分布を開発します。その後、これらのエッジ重みのグローバルかつ同時テスト手順を確立します。提案された方法論のパフォーマンスは、合成データと実際のデータで評価されます。

関連記事