Journal of Machine Learning Research Papers: Volume 21の論文一覧

Journal of Machine Learning Research Papers Volume 21に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra
深層学習スペクトルに浸透するクラス/クラス間構造の痕跡

Numerous researchers recently applied empirical spectral analysis to the study of modern deep learning classifiers. We identify and discuss an important formal class/cross-class structure and show how it lies at the origin of the many visually striking features observed in deep neural network spectra, some of which were reported in recent articles, others are unveiled here for the first time. These include spectral outliers, “spikes”, and small but distinct continuous distributions, “bumps”, often seen beyond the edge of a “main bulk”.

最近、多くの研究者が経験的スペクトル解析を最新の深層学習分類器の研究に適用しました。私たちは、重要な形式クラス/クロスクラス構造を特定し、それがディープニューラルネットワークスペクトルで観察される多くの視覚的に印象的な特徴の起源にどのように位置しているかを示します。これらには、スペクトルの外れ値、「スパイク」、および小さいながらも明確な連続分布「バンプ」が含まれ、多くの場合、「メインバルク」のエッジを超えて見られます。

Online matrix factorization for Markovian data and applications to Network Dictionary Learning
マルコフデータのオンライン行列分解とネットワーク辞書学習への応用

Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this paper, we show that a non-convex generalization of the well-known OMF algorithm for i.i.d. stream of data in Mairal et al. converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. This allows one to extract features more efficiently from dependent data streams, as there is no need to subsample the data sequence to approximately satisfy the independence assumption. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning, which extracts “network dictionary patches” from a given network in an online manner that encodes main features of the network. We demonstrate this technique and its application to network denoising problems on real-world network data.

オンライン行列分解(OMF)は辞書学習問題のための基本的なツールであり、抽出された特徴の数を減らして複雑なデータセットを近似的に表現します。文献にあるほとんどのOMFアルゴリズムの収束保証は、データマトリックス間の独立性を前提としており、従属データストリームの場合はほとんど未調査のままです。この論文では、Mairalらによるi.i.d.データストリーム用のよく知られたOMFアルゴリズムの非凸一般化が、データマトリックスが軽度の混合条件を満たす何らかの基礎マルコフ連鎖の関数である場合でも、期待損失関数の臨界点のセットにほぼ確実に収束することを示します。これにより、独立性の仮定を近似的に満たすためにデータシーケンスをサブサンプリングする必要がないため、従属データストリームからより効率的に特徴を抽出できます。主な応用として、オンライン非負行列分解とネットワークからモチーフをサンプリングする最新のMCMCアルゴリズムを組み合わせることで、ネットワーク辞書学習の新しいフレームワークを提案します。これは、ネットワークの主要な特徴をエンコードする「ネットワーク辞書パッチ」を、特定のネットワークからオンラインで抽出します。この手法と、実際のネットワークデータに対するネットワークノイズ除去問題への応用を示します。

High-dimensional quantile tensor regression
高次元分位点テンソル回帰

Quantile regression is an indispensable tool for statistical learning. Traditional quantile regression methods consider vector-valued covariates and estimate the corresponding coefficient vector. Many modern applications involve data with a tensor structure. In this paper, we propose a quantile regression model which takes tensors as covariates, and present an estimation approach based on Tucker decomposition. It effectively reduces the number of parameters, leading to efficient estimation and feasible computation. We also use a sparse Tucker decomposition, which is a popular approach in the literature, to further reduce the number of parameters when the dimension of the tensor is large. We propose an alternating update algorithm combined with alternating direction method of multipliers (ADMM). The asymptotic properties of the estimators are established under suitable conditions. The numerical performances are demonstrated via simulations and an application to a crowd density estimation problem.

分位回帰は、統計学習に欠かせないツールです。従来の分位回帰法では、ベクトル値の共変量を考慮し、対応する係数ベクトルを推定します。多くの最新のアプリケーションでは、テンソル構造を持つデータが関係します。この論文では、共変量としてテンソルを取る分位回帰モデルを提案し、タッカー分解に基づく推定アプローチを示します。これにより、パラメータ数が効果的に削減され、効率的な推定と実行可能な計算が可能になります。また、文献で一般的なアプローチであるスパースタッカー分解を使用して、テンソルの次元が大きい場合にパラメータ数をさらに削減します。交互方向乗数法(ADMM)と組み合わせた交互更新アルゴリズムを提案します。推定量の漸近特性は、適切な条件下で確立されます。数値パフォーマンスは、シミュレーションと群衆密度推定問題へのアプリケーションによって実証されています。

Learning Mixed Latent Tree Models
混合潜在木モデルの学習

Latent structural learning has attracted more attention in recent years. But most related works only focuses on pure continuous or pure discrete data. In this paper, we consider mixed latent tree models for mixed data mining. We address the latent structural learning and parameter estimation for those mixed models. For structural learning, we propose a consistent bottom-up algorithm, and give a finite sample bound guarantee for the exact structural recovery. For parameter estimation, we suggest a moment estimator by exploiting matrix decomposition, and prove asymptotic normality of the estimator. Experiments on the simulated and real data support that our method is valid for mining the hierarchical structure and latent information.

近年、潜在的構造学習が注目されています。しかし、ほとんどの関連研究は、純粋な連続データまたは純粋な離散データにのみ焦点を当てています。この論文では、混合データマイニングのための混合潜在木モデルについて考察します。これらの混合モデルに対する潜在構造学習とパラメータ推定について取り組みます。構造学習については、一貫性のあるボトムアップアルゴリズムを提案し、正確な構造回復に対して有限のサンプル範囲を保証します。パラメータ推定については、行列分解を利用してモーメント推定量を提案し、推定量の漸近正規性を証明します。シミュレーションデータと実際のデータでの実験は、私たちの方法が階層構造と潜在情報のマイニングに有効であることを裏付けています。

Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning
機械学習のエネルギーとカーボンフットプリントの体系的な報告に向けて

Accurate reporting of energy and carbon usage is essential for understanding the potential climate impacts of machine learning research. We introduce a framework that makes this easier by providing a simple interface for tracking realtime energy consumption and carbon emissions, as well as generating standardized online appendices. Utilizing this framework, we create a leaderboard for energy efficient reinforcement learning algorithms to incentivize responsible research in this area as an example for other areas of machine learning. Finally, based on case studies using our framework, we propose strategies for mitigation of carbon emissions and reduction of energy consumption. By making accounting easier, we hope to further the sustainable development of machine learning experiments and spur more research into energy efficient algorithms.

エネルギーと炭素の使用量を正確に報告することは、機械学習研究の潜在的な気候への影響を理解するために不可欠です。私たちは、リアルタイムのエネルギー消費と炭素排出量を追跡するためのシンプルなインターフェースを提供し、標準化されたオンライン付録を生成することで、これを容易にするフレームワークを導入しています。このフレームワークを利用して、エネルギー効率の高い強化学習アルゴリズムのリーダーボードを作成し、機械学習の他の分野の例として、この分野での責任ある研究を奨励します。最後に、私たちのフレームワークを使用したケーススタディに基づいて、炭素排出量の軽減とエネルギー消費の削減のための戦略を提案します。会計処理を容易にすることで、機械学習実験の持続可能な発展を促進し、エネルギー効率の高いアルゴリズムの研究をさらに促進したいと考えています。

Adaptive Rates for Total Variation Image Denoising
総変動画像ノイズ除去の適応率

We study the theoretical properties of image denoising via total variation penalized least-squares. We define the total vatiation in terms of the two-dimensional total discrete derivative of the image and show that it gives rise to denoised images that are piecewise constant on rectangular sets. We prove that, if the true image is piecewise constant on just a few rectangular sets, the denoised image converges to the true image at a parametric rate, up to a log factor. More generally, we show that the denoised image enjoys oracle properties, that is, it is almost as good as if some aspects of the true image were known. In other words, image denoising with total variation regularization leads to an adaptive reconstruction of the true image.

私たちは、全変動ペナルティを受けた最小二乗法による画像のノイズ除去の理論的特性を研究します。イメージの2次元全離散微分の観点から全振動を定義し、それが長方形のセットで区分的に一定のノイズ除去されたイメージを生じさせることを示します。真のイメージが少数の長方形のセットで区分的に一定である場合、ノイズ除去されたイメージは、対数係数までのパラメトリックレートで真のイメージに収束することを証明します。より一般的には、ノイズ除去されたイメージがオラクルプロパティを享受している、つまり、真のイメージの一部の側面がわかっているかのようにほぼ同じであることを示します。言い換えれば、全バリエーション正則化による画像のノイズ除去は、真の画像の適応的再構築につながります。

On Efficient Adjustment in Causal Graphs
因果グラフの効率的な調整について

We consider estimation of a total causal effect from observational data via covariate adjustment. Ideally, adjustment sets are selected based on a given causal graph, reflecting knowledge of the underlying causal structure. Valid adjustment sets are, however, not unique. Recent research has introduced a graphical criterion for an ‘optimal’ valid adjustment set (O-set). For a given graph, adjustment by the O-set yields the smallest asymptotic variance compared to other adjustment sets in certain parametric and non-parametric models. In this paper, we provide three new results on the O-set. First, we give a novel, more intuitive graphical characterisation: We show that the O-set is the parent set of the outcome node(s) in a suitable latent projection graph, which we call the forbidden projection. An important property is that the forbidden projection preserves all information relevant to total causal effect estimation via covariate adjustment, making it a useful methodological tool in its own right. Second, we extend the existing IDA algorithm to use the O-set, and argue that the algorithm remains semi-local. This is implemented in the R-package pcalg. Third, we present assumptions under which the O-set can be viewed as the target set of popular non-graphical variable selection algorithms such as stepwise backward selection.

私たちは、共変量調整による観測データからの総因果効果の推定について検討します。理想的には、調整セットは、基礎となる因果構造に関する知識を反映して、与えられた因果グラフに基づいて選択されます。しかし、有効な調整セットは一意ではない。最近の研究では、「最適な」有効な調整セット(Oセット)のグラフィカル基準が導入されています。与えられたグラフでは、Oセットによる調整は、特定のパラメトリックおよびノンパラメトリックモデルにおける他の調整セットと比較して、最小の漸近分散をもたらす。この論文では、Oセットに関する3つの新しい結果を提供します。まず、新しい、より直感的なグラフィカルな特徴付けを示す。Oセットは、適切な潜在投影グラフ(禁制投影と呼ぶ)における結果ノードの親セットであることを示す。重要な特性は、禁制投影が共変量調整による総因果効果の推定に関連するすべての情報を保持するため、それ自体が有用な方法論的ツールとなることです。次に、既存のIDAアルゴリズムを拡張してOセットを使用し、アルゴリズムが半ローカルのままであると主張します。これは、Rパッケージpcalgに実装されています。3番目に、Oセットを、段階的後方選択などの一般的な非グラフィカル変数選択アルゴリズムのターゲットセットとして見ることができるという仮定を示します。

A Group-Theoretic Framework for Data Augmentation
データ拡張のための群論的フレームワーク

Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).

データ拡張は、ディープニューラルネットワークのトレーニング時に広く使用されているトリックです:元のデータに加えて、適切に変換されたデータもトレーニングセットに追加されます。しかし、私たちの知る限りでは、データ拡張のパフォーマンス上の利点を説明する明確な数学的フレームワークは利用できません。この論文では、このような理論的枠組みを発展させる。データ拡張は、データ分布をほぼ不変に保つ特定のグループの軌道上の平均化操作と同等であることを示します。それが分散の縮小につながることを証明します。経験的リスクの最小化、指数ファミリー、線形回帰、および特定の2層ニューラルネットワークの例を研究します。また、クライオ電子顕微鏡(クライオ電子顕微鏡)など、他のアプローチが普及している対称性の問題でデータ拡張をどのように使用できるかについても説明します。

Rank-based Lasso – efficient methods for high-dimensional robust model selection
ランクベースのLasso – 高次元のロバストなモデル選択のための効率的な方法

We consider the problem of identifying significant predictors in large data bases, where the response variable depends on the linear combination of explanatory variables through an unknown monotonic link function, corrupted with the noise from the unknown distribution. We utilize the natural, robust and efficient approach, which relies on replacing values of the response variables by their ranks and then identifying significant predictors by using well known Lasso. We provide new consistency results for the proposed procedure (called ,,RankLasso”) and extend the scope of its applications by proposing its thresholded and adaptive versions. Our theoretical results show that these modifications can identify the set of relevant predictors under a wide range of data generating scenarios. Theoretical results are supported by the simulation study and the real data analysis, which show that our methods can properly identify relevant predictors, even when the error terms come from the Cauchy distribution and the link function is nonlinear. They also demonstrate the superiority of the modified versions of RankLasso over its regular version in the case when predictors are substantially correlated. The numerical study shows also that RankLasso performs substantially better in model selection than LADLasso, which is a well established methodology for robust model selection.

私たちは、応答変数が未知の分布からのノイズで損なわれた未知の単調リンク関数を介した説明変数の線形結合に依存する大規模データベースで重要な予測子を識別する問題を考察します。私たちは、応答変数の値をその順位で置き換え、次によく知られているLassoを使用して重要な予測子を識別することに依存する、自然で堅牢かつ効率的なアプローチを利用します。私たちは、提案された手順(「RankLasso」と呼ばれる)の新しい一貫性の結果を提供し、しきい値バージョンと適応バージョンを提案することで、その適用範囲を拡大します。我々の理論的結果は、これらの修正により、幅広いデータ生成シナリオの下で関連する予測子のセットを識別できることを示しています。理論的結果は、シミュレーション研究と実際のデータ分析によって裏付けられており、誤差項がコーシー分布から来ていてリンク関数が非線形である場合でも、我々の方法が関連する予測子を適切に識別できることを示しています。また、予測子が実質的に相関している場合、RankLassoの修正バージョンが通常のバージョンよりも優れていることも示しています。数値的研究では、RankLassoが、堅牢なモデル選択のための確立された方法論であるLADLassoよりもモデル選択において大幅に優れたパフォーマンスを発揮することも示されています。

Best Practices for Scientific Research on Neural Architecture Search
ニューラルアーキテクチャ検索に関する科学研究のベストプラクティス

Finding a well-performing architecture is often tedious for both deep learning practitioners and researchers, leading to tremendous interest in the automation of this task by means of neural architecture search (NAS). Although the community has made major strides in developing better NAS methods, the quality of scientific empirical evaluations in the young field of NAS is still lacking behind that of other areas of machine learning. To address this issue, we describe a set of possible issues and ways to avoid them, leading to the NAS best practices checklist available at http://automl.org/nas_checklist.pdf.

優れたパフォーマンスを発揮するアーキテクチャを見つけることは、ディープラーニングの実践者と研究者の両方にとって退屈なことが多く、ニューラルアーキテクチャ検索(NAS)によるこのタスクの自動化に大きな関心が寄せられています。コミュニティはより優れたNAS手法の開発において大きな進歩を遂げましたが、NASの若い分野における科学的な実証的評価の質は、機械学習の他の分野にはまだ及ばない。この問題に対処するために、考えられる一連の問題とその回避方法について説明し、http://automl.org/nas_checklist.pdfで利用可能なNASベストプラクティスチェックリストを示します。

Fair Data Adaptation with Quantile Preservation
分位点保存による公正なデータ適応

Fairness of classification and regression has received much attention recently and various, partially non-compatible, criteria have been proposed. The fairness criteria can be enforced for a given classifier or, alternatively, the data can be adapted to ensure that every classifier trained on the data will adhere to desired fairness criteria. We present a practical data adaption method based on quantile preservation in causal structural equation models. The data adaptation is based on a presumed counterfactual model for the data. While the counterfactual model itself cannot be verified experimentally, we show that certain population notions of fairness are still guaranteed even if the counterfactual model is misspecified. The nature of the fulfilled observational non-causal fairness notion (such as demographic parity, separation or sufficiency) depends on the structure of the underlying causal model and the choice of resolving variables. We describe an implementation of the proposed data adaptation procedure based on Random Forests and demonstrate its practical use on simulated and real-world data.

分類と回帰の公平性は最近多くの注目を集めており、部分的に互換性のないさまざまな基準が提案されています。公平性の基準は、特定の分類器に適用することも、データでトレーニングされたすべての分類器が望ましい公平性の基準に準拠するようにデータを適応させることもできます。因果構造方程式モデルにおける分位保存に基づく実用的なデータ適応方法を紹介します。データ適応は、データの想定された反事実モデルに基づいています。反事実モデル自体は実験的に検証できませんが、反事実モデルが誤って指定されている場合でも、公平性の特定の人口概念が保証されることを示します。満たされる観測非因果的公平性の概念(人口統計的パリティ、分離、十分性など)の性質は、基礎となる因果モデルの構造と解決変数の選択によって異なります。ランダムフォレストに基づく提案されたデータ適応手順の実装について説明し、シミュレーションデータと実際のデータでの実際の使用を示します。

Efficient Inference for Nonparametric Hawkes Processes Using Auxiliary Latent Variables
補助潜在変数を用いたノンパラメトリックホークス過程の効率的な推論

The expressive ability of classic Hawkes processes is limited due to the parametric assumption on the baseline intensity and triggering kernel. Therefore, it is desirable to perform inference in a data-driven, nonparametric approach. Many recent works have proposed nonparametric Hawkes process models based on Gaussian processes (GP). However, the likelihood is non-conjugate to the prior resulting in a complicated and time-consuming inference procedure. To address the problem, we present the sigmoid Gaussian Hawkes process model in this paper: the baseline intensity and triggering kernel are both modeled as the sigmoid transformation of random trajectories drawn from a GP. By introducing auxiliary latent random variables (branching structure, P\'{o}lya-Gamma random variables and latent marked Poisson processes), the likelihood is converted to two decoupled components with a Gaussian form which allows for an efficient conjugate analytical inference. Using the augmented likelihood, we derive an efficient Gibbs sampling algorithm to sample from the posterior; an efficient expectation-maximization (EM) algorithm to obtain the maximum a posteriori (MAP) estimate and furthermore an efficient mean-field variational inference algorithm to approximate the posterior. To further accelerate the inference, a sparse GP approximation is introduced to reduce complexity. We demonstrate the performance of our three algorithms on both simulated and real data. The experiments show that our proposed inference algorithms can recover well the underlying prompting characteristics efficiently.

古典的なホークス過程の表現能力は、ベースライン強度とトリガーカーネルに対するパラメトリック仮定のために制限されています。したがって、データ駆動型のノンパラメトリックアプローチで推論を実行することが望ましいです。最近の多くの研究では、ガウス過程(GP)に基づくノンパラメトリックホークス過程モデルが提案されています。しかし、尤度は事前分布と非共役であるため、推論手順が複雑で時間がかかります。この問題に対処するために、本稿ではシグモイドガウスホークス過程モデルを提示します。ベースライン強度とトリガーカーネルは、どちらもGPから抽出されたランダム軌道のシグモイド変換としてモデル化されます。補助的な潜在的ランダム変数(分岐構造、P\'{o}lya-Gammaランダム変数、潜在的マーク付きポアソン過程)を導入することで、尤度はガウス形式の2つの分離されたコンポーネントに変換され、効率的な共役分析推論が可能になります。拡張尤度を使用して、事後分布からサンプリングする効率的なギブスサンプリングアルゴリズム、最大事後分布(MAP)推定値を取得する効率的な期待値最大化(EM)アルゴリズム、さらに事後分布を近似する効率的な平均場変分推論アルゴリズムを導出します。推論をさらに高速化するために、スパースGP近似を導入して複雑さを軽減します。シミュレーションデータと実際のデータの両方で、3つのアルゴリズムのパフォーマンスを実証します。実験により、提案された推論アルゴリズムが、基礎となるプロンプト特性を効率的に回復できることが示されました。

Risk Bounds for Reservoir Computing
貯留層コンピューティングのリスク限界

We analyze the practices of reservoir computing in the framework of statistical learning theory. In particular, we derive finite sample upper bounds for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure. Non-asymptotic bounds are explicitly written down in terms of the multivariate Rademacher complexities of the reservoir systems and the weak dependence structure of the signals that are being handled. This allows, in particular, to determine the minimal number of observations needed in order to guarantee a prescribed estimation accuracy with high probability for a given reservoir family. At the same time, the asymptotic behavior of the devised bounds guarantees the consistency of the empirical risk minimization procedure for various hypothesis classes of reservoir functionals.

私たちは、統計学習理論の枠組みの中で貯留層コンピューティングの実践を分析します。特に、特定のリザーバーコンピューティングシステムのファミリーが、その依存構造に関するさまざまな仮説の下で離散時間入力を処理するときに犯す一般化エラーの有限サンプル上限を導出します。非漸近境界は、貯留層システムの多変量Rademacherの複雑さと、処理される信号の弱い依存構造の観点から明示的に記述されます。これにより、特に、特定の貯留層ファミリーに対して高い確率で所定の推定精度を保証するために必要な観測値の最小数を決定できます。同時に、考案された境界の漸近的な振る舞いは、リザーバー汎関数のさまざまな仮説クラスに対する経験的リスク最小化手順の一貫性を保証します。

Minimal Learning Machine: Theoretical Results and Clustering-Based Reference Point Selection
最小学習マシン:理論結果とクラスタリングに基づく基準点選択

The Minimal Learning Machine (MLM) is a nonlinear, supervised approach based on learning linear mapping between distance matrices computed in input and output data spaces, where distances are calculated using a subset of points called reference points. Its simple formulation has attracted several recent works on extensions and applications. In this paper, we aim to address some open questions related to the MLM. First, we detail the theoretical aspects that assure the MLM’s interpolation and universal approximation capabilities, which had previously only been empirically verified. Second, we identify the major importance of the task of selecting reference points for the MLM’s generalization capability. Several clustering-based methods for reference point selection in regression scenarios are then proposed and analyzed. Based on an extensive empirical evaluation, we conclude that the evaluated methods are both scalable and useful. Specifically, for a small number of reference points, the clustering-based methods outperform the standard random selection of the original MLM formulation.

最小学習マシン(MLM)は、入力データ空間と出力データ空間で計算された距離行列間の線形マッピングを学習することに基づく非線形の教師ありアプローチであり、距離は参照点と呼ばれる点のサブセットを使用して計算されます。そのシンプルな定式化は、拡張とアプリケーションに関する最近のいくつかの研究の注目を集めています。この論文では、MLMに関連するいくつかの未解決の問題に対処することを目的としています。まず、これまで経験的にのみ検証されていたMLMの補間機能と普遍的近似機能を保証する理論的側面を詳しく説明します。次に、MLMの一般化機能にとって参照点を選択するタスクが非常に重要であることを明らかにします。次に、回帰シナリオでの参照点選択のためのクラスタリングベースの方法をいくつか提案し、分析します。広範な経験的評価に基づいて、評価された方法はスケーラブルで有用であると結論付けています。具体的には、少数の参照点の場合、クラスタリングベースの方法は、元のMLM定式化の標準的なランダム選択よりも優れています。

algcomparison: Comparing the Performance of Graphical Structure Learning Algorithms with TETRAD
algcomparison: TETRAD によるグラフィカル構造学習アルゴリズムの性能の比較

In this report we describe a tool for comparing the performance of graphical causal structure learning algorithms implemented in the TETRAD freeware suite of causal analysis methods. Currently the tool is available as package in the TETRAD source code (written in Java). Simulations can be done varying the number of runs, sample sizes, and data modalities. Performance on this simulated data can then be compared for a number of algorithms, with parameters varied and with performance statistics as selected, producing a publishable report. The package presented here may also be used to compare structure learning methods across platforms and programming languages, i.e., to compare algorithms implemented in TETRAD with those implemented in MATLAB, Python, or R.

このレポートでは、TETRADフリーウェアの因果分析手法スイートに実装されたグラフィカルな因果構造学習アルゴリズムのパフォーマンスを比較するためのツールについて説明します。現在、このツールはTETRADソースコード(Javaで記述)のパッケージとして利用できます。シミュレーションは、実行回数、サンプルサイズ、およびデータモダリティを変化させて実行できます。このシミュレーションデータのパフォーマンスは、さまざまなパラメーターと選択したパフォーマンス統計を使用して、いくつかのアルゴリズムで比較でき、パブリッシュ可能なレポートを作成できます。ここで紹介するパッケージは、プラットフォームやプログラミング言語間で構造学習方法を比較するため、つまり、TETRADに実装されたアルゴリズムとMATLAB、Python、またはRに実装されたアルゴリズムを比較するためにも使用できます。

The Error-Feedback framework: SGD with Delayed Gradients
エラーフィードバックフレームワーク:遅延勾配のあるSGD

We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients—compressed gradients under error compensation and for local~SGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.

私たちは、滑らかな準凸関数および非凸関数に対する遅延更新を伴う（確率的）勾配降下法（SGD）を分析し、簡潔で非漸近的な収束率を導出します。私たちは、すべてのケースにおいて収束率が2つの項、すなわち（i）遅延の影響を受けない確率的項、および（ii）遅延によって線形にのみ遅くなる高次の決定論的項、で構成されていることを示す。したがって、ノイズが存在する場合、遅延の影響は数回の反復後に無視できるようになり、アルゴリズムは標準SGDと同じ最適速度で収束します。この結果は、漸近的領域または強凸2次関数のみで同様の結果を示した一連の研究を拡張するものです。我々はさらに、より複雑な形式の遅延勾配（エラー補正下の圧縮勾配）を伴うSGD、および複数のワーカーが互いに通信する前にローカルステップを実行するローカルSGDについても同様の結果を示す。これらすべての設定で、既知の最高レートよりも改善されています。これらの結果は、SGDが圧縮および/または遅延された確率的勾配更新に対して堅牢であることを示しています。これは、複数のデバイスでの最適化の線形高速化を実現するために、非同期および通信効率の高い方法が鍵となる分散並列実装にとって特に重要です。

Lower Bounds for Learning Distributions under Communication Constraints via Fisher Information
フィッシャー情報による通信制約下での学習分布の下限

We consider the problem of learning high-dimensional, nonparametric and structured (e.g., Gaussian) distributions in distributed networks, where each node in the network observes an independent sample from the underlying distribution and can use $k$ bits to communicate its sample to a central processor. We consider three different models for communication. Under the independent model, each node communicates its sample to a central processor by independently encoding it into $k$ bits. Under the more general sequential or blackboard communication models, nodes can share information interactively but each node is restricted to write at most $k$ bits on the final transcript. We characterize the impact of the communication constraint $k$ on the minimax risk of estimating the underlying distribution under $\ell^2$ loss. We develop minimax lower bounds that apply in a unified way to many common statistical models and reveal that the impact of the communication constraint can be qualitatively different depending on the tail behavior of the score function associated with each model. A key ingredient in our proofs is a geometric characterization of Fisher information from quantized samples.

私たちは、分散ネットワークにおける高次元、ノンパラメトリック、構造化（例えばガウス分布）分布の学習の問題を考察します。このネットワークでは、ネットワーク内の各ノードが基礎分布から独立したサンプルを観測し、そのサンプルを中央プロセッサに通信するために$k$ビットを使用できます。私たちは、通信に関して3つの異なるモデルを考察します。独立モデルでは、各ノードは、独立して$k$ビットにエンコードすることにより、そのサンプルを中央プロセッサに通信します。より一般的な順次通信モデルまたは黒板通信モデルでは、ノードは情報を対話的に共有できるが、各ノードは最終トランスクリプトに最大$k$ビットしか書き込めないという制限があります。私たちは、通信制約$k$が、$\ell^2$損失の下で基礎分布を推定するミニマックスリスクに与える影響を特徴付ける。私たちは、多くの一般的な統計モデルに統一的に適用されるミニマックス下限を開発し、通信制約の影響が、各モデルに関連付けられたスコア関数のテール動作に応じて質的に異なる可能性があることを明らかにします。我々の証明の重要な要素は、量子化されたサンプルからのフィッシャー情報の幾何学的特徴付けです。

Convex Programming for Estimation in Nonlinear Recurrent Models
非線形回帰モデルにおける推定のための凸計画法

We propose a formulation for nonlinear recurrent models that includes simple parametric models of recurrent neural networks as a special case. The proposed formulation leads to a natural estimator in the form of a convex program. We provide a sample complexity for this estimator in the case of stable dynamics, where the nonlinear recursion has a certain contraction property, and under certain regularity conditions on the input distribution. We evaluate the performance of the estimator by simulation on synthetic data. These numerical experiments also suggest the extent at which the imposed theoretical assumptions may be relaxed.

私たちは、特殊なケースとして、リカレントニューラルネットワークの単純なパラメトリックモデルを含む非線形リカレントモデルの定式化を提案します。提案された定式化は、凸計画法の形式で自然推定量につながります。この推定量のサンプルは、非線形再帰が特定の収縮特性を持ち、入力分布の特定の規則性条件下での安定したダイナミクスの場合に提供されます。推定器の性能は、合成データに対するシミュレーションによって評価します。これらの数値実験は、課せられた理論的仮定がどの程度緩和されるかも示唆しています。

Dual Extrapolation for Sparse GLMs
スパースGLMの双対外挿

Generalized Linear Models (GLM) form a wide class of regression and classification models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while offering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, significant variables are identified thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector AutoRegressive (VAR) behavior after sign identification, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that offer tighter certificates of optimality, enhancing the performance of screening rules and working set algorithms.

一般化線形モデル(GLM)は、回帰モデルと分類モデルの幅広いクラスを形成し、予測は入力変数の線形結合の関数です。高次元の統計的推論では、スパース性を誘導する正則化が統計的保証を提供しながら有用であることが証明されています。ただし、結果として生じる最適化問題を解決することは困難な場合があります。座標降下法などの一般的な反復アルゴリズムであっても、多数の変数をループする必要があります。これを緩和するために、スクリーニングルールとワーキングセットと呼ばれる手法では、変数を段階的に削除するか、小さな問題を段階的に解決することで、手元の最適化問題のサイズを縮小します。両方の手法では、凸双対性引数により重要な変数が識別されます。この論文では、主問題が近似勾配降下法または巡回座標降下法で解決される場合、GLMの双対反復が符号識別後にベクトル自己回帰(VAR)動作を示すことを示します。この規則性を利用すると、より厳密な最適性の証明書を提供するデュアルポイントを構築でき、スクリーニングルールとワーキングセットアルゴリズムのパフォーマンスが向上します。

Robust high dimensional learning for Lipschitz and convex losses
リプシッツ損失と凸損失に対するロバストな高次元学習

We establish risk bounds for Regularized Empirical Risk Minimizers (RERM) when the loss is Lipschitz and convex and the regularization function is a norm. In a first part, we obtain these results in the i.i.d. setup under subgaussian assumptions on the design. In a second part, a more general framework where the design might have heavier tails and data may be corrupted by outliers both in the design and the response variables is considered. In this situation, RERM performs poorly in general. We analyse an alternative procedure based on median-of-means principles and called “minmax MOM”. We show optimal subgaussian deviation rates for these estimators in the relaxed setting. The main results are meta-theorems allowing a wide-range of applications to various problems in learning theory. To show a non-exhaustive sample of these potential applications, it is applied to classification problems with logistic loss functions regularized by LASSO and SLOPE, to regression problems with Huber loss regularized by Group LASSO and Total Variation. Another advantage of the minmax MOM formulation is that it suggests a systematic way to slightly modify descent based algorithms used in high-dimensional statistics to make them robust to outliers. We illustrate this principle in a Simulations section where a “ minmax MOM” version of classical proximal descent algorithms are turned into robust to outliers algorithms.

私たちは、損失がリプシッツかつ凸で、正則化関数がノルムである場合に、正則化経験的リスク最小化器(RERM)のリスク境界を確立します。最初の部分では、設計に対するサブガウス仮定の下でi.i.d.設定でこれらの結果を取得します。2番目の部分では、設計の裾がより重くなり、設計と応答変数の両方で外れ値によってデータが破損する可能性がある、より一般的なフレームワークを検討します。この状況では、RERMは一般にパフォーマンスが低下します。私たちは、平均の中央値の原則に基づく「minmax MOM」と呼ばれる代替手順を分析します。私たちは、緩和された設定でこれらの推定量の最適なサブガウス偏差率を示します。主な結果は、学習理論のさまざまな問題に幅広く応用できるメタ定理です。これらの潜在的な応用例の一部を示すために、LASSOとSLOPEによって正規化されたロジスティック損失関数による分類問題、およびGroup LASSOとTotal Variationによって正規化されたHuber損失による回帰問題に適用します。minmax MOM定式化のもう1つの利点は、高次元統計で使用される降下ベースのアルゴリズムをわずかに変更して外れ値に対して堅牢にする体系的な方法を提案することです。この原理は、古典的な近似降下アルゴリズムの「minmax MOM」バージョンを外れ値に対して堅牢なアルゴリズムに変換するシミュレーションセクションで説明します。

Spectral Deconfounding via Perturbed Sparse Linear Models
摂動スパース線形モデルによるスペクトル交絡解除

Standard high-dimensional regression methods assume that the underlying coefficient vector is sparse. This might not be true in some cases, in particular in presence of hidden, confounding variables. Such hidden confounding can be represented as a high-dimensional linear model where the sparse coefficient vector is perturbed. For this model, we develop and investigate a class of methods that are based on running the Lasso on preprocessed data. The preprocessing step consists of applying certain spectral transformations that change the singular values of the design matrix. We show that, under some assumptions, one can achieve the usual Lasso $\ell_1$-error rate for estimating the underlying sparse coefficient vector, despite the presence of confounding. Our theory also covers the Lava estimator (Chernozhukov et al., 2017) for a special model class. The performance of the methodology is illustrated on simulated data and a genomic dataset.

標準的な高次元回帰手法では、基になる係数ベクトルがスパースであると仮定します。これは、特に隠れた交絡変数が存在する場合など、当てはまらない場合があります。このような隠れた交絡は、スパース係数ベクトルが摂動される高次元線形モデルとして表すことができます。このモデルでは、前処理されたデータに対してLassoを実行することに基づくメソッドのクラスを開発および調査します。前処理ステップでは、計画行列の特異値を変更する特定のスペクトル変換を適用します。いくつかの仮定の下で、交絡が存在するにもかかわらず、基礎となるスパース係数ベクトルを推定するための通常のLasso $ell_1$-error率を達成できることを示します。私たちの理論は、特別なモデルクラスの溶岩推定器(Chernozhukovら, 2017)もカバーしています。この方法論の性能は、シミュレーションデータとゲノムデータセットに示されています。

Fast Exact Matrix Completion: A Unified Optimization Framework for Matrix Completion
高速厳密行列補完:行列補完のための統一最適化フレームワーク

We formulate the problem of matrix completion with and without side information as a non-convex optimization problem. We design fastImpute based on non-convex gradient descent and show it converges to a global minimum that is guaranteed to recover closely the underlying matrix while it scales to matrices of sizes beyond $10^5 \times 10^5$. We report experiments on both synthetic and real-world datasets that show fastImpute is competitive in both the accuracy of the matrix recovered and the time needed across all cases. Furthermore, when a high number of entries are missing, fastImpute is over $75\%$ lower in MAPE and $15$ times faster than current state-of-the-art matrix completion methods in both the case with side information and without.

私たちは、サイド情報ありとなしの行列完成問題を非凸最適化問題として定式化します。fastImputeは非凸勾配降下法に基づいて設計し、$10^5 times 10^5$を超えるサイズの行列にスケーリングしながら、基になる行列を密接に回復することが保証されているグローバル最小値に収束することを示しています。合成データセットと実世界のデータセットの両方で、fastImputeが回収されたマトリックスの精度とすべてのケースで必要な時間の両方で競争力があることを示す実験を報告しています。さらに、多数のエントリが欠落している場合、fastImputeは、サイド情報がある場合とない場合の両方で、MAPEが$75%$以上低く、現在の最先端の行列補完方法よりも$15$倍高速です。

Stable Regression: On the Power of Optimization over Randomization
安定回帰:ランダム化に対する最適化の力について

We investigate and ultimately suggest remediation to the widely held belief that the best way to train regression models is via random assignment of our data to training and validation sets. In particular, we show that taking a robust optimization approach, and optimally selecting such training and validation sets, leads to models that not only perform significantly better than their randomly constructed counterparts in terms of prediction error, but more importantly, are considerably more stable in the sense that the standard deviation of the resulting predictions, as well as of the model coefficients, is greatly reduced. Moreover, we show that this optimization approach to training is far more effective at recovering the true support of a given data set, i.e., correctly identifying important features while simultaneously excluding spurious ones. We further compare the robust optimization approach to cross validation and find that optimization continues to have a performance edge albeit smaller. Finally, we show that this optimization approach to training is equivalent to building models that are robust to all subpopulations in the data, and thus in particular are robust to the hardest subpopulation, which leads to interesting domain specific interpretations through the use of optimal classification trees. The proposed robust optimization algorithm is efficient and scales training to essentially any desired size.

私たちは、回帰モデルをトレーニングする最良の方法は、データをトレーニングセットと検証セットにランダムに割り当てることであるという広く信じられている考えを調査し、最終的にその改善策を提案します。特に、堅牢な最適化アプローチを採用し、そのようなトレーニングセットと検証セットを最適に選択すると、予測誤差の点でランダムに構築されたモデルよりも大幅に優れたパフォーマンスを発揮するだけでなく、より重要なことに、結果として得られる予測の標準偏差とモデル係数が大幅に減少するという意味で、かなり安定したモデルが得られることを示します。さらに、このトレーニングの最適化アプローチは、特定のデータセットの真のサポートを回復する、つまり、重要な特徴を正しく識別すると同時に誤った特徴を排除する点で、はるかに効果的であることを示します。さらに、堅牢な最適化アプローチをクロス検証と比較すると、最適化の方が、小さいながらもパフォーマンスの優位性を維持していることがわかります。最後に、このトレーニングの最適化アプローチは、データ内のすべてのサブポピュレーションに対して堅牢なモデルを構築することと同等であり、特に最も困難なサブポピュレーションに対して堅牢であり、最適な分類ツリーを使用することで興味深いドメイン固有の解釈につながることを示します。提案された堅牢な最適化アルゴリズムは効率的であり、トレーニングを基本的に任意のサイズに拡張できます。

Nonparametric graphical model for counts
カウントのノンパラメトリックグラフィカルモデル

Although multivariate count data are routinely collected in many application areas, there is surprisingly little work developing flexible models for characterizing their dependence structure. This is particularly true when interest focuses on inferring the conditional independence graph. In this article, we propose a new class of pairwise Markov random field-type models for the joint distribution of a multivariate count vector. By employing a novel type of transformation, we avoid restricting to non-negative dependence structures or inducing other restrictions through truncations. Taking a Bayesian approach to inference, we choose a Dirichlet process prior for the distribution of a random effect to induce great flexibility in the specification. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for posterior computation. We prove various theoretical properties, including posterior consistency, and show that our COunt Nonparametric Graphical Analysis (CONGA) approach has good performance relative to competitors in simulation studies. The methods are motivated by an application to neuron spike count data in mice.

多変量カウントデータは多くの応用分野で日常的に収集されていますが、その依存構造を特徴付ける柔軟なモデルを開発する研究は驚くほど少ないです。これは、条件付き独立グラフの推論に関心が集中している場合に特に当てはまります。この記事では、多変量カウントベクトルの結合分布に対する新しいクラスのペアワイズマルコフランダムフィールド型モデルを提案します。新しいタイプの変換を使用することで、非負の依存構造に制限したり、切り捨てによって他の制限を誘発したりすることを回避します。ベイジアンアプローチを推論に採用し、ランダム効果の分布にディリクレ過程事前分布を選択して、仕様に大きな柔軟性をもたらします。効率的なマルコフ連鎖モンテカルロ(MCMC)アルゴリズムを事後計算用に開発しました。事後一貫性を含むさまざまな理論的特性を証明し、シミュレーション研究において、当社のカウントノンパラメトリックグラフィカル分析(CONGA)アプローチが競合製品と比較して優れたパフォーマンスを発揮することを示します。この方法は、マウスのニューロンスパイクカウントデータへの応用をきっかけに生まれました。

Posterior sampling strategies based on discretized stochastic differential equations for machine learning applications
機械学習アプリケーションのための離散化確率微分方程式に基づく事後サンプリング戦略

With the advent of GPU-assisted hardware and maturing high-efficiency software platforms such as TensorFlow and PyTorch, Bayesian posterior sampling for neural networks becomes plausible. In this article we discuss Bayesian parametrization in machine learning based on Markov Chain Monte Carlo methods, specifically discretized stochastic differential equations such as Langevin dynamics and extended system methods in which an ensemble of walkers is employed to enhance sampling. We provide a glimpse of the potential of the sampling-intensive approach by studying (and visualizing) the loss landscape of a neural network applied to the MNIST data set. Moreover, we investigate how the sampling efficiency itself can be significantly enhanced through an ensemble quasi-Newton preconditioning method. This article accompanies the release of a new TensorFlow software package, the Thermodynamic Analytics ToolkIt, which is used in the computational experiments.

GPU支援ハードウェアの出現と、TensorFlowやPyTorchなどの高効率ソフトウェアプラットフォームの成熟により、ニューラルネットワークのベイジアン事後サンプリングが現実味を帯びてきました。この記事では、マルコフ連鎖モンテカルロ法に基づく機械学習のベイズパラメータ化、特にランジュバンダイナミクスなどの離散化された確率微分方程式と、ウォーカーのアンサンブルを使用してサンプリングを強化する拡張システム法について説明します。MNISTデータセットに適用されたニューラルネットワークの損失ランドスケープを研究(および視覚化)することにより、サンプリング集約型アプローチの可能性を垣間見ることができます。さらに、アンサンブル準ニュートン前処理法により、サンプリング効率自体をいかにして大幅に向上させることができるかを検討しています。この記事は、計算実験で使用される新しいTensorFlowソフトウェアパッケージであるThermodynamic Analytics ToolkItのリリースに伴うものです。

Significance Tests for Neural Networks
ニューラルネットワークの有意性検定

We develop a pivotal test to assess the statistical significance of the feature variables in a single-layer feed-forward neural network regression model. We propose a gradient-based test statistic and study its asymptotics using non-parametric techniques. Under technical conditions, the limiting distribution is given by a mixture of chi-square distributions. The tests enable one to discern the impact of individual variables on the prediction of a neural network. The test statistic can be used to rank variables according to their influence. Simulation results illustrate the computational efficiency and the performance of the test. An empirical application to house price valuation highlights the behavior of the test using actual data.

私たちは、単層フィードフォワードニューラルネットワーク回帰モデルの特徴変数の統計的有意性を評価するための重要な検定を開発します。勾配ベースの検定統計を提案し、ノンパラメトリック手法を使用してその漸近性を研究します。技術的な条件下では、極限分布はカイ二乗分布の混合によって与えられます。このテストにより、ニューラルネットワークの予測に対する個々の変数の影響を識別できます。検定統計量を使用して、その影響度に応じて変数をランク付けできます。シミュレーション結果は、テストの計算効率とパフォーマンスを示しています。住宅価格評価への経験的応用は、実際のデータを使用したテストの動作を強調しています。

A Sparse Semismooth Newton Based Proximal Majorization-Minimization Algorithm for Nonconvex Square-Root-Loss Regression Problems
非凸平方根損失回帰問題に対するスパース半平滑ニュートンに基づく近位多数決最小化アルゴリズム

In this paper, we consider high-dimensional nonconvex square-root-loss regression problems and introduce a proximal majorization-minimization (PMM) algorithm for solving these problems. Our key idea for making the proposed PMM to be efficient is to develop a sparse semismooth Newton method to solve the corresponding subproblems. By using the Kurdyka- Lojasiewicz property exhibited in the underlining problems, we prove that the PMM algorithm converges to a d-stationary point. We also analyze the oracle property of the initial subproblem used in our algorithm. Extensive numerical experiments are presented to demonstrate the high efficiency of the proposed PMM algorithm.

この論文では、高次元の非凸平方根損失回帰問題を検討し、これらの問題を解決するための近位多数決最小化(PMM)アルゴリズムを紹介します。提案されたPMMを効率的にするための重要なアイデアは、対応するサブ問題を解決するためのスパース半平滑ニュートン法を開発することです。下線の問題で示されたKurdyka-Lojasiewicz特性を使用して、PMMアルゴリズムがd-静止点に収束することを証明します。また、アルゴリズムで使用される初期サブ問題のオラクルプロパティを分析します。提案されたPMMアルゴリズムの高効率を実証するために、広範な数値実験が提示されます。

Recovery of a Mixture of Gaussians by Sum-of-Norms Clustering
ノルム和クラスタリングによるガウス分布の混合分布の復元

Sum-of-norms clustering is a method for assigning $n$ points in $\mathbf{R}^d$ to $K$ clusters, $1\le K\le n$, using convex optimization. Recently, Panahi (2017) proved that sum-of-norms clustering is guaranteed to recover a mixture of Gaussians under the restriction that the number of samples is not too large. The purpose of this note is to lift this restriction, that is, show that sum-of-norms clustering can recover a mixture of Gaussians even as the number of samples tends to infinity. Our proof relies on an interesting characterization of clusters computed by sum-of-norms clustering that was developed inside a proof of the agglomeration conjecture by Chiquet et al. (2017). Because we believe this theorem has independent interest, we restate and reprove the Chiquet et al. (2017) result herein.

Sum-of-normsクラスタリングは、凸最適化を使用して、$mathbf{R}^d$の$n$点を$1le Kle n$の$$K$クラスターに割り当てる方法です。最近、Panahi(2017)は、サンプル数が多すぎないという制限の下で、標準和クラスタリングがガウス分布の混合物を回復することが保証されていることを証明しました。このノートの目的は、この制限を解除すること、つまり、サンプル数が無限大になる傾向がある場合でも、規範和クラスタリングがガウス分布の混合を回復できることを示すことです。私たちの証明は、Chiquetら(2017)による凝集予想の証明内で開発された、規範和クラスタリングによって計算されたクラスターの興味深い特性評価に依存しています。この定理には独立した関心があると信じているため、ここでChiquetら(2017)の結果を言い換え、反論します。

Ultra-High Dimensional Single-Index Quantile Regression
超高次元単一インデックス分位点回帰

We consider a flexible semiparametric single-index quantile regression model where the number of covariates may be ultra-high dimensional, and the number of the relevant covariates is potentially diverging. The approach is particularly appealing to uncover the complex heterogeneity in high-dimensional data, incorporate nonlinearity and potential interaction, avoid the curse of dimensionality, and allow different variables to be included at different quantile levels. We estimate the unknown function via polynomial splines nonparametrically and adopt a nonconvex penalty function to identify the sparse variable set. We further extend it to partially linear single-index quantile model where both the single-index components in the nonparametric term and the partially linear components can be in ultra-high dimension. However, a number of major challenges arise in developing both theory and computation: (a) The model is highly nonlinear in single-index coefficients because the high-dimensional single-index covariates are embedded inside the unknown flexible function. (b) The data are ultra-high dimensional where the dimension of the single-index covariates ($p_n$) is diverging or even in the exponential order of sample size $n$. (c) The objective function is non-smooth for quantile regression. (d) Nonconvex variable selection such as SCAD is adopted for regularization. (e) The extended partially linear single-index quantile models may include both ultra-high dimensional ($p_n$) single-index covariates and ultra-high dimensional ($q_n$) partially linear covariates. We develop a novel approach using empirical process techniques in establishing the theoretical properties of the nonconvex penalized estimators for partially linear single-index quantile models and show those estimators indeed possess the oracle property in ultra-high dimensional setting. We propose an efficient algorithm to circumvent the computational challenges. The results of Monte Carlo simulations and an application to gene expression data demonstrate the effectiveness of the proposed models and estimation method.

私たちは、共変量の数が超高次元になる可能性があり、関連する共変量の数が潜在的に発散する可能性がある柔軟なセミパラメトリック単一インデックス分位回帰モデルを検討します。このアプローチは、高次元データの複雑な異質性を明らかにし、非線形性と潜在的な相互作用を組み込み、次元の呪いを回避し、異なる変数を異なる分位レベルに含めることを可能にするために特に魅力的です。私たちは、多項式スプラインを介してノンパラメトリックに未知の関数を推定し、非凸ペナルティ関数を採用してスパース変数セットを識別します。我々はさらに、ノンパラメトリック項の単一インデックス成分と部分的に線形な成分の両方が超高次元になる可能性がある部分的に線形の単一インデックス分位モデルにそれを拡張します。しかし、理論と計算の両方を開発する際には、いくつかの大きな課題が生じます。(a)高次元の単一インデックス共変量が未知の柔軟な関数内に埋め込まれているため、モデルは単一インデックス係数で高度に非線形です。(b)データは超高次元であり、単一インデックス共変量($p_n$)の次元は発散しているか、サンプルサイズ$n$の指数オーダーですらあります。(c)目的関数は、分位回帰に対して非滑らかです。(d)正則化には、SCADなどの非凸変数選択が採用されています。(e)拡張された部分線形単一インデックス分位モデルには、超高次元($p_n$)単一インデックス共変量と超高次元($q_n$)部分線形共変量の両方が含まれる場合があります。部分線形単一インデックス分位モデルに対する非凸ペナルティ付き推定量の理論的特性を確立するために、経験的プロセス手法を使用する新しいアプローチを開発し、それらの推定量が超高次元設定で実際にオラクル特性を備えていることを示します。計算上の課題を回避するための効率的なアルゴリズムを提案します。モンテカルロシミュレーションの結果と遺伝子発現データへの適用により、提案されたモデルと推定方法の有効性が実証されています。

Geomstats: A Python Package for Riemannian Geometry in Machine Learning
Geomstats: 機械学習におけるリーマン幾何学の Python パッケージ

We introduce Geomstats, an open-source Python package for computations and statistics on nonlinear manifolds such as hyperbolic spaces, spaces of symmetric positive definite matrices, Lie groups of transformations, and many more. We provide object-oriented and extensively unit-tested implementations. Manifolds come equipped with families of Riemannian metrics with associated exponential and logarithmic maps, geodesics, and parallel transport. Statistics and learning algorithms provide methods for estimation, clustering, and dimension reduction on manifolds. All associated operations are vectorized for batch computation and provide support for different execution backends—namely NumPy, PyTorch, and TensorFlow. This paper presents the package, compares it with related libraries, and provides relevant code examples. We show that Geomstats provides reliable building blocks to both foster research in differential geometry and statistics and democratize the use of Riemannian geometry in machine learning applications. The source code is freely available under the MIT license at geomstats.ai.

私たちは、双曲空間、対称正定値行列の空間、変換のリー群など、非線形多様体に関する計算と統計のためのオープンソースのPythonパッケージ、Geomstatsを紹介します。オブジェクト指向で、広範囲にユニットテストされた実装を提供します。多様体には、指数マップ、対数マップ、測地線、並列転送に関連するリーマン計量のファミリーが装備されています。統計と学習アルゴリズムは、多様体での推定、クラスタリング、次元削減の方法を提供します。関連するすべての操作は、バッチ計算用にベクトル化されており、NumPy、PyTorch、TensorFlowなどのさまざまな実行バックエンドをサポートしています。この論文では、このパッケージを紹介し、関連ライブラリと比較し、関連するコード例を示します。Geomstatsは、微分幾何学と統計の研究を促進し、機械学習アプリケーションでのリーマン幾何学の使用を民主化するための信頼性の高いビルディングブロックを提供することを示します。ソースコードは、MITライセンスの下でgeomstats.aiから無料で入手できます。

Theory of Curriculum Learning, with Convex Loss Functions
凸損失関数を用いたカリキュラム学習の理論

Curriculum Learning is motivated by human cognition, where teaching often involves gradually exposing the learner to examples in a meaningful order, from easy to hard. Although methods based on this concept have been empirically shown to improve performance of several machine learning algorithms, no theoretical analysis has been provided even for simple cases. To address this shortfall, we start by formulating an ideal definition of difficulty score – the loss of the optimal hypothesis at a given datapoint. We analyze the possible contribution of curriculum learning based on this score in two convex problems – linear regression, and binary classification by hinge loss minimization. We show that in both cases, the convergence rate of SGD optimization decreases monotonically with the difficulty score, in accordance with earlier empirical results. We also prove that when the difficulty score is fixed, the convergence rate of SGD optimization is monotonically increasing with respect to the loss of the current hypothesis at each point. We discuss how these results settle some confusion in the literature where two apparently opposing heuristics are reported to improve performance: curriculum learning in which easier points are given priority, vs hard data mining where the more difficult points are sought out.

カリキュラム学習は人間の認知によって動機付けられており、教える際には多くの場合、学習者に簡単なものから難しいものへと意味のある順序で徐々に例を見せていくことが含まれます。この概念に基づく方法は、いくつかの機械学習アルゴリズムのパフォーマンスを向上させることが経験的に示されていますが、単純なケースであっても理論的分析は提供されていません。この不足に対処するために、難易度スコア（特定のデータポイントでの最適仮説の損失）の理想的な定義を定式化することから始めます。このスコアに基づくカリキュラム学習の可能性のある貢献を、線形回帰とヒンジ損失最小化によるバイナリ分類という2つの凸問題で分析します。両方のケースで、SGD最適化の収束率は、以前の経験的結果に従って、難易度スコアとともに単調に減少することを示します。また、難易度スコアが固定されている場合、SGD最適化の収束率は、各ポイントでの現在の仮説の損失に対して単調に増加することを証明します。パフォーマンスを向上させるために、より簡単なポイントを優先するカリキュラム学習と、より難しいポイントを探し出すハードデータマイニングという、一見相反する2つのヒューリスティックが報告されており、これらの結果が文献における混乱をどのように解決するかについて説明します。

Learning Sums of Independent Random Variables with Sparse Collective Support
スパース集団サポートによる独立確率変数の和の学習

We study the learnability of sums of independent integer random variables given a bound on the size of the union of their supports. For $\mathcal{A} \subset \mathbb{Z}_{+}$, a {sum of independent random variables with collective support $\mathcal{A}$} (called an $\mathcal{A}$-sum in this paper) is a distribution $\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$ where the $\mathbf{X}_i$’s are mutually independent (but not necessarily identically distributed) integer random variables with $\cup_i \mathrm{supp}(\mathbf{X}_i) \subseteq \mathcal{A}.$ We give two main algorithmic results for learning such distributions. First, for the case $| \mathcal{A} | = 3$, we give an algorithm for learning an unknown $\mathcal{A}$-sum to accuracy $\epsilon$ using $\mathrm{poly}(1/\epsilon)$ samples and running in time $\mathrm{poly}(1/\epsilon)$, independent of $N$ and of the elements of $\mathcal{A}$. Second, for an arbitrary constant $k \geq 4$, if $\mathcal{A} = \{ a_1,…,a_k\}$ with $0 \leq a_1 < … < a_k$, we give an algorithm that uses $\mathrm{poly}(1/\epsilon) \cdot \log \log a_k$ samples (independent of $N$) and runs in time $\mathrm{poly}(1/\epsilon, \log a_k).$ We prove an essentially matching lower bound: if $|\mathcal{A}| = 4$, then any algorithm must use \[ \Omega(\log \log a_4) \] samples even for learning to constant accuracy. We also give similar-in-spirit (but quantitatively very different) algorithmic results, and essentially matching lower bounds, for the case in which $\mathcal{A}$ is not known to the learner. Our algorithms and lower bounds together settle the question of how the sample complexity of learning sums of independent integer random variables scales with the elements in the union of their supports, both in the known-support and unknown-support settings. Finally, all our algorithms easily extend to the “semi-agnostic” learning model, in which training data is generated from a distribution that is only $c \epsilon$-close to some $\mathcal{A}$-sum for a constant $c>0$.

私たちは、サポートの和のサイズに上限がある、独立した整数ランダム変数の和の学習可能性について調べます。$\mathcal{A} \subset \mathbb{Z}_{+}$の場合、{集合サポート$\mathcal{A}$を持つ独立したランダム変数の和} (この論文では$\mathcal{A}$和と呼びます)は分布$\mathbf{S} = \mathbf{X}_1 + \cdots + \mathbf{X}_N$です。ここで、$\mathbf{X}_i$は、$\cup_i \mathrm{supp}(\mathbf{X}_i) \subseteq \mathcal{A}$を満たす、相互に独立した(ただし必ずしも同一に分布するとは限らない)整数ランダム変数です。このような分布を学習するための主なアルゴリズムの結果が2つあります。まず、$| \mathcal{A} | = 3$の場合、未知の$\mathcal{A}$和を$\mathrm{poly}(1/\epsilon)$サンプルを使用して精度$\epsilon$で学習し、$N$および$\mathcal{A}$の要素とは無関係に、時間$\mathrm{poly}(1/\epsilon)$で実行するアルゴリズムを示します。次に、任意の定数$k \geq 4$に対して、$\mathcal{A} = \{a_1,…,a_k\}$で$0 \leq a_1 < … < a_k$の場合、$\mathrm{poly}(1/\epsilon) \cdot \log \log a_k$サンプル($N$とは無関係)を使用し、時間$\mathrm{poly}(1/\epsilon, \log a_k)$で実行するアルゴリズムを示します。基本的に一致する下限を証明します。= 4$の場合、一定の精度で学習する場合であっても、どのアルゴリズムでも\[ \Omega(\log \log a_4) \]サンプルを使用する必要があります。また、学習者が$\mathcal{A}$を知らない場合についても、精神的には同様の(ただし量的には非常に異なる)アルゴリズム結果と、基本的に一致する下限を示します。アルゴリズムと下限を組み合わせることで、既知のサポートと未知のサポートの両方の設定において、独立した整数ランダム変数の合計を学習する際のサンプル複雑性が、それらのサポートの和集合内の要素によってどのように変化するかという問題を解決します。最後に、すべてのアルゴリズムは、定数$c>0$に対して、ある$\mathcal{A}$合計に$c \epsilon$のみ近い分布からトレーニングデータが生成される「半不可知論的」学習モデルに簡単に拡張できます。

Diffeomorphic Learning
微分同相学習

We introduce in this paper a learning paradigm in which training data is transformed by a diffeomorphic transformation before prediction. The learning algorithm minimizes a cost function evaluating the prediction error on the training set penalized by the distance between the diffeomorphism and the identity. The approach borrows ideas from shape analysis where diffeomorphisms are estimated for shape and image alignment, and brings them in a previously unexplored setting, estimating, in particular diffeomorphisms in much larger dimensions. After introducing the concept and describing a learning algorithm, we present diverse applications, mostly with synthetic examples, demonstrating the potential of the approach, as well as some insight on how it can be improved.

私たちは、この論文では、予測前に学習データを微分同相変換によって変換する学習パラダイムについて紹介します。学習アルゴリズムは、微分同相と恒等式の間の距離によってペナルティを受ける学習セットの予測誤差を評価するコスト関数を最小化します。このアプローチは、形状と画像の整列について微分同相を推定する形状解析からアイデアを借用し、それらを以前には未踏の設定に持ち込み、特にはるかに大きな次元の微分同相を推定します。概念を紹介し、学習アルゴリズムについて説明した後、さまざまなアプリケーション、主に合成例を使用して、アプローチの可能性と、それをどのように改善できるかについての洞察を示します。

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
AdaGrad ステップサイズ: 非凸地形上の急激な収束

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the $\mathcal{O}(\log(N)/\sqrt{N})$ rate in the stochastic setting, and at the optimal $\mathcal{O}(1/N)$ rate in the batch (non-stochastic) setting — in this sense, our convergence guarantees are “sharp”. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theoretical findings; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to the models in deep learning.

AdaGradやその変種などの適応勾配法は、途中で受け取った勾配に応じて確率的勾配降下法のステップサイズをその場で更新します。このような方法は、ステップサイズスケジュールを微調整する必要なく、堅牢に収束できるため、大規模最適化で広く使用されています。しかし、AdaGradのこれまでの理論的保証は、オンラインおよび凸最適化に関するものです。私たちは、滑らかな非凸関数に対するAdaGradの収束に対する理論的保証を提供することで、このギャップを埋めます。私たちは、AdaGradのノルムバージョン(AdaGrad-Norm)が、確率的設定では$\mathcal{O}(\log(N)/\sqrt{N})$の速度で定常点に収束し、バッチ(非確率的)設定では最適な$\mathcal{O}(1/N)$の速度で収束することを示します。この意味で、私たちの収束保証は「鋭い」ものです。特に、AdaGrad-Normの収束は、アルゴリズムのすべてのハイパーパラメータの選択に対して堅牢です。これは、収束がステップサイズを(一般には未知である)リプシッツ平滑定数と勾配の確率的ノイズのレベルに合わせて調整することに大きく依存する確率的勾配降下法とは対照的です。私たちの理論的発見を裏付けるために、広範な数値実験が行われています。さらに、実験は、AdaGrad-Normの堅牢性がディープラーニングのモデルにまで及ぶことを示唆しています。

Spectral bandits
スペクトルバンディット

Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node of an undirected graph and its expected rating is similar to the one of its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose three algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of node evaluations.

グラフ上の滑らかな関数は、多様体学習や半教師あり学習で幅広く応用されています。この研究では、グラフ上で腕の報酬が滑らかであるバンディット問題を研究します。このフレームワークは、コンテンツベースの推奨など、グラフを含むオンライン学習問題を解決するのに適しています。この問題では、推奨できる各アイテムは無向グラフのノードであり、その期待評価は近隣のものと似ています。目標は、高い期待評価を持つアイテムを推奨することです。最適なポリシーに関する累積後悔がノードの数に応じて悪化しないアルゴリズムを目指します。特に、実世界のグラフでは小さい有効次元の概念を導入し、この次元で線形および非線形にスケーリングする問題を解決するための3つのアルゴリズムを提案します。コンテンツ推奨問題に関する実験では、数千のアイテムに対するユーザーの好みの優れた推定値を、わずか数十のノード評価から学習できることが示されています。

On the Theoretical Guarantees for Parameter Estimation of Gaussian Random Field Models: A Sparse Precision Matrix Approach
ガウスランダムフィールドモデルのパラメータ推定の理論的保証について:スパース精度行列アプローチ

Iterative methods for fitting a Gaussian Random Field (GRF) model via maximum likelihood (ML) estimation requires solving a nonconvex optimization problem. The problem is aggravated for anisotropic GRFs where the number of covariance function parameters increases with the dimension. Even evaluation of the likelihood function requires $O(n^3)$ floating point operations, where $n$ denotes the number of data locations. In this paper, we propose a new two-stage procedure to estimate the parameters of second-order stationary GRFs. First, a convex likelihood problem regularized with a weighted $\ell_1$-norm, utilizing the available distance information between observation locations, is solved to fit a sparse precision (inverse covariance) matrix to the observed data. Second, the parameters of the covariance function are estimated by solving a least squares problem. Theoretical error bounds for the solutions of stage I and II problems are provided, and their tightness are investigated.

最尤法(ML)推定を使用してガウス確率場(GRF)モデルを近似する反復法では、非凸最適化問題を解く必要があります。この問題は、共分散関数パラメータの数が次元とともに増加する異方性GRFでは悪化します。尤度関数の評価でも、$O(n^3)$浮動小数点演算が必要であり、$n$はデータ位置の数を示します。この論文では、2次静止GRFのパラメータを推定するための新しい2段階の手順を提案します。まず、重み付けされた$ell_1$-normで正則化された凸尤問題を解き、観測位置間の利用可能な距離情報を利用して、観測データにスパース精度(逆共分散)行列を適合させます。次に、共分散関数のパラメータは、最小二乗問題を解くことによって推定されます。ステージIおよびIIの問題の解の理論的誤差範囲が提供され、その厳密性が調査されます。

Dynamic Assortment Optimization with Changing Contextual Information
変化するコンテキスト情報による動的な品揃えの最適化

In this paper, we study the dynamic assortment optimization problem over a finite selling season of length $T$. At each time period, the seller offers an arriving customer an assortment of substitutable products under a cardinality constraint, and the customer makes the purchase among offered products according to a discrete choice model. Most existing work associates each product with a real-valued fixed mean utility and assumes a multinomial logit choice (MNL) model. In many practical applications, feature/contextual information of products is readily available. In this paper, we incorporate the feature information by assuming a linear relationship between the mean utility and the feature. In addition, we allow the feature information of products to change over time so that the underlying choice model can also be non-stationary. To solve the dynamic assortment optimization under this changing contextual MNL model, we need to simultaneously learn the underlying unknown coefficient and make the decision on the assortment. To this end, we develop an upper confidence bound (UCB) based policy and establish the regret bound on the order of $\tilde{O}(d\sqrt{T})$, where $d$ is the dimension of the feature and $\tilde{O}$ suppresses logarithmic dependence. We further establish a lower bound $\Omega(d\sqrt{T}/{K})$, where $K$ is the cardinality constraint of an offered assortment, which is usually small. When $K$ is a constant, our policy is optimal up to logarithmic factors. In the exploitation phase of the UCB algorithm, we need to solve a combinatorial optimization problem for assortment optimization based on the learned information. We further develop an approximation algorithm and an efficient greedy heuristic. The effectiveness of the proposed policy is further demonstrated by our numerical studies.

この論文では、長さ$T$の有限な販売シーズンにおける動的品揃え最適化問題を研究します。各期間において、販売者は到着した顧客に対して、カーディナリティ制約の下で代替可能な製品の品揃えを提供し、顧客は離散選択モデルに従って提供された製品の中から購入します。既存の研究のほとんどは、各製品を実数値の固定平均効用と関連付け、多項ロジット選択(MNL)モデルを想定しています。多くの実際のアプリケーションでは、製品の機能/コンテキスト情報は容易に入手できます。この論文では、平均効用と機能の間に線形関係があると想定して、機能情報を組み込みます。さらに、製品の機能情報が時間の経過とともに変化することを許容し、基礎となる選択モデルも非定常になるようにします。この変化するコンテキストMNLモデルの下で動的品揃え最適化を解くには、基礎となる未知の係数を学習すると同時に、品揃えに関する決定を行う必要があります。この目的のために、信頼上限(UCB)ベースのポリシーを開発し、$\tilde{O}(d\sqrt{T})$のオーダーで後悔境界を確立します。ここで、$d$は機能の次元であり、$\tilde{O}$は対数依存性を抑制します。さらに、下限$\Omega(d\sqrt{T}/{K})$を確立します。ここで、$K$は、提供される品揃えの基数制約であり、通常は小さいです。$K$が定数の場合、ポリシーは対数係数まで最適です。UCBアルゴリズムの活用フェーズでは、学習した情報に基づいて、品揃えの最適化のための組み合わせ最適化問題を解決する必要があります。さらに、近似アルゴリズムと効率的な貪欲ヒューリスティックを開発します。提案されたポリシーの有効性は、数値研究によってさらに実証されています。

Mining Topological Structure in Graphs through Forest Representations
森林表現によるグラフにおけるトポロジカル構造のマイニング

We consider the problem of inferring simplified topological substructures—which we term backbones—in metric and non-metric graphs. Intuitively, these are subgraphs with ‘few’ nodes, multifurcations, and cycles, that model the topology of the original graph well. We present a multistep procedure for inferring these backbones. First, we encode local (geometric) information of each vertex in the original graph by means of the boundary coefficient (BC) to identify ‘core’ nodes in the graph. Next, we construct a forest representation of the graph, termed an f-pine, that connects every node of the graph to a local ‘core’ node. The final backbone is then inferred from the f-pine through CLOF (Constrained Leaves Optimal subForest), a novel graph optimization problem we introduce in this paper. On a theoretical level, we show that CLOF is NP-hard for general graphs. However, we prove that CLOF can be efficiently solved for forest graphs, a surprising fact given that CLOF induces a nontrivial monotone submodular set function maximization problem on tree graphs. This result is the basis of our method for mining backbones in graphs through forest representation. We qualitatively and quantitatively confirm the applicability, effectiveness, and scalability of our method for discovering backbones in a variety of graph-structured data, such as social networks, earthquake locations scattered across the Earth, and high-dimensional cell trajectory data.

私たちは、メトリックグラフと非メトリックグラフにおける簡略化された位相的サブ構造（バックボーンと呼ぶ）を推論する問題について考える。直感的には、これらは「少数の」ノード、多重分岐、およびサイクルを持つサブグラフであり、元のグラフのトポロジーをうまくモデル化します。これらのバックボーンを推論するための多段階の手順を示す。まず、境界係数（BC）を使用して元のグラフの各頂点のローカル（幾何学的）情報をエンコードし、グラフの「コア」ノードを識別します。次に、グラフのすべてのノードをローカル「コア」ノードに接続する、f-パインと呼ばれるグラフのフォレスト表現を構築します。最終的なバックボーンは、この論文で紹介する新しいグラフ最適化問題であるCLOF（制約付き葉最適サブフォレスト）を通じてf-パインから推論されます。理論レベルでは、CLOFは一般的なグラフに対してNP困難であることを示す。しかし、CLOFはフォレストグラフに対して効率的に解くことができることを証明しました。これは、CLOFがツリーグラフ上で非自明な単調サブモジュラー集合関数最大化問題を誘導することを考えると、驚くべき事実です。この結果は、フォレスト表現を通じてグラフ内のバックボーンをマイニングする私たちの方法の基礎となっています。ソーシャルネットワーク、地球上に散らばる地震の発生場所、高次元の細胞軌跡データなど、さまざまなグラフ構造化データでバックボーンを発見するための私たちの方法の適用性、有効性、スケーラビリティを定性的および定量的に確認しました。

Provable Convex Co-clustering of Tensors
テンソルの証明可能な凸共クラスタリング

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.

クラスター分析は、複雑で異質なデータのパターン発見のための基本的なツールです。一般的なクラスタリング手法は、主にベクトルまたは行列変数データに焦点を当てており、現代の科学およびビジネスアプリケーションで頻繁に発生する一般次数テンソルには適用できません。さらに、既存のテンソルクラスタリングソリューションでは、非凸定式化の性質上、統計的保証と計算効率の間にギャップがあります。この研究では、テンソル共クラスタリングの証明可能な凸定式化を開発することで、このギャップを埋めます。凸共クラスタリング(CoCo)推定量は安定性が保証されており、その計算コストとストレージコストはデータサイズの多項式です。さらに、CoCo推定量の非漸近誤差境界を確立し、ベクトルまたは行列変数クラスター分析には存在しない驚くべき「次元の恵み」現象を明らかにしました。私たちの理論的発見は、広範なシミュレーション研究によって裏付けられています。最後に、大手オンライン企業の広告クリックテンソルデータのクラスター分析にCoCo推定量を適用します。当社のクラスタリング結果は、広告効果を向上させるための有意義なビジネス洞察を提供します。

Multiclass Anomaly Detector: the CS++ Support Vector Machine
マルチクラス異常検出器: CS++ サポートベクターマシン

A new support vector machine (SVM) variant, called CS++-SVM, is presented combining multiclass classification and anomaly detection in a single-step process to create a trained machine that can simultaneously classify test data belonging to classes represented in the training set and label as anomalous test data belonging to classes not represented in the training set. A theoretical analysis of the properties of the new method, showing how it combines properties inherited both from the conic-segmentation SVM (CS-SVM) and the $1$-class SVM (to which the method described reduces to in the case of unlabelled training data), is given. Finally, experimental results are presented to demonstrate the effectiveness of the algorithm for both simulated and real-world data.

CS++-SVMと呼ばれる新しいサポートベクターマシン(SVM)バリアントは、マルチクラス分類と異常検出を1つのステッププロセスで組み合わせて、トレーニングセットで表されるクラスに属するテストデータを同時に分類し、トレーニングセットで表されないクラスに属する異常なテストデータとしてラベル付けできるトレーニングマシンを作成します。新しい方法の特性の理論的分析は、円錐セグメンテーションSVM(CS-SVM)と$1$クラスSVM(説明されている方法がラベルのない訓練データの場合に減少する)の両方から継承されたプロパティをどのように組み合わせるかを示しています。最後に、シミュレーションデータと実世界データの両方に対するアルゴリズムの有効性を実証するための実験結果を示します。

scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn
scikit-survival: scikit-learn の上に構築されたイベントまでの時間分析のためのライブラリ

scikit-survival is an open-source Python package for time-to-event analysis fully compatible with scikit-learn. It provides implementations of many popular machine learning techniques for time-to-event analysis, including penalized Cox model, Random Survival Forest, and Survival Support Vector Machine. In addition, the library includes tools to evaluate model performance on censored time-to-event data. The documentation contains installation instructions, interactive notebooks, and a full description of the API. scikit-survival is distributed under the GPL-3 license with the source code and detailed instructions available at https://github.com/sebp/scikit-survival

scikit-survivalは、scikit-learnと完全に互換性のあるイベントまでの時間分析用のオープンソースのPythonパッケージです。これは、ペナルティ付きCoxモデル、ランダムサバイバルフォレスト、サバイバルサポートベクターマシンなど、イベントまでの時間分析のための多くの一般的な機械学習手法の実装を提供します。さらに、ライブラリには、打ち切られたイベントまでの時間データでモデルのパフォーマンスを評価するツールが含まれています。ドキュメントには、インストール手順、対話型ノートブック、APIの詳細な説明が含まれています。scikit-survivalはGPL-3ライセンスの下で配布されており、ソースコードと詳細な手順はhttps://github.com/sebp/scikit-survivalで入手できます。

Random Smoothing Might be Unable to Certify l_∞ Robustness for High-Dimensional Images
ランダムスムージングでは高次元画像の l_∞ の堅牢性を証明できない可能性がある

We show a hardness result for random smoothing to achieve certified adversarial robustness against attacks in the $\ell_p$ ball of radius $\epsilon$ when $p>2$. Although random smoothing has been well understood for the $\ell_2$ case using the Gaussian distribution, much remains unknown concerning the existence of a noise distribution that works for the case of $p>2$. This has been posed as an open problem by Cohen et al. (2019) and includes many significant paradigms such as the $\ell_\infty$ threat model. In this work, we show that any noise distribution $\mathcal{D}$ over $\mathbb{R}^d$ that provides $\ell_p$ robustness for all base classifiers with $p>2$ must satisfy $\mathbb{E} \eta_i^2=\Omega(d^{1-2/p}\epsilon^2(1-\delta)/\delta^2)$ for 99% of the features (pixels) of vector $\eta\sim\mathcal{D}$, where $\epsilon$ is the robust radius and $\delta$ is the score gap between the highest-scored class and the runner-up. Therefore, for high-dimensional images with pixel values bounded in $[0,255]$, the required noise will eventually dominate the useful information in the images, leading to trivial smoothed classifiers.

私たちは、半径$\epsilon$の$\ell_p$球における攻撃に対して、$p>2$の場合に、ランダムスムージングによって認定された敵対的堅牢性を達成するための困難性の結果を示します。ガウス分布を使用した$\ell_2$の場合のランダムスムージングはよく理解されていますが、$p>2$の場合に機能するノイズ分布の存在については多くのことが分かっていません。これはCohenら(2019)によって未解決の問題として提起されており、$\ell_\infty$脅威モデルなどの多くの重要なパラダイムが含まれています。この研究では、$p>2$であるすべての基本分類器に対して$\ell_p$の堅牢性を提供する$\mathbb{R}^d$上の任意のノイズ分布$\mathcal{D}$は、ベクトル$\eta\sim\mathcal{D}$の特徴（ピクセル）の99%に対して$\mathbb{E} \eta_i^2=\Omega(d^{1-2/p}\epsilon^2(1-\delta)/\delta^2)$を満たす必要があることを示します。ここで、$\epsilon$は堅牢な半径、$\delta$は最高スコアのクラスと次点のクラスとのスコアギャップです。したがって、ピクセル値が$[0,255]$の範囲内にある高次元画像の場合、必要なノイズが最終的に画像内の有用な情報を支配し、単純な平滑化分類器につながります。

ProtoAttend: Attention-Based Prototypical Learning
ProtoAttend:アテンションベースのプロトタイプ学習

We propose a novel inherently interpretable machine learning method that bases decisions on few relevant examples that we call prototypes. Our method, ProtoAttend, can be integrated into a wide range of neural network architectures including pre-trained models. It utilizes an attention mechanism that relates the encoded representations to samples in order to determine prototypes. Protoattend yields superior results in three high impact problems without sacrificing accuracy of the original model: (1)it enables high-quality interpretability that outputs samples most relevant to the decision-making (i.e. a sample-based interpretability method); (2) it achieves state of the art confidence estimation by quantifying the mismatch across prototype labels; and (3) it obtains state of the art in distribution mismatch detection. All these can be achieved with minimal additional test time and a practically viable training time computational cost.

私たちは、プロトタイプと呼ばれるいくつかの関連例に基づいて決定を下す、本質的に解釈可能な新しい機械学習方法を提案します。私たちの方法であるProtoAttendは、事前学習済みモデルを含む幅広いニューラルネットワークアーキテクチャに統合できます。これは、プロトタイプを決定するために、エンコードされた表現をサンプルに関連付けるアテンションメカニズムを利用します。Protoattendは、元のモデルの精度を犠牲にすることなく、3つの影響の大きい問題で優れた結果をもたらします:(1)意思決定に最も関連性の高いサンプルを出力する高品質の解釈可能性を可能にします(つまり、サンプルベースの解釈可能性法)。(2)プロトタイプラベル間のミスマッチを定量化することにより、最先端の信頼性推定を達成します。(3)分布の不一致検出の最先端を取得します。これらはすべて、最小限の追加テスト時間と実質的に実行可能なトレーニング時間の計算コストで達成できます。

A Sober Look at the Unsupervised Learning of Disentangled Representations and their Evaluation
解きほぐされた表象の教師なし学習とその評価を冷静に見る

The idea behind the unsupervised learning of disentangled representations is that real-world data is generated by a few explanatory factors of variation which can be recovered by unsupervised learning algorithms. In this paper, we provide a sober look at recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train over $14000$ models covering most prominent methods and evaluation metrics in a reproducible large-scale experimental study on eight data sets. We observe that while the different methods successfully enforce properties “encouraged” by the corresponding losses, well-disentangled models seemingly cannot be identified without supervision. Furthermore, different evaluation metrics do not always agree on what should be considered “disentangled” and exhibit systematic differences in the estimation. Finally, increased disentanglement does not seem to necessarily lead to a decreased sample complexity of learning for downstream tasks. Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.

分離表現の教師なし学習の背後にある考え方は、現実世界のデータは、教師なし学習アルゴリズムによって回復できるいくつかの説明的な変動要因によって生成されるというものです。この論文では、この分野における最近の進歩を冷静に検討し、いくつかの一般的な仮定に異議を唱えます。まず、分離表現の教師なし学習は、モデルとデータの両方に帰納的バイアスがなければ基本的に不可能であることを理論的に示します。次に、8つのデータセットで再現可能な大規模な実験研究で、最も著名な方法と評価基準をカバーする14,000を超えるモデルをトレーニングします。さまざまな方法で、対応する損失によって「促進される」特性をうまく強化できる一方で、よく分離したモデルは、教師なしでは識別できないようです。さらに、さまざまな評価基準は、何を「分離」と見なすべきかについて常に一致するわけではなく、推定に体系的な違いを示します。最後に、分離の増加は、下流のタスクの学習のサンプル複雑性の減少に必ずしもつながらないようです。私たちの結果は、分離学習に関する今後の研究では、帰納的バイアスと（暗黙の）監督の役割を明確にし、学習した表現の分離を強制することの具体的な利点を調査し、複数のデータセットをカバーする再現可能な実験設定を検討する必要があることを示唆しています。

Learning Data-adaptive Non-parametric Kernels
学習データ適応型ノンパラメトリックカーネル

In this paper, we propose a data-adaptive non-parametric kernel learning framework in margin based kernel methods. In model formulation, given an initial kernel matrix, a data-adaptive matrix with two constraints is imposed in an entry-wise scheme. Learning this data-adaptive matrix in a formulation-free strategy enlarges the margin between classes and thus improves the model flexibility. The introduced two constraints are imposed either exactly (on small data sets) or approximately (on large data sets) in our model, which provides a controllable trade-off between model flexibility and complexity with theoretical demonstration. In algorithm optimization, the objective function of our learning framework is proven to be gradient-Lipschitz continuous. Thereby, kernel and classifier/regressor learning can be efficiently optimized in a unified framework via Nesterov’s acceleration. For the scalability issue, we study a decomposition-based approach to our model in the large sample case. The effectiveness of this approximation is illustrated by both empirical studies and theoretical guarantees. Experimental results on various classification and regression benchmark data sets demonstrate that our non-parametric kernel learning framework achieves good performance when compared with other representative kernel learning based algorithms.

この論文では、マージンベースのカーネル法におけるデータ適応型ノンパラメトリックカーネル学習フレームワークを提案します。モデル定式化では、初期カーネル行列が与えられた場合、2つの制約を持つデータ適応型行列がエントリワイズ方式で課されます。定式化のない戦略でこのデータ適応型行列を学習すると、クラス間のマージンが拡大し、モデルの柔軟性が向上します。導入された2つの制約は、モデル内で正確に(小さなデータセットの場合)または近似的に(大きなデータセットの場合)課され、理論的な実証により、モデルの柔軟性と複雑さの間の制御可能なトレードオフが提供されます。アルゴリズムの最適化では、学習フレームワークの目的関数が勾配リプシッツ連続であることが証明されています。これにより、カーネルと分類器/回帰器の学習は、ネステロフの加速を介して統合フレームワークで効率的に最適化できます。スケーラビリティの問題については、大規模なサンプルの場合のモデルに対する分解ベースのアプローチを研究します。この近似の有効性は、経験的研究と理論的保証の両方によって示されています。さまざまな分類および回帰ベンチマークデータセットでの実験結果は、当社のノンパラメトリックカーネル学習フレームワークが、他の代表的なカーネル学習ベースのアルゴリズムと比較して優れたパフォーマンスを実現することを示しています。

Functional Martingale Residual Process for High-Dimensional Cox Regression with Model Averaging
モデル平均化による高次元Cox回帰のための機能的マーチンゲール残差過程

Regularization methods for the Cox proportional hazards regression with high-dimensional survival data have been studied extensively in the literature. However, if the model is misspecified, this would result in misleading statistical inference and prediction. To enhance the prediction accuracy for the relative risk and the survival probability, we propose three model averaging approaches for the high-dimensional Cox proportional hazards regression. Based on the martingale residual process, we define the delete-one cross-validation (CV) process, and further propose three novel CV functionals, including the end-time CV, integrated CV, and supremum CV, to achieve more accurate prediction for the risk quantities of clinical interest. The optimal weights for candidate models, without the constraint of summing up to one, can be obtained by minimizing these functionals, respectively. The proposed model averaging approach can attain the lowest possible prediction loss asymptotically. Furthermore, we develop a greedy model averaging algorithm to overcome the computational obstacle when the dimension is high. The performances of the proposed model averaging procedures are evaluated via extensive simulation studies, demonstrating that our methods achieve superior prediction accuracy over the existing regularization methods. As an illustration, we apply the proposed methods to the mantle cell lymphoma study.

高次元生存データを用いたCox比例ハザード回帰の正則化法は、文献で広く研究されてきました。しかし、モデルが誤って指定されている場合、統計的推論と予測が誤解を招く可能性があります。相対リスクと生存確率の予測精度を高めるために、高次元Cox比例ハザード回帰の3つのモデル平均化アプローチを提案します。マルチンゲール残差プロセスに基づいて、削除1交差検証(CV)プロセスを定義し、さらに、終了時CV、統合CV、および上限CVを含む3つの新しいCV機能を提案して、臨床的に関心のあるリスク量のより正確な予測を実現します。候補モデルの最適な重みは、合計が1になるという制約なしに、これらの機能をそれぞれ最小化することで取得できます。提案されたモデル平均化アプローチは、漸近的に可能な限り低い予測損失を達成できます。さらに、次元が高い場合の計算障害を克服するために、貪欲モデル平均化アルゴリズムを開発します。提案されたモデル平均化手順のパフォーマンスは、広範なシミュレーション研究によって評価され、私たちの方法が既存の正規化方法よりも優れた予測精度を達成することが実証されています。例として、提案された方法をマントル細胞リンパ腫の研究に適用します。

On Convergence of Distributed Approximate Newton Methods: Globalization, Sharper Bounds and Beyond
分散近似ニュートン法の収束について:グローバリゼーション、シャープバウンド、そしてその先へ

The DANE algorithm is an approximate Newton method popularly used for communication-efficient distributed machine learning. Reasons for the interest in DANE include scalability and efficiency. Convergence of DANE, however, can be tricky; its appealing convergence rate is only rigorous for quadratic objective function, and for more general convex functions the known results are no stronger than those of the classic first-order methods. To remedy these drawbacks, we propose in this article some new alternatives of DANE which are more suitable for analysis. We first introduce a simple variant of DANE equipped with backtracking line search, for which global asymptotic convergence and sharper local non-asymptotic convergence guarantees can be proved for both quadratic and non-quadratic strongly convex functions. Then we propose a heavy-ball method to accelerate the convergence of DANE, showing that the near-tight local rate of convergence can be established for strongly convex functions, and with proper modification of the algorithm about the same result applies globally to linear prediction models. Numerical evidence is provided to confirm the theoretical and practical advantages of our methods.

DANEアルゴリズムは、通信効率の高い分散機械学習に広く使用されている近似ニュートン法です。DANEが注目されている理由には、スケーラビリティと効率性があります。しかし、DANEの収束は扱いが難しい場合があります。その魅力的な収束率は、2次目的関数に対してのみ厳密であり、より一般的な凸関数に対しては、既知の結果は古典的な1次法よりも強力ではありません。これらの欠点を改善するために、この記事では、分析に適したDANEの新しい代替案をいくつか提案します。最初に、バックトラッキングラインサーチを備えたDANEの単純なバリアントを紹介します。これにより、2次および非2次強凸関数の両方に対して、グローバルな漸近収束とよりシャープなローカルな非漸近収束の保証が証明されます。次に、DANEの収束を加速するためのヘビーボール法を提案し、強凸関数に対してほぼタイトなローカル収束率を確立できることを示し、アルゴリズムを適切に変更すると、ほぼ同じ結果が線形予測モデルにグローバルに適用されます。私たちの方法の理論的および実用的な利点を確認するために数値的な証拠が提供されます。

Sobolev Norm Learning Rates for Regularized Least-Squares Algorithms
正則化最小二乗アルゴリズムのソボレフノルム学習率

Learning rates for least-squares regression are typically expressed in terms of $L_2$-norms. In this paper we extend these rates to norms stronger than the $L_2$-norm without requiring the regression function to be contained in the hypothesis space. In the special case of Sobolev reproducing kernel Hilbert spaces used as hypotheses spaces, these stronger norms coincide with fractional Sobolev norms between the used Sobolev space and $L_2$. As a consequence, not only the target function but also some of its derivatives can be estimated without changing the algorithm. From a technical point of view, we combine the well-known integral operator techniques with an embedding property, which so far has only been used in combination with empirical process arguments. This combination results in new finite sample bounds with respect to the stronger norms. From these finite sample bounds our rates easily follow. Finally, we prove the asymptotic optimality of our results in many cases.

最小二乗回帰の学習率は、通常、$L_2$-normsで表されます。この論文では、回帰関数が仮説空間に含まれることなく、これらのレートを$L_2$-normよりも強いノルムに拡張します。仮説空間として使用されるカーネルヒルベルト空間を再現するソボレフの特殊なケースでは、これらの強いノルムは、使用されるソボレフ空間と$L_2$の間の分数ソボレフノルムと一致します。その結果、アルゴリズムを変更せずに、ターゲット関数だけでなく、その導関数の一部も推定できます。技術的な観点からは、よく知られている積分演算子の手法と、これまで経験的なプロセス引数と組み合わせてのみ使用されてきた埋め込みプロパティを組み合わせています。この組み合わせにより、より強いノルムに関して新しい有限サンプル境界が得られます。これらの有限サンプルの限界から、私たちのレートは簡単に追従します。最後に、多くの場合、結果の漸近最適性を証明します。

Two-Stage Approach to Multivariate Linear Regression with Sparsely Mismatched Data
まばらに不一致のデータを持つ多変量線形回帰への 2 段階アプローチ

A tacit assumption in linear regression is that (response, predictor)-pairs correspond to identical observational units. A series of recent works have studied scenarios in which this assumption is violated under terms such as “Unlabeled Sensing and “Regression with Unknown Permutation”. In this paper, we study the setup of multiple response variables and a notion of mismatches that generalizes permutations in order to allow for missing matches as well as for one-to-many matches. A two-stage method is proposed under the assumption that most pairs are correctly matched. In the first stage, the regression parameter is estimated by handling mismatches as contaminations, and subsequently the generalized permutation is estimated by a basic variant of matching. The approach is both computationally convenient and equipped with favorable statistical guarantees. Specifically, it is shown that the conditions for permutation recovery become considerably less stringent as the number of responses $m$ per observation increase. Particularly, for $m = \Omega(\log n)$, the required signal-to-noise ratio no longer depends on the sample size $n$. Numerical results on synthetic and real data are presented to support the main findings of our analysis.

線形回帰における暗黙の仮定は、(応答、予測)ペアが同一の観測単位に対応するというものです。最近の一連の研究では、「ラベルなしセンシング」や「未知の順列による回帰」などの用語で、この仮定が破られるシナリオが研究されています。この論文では、複数の応答変数の設定と、1対多の一致だけでなく欠落した一致も考慮するために順列を一般化するミスマッチの概念について研究します。ほとんどのペアが正しく一致しているという仮定の下で、2段階法が提案されています。第1段階では、ミスマッチを汚染として扱うことで回帰パラメータを推定し、続いてマッチングの基本的な変形によって一般化された順列を推定します。このアプローチは計算上便利であり、好ましい統計的保証を備えています。具体的には、観測あたりの応答数$m$が増加すると、順列回復の条件がかなり緩くなることが示されています。特に、$m = \Omega(\log n)$の場合、必要な信号対雑音比はサンプルサイズ$n$に依存しなくなります。合成データと実際のデータに関する数値結果は、分析の主な結果を裏付けるために提示されています。

Dynamic Control of Stochastic Evolution: A Deep Reinforcement Learning Approach to Adaptively Targeting Emergent Drug Resistance
確率的進化の動的制御:創発的薬剤耐性を適応的に標的とするための深層強化学習アプローチ

The challenge in controlling stochastic systems in which low-probability events can set the system on catastrophic trajectories is to develop a robust ability to respond to such events without significantly compromising the optimality of the baseline control policy. This paper presents CelluDose, a stochastic simulation-trained deep reinforcement learning adaptive feedback control prototype for automated precision drug dosing targeting stochastic and heterogeneous cell proliferation. Drug resistance can emerge from random and variable mutations in targeted cell populations; in the absence of an appropriate dosing policy, emergent resistant subpopulations can proliferate and lead to treatment failure. Dynamic feedback dosage control holds promise in combatting this phenomenon, but the application of traditional control approaches to such systems is fraught with challenges due to the complexity of cell dynamics, uncertainty in model parameters, and the need in medical applications for a robust controller that can be trusted to properly handle unexpected outcomes. Here, training on a sample biological scenario identified single-drug and combination therapy policies that exhibit a $100\%$ success rate at suppressing cell proliferation and responding to diverse system perturbations while establishing low-dose no-event baselines. These policies were found to be highly robust to variations in a key model parameter subject to significant uncertainty and unpredictable dynamical changes.

低確率のイベントがシステムを破滅的な軌道に乗せる可能性がある確率的システムを制御する際の課題は、ベースライン制御ポリシーの最適性を大幅に損なうことなく、そのようなイベントに対応する堅牢な機能を開発することです。この論文では、確率的かつ不均質な細胞増殖をターゲットとした自動精密薬物投与のための、確率シミュレーションでトレーニングされた深層強化学習適応型フィードバック制御プロトタイプであるCelluDoseを紹介します。薬剤耐性は、標的細胞集団のランダムで変動する突然変異から発生する可能性があります。適切な投与ポリシーがない場合、耐性のあるサブ集団が増殖し、治療の失敗につながる可能性があります。動的フィードバック投与量制御は、この現象に対抗する上で有望ですが、従来の制御アプローチをそのようなシステムに適用することは、細胞ダイナミクスの複雑さ、モデルパラメータの不確実性、および予期しない結果を適切に処理できる信頼できる堅牢なコントローラに対する医療用途の必要性のために、多くの課題を伴います。ここでは、サンプルの生物学的シナリオのトレーニングにより、低用量の無イベントベースラインを確立しながら、細胞増殖を抑制し、さまざまなシステム摂動に対応する際に100%の成功率を示す単剤療法および併用療法ポリシーが特定されました。これらのポリシーは、大きな不確実性と予測不可能な動的変化の影響を受ける主要なモデルパラメータの変動に対して非常に堅牢であることがわかりました。

A Numerical Measure of the Instability of Mapper-Type Algorithms
マッパー型アルゴリズムの不安定性の数値尺度

Mapper is an unsupervised machine learning algorithm generalising the notion of clustering to obtain a geometric description of a dataset. The procedure splits the data into possibly overlapping bins which are then clustered. The output of the algorithm is a graph where nodes represent clusters and edges represent the sharing of data points between two clusters. However, several parameters must be selected before applying Mapper and the resulting graph may vary dramatically with the choice of parameters. We define an intrinsic notion of Mapper instability that measures the variability of the output as a function of the choice of parameters required to construct a Mapper output. Our results and discussion are general and apply to all Mapper-type algorithms. We derive theoretical results that provide estimates for the instability and suggest practical ways to control it. We provide also experiments to illustrate our results and in particular we demonstrate that a reliable candidate Mapper output can be identified as a local minimum of instability regarded as a function of Mapper input parameters.

Mapperは、データセットの幾何学的記述を取得するためにクラスタリングの概念を一般化する教師なし機械学習アルゴリズムです。この手順では、データを重複する可能性のあるビンに分割し、その後クラスタリングします。アルゴリズムの出力はグラフであり、ノードはクラスターを表し、エッジは2つのクラスター間のデータポイントの共有を表します。ただし、Mapperを適用する前にいくつかのパラメーターを選択する必要があり、結果のグラフはパラメーターの選択によって大幅に異なる場合があります。Mapper出力の構築に必要なパラメーターの選択の関数として出力の変動性を測定するMapper不安定性の本質的な概念を定義します。結果と考察は一般的なものであり、すべてのMapperタイプのアルゴリズムに適用されます。不安定性の推定値を提供する理論的結果を導き、それを制御する実用的な方法を提案します。また、結果を説明するための実験も提供し、特に、Mapper入力パラメーターの関数として見なされる不安定性の局所的最小値として信頼できる候補Mapper出力を識別できることを実証します。

Continuous-Time Birth-Death MCMC for Bayesian Regression Tree Models
ベイズ回帰木モデルのための連続時間出生-死亡MCMC

Decision trees are flexible models that are well suited for many statistical regression problems. In the Bayesian framework for regression trees, Markov Chain Monte Carlo (MCMC) search algorithms are required to generate samples of tree models according to their posterior probabilities. The critical component of such MCMC algorithms is to construct “good” Metropolis-Hastings steps to update the tree topology. Such algorithms frequently suffer from poor mixing and local mode stickiness; therefore, the algorithms are slow to converge. Hitherto, authors have primarily used discrete-time birth/death mechanisms for Bayesian (sums of) regression tree models to explore the tree-model space. These algorithms are efficient, in terms of computation and convergence, only if the rejection rate is low which is not always the case. We overcome this issue by developing a novel search algorithm which is based on a continuous-time birth-death Markov process. The search algorithm explores the tree-model space by jumping between parameter spaces corresponding to different tree structures. The jumps occur in continuous time corresponding to the birth-death events which are modeled as independent Poisson processes. In the proposed algorithm, the moves between models are always accepted which can dramatically improve the convergence and mixing properties of the search algorithm. We provide theoretical support of the algorithm for Bayesian regression tree models and demonstrate its performance in a simulated example.

決定木は、多くの統計的回帰問題に適した柔軟なモデルです。回帰木のベイズフレームワークでは、事後確率に従ってツリーモデルのサンプルを生成するために、マルコフ連鎖モンテカルロ(MCMC)検索アルゴリズムが必要です。このようなMCMCアルゴリズムの重要なコンポーネントは、ツリートポロジを更新するための「適切な」メトロポリス-ヘイスティングステップを構築することです。このようなアルゴリズムは、混合が不十分でローカルモードの粘着性に悩まされることが多く、そのためアルゴリズムの収束が遅くなります。これまで、著者は主に、ベイズ(合計)回帰ツリーモデルの離散時間誕生/消滅メカニズムを使用してツリーモデル空間を探索してきました。これらのアルゴリズムは、計算と収束の点で効率的ですが、拒否率が低い場合に限られますが、常にそうであるとは限りません。この問題を克服するために、連続時間誕生-消滅マルコフ過程に基づく新しい検索アルゴリズムを開発しました。検索アルゴリズムは、異なるツリー構造に対応するパラメーター空間間を移動してツリーモデル空間を探索します。ジャンプは、独立したポアソン過程としてモデル化される誕生と死亡のイベントに対応する連続時間で発生します。提案されたアルゴリズムでは、モデル間の移動は常に受け入れられるため、検索アルゴリズムの収束と混合特性が大幅に改善されます。ベイジアン回帰ツリーモデルに対するアルゴリズムの理論的サポートを提供し、シミュレーション例でそのパフォーマンスを実証します。

Empirical Risk Minimization in the Non-interactive Local Model of Differential Privacy
差分プライバシーの非対話型局所モデルにおける経験的リスク最小化

In this paper, we study the Empirical Risk Minimization (ERM) problem in the non-interactive Local Differential Privacy (LDP) model. Previous research on this problem (Smith et al., 2017) indicates that the sample complexity, to achieve error $\alpha$, needs to be exponentially depending on the dimensionality $p$ for general loss functions. In this paper, we make two attempts to resolve this issue by investigating conditions on the loss functions that allow us to remove such a limit. In our first attempt, we show that if the loss function is $(\infty, T)$-smooth, by using the Bernstein polynomial approximation we can avoid the exponential dependency in the term of $\alpha$. We then propose player-efficient algorithms with $1$-bit communication complexity and $O(1)$ computation cost for each player. The error bound of these algorithms is asymptotically the same as the original one. With some additional assumptions, we also give an algorithm which is more efficient for the server. In our second attempt, we show that for any $1$-Lipschitz generalized linear convex loss function, there is an $(\epsilon, \delta)$-LDP algorithm whose sample complexity for achieving error $\alpha$ is only linear in the dimensionality $p$. Our results use a polynomial of inner product approximation technique. Finally, motivated by the idea of using polynomial approximation and based on different types of polynomial approximations, we propose (efficient) non-interactive locally differentially private algorithms for learning the set of k-way marginal queries and the set of smooth queries.

この論文では、非対話型ローカル差分プライバシー(LDP)モデルにおける経験的リスク最小化(ERM)問題を研究します。この問題に関する以前の研究(Smithら, 2017)によると、一般的な損失関数では、誤差$\alpha$を達成するためのサンプル複雑度は次元$p$に指数的に依存する必要があります。この論文では、このような制限を取り除くことができる損失関数の条件を調査することで、この問題を解決するための2つの試みを行います。最初の試みでは、損失関数が$(\infty, T)$滑らかな場合、Bernstein多項式近似を使用することで、$\alpha$の項における指数依存性を回避できることを示します。次に、各プレーヤーの通信複雑度が$1$ビットで計算コストが$O(1)$である、プレーヤー効率の高いアルゴリズムを提案します。これらのアルゴリズムの誤差境界は、漸近的に元のアルゴリズムと同じです。いくつかの追加の仮定により、サーバーにとってより効率的なアルゴリズムも提供します。2回目の試みでは、任意の$1$-Lipschitz一般化線形凸損失関数に対して、誤差$\alpha$を達成するためのサンプル複雑度が次元$p$でのみ線形である$(\epsilon, \delta)$-LDPアルゴリズムが存在することを示します。結果では、内積の多項式近似手法を使用しています。最後に、多項式近似を使用するというアイデアに動機付けられ、さまざまな種類の多項式近似に基づいて、k方向の周辺クエリのセットと滑らかなクエリのセットを学習するための(効率的な)非対話型ローカル差分プライバシーアルゴリズムを提案します。

Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms
統計的・計算パラダイムにおける勾配降下法の確率的微分方程式による漸近解析

This paper investigates the asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning, where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuous-time ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the large-sample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations, like time-dependent Ornstein-Uhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies the dynamic behaviors of these algorithms with time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates the large-sample behaviors of the statistical procedures (like estimators and classifiers) that are computed by applying the algorithms; in fact, the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms, as the number of iterations goes to infinity. The joint analysis results based on the obtained gradient flow central limit theorems lead to the identification of four factors—learning rate, batch size, gradient covariance, and Hessian—to derive new theories regarding the local minima found by stochastic gradient descent for solving non-convex optimization problems.

この論文では、統計学や機械学習で生じる確率的最適化の文脈で、利用可能なデータから目的関数を推定する勾配降下法アルゴリズム（特に加速勾配降下法と確率的勾配降下法）の漸近挙動を調査します。これらのアルゴリズムは、連続時間の常微分方程式または確率微分方程式によって計算モデル化できることを示します。アルゴリズムの反復回数とデータサイズが両方とも無限大になるにつれて、これらの計算アルゴリズムの極限的な動的挙動と、関連する統計手順の大規模サンプルのパフォーマンスを説明する勾配フロー中心極限定理を確立します。ここで、勾配フロー中心極限定理は、時間依存オルンシュタイン-ウーレンベック過程などのいくつかの線形常微分方程式または確率微分方程式によって制御されます。この研究では、計算漸近解析と統計漸近解析を組み合わせた新しい統合フレームワークを提供できることを示しています。計算漸近解析では、これらのアルゴリズムの時間経過に伴う動的動作（またはアルゴリズムの反復回数）を調査し、統計漸近解析では、アルゴリズムを適用して計算される統計手順（推定器や分類器など）の大規模サンプル動作を調査します。実際、反復回数が無限大になると、統計手順はこれらの反復アルゴリズムから生成されるランダムシーケンスの限界に等しくなります。得られた勾配フロー中心極限定理に基づく共同解析の結果から、学習率、バッチサイズ、勾配共分散、ヘッセ行列の4つの要素が特定され、非凸最適化問題を解決するための確率的勾配降下法で見つかる局所最小値に関する新しい理論が導き出されます。

Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach
連続時間と空間における強化学習:確率的制御アプローチ

We consider reinforcement learning (RL) in continuous time with continuous feature and action spaces. We motivate and devise an exploratory formulation for the feature dynamics that captures learning under exploration, with the resulting optimization problem being a revitalization of the classical relaxed stochastic control. We then study the problem of achieving the best trade-off between exploration and exploitation by considering an entropy-regularized reward function. We carry out a complete analysis of the problem in the linear–quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured respectively by the mean and variance of the Gaussian distribution. We characterize the cost of exploration, which, for the LQ case, is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate. Finally, as the weight of exploration decays to zero, we prove the convergence of the solution of the entropy-regularized LQ problem to the one of the classical LQ problem.

私たちは、連続的な特徴空間と行動空間を持つ連続時間における強化学習(RL)について考察します。私たちは、探索下での学習を捉える特徴ダイナミクスの探索的定式化を動機付け、考案します。その結果生じる最適化問題は、古典的な緩和確率制御の復活となります。次に、エントロピー正則化報酬関数を考慮することにより、探索と活用の間の最良のトレードオフを達成する問題を研究します。私たちは、線形-二次(LQ)設定でこの問題の完全な分析を実行し、活用と探索のバランスをとるための最適なフィードバック制御分布はガウス分布であると推論します。これは、RLで広く採用されているガウス探索を、サンプリングの単純さを超えて解釈します。さらに、活用と探索は、それぞれガウス分布の平均と分散によって捉えられます。私たちは、探索のコストを特徴付ける。これは、LQの場合、エントロピー正則化の重みに比例し、割引率に反比例することが示されます。最後に、探索の重みがゼロに減少するにつれて、エントロピー正規化LQ問題の解が古典的なLQ問題の解に収束することを証明します。

A determinantal point process for column subset selection
カラムサブセット選択のための決定的ポイントプロセス

Two popular approaches to dimensionality reduction are principal component analysis, which projects onto a small number of well-chosen but non-interpretable directions, and feature selection, which selects a small number of the original features. Feature selection can be abstracted as selecting the subset of columns of a matrix $X \in \mathbb{R}^{N \times d}$ which minimize the approximation error, i.e., the norm of the residual after projecting $X$ onto the space spanned by the selected columns. Such a combinatorial optimization is usually impractical, and there has been interest in polynomial-cost, random subset selection algorithms that favour small values of this approximation error. We propose sampling from a projection determinantal point process, a repulsive distribution over column indices that favours diversity among the selected columns. We bound the ratio of the expected approximation error over the optimal error of PCA. These bounds improve over the state-of-the-art bounds of volume sampling when some realistic structural assumptions are satisfied for $X$. Numerical experiments suggest that our bounds are tight, and that our algorithms have comparable performance with the double phase algorithm, often considered the practical state-of-the-art.

次元削減の一般的な2つのアプローチは、少数の適切に選択されたが解釈不可能な方向に射影する主成分分析と、少数の元の特徴を選択する特徴選択です。特徴選択は、近似誤差、つまり選択された列によって張られる空間に$X$を射影した後の残差のノルムを最小化する行列$X \in \mathbb{R}^{N \times d}$の列のサブセットを選択することとして抽象化できます。このような組み合わせ最適化は通常非実用的であり、この近似誤差の値が小さいことを優先する多項式コストのランダムサブセット選択アルゴリズムに関心が寄せられています。私たちは、射影行列式の点過程、つまり選択された列間の多様性を優先する列インデックス上の反発分布からのサンプリングを提案します。私たちは、PCAの最適誤差に対する予想される近似誤差の比率を制限します。これらの境界は、$X$に対していくつかの現実的な構造仮定が満たされている場合、ボリュームサンプリングの最先端の境界よりも改善されます。数値実験では、境界が厳密であること、および当社のアルゴリズムが、実用的な最先端のアルゴリズムとよく考えられている二重位相アルゴリズムと同等のパフォーマンスを持つことが示唆されています。

Successor Features Combine Elements of Model-Free and Model-based Reinforcement Learning
後継機能は、モデルフリー強化学習とモデルベース強化学習の要素を組み合わせたものです

A key question in reinforcement learning is how an intelligent agent can generalize knowledge across different inputs. By generalizing across different inputs, information learned for one input can be immediately reused for improving predictions for another input. Reusing information allows an agent to compute an optimal decision-making strategy using less data. State representation is a key element of the generalization process, compressing a high-dimensional input space into a low-dimensional latent state space. This article analyzes properties of different latent state spaces, leading to new connections between model-based and model-free reinforcement learning. Successor features, which predict frequencies of future observations, form a link between model-based and model-free learning: Learning to predict future expected reward outcomes, a key characteristic of model-based agents, is equivalent to learning successor features. Learning successor features is a form of temporal difference learning and is equivalent to learning to predict a single policy’s utility, which is a characteristic of model-free agents. Drawing on the connection between model-based reinforcement learning and successor features, we demonstrate that representations that are predictive of future reward outcomes generalize across variations in both transitions and rewards. This result extends previous work on successor features, which is constrained to fixed transitions and assumes re-learning of the transferred state representation.

強化学習における重要な問題は、インテリジェントエージェントがどのようにして異なる入力にわたって知識を一般化できるかということです。異なる入力にわたって一般化することで、1つの入力について学習した情報を、別の入力の予測を改善するためにすぐに再利用できます。情報を再利用すると、エージェントはより少ないデータを使用して最適な意思決定戦略を計算できます。状態表現は一般化プロセスの重要な要素であり、高次元の入力空間を低次元の潜在状態空間に圧縮します。この記事では、さまざまな潜在状態空間の特性を分析し、モデルベースの強化学習とモデルフリーの強化学習の新しいつながりを導きます。将来の観測の頻度を予測する後継機能は、モデルベースの学習とモデルフリーの学習を結び付けます。モデルベースのエージェントの主要な特性である将来の期待報酬結果を予測することを学習することは、後継機能の学習と同等です。後継機能の学習は、時間差学習の一種であり、モデルフリーエージェントの特性である単一のポリシーの効用を予測することを学習することと同等です。モデルベースの強化学習と後継機能の関係を利用して、将来の報酬結果を予測する表現が、遷移と報酬の両方のバリエーションにわたって一般化されることを実証します。この結果は、固定遷移に制限され、転送された状態表現の再学習を前提とする後継機能に関する以前の研究を拡張したものです。

Conic Optimization for Quadratic Regression Under Sparse Noise
スパースノイズ下での二次回帰のための円錐最適化

This paper is concerned with the quadratic regression problem, where the goal is to find the unknown state (numerical parameters) of a system modeled by a set of equations that are quadratic in the state. We focus on the setting when a subset of equations of fixed cardinality is subject to errors of arbitrary magnitudes (potentially adversarial). We develop two methods to address this problem, which are both based on conic optimization and are able to accept any available prior knowledge on the solution as an input. We derive sufficient conditions for guaranteeing the correct recovery of the unknown state for each method and show that one method provides a better accuracy while the other one scales better to large-scale systems. The obtained conditions consist in bounds on the number of bad measurements each method can tolerate without producing a nonzero estimation error. In the case when no prior knowledge is available, we develop an iterative-based conic optimization technique. It is proved that the proposed methods allow up to half of the total number of measurements to be grossly erroneous.The efficacy of the developed methods is demonstrated in different case studies, including data analytics for a European power grid.

この論文では、2次回帰問題に関するもので、その目的は、状態が2次である方程式のセットによってモデル化されたシステムの未知の状態(数値パラメーター)を見つけることです。固定基数の方程式のサブセットが任意の大きさのエラー(潜在的に敵対的)の影響を受ける設定に焦点を当てます。この問題に対処するために、円錐最適化に基づいており、入力としてソリューションに関する利用可能な事前知識を受け入れることができる2つの方法を開発します。各方法について、未知の状態を正しく回復することを保証する十分な条件を導出し、一方の方法はより優れた精度を提供し、もう一方の方法は大規模システムによりよく拡張できることを示します。得られた条件は、各方法が非ゼロの推定誤差を生成せずに許容できる不良測定の数の境界で構成されます。事前知識が利用できない場合は、反復ベースの円錐最適化手法を開発します。提案された方法では、測定値の総数の最大半分が重大な誤りとなる可能性があることが証明されています。開発された方法の有効性は、ヨーロッパの電力網のデータ分析を含むさまざまなケーススタディで実証されています。

Contextual Explanation Networks
コンテキスト説明ネットワーク

Modern learning algorithms excel at producing accurate but complex models of the data. However, deploying such models in the real-world requires extra care: we must ensure their reliability, robustness, and absence of undesired biases. This motivates the development of models that are equally accurate but can be also easily inspected and assessed beyond their predictive performance. To this end, we introduce contextual explanation networks (CENs)—a class of architectures that learn to predict by generating and utilizing intermediate, simplified probabilistic models. Specifically, CENs generate parameters for intermediate graphical models which are further used for prediction and play the role of explanations. Contrary to the existing post-hoc model-explanation tools, CENs learn to predict and to explain simultaneously. Our approach offers two major advantages: (i) for each prediction, valid, instance-specific explanation is generated with no computational overhead and (ii) prediction via explanation acts as a regularizer and boosts performance in data-scarce settings. We analyze the proposed framework theoretically and experimentally. Our results on image and text classification and survival analysis tasks demonstrate that CENs are not only competitive with the state-of-the-art methods but also offer additional insights behind each prediction, that can be valuable for decision support. We also show that while post-hoc methods may produce misleading explanations in certain cases, CENs are consistent and allow to detect such cases systematically.

現代の学習アルゴリズムは、正確だが複雑なデータモデルの作成に優れています。しかし、そのようなモデルを現実世界に展開するには特別な注意が必要です。信頼性、堅牢性、望ましくないバイアスがないことを確認する必要があります。これが、予測性能を超えて、同様に正確でありながら簡単に検査および評価できるモデルの開発を動機付けます。この目的のために、コンテキスト説明ネットワーク(CEN)を導入します。これは、中間の単純化された確率モデルを生成して使用することで予測を学習するアーキテクチャのクラスです。具体的には、CENは中間のグラフィカルモデルのパラメーターを生成し、それがさらに予測に使用され、説明の役割を果たします。既存の事後モデル説明ツールとは異なり、CENは予測と説明を同時に学習します。私たちのアプローチには、2つの大きな利点があります。(i)各予測に対して、有効なインスタンス固有の説明が計算オーバーヘッドなしで生成されること、(ii)説明による予測が正規化子として機能し、データが不足している設定でパフォーマンスが向上することです。提案されたフレームワークを理論的および実験的に分析します。画像とテキストの分類および生存分析タスクに関する私たちの結果は、CENが最先端の方法と競合できるだけでなく、各予測の背後にある追加の洞察も提供し、意思決定のサポートに役立つことを示しています。また、事後手法では特定のケースで誤解を招くような説明が生じる可能性がありますが、CENは一貫性があり、そのようなケースを体系的に検出できることも示しています。

Learning and Interpreting Multi-Multi-Instance Learning Networks
マルチマルチインスタンス学習ネットワークの学習と解釈

We introduce an extension of the multi-instance learning problem where examples are organized as nested bags of instances (e.g., a document could be represented as a bag of sentences, which in turn are bags of words). This framework can be useful in various scenarios, such as text and image classification, but also supervised learning over graphs. As a further advantage, multi-multi instance learning enables a particular way of interpreting predictions and the decision function. Our approach is based on a special neural network layer, called bag-layer, whose units aggregate bags of inputs of arbitrary size. We prove theoretically that the associated class of functions contains all Boolean functions over sets of sets of instances and we provide empirical evidence that functions of this kind can be actually learned on semi-synthetic datasets. We finally present experiments on text classification, on citation graphs, and social graph data, which show that our model obtains competitive results with respect to accuracy when compared to other approaches such as convolutional networks on graphs, while at the same time it supports a general approach to interpret the learnt model, as well as explain individual predictions.

私たちは、例がネストされたインスタンスのバッグとして整理されるマルチインスタンス学習問題の拡張を導入します(たとえば、ドキュメントは文のバッグとして表すことができ、文のバッグは単語のバッグです)。このフレームワークは、テキストや画像の分類などのさまざまなシナリオで役立ちますが、グラフ上の教師あり学習にも役立ちます。さらなる利点として、マルチマルチインスタンス学習では、予測と決定関数を解釈する特定の方法が可能になります。私たちのアプローチは、バッグレイヤーと呼ばれる特別なニューラルネットワークレイヤーに基づいています。このレイヤーのユニットは、任意のサイズの入力バッグを集約します。関連する関数のクラスには、インスタンスのセットのセット上のすべてのブール関数が含まれることを理論的に証明し、この種の関数が半合成データセットで実際に学習できることを実証的に証明します。最後に、テキスト分類、引用グラフ、ソーシャルグラフデータに関する実験を紹介します。これらの実験では、グラフ上の畳み込みネットワークなどの他のアプローチと比較した場合、私たちのモデルが精度に関して競争力のある結果を獲得すると同時に、学習したモデルを解釈し、個々の予測を説明する一般的なアプローチをサポートしていることを示しています。

Semi-parametric Learning of Structured Temporal Point Processes
構造化された時間点過程のセミパラメトリック学習

We propose a general framework of using a multi-level log-Gaussian Cox process to model repeatedly observed point processes with complex structures; such type of data has become increasingly available in various areas including medical research, social sciences, economics, and finance due to technological advances. A novel nonparametric approach is developed to efficiently and consistently estimate the covariance functions of the latent Gaussian processes at all levels. To predict the functional principal component scores, we propose a consistent estimation procedure by maximizing the conditional likelihood of super-positions of point processes. We further extend our procedure to the bivariate point process case in which potential correlations between the processes can be assessed. Asymptotic properties of the proposed estimators are investigated, and the effectiveness of our procedures is illustrated through a simulation study and an application to a stock trading dataset.

私たちは、マルチレベルのlog-Gaussian Coxプロセスを使用して、複雑な構造を持つ繰り返し観測される点プロセスをモデル化する一般的なフレームワークを提案します。このようなタイプのデータは、技術の進歩により、医学研究、社会科学、経済学、金融など、さまざまな分野でますます利用可能になっています。新しいノンパラメトリックなアプローチが開発され、すべてのレベルで潜在的なガウス過程の共分散関数を効率的かつ一貫して推定します。機能主成分スコアを予測するために、点プロセスの重ね合わせの条件付き尤度を最大化することにより、一貫した推定手順を提案します。さらに、プロセス間の潜在的な相関関係を評価できる二変量点過程のケースに手順を拡張します。提案された推定量の漸近特性が調査され、シミュレーション研究と株式取引データセットへの適用を通じて、手順の有効性が示されています。

Adaptive Smoothing for Path Integral Control
パス積分制御のための適応平滑化

In Path Integral control problems a representation of an optimally controlled dynamical system can be formally computed and serve as a guidepost to learn a parametrized policy. The Path Integral Cross-Entropy (PICE) method tries to exploit this, but is hampered by poor sample efficiency. We propose a model-free algorithm called ASPIC (Adaptive Smoothing of Path Integral Control) that applies an inf-convolution to the cost function to speedup convergence of policy optimization. We identify PICE as the infinite smoothing limit of such technique and show that the sample efficiency problems that PICE suffers disappear for finite levels of smoothing. For zero smoothing, ASPIC becomes a greedy optimization of the cost, which is the standard approach in current reinforcement learning. ASPIC adapts the smoothness parameter to keep the variance of the gradient estimator at a predefined level, independently of the number of samples. We show analytically and empirically that intermediate levels of smoothing are optimal, which renders the new method superior to both PICE and direct cost optimization.

経路積分制御問題では、最適に制御された動的システムの表現を形式的に計算し、パラメータ化されたポリシーを学習するためのガイドポストとして使用できます。経路積分クロスエントロピー(PICE)法はこれを利用しようとしますが、サンプル効率が悪いという問題があります。私たちは、ポリシー最適化の収束を高速化するためにコスト関数に無限畳み込みを適用するASPIC (経路積分制御の適応型平滑化)と呼ばれるモデルフリーアルゴリズムを提案します。私たちは、PICEをこのような手法の無限平滑化限界として特定し、有限レベルの平滑化ではPICEが抱えるサンプル効率の問題がなくなることを示します。平滑化がゼロの場合、ASPICはコストの貪欲最適化になります。これは、現在の強化学習の標準的なアプローチです。ASPICは、サンプル数に関係なく、平滑化パラメータを適応させて勾配推定器の分散を定義済みレベルに保ちます。私たちは、中間レベルの平滑化が最適であり、これにより新しい方法がPICEと直接コスト最適化の両方よりも優れていることを分析的かつ経験的に示します。

A Unified q-Memorization Framework for Asynchronous Stochastic Optimization
非同期確率最適化のための統一q記憶フレームワーク

Asynchronous stochastic algorithms with various variance reduction techniques (such as SVRG, S2GD, SAGA and q-SAGA) are popular in solving large scale learning problems. Recently, Reddi et al. (2015) proposed an unified variance reduction framework (i.e., HSAG) to analyze the asynchronous stochastic gradient optimization. However, the HSAG framework cannot incorporate the S2GD technique, the analysis of the HSAG framework is limited to the SVRG and SAGA techniques on the smooth convex optimization. They did not analyze other important various variance techniques (e.g., S2GD and q-SAGA) and other important optimization problems (e.g., convex optimization with non-smooth regularization and non-convex optimization with cardinality constraint). In this paper, we bridge this gap by using an unified q-memorization framework for various variance reduction techniques (including SVRG, S2GD, SAGA, q-SAGA) to analyze asynchronous stochastic algorithms for three important optimization problems. Specifically, based on the q-memorization framework, 1) we propose an asynchronous stochastic gradient hard thresholding algorithm with q-memorization (AsySGHT-qM) for the non-convex optimization with cardinality constraint, and prove that the convergence rate of AsySGHT-qM before reaching the inherent error induced by gradient hard thresholding methods is geometric. 2) We propose an asynchronous stochastic proximal gradient algorithm (AsySPG-qM) for the convex optimization with non-smooth regularization, and prove that AsySPG-qM can achieve a linear convergence rate. 3) We propose an asynchronous stochastic gradient descent algorithm (AsySGD-qM) for the general non-convex optimization problem, and prove that AsySGD-qM can achieve a sublinear convergence rate to stationary points. The experimental results on various large-scale datasets confirm the fast convergence of our AsySGHT-qM, AsySPG-qM and AsySGD-qM through concrete realizations of SVRG and SAGA.

さまざまな分散削減手法(SVRG、S2GD、SAGA、q-SAGAなど)を備えた非同期確率アルゴリズムは、大規模な学習問題を解決する際によく使用されます。最近、Reddiら(2015)は、非同期確率勾配最適化を分析するための統合分散削減フレームワーク(HSAG)を提案しました。ただし、HSAGフレームワークはS2GD手法を組み込むことができず、HSAGフレームワークの分析は滑らかな凸最適化に関するSVRGおよびSAGA手法に限定されています。彼らは、他の重要なさまざまな分散手法(S2GDやq-SAGAなど)やその他の重要な最適化問題(滑らかな正則化のない凸最適化や、カーディナリティ制約のある非凸最適化など)を分析していませんでした。この論文では、さまざまな分散削減手法(SVRG、S2GD、SAGA、q-SAGAを含む)に統一されたq記憶フレームワークを使用してこのギャップを埋め、3つの重要な最適化問題に対する非同期確率アルゴリズムを分析します。具体的には、q記憶フレームワークに基づいて、1)カーディナリティ制約付きの非凸最適化に対して、q記憶を備えた非同期確率勾配ハードしきい値アルゴリズム(AsySGHT-qM)を提案し、勾配ハードしきい値法によって誘発される固有の誤差に達する前のAsySGHT-qMの収束率が幾何的であることを証明します。2)非滑らかな正則化による凸最適化に対して、非同期確率近似勾配アルゴリズム(AsySPG-qM)を提案し、AsySPG-qMが線形収束率を達成できることを証明します。3)一般的な非凸最適化問題に対する非同期確率的勾配降下法アルゴリズム(AsySGD-qM)を提案し、AsySGD-qMが定常点への準線形収束率を達成できることを証明します。さまざまな大規模データセットでの実験結果により、SVRGとSAGAの具体的な実現を通じて、AsySGHT-qM、AsySPG-qM、およびAsySGD-qMの高速収束が確認されます。

Beyond Trees: Classification with Sparse Pairwise Dependencies
Beyond Trees: スパースペアワイズ依存関係による分類

Several classification methods assume that the underlying distributions follow tree-structured graphical models. Indeed, trees capture statistical dependencies between pairs of variables, which may be crucial to attaining low classification errors. In this setting, the optimal classifier is linear in the log-transformed univariate and bivariate densities that correspond to the tree edges. In practice, observed data may not be well approximated by trees. Yet, motivated by the importance of pairwise dependencies for accurate classification, here we propose to approximate the optimal decision boundary by a sparse linear combination of the univariate and bivariate log-transformed densities. Our proposed approach is semi-parametric in nature: we non-parametrically estimate the univariate and bivariate densities, remove pairs of variables that are nearly independent using the Hilbert-Schmidt independence criterion, and finally construct a linear SVM using the retained log-transformed densities. We demonstrate on synthetic and real data sets, that our classifier, named SLB (sparse log-bivariate density), is competitive with other popular classification methods.

いくつかの分類法では、基礎となる分布がツリー構造のグラフィカルモデルに従うことを前提としています。実際、ツリーは変数のペア間の統計的依存関係を捉えており、これは低い分類エラーを達成するために非常に重要となる可能性があります。この設定では、最適な分類器は、ツリーのエッジに対応する対数変換された単変量および二変量密度において線形です。実際には、観測されたデータはツリーではうまく近似されない可能性があります。しかし、正確な分類にはペアごとの依存関係が重要であることから、ここでは、単変量および二変量の対数変換された密度のスパース線形結合によって最適な決定境界を近似することを提案します。提案されたアプローチは、本質的にセミパラメトリックです。つまり、単変量および二変量密度をノンパラメトリックに推定し、ヒルベルトシュミット独立基準を使用してほぼ独立した変数のペアを削除し、最後に保持された対数変換された密度を使用して線形SVMを構築します。私たちは、合成データセットと実際のデータセットで、SLB (スパース対数二変量密度)と呼ばれる私たちの分類器が他の一般的な分類方法と競合できることを証明しました。

Efficient Adjustment Sets for Population Average Causal Treatment Effect Estimation in Graphical Models
グラフィカルモデルにおける母集団平均因果治療効果推定のための効率的な調整セット

The method of covariate adjustment is often used for estimation of total treatment effects from observational studies. Restricting attention to causal linear models, a recent article (Henckel et al., 2019) derived two novel graphical criteria: one to compare the asymptotic variance of linear regression treatment effect estimators that control for certain distinct adjustment sets and another to identify the optimal adjustment set that yields the least squares estimator with the smallest asymptotic variance. In this paper we show that the same graphical criteria can be used in non-parametric causal graphical models when treatment effects are estimated using non-parametrically adjusted estimators of the interventional means. We also provide a new graphical criterion for determining the optimal adjustment set among the minimal adjustment sets and another novel graphical criterion for comparing time dependent adjustment sets. We show that uniformly optimal time dependent adjustment sets do not always exist. For point interventions, we provide a sound and complete graphical criterion for determining when a non-parametric optimally adjusted estimator of an interventional mean, or of a contrast of interventional means, is semiparametric efficient under the non-parametric causal graphical model. In addition, when the criterion is not met, we provide a sound algorithm that checks for possible simplifications of the efficient influence function of the parameter. Finally, we find an interesting connection between identification and efficient covariate adjustment estimation. Specifically, we show that if there exists an identifying formula for an interventional mean that depends only on treatment, outcome and mediators, then the non-parametric optimally adjusted estimator can never be globally efficient under the causal graphical model.

共変量調整法は、観察研究からの総治療効果の推定によく使用されます。因果線形モデルに焦点を絞った最近の論文(Henckelら, 2019)では、2つの新しいグラフィカル基準が導き出されました。1つは、特定の異なる調整セットを制御する線形回帰治療効果推定量の漸近分散を比較するための基準で、もう1つは、漸近分散が最小の最小二乗推定量をもたらす最適な調整セットを識別するための基準です。この論文では、介入平均のノンパラメトリック調整推定量を使用して治療効果を推定する場合、ノンパラメトリック因果グラフィカルモデルで同じグラフィカル基準を使用できることを示します。また、最小調整セットの中から最適な調整セットを決定するための新しいグラフィカル基準と、時間依存調整セットを比較するための別の新しいグラフィカル基準も提供します。均一に最適な時間依存調整セットが常に存在するとは限らないことを示します。ポイント介入については、介入平均または介入平均の対比のノンパラメトリック最適調整推定量がノンパラメトリック因果グラフィカルモデル下でセミパラメトリック効率的であるかどうかを判断するための、適切で完全なグラフィカル基準を提供します。さらに、基準が満たされない場合、パラメーターの効率的な影響関数の可能な簡略化をチェックする適切なアルゴリズムを提供します。最後に、識別と効率的な共変量調整推定の間に興味深い関係があることを発見しました。具体的には、治療、結果、および媒介因子のみに依存する介入平均の識別式が存在する場合、ノンパラメトリック最適調整推定量は因果グラフィカルモデル下でグローバルに効率的になることは決してないことを示します。

Kriging Prediction with Isotropic Matern Correlations: Robustness and Experimental Designs
等方性母体相関によるクリギング予測:ロバスト性と実験計画

This work investigates the prediction performance of the kriging predictors. We derive some error bounds for the prediction error in terms of non-asymptotic probability under the uniform metric and $L_p$ metrics when the spectral densities of both the true and the imposed correlation functions decay algebraically. The Matern family is a prominent class of correlation functions of this kind. Our analysis shows that, when the smoothness of the imposed correlation function exceeds that of the true correlation function, the prediction error becomes more sensitive to the space-filling property of the design points. In particular, the kriging predictor can still reach the optimal rate of convergence, if the experimental design scheme is quasi-uniform. Lower bounds of the kriging prediction error are also derived under the uniform metric and $L_p$ metrics. An accurate characterization of this error is obtained, when an oversmoothed correlation function and a space-filling design is used.

この研究では、クリギング予測子の予測性能を調査します。一様メトリックと$L_p$メトリックの下で、真相関関数と強制相関関数の両方のスペクトル密度が代数的に減衰する場合の非漸近確率の観点から、予測誤差の誤差範囲を導き出します。Maternファミリーは、この種の相関関数の著名なクラスです。この分析では、課せられた相関関数の滑らかさが真の相関関数の滑らかさを超えると、予測誤差は設計点の空間充填特性に対してより敏感になることが示されています。特に、クリギング予測子は、実験計画スキームが準一様である場合、最適な収束率に達することができます。クリギング予測誤差の下限は、uniform metricと$L_p$メトリクスでも導出されます。この誤差の正確な特性評価は、過度に平滑化された相関関数と空間を埋める設計を使用すると得られます。

Consistency of Semi-Supervised Learning Algorithms on Graphs: Probit and One-Hot Methods
グラフ上の半教師あり学習アルゴリズムの一貫性:プロビット法とワンホット法

Graph-based semi-supervised learning is the problem of propagating labels from a small number of labelled data points to a larger set of unlabelled data. This paper is concerned with the consistency of optimization-based techniques for such problems, in the limit where the labels have small noise and the underlying unlabelled data is well clustered. We study graph-based probit for binary classification, and a natural generalization of this method to multi-class classification using one-hot encoding. The resulting objective function to be optimized comprises the sum of a quadratic form defined through a rational function of the graph Laplacian, involving only the unlabelled data, and a fidelity term involving only the labelled data. The consistency analysis sheds light on the choice of the rational function defining the optimization.

グラフベースの半教師あり学習は、少数のラベル付きデータポイントからラベルを大量のラベル付きデータセットに伝播する問題です。この論文では、ラベルのノイズが小さく、ラベル付けされていないデータが十分にクラスター化されている制限において、このような問題に対する最適化ベースの手法の一貫性に関係しています。私たちは、二項分類のためのグラフベースのプロビットと、この方法をワンホットエンコーディングを使用したマルチクラス分類への自然な一般化について研究しています。最適化される結果の目的関数は、ラベル付けされていないデータのみを含むグラフLaplacianの有理関数によって定義される2次形式と、ラベル付けされたデータのみを含む忠実度項の合計で構成されます。一貫性分析は、最適化を定義する有理関数の選択に光を当てます。

Scikit-network: Graph Analysis in Python
scikit-network:Pythonでのグラフ分析

Scikit-network is a Python package inspired by scikit-learn for the analysis of large graphs. Graphs are represented by their adjacency matrix in the sparse CSR format of SciPy. The package provides state-of-the-art algorithms for ranking, clustering, classifying, embedding and visualizing the nodes of a graph. High performance is achieved through a mix of fast matrix-vector products (using SciPy), compiled code (using Cython) and parallel processing. The package is distributed under the BSD license, with dependencies limited to NumPy and SciPy. It is compatible with Python 3.6 and newer. Source code, documentation and installation instructions are available online.

scikit-networkは、scikit-learnに触発されたPythonパッケージで、大きなグラフを解析するためのものです。グラフは、SciPyのスパースCSR形式の隣接行列で表されます。このパッケージは、グラフのノードをランク付け、クラスタリング、分類、埋め込み、視覚化するための最先端のアルゴリズムを提供します。高速な行列ベクトル積(SciPyを使用)、コンパイル済みコード(Cythonを使用)、並列処理の組み合わせにより、高いパフォーマンスが実現されます。このパッケージはBSDライセンスの下で配布されており、依存関係はNumPyとSciPyに限定されています。Python 3.6以降と互換性があります。ソースコード、ドキュメント、インストール手順はオンラインで入手できます。

Topology of Deep Neural Networks
ディープニューラルネットワークのトポロジー

We study how the topology of a data set $M = M_a \cup M_b \subseteq \mathbb{R}^d$, representing two classes $a$ and $b$ in a binary classification problem, changes as it passes through the layers of a well-trained neural network, i.e., one with perfect accuracy on training set and near-zero generalization error ($\approx 0.01\%$). The goal is to shed light on two mysteries in deep neural networks: (i) a nonsmooth activation function like ReLU outperforms a smooth one like hyperbolic tangent; (ii) successful neural network architectures rely on having many layers, even though a shallow network can approximate any function arbitrarily well. We performed extensive experiments on the persistent homology of a wide range of point cloud data sets, both real and simulated. The results consistently demonstrate the following: (1) Neural networks operate by changing topology, transforming a topologically complicated data set into a topologically simple one as it passes through the layers. No matter how complicated the topology of $M$ we begin with, when passed through a well-trained neural network $f : \mathbb{R}^d \to \mathbb{R}^p$, there is a vast reduction in the Betti numbers of both components $M_a$ and $M_b$; in fact they nearly always reduce to their lowest possible values: $\beta_k\bigl(f(M_i)\bigr) = 0$ for $k \ge 1$ and $\beta_0\bigl(f(M_i)\bigr) = 1$, $i =a, b$. (2) The reduction in Betti numbers is significantly faster for ReLU activation than for hyperbolic tangent activation as the former defines nonhomeomorphic maps that change topology, whereas the latter defines homeomorphic maps that preserve topology. (3) Shallow and deep networks transform data sets differently — a shallow network operates mainly through changing geometry and changes topology only in its final layers, a deep one spreads topological changes more evenly across all layers.

私たちは、バイナリ分類問題における2つのクラス$a$と$b$を表すデータセット$M = M_a \cup M_b \subseteq \mathbb{R}^d$のトポロジーが、十分にトレーニングされたニューラルネットワーク(トレーニングセットで完全な精度を持ち、一般化エラーがほぼゼロ($\approx 0.01\%$))のレイヤーを通過するときにどのように変化するかを調べます。目標は、ディープニューラルネットワークの2つの謎を解明することです。(i) ReLUのような滑らかでない活性化関数は、双曲正接のような滑らかな関数よりも優れています。(ii)浅いネットワークは任意の関数を任意に適切に近似できますが、成功したニューラルネットワークアーキテクチャは多くのレイヤーを持つことに依存しています。私たちは、実際のデータセットとシミュレートされたデータセットの両方を含む、さまざまなポイントクラウドデータセットの永続的なホモロジーに関する広範な実験を行いました。結果は一貫して次のことを示しています: (1)ニューラルネットワークはトポロジーを変更することで動作し、レイヤーを通過するときにトポロジー的に複雑なデータセットをトポロジー的に単純なものに変換します。$M$のトポロジーが最初にどれほど複雑であっても、十分にトレーニングされたニューラルネットワーク$f : \mathbb{R}^d \to \mathbb{R}^p$を通過すると、両方のコンポーネント$M_a$と$M_b$のベッティ数が大幅に減少します。実際、それらはほぼ常に可能な限り低い値まで減少します: $k \ge 1$の場合、$\beta_k\bigl(f(M_i)\bigr) = 0$、$i =a, b$の場合、$\beta_0\bigl(f(M_i)\bigr) = 1$。(2) ReLU活性化は双曲正接活性化よりもベッティ数の減少が大幅に速くなります。これは、前者がトポロジーを変更する非同相マップを定義するのに対し、後者はトポロジーを保存する同相マップを定義するためです。(3)浅いネットワークと深いネットワークはデータセットを異なる方法で変換します。浅いネットワークは主にジオメトリを変更することで動作し、最終層でのみトポロジーを変更しますが、深いネットワークはすべての層にわたってトポロジーの変更をより均等に広げます。

Near-optimal Individualized Treatment Recommendations
最適に近い個別化治療の推奨事項

The individualized treatment recommendation (ITR) is an important analytic framework for precision medicine. The goal of ITR is to assign the best treatments to patients based on their individual characteristics. From the machine learning perspective, the solution to the ITR problem can be formulated as a weighted classification problem to maximize the mean benefit from the recommended treatments given patients’ characteristics. Several ITR methods have been proposed in both the binary setting and the multicategory setting. In practice, one may prefer a more flexible recommendation that includes multiple treatment options. This motivates us to develop methods to obtain a set of near-optimal individualized treatment recommendations alternative to each other, called alternative individualized treatment recommendations (A-ITR). We propose two methods to estimate the optimal A-ITR within the outcome weighted learning (OWL) framework. Simulation studies and a real data analysis for Type 2 diabetic patients with injectable antidiabetic treatments are conducted to show the usefulness of the proposed A-ITR framework. We also show the consistency of these methods and obtain an upper bound for the risk between the theoretically optimal recommendation and the estimated one. An R package “aitr” has been developed, found at https://github.com/menghaomiao/aitr.

個別治療推奨(ITR)は、精密医療にとって重要な分析フレームワークです。ITRの目標は、患者の個々の特性に基づいて、患者に最適な治療を割り当てることです。機械学習の観点から、ITR問題の解決策は、患者の特性を考慮して推奨される治療からの平均利益を最大化する重み付け分類問題として定式化できます。バイナリ設定とマルチカテゴリ設定の両方で、いくつかのITR方法が提案されています。実際には、複数の治療オプションを含むより柔軟な推奨が好まれる場合があります。これが、互いに代替可能な、ほぼ最適な個別治療推奨のセットを取得する方法の開発の動機となります。これは、代替個別治療推奨(A-ITR)と呼ばれます。私たちは、アウトカム加重学習(OWL)フレームワーク内で最適なA-ITRを推定する2つの方法を提案します。提案されたA-ITRフレームワークの有用性を示すために、注射による抗糖尿病治療を受けている2型糖尿病患者に対するシミュレーション研究と実際のデータ分析を実施しました。また、これらの方法の一貫性を示し、理論的に最適な推奨と推定された推奨の間のリスクの上限を取得します。Rパッケージ「aitr」が開発されました。https://github.com/menghaomiao/aitrで見つかります。

Distributed High-dimensional Regression Under a Quantile Loss Function
分位点損失関数の下での分布高次元回帰

This paper studies distributed estimation and support recovery for high-dimensional linear regression model with heavy-tailed noise. To deal with heavy-tailed noise whose variance can be infinite, we adopt the quantile regression loss function instead of the commonly used squared loss. However, the non-smooth quantile loss poses new challenges to high-dimensional distributed estimation in both computation and theoretical development. To address the challenge, we transform the response variable and establish a new connection between quantile regression and ordinary linear regression. Then, we provide a distributed estimator that is both computationally and communicationally efficient, where only the gradient information is communicated at each iteration. Theoretically, we show that, after a constant number of iterations, the proposed estimator achieves a near-oracle convergence rate without any restriction on the number of machines. Moreover, we establish the theoretical guarantee for the support recovery. The simulation analysis is provided to demonstrate the effectiveness of our method.

この論文では、裾が重いノイズのある高次元線形回帰モデルの分散推定とサポート回復について検討します。分散が無限大になる可能性がある裾が重いノイズに対処するために、一般的に使用される二乗損失の代わりに、分位点回帰損失関数を採用します。しかし、滑らかでない分位点損失は、計算と理論開発の両方において、高次元分散推定に新たな課題をもたらします。この課題に対処するために、応答変数を変換し、分位点回帰と通常の線形回帰の間に新しい接続を確立します。次に、各反復で勾配情報のみが通信される、計算と通信の両方で効率的な分散推定量を提供します。理論的には、一定数の反復の後、提案された推定量は、マシンの数に制限なく、オラクルに近い収束率を達成することを示します。さらに、サポート回復の理論的保証を確立します。シミュレーション分析は、この方法の有効性を実証するために提供されます。

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey
強化学習領域のためのカリキュラム学習:フレームワークと調査

Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research.

強化学習(RL)は、エージェントが限られた環境フィードバックしか受けない連続的な意思決定タスクに対処するための一般的なパラダイムです。過去30年間で多くの進歩があったにもかかわらず、多くのドメインでの学習には依然として環境との大量のインタラクションが必要であり、現実的なシナリオでは法外なコストがかかる可能性があります。この問題に対処するために、強化学習に転移学習が適用され、1つのタスクで得た経験を次のより難しいタスクの学習を開始するときに活用できるようになりました。最近では、いくつかの研究ラインで、ゼロから学習するには難しすぎる可能性のある問題を学習するために、タスクまたはデータサンプル自体をカリキュラムに順序付ける方法が検討されています。この記事では、強化学習におけるカリキュラム学習(CL)のフレームワークを紹介し、それを使用して、既存のCL手法をその前提、機能、および目標の観点から調査および分類します。最後に、フレームワークを使用して未解決の問題を見つけ、将来のRLカリキュラム学習研究の方向性を提案します。

Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction
勾配追従と分散縮小によるネットワークにおける通信効率分散最適化

There is growing interest in large-scale machine learning and optimization over decentralized networks, e.g. in the context of multi-agent learning and federated learning. Due to the imminent need to alleviate the communication burden, the investigation of communication-efficient distributed optimization algorithms — particularly for empirical risk minimization — has flourished in recent years. A large fraction of these algorithms have been developed for the master/slave setting, relying on the presence of a central parameter server that can communicate with all agents. This paper focuses on distributed optimization over networks, or decentralized optimization, where each agent is only allowed to aggregate information from its neighbors over a network (namely, no centralized coordination is present). By properly adjusting the global gradient estimate via local averaging in conjunction with proper correction, we develop a communication-efficient approximate Newton-type method, called Network-DANE, which generalizes DANE to accommodate decentralized scenarios. Our key ideas can be applied, in a systematic manner, to obtain decentralized versions of other master/slave distributed algorithms. A notable development is Network-SVRG/SARAH, which employs variance reduction at each agent to further accelerate local computation. We establish linear convergence of Network-DANE and Network-SVRG for strongly convex losses, and Network-SARAH for quadratic losses, which shed light on the impacts of data homogeneity, network connectivity, and local averaging upon the rate of convergence. We further extend Network-DANE to composite optimization by allowing a nonsmooth penalty term. Numerical evidence is provided to demonstrate the appealing performance of our algorithms over competitive baselines, in terms of both communication and computation efficiency. Our work suggests that by performing a judiciously chosen amount of local communication and computation per iteration, the overall efficiency can be substantially improved.

分散ネットワーク上での大規模な機械学習と最適化、たとえばマルチエージェント学習や連合学習のコンテキストでの関心が高まっています。通信負荷を軽減する必要性が差し迫っているため、通信効率の高い分散最適化アルゴリズムの研究が近年盛んになっています。特に経験的リスク最小化のためのアルゴリズムです。これらのアルゴリズムの大部分は、すべてのエージェントと通信できる中央パラメータサーバーの存在を前提としたマスター/スレーブ設定用に開発されています。この論文では、ネットワーク上の分散最適化、つまり各エージェントがネットワーク経由で近隣のエージェントからのみ情報を集約できる(つまり、中央調整が存在しない)分散最適化に焦点を当てています。適切な補正と組み合わせてローカル平均化によってグローバル勾配推定を適切に調整することで、分散シナリオに対応するようにDANEを一般化する、通信効率の高い近似ニュートン型手法(Network-DANE)を開発します。私たちの主要なアイデアは、他のマスター/スレーブ分散アルゴリズムの分散バージョンを取得するために体系的に適用できます。注目すべき開発は、各エージェントで分散削減を採用してローカル計算をさらに高速化するNetwork-SVRG/SARAHです。私たちは、強い凸損失に対してNetwork-DANEとNetwork-SVRGの線形収束を確立し、2次損失に対してNetwork-SARAHを確立しました。これにより、データの均質性、ネットワーク接続性、およびローカル平均化が収束率に与える影響が明らかになりました。さらに、非滑らかなペナルティ項を許可することで、Network-DANEを複合最適化に拡張しました。数値的証拠は、通信と計算の効率の両方の点で、競合ベースラインよりも優れたアルゴリズムのパフォーマンスを実証しています。私たちの研究は、反復ごとに慎重に選択された量のローカル通信と計算を実行することで、全体的な効率を大幅に向上できることを示唆しています。

Variational Inference for Computational Imaging Inverse Problems
コンピュテーショナルイメージング逆問題のための変分推論

Machine learning methods for computational imaging require uncertainty estimation to be reliable in real settings. While Bayesian models offer a computationally tractable way of recovering uncertainty, they need large data volumes to be trained, which in imaging applications implicates prohibitively expensive collections with specific imaging instruments. This paper introduces a novel framework to train variational inference for inverse problems exploiting in combination few experimentally collected data, domain expertise and existing image data sets. In such a way, Bayesian machine learning models can solve imaging inverse problems with minimal data collection efforts. Extensive simulated experiments show the advantages of the proposed framework. The approach is then applied to two real experimental optics settings: holographic image reconstruction and imaging through highly scattering media. In both settings, state of the art reconstructions are achieved with little collection of training data.

計算イメージングのための機械学習法では、実際の設定で不確実性の推定が信頼できることが必要です。ベイズモデルは不確実性を回復するための計算的に扱いやすい方法を提供しますが、トレーニングには大量のデータが必要であり、イメージングアプリケーションでは、特定のイメージング機器を使用した法外に高価なコレクションが必要になります。この論文では、実験的に収集された少数のデータ、ドメインの専門知識、および既存の画像データセットを組み合わせて活用し、逆問題の変分推論をトレーニングするための新しいフレームワークを紹介します。このようにして、ベイズ機械学習モデルは、最小限のデータ収集労力でイメージング逆問題を解決できます。広範囲にわたるシミュレーション実験により、提案されたフレームワークの利点が示されています。次に、このアプローチを、ホログラフィック画像再構成と高度散乱媒体を介したイメージングという2つの実際の実験光学設定に適用します。両方の設定で、最先端の再構成がトレーニングデータをほとんど収集せずに達成されます。

Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
深層マルチエージェント強化学習のための単調値関数分解

In many real-world settings, a team of agents must coordinate its behaviour while acting in a decentralised fashion. At the same time, it is often possible to train the agents in a centralised fashion where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a mixing network that estimates joint action-values as a monotonic combination of per-agent values. We structurally enforce that the joint-action value is monotonic in the per-agent values, through the use of non-negative weights in the mixing network, which guarantees consistency between the centralised and decentralised policies. To evaluate the performance of QMIX, we propose the StarCraft Multi-Agent Challenge (SMAC) as a new benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a challenging set of SMAC scenarios and show that it significantly outperforms existing multi-agent reinforcement learning methods.

多くの現実世界では、エージェントのチームは分散型で行動しながら、その行動を調整する必要があります。同時に、グローバルな状態情報が利用可能で、通信の制約が解除された集中型でエージェントをトレーニングすることもしばしば可能です。追加の状態情報に条件付けられた共同アクション値を学習することは、集中型学習を活用する魅力的な方法ですが、分散型ポリシーを抽出するための最善の戦略は不明です。私たちの解決策は、分散型ポリシーを集中型のエンドツーエンドでトレーニングできる新しい価値ベースの方法であるQMIXです。QMIXは、共同アクション値をエージェントごとの値の単調な組み合わせとして推定する混合ネットワークを採用しています。混合ネットワークで非負の重みを使用することで、共同アクション値がエージェントごとの値に対して単調であることを構造的に強制し、集中型ポリシーと分散型ポリシー間の一貫性を保証します。QMIXのパフォーマンスを評価するために、深層マルチエージェント強化学習の新しいベンチマークとしてStarCraft Multi-Agent Challenge (SMAC)を提案します。私たちは、一連の困難なSMACシナリオでQMIXを評価し、それが既存のマルチエージェント強化学習手法よりも大幅に優れていることを示します。

Optimal Estimation of Sparse Topic Models
スパーストピックモデルの最適推定

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse $A$, and has superior performance is many scenarios of interest.

トピックモデルは、$p$語の語彙の観測頻度を$n$文書で$p\times n$行列に格納したもので構成されるテキストデータの次元削減と探索的分析のための一般的なツールになっています。主な前提は、このデータ行列の平均が、2つの非負行列、つまり$p\times K$語トピック行列$A$と$K\times n$トピック文書行列$W$の積に因数分解できることです。この論文では、要素ごとにスパースである可能性があり、トピックの数$K$が不明な$A$の推定について検討します。このあまり調査されていないコンテキストで、このような$A$の推定の新しいミニマックス下限を導出し、その回復のための計算効率の高い新しいアルゴリズムを提案します。推定量の有限サンプル上限を導出し、多くのシナリオでそれがミニマックス下限と一致することを示します。私たちの推定は、$A$の未知のスパース性に適応し、私たちの分析は有限の$n$、$p$、$K$、および文書の長さに対して有効です。合成データと半合成データの両方に関する実験結果は、提案された推定器が、非スパース$A$とスパース$A$の両方に対して既存の最先端アルゴリズムの強力な競合相手であり、多くの興味深いシナリオで優れたパフォーマンスを発揮することを示しています。

Breaking the Curse of Nonregularity with Subagging — Inference of the Mean Outcome under Optimal Treatment Regimes
最適治療レジメンの下での平均結果のサブアギング—推論による不規則性の呪いを解く

Precision medicine is an emerging medical approach that allows physicians to select the treatment options based on individual patient information. The goal of precision medicine is to identify the optimal treatment regime (OTR) that yields the most favorable clinical outcome. Prior to adopting any OTR in clinical practice, it is crucial to know the impact of implementing such a policy. Although considerable research has been devoted to estimating the OTR in the literature, less attention has been paid to statistical inference of the OTR. Challenges arise in the nonregular cases where the OTR is not uniquely defined. To deal with nonregularity, we develop a novel inference method for the mean outcome under an OTR (the optimal value function) based on subsample aggregating (subagging). The proposed method can be applied to multi-stage studies where treatments are sequentially assigned over time. Bootstrap aggregating (bagging) and subagging have been recognized as effective vari- ance reduction techniques to improve unstable estimators or classifiers (Buhlmann and Yu, 2002). However, it remains unknown whether these approaches can yield valid inference results. We show the proposed confidence interval (CI) for the optimal value function achieves nominal coverage. In addition, due to the variance reduction effect of subagging, our method enjoys certain statistical optimality. Specifically, we show that the mean squared error of the proposed value estimator is strictly smaller than that based on the simple sample-splitting estimator in the nonregular cases. Moreover, under certain conditions, the length of our proposed CI is shown to be on average shorter than CIs constructed based on the existing state-of-the-art method (Luedtke and van der Laan, 2016) and the ‘oracle’ method which works as well as if an OTR were known. Extensive numerical studies are conducted to back up our theoretical findings.

精密医療は、医師が個々の患者情報に基づいて治療オプションを選択できるようにする新しい医療アプローチです。精密医療の目標は、最も好ましい臨床結果をもたらす最適治療計画(OTR)を特定することです。臨床診療でOTRを採用する前に、このようなポリシーを実装した場合の影響を知ることが重要です。文献ではOTRの推定にかなりの研究が費やされてきましたが、OTRの統計的推論にはあまり注意が払われていません。OTRが一意に定義されない非規則的なケースでは課題が生じます。非規則性に対処するために、サブサンプル集約(サブアギング)に基づくOTR (最適値関数)の下での平均結果の新しい推論方法を開発しました。提案された方法は、時間の経過とともに治療が順次割り当てられる多段階研究に適用できます。ブートストラップ集約(バギング)とサブアギングは、不安定な推定量または分類器を改善するための効果的な分散削減手法として認識されています(BuhlmannおよびYu、2002)。しかし、これらのアプローチが有効な推論結果をもたらすことができるかどうかは不明です。最適値関数の提案された信頼区間(CI)が名目上のカバレッジを達成することを示します。さらに、サブアギングの分散削減効果により、私たちの方法は一定の統計的最適性を享受します。具体的には、提案された値推定量の平均二乗誤差は、非正規のケースにおける単純なサンプル分割推定量に基づくものよりも確実に小さいことを示します。さらに、特定の条件下では、提案されたCIの長さは、既存の最先端の方法(Luedtkeおよびvan der Laan、2016)およびOTRが既知である場合と同様に機能する「オラクル」方法に基づいて構築されたCIよりも平均して短いことが示されています。理論的な発見を裏付けるために、広範な数値研究が行われています。

Wide Neural Networks with Bottlenecks are Deep Gaussian Processes
ボトルネックのある広範なニューラルネットワークは深層ガウス過程

There has recently been much work on the “wide limit” of neural networks, where Bayesian neural networks (BNNs) are shown to converge to a Gaussian process (GP) as all hidden layers are sent to infinite width. However, these results do not apply to architectures that require one or more of the hidden layers to remain narrow. In this paper, we consider the wide limit of BNNs where some hidden layers, called “bottlenecks”, are held at finite width. The result is a composition of GPs that we term a “bottleneck neural network Gaussian process” (bottleneck NNGP). Although intuitive, the subtlety of the proof is in showing that the wide limit of a composition of networks is in fact the composition of the limiting GPs. We also analyze theoretically a single-bottleneck NNGP, finding that the bottleneck induces dependence between the outputs of a multi-output network that persists through extreme post-bottleneck depths, and prevents the kernel of the network from losing discriminative power at extreme post-bottleneck depths.

最近、ニューラルネットワークの「広い限界」に関する研究が盛んに行われており、ベイジアンニューラルネットワーク(BNN)は、すべての隠し層が無限の幅に送られると、ガウス過程(GP)に収束することが示されています。ただし、これらの結果は、1つ以上の隠し層を狭いままにする必要があるアーキテクチャには適用されません。この論文では、一部の隠し層(「ボトルネック」と呼ばれる)が有限の幅に保持されるBNNの広い限界について検討します。その結果、GPの構成が生まれ、これを「ボトルネックニューラルネットワークガウス過程」(ボトルネックNNGP)と呼びます。直感的ではありますが、証明の巧妙な点は、ネットワークの構成の広い限界が、実際には制限GPの構成であることを示すことです。また、単一ボトルネックのNNGPを理論的に分析し、ボトルネックによって、ボトルネック後の極度深度まで持続するマルチ出力ネットワークの出力間の依存関係が誘発され、ボトルネック後の極度深度でネットワークのカーネルが識別力を失うのを防ぐことがわかりました。

Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality
固有次元を持つ深層ニューラルネットワークの適応近似と一般化

In this study, we prove that an intrinsic low dimensionality of covariates is the main factor that determines the performance of deep neural networks (DNNs). DNNs generally provide outstanding empirical performance. Hence, numerous studies have actively investigated the theoretical properties of DNNs to understand their underlying mechanisms. In particular, the behavior of DNNs in terms of high-dimensional data is one of the most critical questions. However, this issue has not been sufficiently investigated from the aspect of covariates, although high-dimensional data have practically low intrinsic dimensionality. In this study, we derive bounds for an approximation error and a generalization error regarding DNNs with intrinsically low dimensional covariates. We apply the notion of the Minkowski dimension and develop a novel proof technique. Consequently, we show that convergence rates of the errors by DNNs do not depend on the nominal high dimensionality of data, but on its lower intrinsic dimension. We further prove that the rate is optimal in the minimax sense. We identify an advantage of DNNs by showing that DNNs can handle a broader class of intrinsic low dimensional data than other adaptive estimators. Finally, we conduct a numerical simulation to validate the theoretical results.

この研究では、共変量の本質的な低次元性がディープニューラルネットワーク（DNN）のパフォーマンスを決定する主な要因であることを証明します。DNNは一般的に優れた実証的パフォーマンスを提供します。そのため、多くの研究がDNNの理論的特性を調査し、その基礎となるメカニズムを理解しようとしてきました。特に、高次元データに関するDNNの動作は最も重要な問題の1つです。しかし、高次元データは実質的に低い本質的次元を持っているにもかかわらず、この問題は共変量の観点から十分に調査されていません。この研究では、本質的に低次元の共変量を持つDNNに関する近似誤差と一般化誤差の境界を導出します。ミンコフスキー次元の概念を適用し、新しい証明手法を開発します。その結果、DNNによる誤差の収束率は、データの名目上の高次元性ではなく、その低い本質的次元に依存することを示します。さらに、その率はミニマックスの意味で最適であることを証明します。DNNが他の適応推定器よりも広範なクラスの固有低次元データを処理できることを示すことにより、DNNの利点を特定します。最後に、理論的な結果を検証するための数値シミュレーションを実行します。

Doubly Distributed Supervised Learning and Inference with High-Dimensional Correlated Outcomes
高次元相関結果を伴う二重分散教師あり学習と推論

This paper presents a unified framework for supervised learning and inference procedures using the divide-and-conquer approach for high-dimensional correlated outcomes. We propose a general class of estimators that can be implemented in a fully distributed and parallelized computational scheme. Modeling, computational and theoretical challenges related to high-dimensional correlated outcomes are overcome by dividing data at both outcome and subject levels, estimating the parameter of interest from blocks of data using a broad class of supervised learning procedures, and combining block estimators in a closed-form meta-estimator asymptotically equivalent to estimates obtained by Hansen (1982)’s generalized method of moments (GMM) that does not require the entire data to be reloaded on a common server. We provide rigorous theoretical justifications for the use of distributed estimators with correlated outcomes by studying the asymptotic behaviour of the combined estimator with fixed and diverging number of data divisions. Simulations illustrate the finite sample performance of the proposed method, and we provide an R package for ease of implementation.

この論文では、高次元の相関結果に対して分割統治法を用いた教師あり学習および推論手順の統一フレームワークを提示します。私たちは、完全に分散され並列化された計算スキームで実装できる推定量の一般クラスを提案します。高次元の相関結果に関連するモデリング、計算、理論上の課題は、結果レベルと被験者レベルの両方でデータを分割し、幅広い教師あり学習手順を使用してデータブロックから対象パラメータを推定し、ブロック推定量を、Hansen (1982)の一般化モーメント法(GMM)によって得られた推定値と漸近的に同等な閉形式メタ推定量で結合することにより克服します。これにより、データ全体を共通サーバーに再ロードする必要がなくなります。私たちは、固定数および発散数のデータ分割による結合推定量の漸近的動作を研究することにより、相関結果を持つ分散推定量の使用に対する厳密な理論的正当性を提供します。シミュレーションは提案された方法の有限サンプルのパフォーマンスを示しており、実装を容易にするためにRパッケージを提供しています。

Krylov Subspace Method for Nonlinear Dynamical Systems with Random Noise
ランダムノイズをもつ非線形力学系のためのクリロフ部分空間法

Operator-theoretic analysis of nonlinear dynamical systems has attracted much attention in a variety of engineering and scientific fields, endowed with practical estimation methods using data such as dynamic mode decomposition. In this paper, we address a lifted representation of nonlinear dynamical systems with random noise based on transfer operators, and develop a novel Krylov subspace method for estimating the operators using finite data, with consideration of the unboundedness of operators. For this purpose, we first consider Perron-Frobenius operators with kernel-mean embeddings for such systems. We then extend the Arnoldi method, which is the most classical type of Kryov subspace methods, so that it can be applied to the current case. Meanwhile, the Arnoldi method requires the assumption that the operator is bounded, which is not necessarily satisfied for transfer operators on nonlinear systems. We accordingly develop the shift-invert Arnoldi method for Perron-Frobenius operators to avoid this problem. Also, we describe an approach of evaluating predictive accuracy by estimated operators on the basis of the maximum mean discrepancy, which is applicable, for example, to anomaly detection in complex systems. The empirical performance of our methods is investigated using synthetic and real-world healthcare data.

非線形動的システムの作用素理論的解析は、動的モード分解などのデータを使用した実用的な推定法を備え、さまざまな工学および科学分野で大きな注目を集めています。この論文では、転送作用素に基づくランダムノイズを含む非線形動的システムの持ち上げられた表現を取り上げ、作用素の非有界性を考慮しながら有限データを使用して作用素を推定するための新しいクリロフ部分空間法を開発します。この目的のために、まず、そのようなシステムに対してカーネル平均埋め込みを持つペロン-フロベニウス作用素を検討します。次に、クリロフ部分空間法の最も古典的なタイプであるアーノルディ法を拡張して、現在のケースに適用できるようにします。一方、アーノルディ法では、作用素が有界であるという仮定が必要ですが、これは非線形システム上の転送作用素に対して必ずしも満たされません。そこで、この問題を回避するために、ペロン-フロベニウス作用素のシフト反転アーノルディ法を開発します。また、最大平均差異に基づいて推定演算子による予測精度を評価するアプローチについても説明します。これは、たとえば複雑なシステムにおける異常検出に適用できます。合成データと実際の医療データを使用して、私たちの方法の実証的なパフォーマンスを調査します。

Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success
正則化としてのランダム化:ランダムフォレストの成功に対する自由度の説明

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.

ランダムフォレストは、回帰と分類の両方の設定で予測精度の実績が確立された、最も人気のある市販の教師あり機械学習ツールの1つです。実験的な成功と、統計的特性を調査する最近の多数の研究にもかかわらず、その成功の完全で満足のいく説明はまだ提示されていません。ここでは、個々のツリーに注入された追加のランダム性が暗黙的な正則化の形式として機能し、ランダムフォレストが低い信号対雑音比(SNR)の設定で理想的なモデルになることを示すことで、この方向への一歩を踏み出すことを目指します。具体的には、モデルの複雑さの観点から、ランダムフォレストのmtryパラメーターが、Lassoやリッジ回帰などの明示的に正則化された回帰手順の縮小ペナルティとほぼ同じ目的を果たすことを示します。この点を強調するために、ツリーベースのランダムフォレストの類似物として意図されたランダム化線形モデルベースの前方選択手順を設計し、その驚くほど強力な実験的パフォーマンスを示します。実際のデータと合成データの両方に関する多数のデモンストレーションが提供されます。

Rationally Inattentive Inverse Reinforcement Learning Explains YouTube Commenting Behavior
合理的に不注意な逆強化学習がYouTubeのコメント行動を説明

We consider a novel application of inverse reinforcement learning with behavioral economics constraints to model, learn and predict the commenting behavior of YouTube viewers. Each group of users is modeled as a rationally inattentive Bayesian agent which solves a contextual bandit problem. Our methodology integrates three key components. First, to identify distinct commenting patterns, we use deep embedded clustering to estimate framing information (essential extrinsic features) that clusters users into distinct groups. Second, we present an inverse reinforcement learning algorithm that uses Bayesian revealed preferences to test for rationality: does there exist a utility function that rationalizes the given data, and if yes, can it be used to predict commenting behavior? Finally, we impose behavioral economics constraints stemming from rational inattention to characterize the attention span of groups of users. The test imposes a Renyi mutual information cost constraint which impacts how the agent can select attention strategies to maximize their expected utility. After a careful analysis of a massive YouTube dataset, our surprising result is that in most YouTube user groups, the commenting behavior is consistent with optimizing a Bayesian utility with rationally inattentive constraints. The paper also highlights how the rational inattention model can accurately predict commenting behavior. The massive YouTube dataset and analysis used in this paper are available on GitHub and completely reproducible.

私たちは、YouTube視聴者のコメント行動をモデル化し、学習し、予測するために、行動経済学の制約を伴う逆強化学習の新しいアプリケーションを検討します。各ユーザーグループは、コンテキストバンディット問題を解決する合理的に不注意なベイジアンエージェントとしてモデル化されます。私たちの方法論は、3つの主要なコンポーネントを統合します。まず、明確なコメントパターンを識別するために、ディープエンベデッドクラスタリングを使用して、ユーザーを明確なグループにクラスタ化するフレーミング情報(必須の外的特徴)を推定します。次に、ベイジアン顕示選好を使用して合理性をテストする逆強化学習アルゴリズムを紹介します。つまり、特定のデータを合理化する効用関数が存在するかどうか、存在する場合、コメント行動の予測に使用できるかどうかです。最後に、合理的不注意から生じる行動経済学の制約を課して、ユーザーグループの注意持続時間を特徴付けます。このテストでは、エージェントが期待効用を最大化するために注意戦略を選択する方法に影響を与える、Renyi相互情報量コスト制約を課します。膨大なYouTubeデータセットを注意深く分析した結果、ほとんどのYouTubeユーザーグループで、コメント行動が合理的無注意制約によるベイズ効用を最適化することと一致しているという驚くべき結果が得られました。この論文では、合理的無注意モデルがコメント行動を正確に予測する方法についても強調しています。この論文で使用されている膨大なYouTubeデータセットと分析はGitHubで入手でき、完全に再現可能です。

The Optimal Ridge Penalty for Real-world High-dimensional Data Can Be Zero or Negative due to the Implicit Ridge Regularization
実世界の高次元データに対する最適なリッジペナルティは、暗黙的なリッジ正則化により、ゼロまたは負になる可能性があります

A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. Here we show that this rule can be violated by linear regression in the underdetermined $n\ll p$ situation under realistic conditions. Using simulations and real-life high-dimensional datasets, we demonstrate that an explicit positive ridge penalty can fail to provide any improvement over the minimum-norm least squares estimator. Moreover, the optimal value of ridge penalty in this situation can be negative. This happens when the high-variance directions in the predictor space can predict the response variable, which is often the case in the real-world high-dimensional data. In this regime, low-variance directions provide an implicit ridge regularization and can make any further positive ridge penalty detrimental. We prove that augmenting any linear model with random covariates and using minimum-norm estimator is asymptotically equivalent to adding the ridge penalty. We use a spiked covariance model as an analytically tractable example and prove that the optimal ridge penalty in this case is negative when $n\ll p$.

統計学習における常識は、大規模モデルでは過剰適合を防ぐために強力な正則化が必要であるということです。ここでは、現実的な条件下では、不確定な$n\ll p$状況で線形回帰によってこのルールが破られる可能性があることを示します。シミュレーションと実際の高次元データセットを使用して、明示的な正のリッジペナルティでは最小ノルム最小二乗推定値よりも改善されない可能性があることを実証します。さらに、この状況でのリッジペナルティの最適値は負になる可能性があります。これは、予測子空間の高分散方向が応答変数を予測できる場合に発生し、実際の高次元データではよくあるケースです。この状況では、低分散方向が暗黙的なリッジ正則化を提供し、それ以上の正のリッジペナルティが有害になる可能性があります。任意の線形モデルをランダムな共変量で拡張し、最小ノルム推定値を使用することは、リッジペナルティを追加することと漸近的に同等であることを証明します。解析的に扱いやすい例としてスパイク共分散モデルを使用し、この場合の最適なリッジペナルティは$n\ll p$のときに負であることを証明します。

Convex and Non-Convex Approaches for Statistical Inference with Class-Conditional Noisy Labels
クラス条件付きノイズラベルによる統計的推論のための凸型および非凸型アプローチ

We study the problem of estimation and testing in logistic regression with class-conditional noise in the observed labels, which has an important implication in the Positive-Unlabeled (PU) learning setting. With the key observation that the label noise problem belongs to a special sub-class of generalized linear models (GLM), we discuss convex and non-convex approaches that address this problem. A non-convex approach based on the maximum likelihood estimation produces an estimator with several optimal properties, but a convex approach has an obvious advantage in optimization. We demonstrate that in the low-dimensional setting, both estimators are consistent and asymptotically normal, where the asymptotic variance of the non-convex estimator is smaller than the convex counterpart. We also quantify the efficiency gap which provides insight into when the two methods are comparable. In the high-dimensional setting, we show that both estimation procedures achieve $\ell_2$-consistency at the minimax optimal $\sqrt{s\log p/n}$ rates under mild conditions. Finally, we propose an inference procedure using a de-biasing approach. We validate our theoretical findings through simulations and a real-data example.

私たちは、観測ラベルにクラス条件付きノイズがあるロジスティック回帰における推定と検定の問題を研究します。これは、Positive-Unlabeled (PU)学習設定において重要な意味を持ちます。ラベルノイズ問題は一般化線形モデル(GLM)の特別なサブクラスに属するという重要な観察に基づき、この問題に対処する凸アプローチと非凸アプローチについて説明します。最大尤度推定に基づく非凸アプローチは、いくつかの最適な特性を持つ推定量を生成しますが、凸アプローチには最適化において明らかな利点があります。低次元設定では、両方の推定量が一致しており、漸近的に正規であり、非凸推定量の漸近分散は凸推定量よりも小さいことを示します。また、2つの方法が比較可能な場合についての洞察を提供する効率ギャップを定量化します。高次元設定では、両方の推定手順が、穏やかな条件下でミニマックス最適$\sqrt{s\log p/n}$レートで$\ell_2$一致を達成することを示します。最後に、バイアス除去アプローチを使用した推論手順を提案します。シミュレーションと実際のデータの例を通じて、理論的発見を検証します。

Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes
マルコフ決定過程における効率的なオフポリシー評価のための二重強化学習

Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness.

強化学習におけるオフポリシー評価(OPE)は、コストがかかるか、実現不可能なことが多い探索を行うことなく、新しい意思決定ポリシーを評価することを可能にします。私たちは初めて、行動、報酬、および状態が記憶がないマルコフ決定過程(MDP)におけるOPEのセミパラメトリック効率限界を検討します。既存のOPE推定器は、この設定では効率的ではない可能性があることを示しています。私たちは、$q$関数のクロスフォールド推定と周辺化密度比に基づく新しい推定量を開発し、これを二重強化学習(DRL)と名付けます。DRLは、両方の成分が4乗根率で推定される場合に効率的であり、一方の成分のみが一貫している場合に二重に堅牢であることを示します。これらの特性を経験的に調査し、無記憶を利用することによるパフォーマンス上の利点を実証します。

High Dimensional Forecasting via Interpretable Vector Autoregression
解釈可能ベクトル自己回帰による高次元予測

Vector autoregression (VAR) is a fundamental tool for modeling multivariate time series. However, as the number of component series is increased, the VAR model becomes overparameterized. Several authors have addressed this issue by incorporating regularized approaches, such as the lasso in VAR estimation. Traditional approaches address overparameterization by selecting a low lag order, based on the assumption of short range dependence, assuming that a universal lag order applies to all components. Such an approach constrains the relationship between the components and impedes forecast performance. The lasso-based approaches perform much better in high-dimensional situations but do not incorporate the notion of lag order selection. We propose a new class of hierarchical lag structures (HLag) that embed the notion of lag selection into a convex regularizer. The key modeling tool is a group lasso with nested groups which guarantees that the sparsity pattern of lag coefficients honors the VAR’s ordered structure. The proposed HLag framework offers three basic structures, which allow for varying levels of flexibility, with many possible generalizations. A simulation study demonstrates improved performance in forecasting and lag order selection over previous approaches, and macroeconomic, financial, and energy applications further highlight forecasting improvements as well as HLag’s convenient, interpretable output.

ベクトル自己回帰(VAR)は、多変量時系列をモデル化するための基本的なツールです。ただし、コンポーネント系列の数が増えると、VARモデルは過剰パラメータ化されます。何人かの著者は、VAR推定にLassoなどの正規化アプローチを組み込むことでこの問題に対処してきました。従来のアプローチでは、短距離依存性の仮定に基づいて低いラグ次数を選択し、すべてのコンポーネントに普遍的なラグ次数が適用されると想定することで、過剰パラメータ化に対処しています。このようなアプローチは、コンポーネント間の関係を制約し、予測のパフォーマンスを妨げます。Lassoベースのアプローチは、高次元の状況ではるかに優れたパフォーマンスを発揮しますが、ラグ次数選択の概念が組み込まれていません。ラグ選択の概念を凸正規化子に組み込んだ新しいクラスの階層ラグ構造(HLag)を提案します。主要なモデリングツールは、ラグ係数のスパースパターンがVARの順序構造を尊重することを保証する、ネストされたグループを持つグループLassoです。提案されたHLagフレームワークは、さまざまなレベルの柔軟性と多くの一般化を可能にする3つの基本構造を提供します。シミュレーション研究では、以前のアプローチよりも予測とラグ順序の選択のパフォーマンスが向上していることが実証されており、マクロ経済、金融、エネルギーのアプリケーションでは、予測の改善とHLagの便利で解釈可能な出力がさらに強調されています。

Complete Dictionary Learning via L4-Norm Maximization over the Orthogonal Group
直交群上のL4ノルム最大化による完全辞書学習

This paper considers the fundamental problem of learning a complete (orthogonal) dictionary from samples of sparsely generated signals. Most existing methods solve the dictionary (and sparse representations) based on heuristic algorithms, usually without theoretical guarantees for either optimality or complexity. The recent $\ell^1$-minimization based methods do provide such guarantees but the associated algorithms recover the dictionary one column at a time. In this work, we propose a new formulation that maximizes the $\ell^4$-norm over the orthogonal group, to learn the entire dictionary. We prove that under a random data model, with nearly minimum sample complexity, the global optima of the $\ell^4$-norm are very close to signed permutations of the ground truth. Inspired by this observation, we give a conceptually simple and yet effective algorithm based on matching, stretching, and projection (MSP). The algorithm provably converges locally and cost per iteration is merely an SVD. In addition to strong theoretical guarantees, experiments show that the new algorithm is significantly more efficient and effective than existing methods, including KSVD and $\ell^1$-based methods. Preliminary experimental results on mixed real imagery data clearly demonstrate advantages of so learned dictionary over classic PCA bases.

この論文では、スパースに生成された信号のサンプルから完全な(直交)辞書を学習するという基本的な問題について検討します。既存の方法のほとんどは、通常、最適性または複雑性の理論的な保証なしに、ヒューリスティックアルゴリズムに基づいて辞書(およびスパース表現)を解決します。最近の$\ell^1$最小化に基づく方法は、そのような保証を提供しますが、関連するアルゴリズムは一度に1列ずつ辞書を復元します。この研究では、辞書全体を学習するために、直交グループで$\ell^4$ノルムを最大化する新しい定式化を提案します。サンプル複雑性がほぼ最小のランダムデータモデルでは、$\ell^4$ノルムのグローバル最適値は、グラウンドトゥルースの符号付き順列に非常に近いことを証明します。この観察に触発されて、マッチング、ストレッチ、および投影(MSP)に基づく概念的に単純でありながら効果的なアルゴリズムを示します。このアルゴリズムは、証明可能なように局所的に収束し、反復あたりのコストは単なるSVDです。強力な理論的保証に加えて、実験では、新しいアルゴリズムがKSVDや$\ell^1$ベースの方法を含む既存の方法よりも大幅に効率的で効果的であることが示されています。混合実画像データに対する予備実験の結果は、このように学習された辞書が従来のPCAベースよりも優れていることを明確に示しています。

Cramer-Wold Auto-Encoder
Cramer-Wold オートエンコーダ

The computation of the distance to the true distribution is a key component of most state-of-the-art generative models. Inspired by prior works on the Sliced-Wasserstein Auto-Encoders (SWAE) and the Wasserstein Auto-Encoders with MMD-based penalty (WAE-MMD), we propose a new generative model – a Cramer-Wold Auto-Encoder (CWAE). A fundamental component of CWAE is the characteristic kernel, the construction of which is one of the goals of this paper, from here on referred to as the Cramer-Wold kernel. Its main distinguishing feature is that it has a closed-form of the kernel product of radial Gaussians. Consequently, CWAE model has a~closed-form for the distance between the posterior and the normal prior, which simplifies the optimization procedure by removing the need to sample in order to compute the loss function. At the same time, CWAE performance often improves upon WAE-MMD and SWAE on standard benchmarks.

真の分布までの距離の計算は、ほとんどの最先端の生成モデルの重要な要素です。Sliced-Wasserstein Auto-Encoders (SWAE)とWasserstein Auto-Encoders with MMD-based penalty (WAE-MMD)に関する先行研究に触発されて、新しい生成モデルであるCramer-Would Auto-Encoder (CWAE)を提案します。CWAEの基本的な構成要素は特徴的なカーネルであり、その構築は本論文の目標の1つであり、以下、Cramer-Woldカーネルと呼びます。その主な際立った特徴は、放射状ガウスのカーネル積の閉型を持っていることです。その結果、CWAEモデルは、事後と正規の事前分布との間の距離に対して~閉形式を持ち、損失関数を計算するためのサンプリングの必要性を排除し、最適化手順を簡素化します。同時に、CWAEのパフォーマンスは、標準ベンチマークでWAE-MMDとSWAEよりも向上することがよくあります。

Trust-Region Variational Inference with Gaussian Mixture Models
ガウス混合モデルによる信頼領域変分推論

Many methods for machine learning rely on approximate inference from intractable probability distributions. Variational inference approximates such distributions by tractable models that can be subsequently used for approximate inference. Learning sufficiently accurate approximations requires a rich model family and careful exploration of the relevant modes of the target distribution. We propose a method for learning accurate GMM approximations of intractable probability distributions based on insights from policy search by using information-geometric trust regions for principled exploration. For efficient improvement of the GMM approximation, we derive a lower bound on the corresponding optimization objective enabling us to update the components independently. Our use of the lower bound ensures convergence to a stationary point of the original objective. The number of components is adapted online by adding new components in promising regions and by deleting components with negligible weight. We demonstrate on several domains that we can learn approximations of complex, multimodal distributions with a quality that is unmet by previous variational inference methods, and that the GMM approximation can be used for drawing samples that are on par with samples created by state-of-the-art MCMC samplers while requiring up to three orders of magnitude less computational resources.

機械学習の多くの方法は、扱いにくい確率分布からの近似推論に依存しています。変分推論は、そのような分布を扱いやすいモデルで近似し、その後、近似推論に使用できます。十分に正確な近似を学習するには、豊富なモデルファミリと、対象分布の関連モードの注意深い探索が必要です。私たちは、情報幾何学的信頼領域を原理的な探索に使用して、ポリシー検索からの洞察に基づいて扱いにくい確率分布の正確なGMM近似を学習する方法を提案します。GMM近似を効率的に改善するために、対応する最適化目的の下限を導出し、コンポーネントを個別に更新できるようにします。下限を使用すると、元の目的の定常点に収束することが保証されます。コンポーネントの数は、有望な領域に新しいコンポーネントを追加し、無視できる重みを持つコンポーネントを削除することで、オンラインで調整されます。私たちは、いくつかの領域において、従来の変分推論法では達成できなかった品質で、複雑で多峰性の分布の近似値を学習できること、そして、GMM近似値を使用することで、最先端のMCMCサンプラーによって作成されたサンプルと同等のサンプルを、最大3桁少ない計算リソースで抽出できることを実証しました。

Regression with Comparisons: Escaping the Curse of Dimensionality with Ordinal Information
比較による回帰: 順序情報による次元の呪いからの脱出

In supervised learning, we typically leverage a fully labeled dataset to design methods for function estimation or prediction. In many practical situations, we are able to obtain alternative feedback, possibly at a low cost. A broad goal is to understand the usefulness of, and to design algorithms to exploit, this alternative feedback. In this paper, we consider a semi-supervised regression setting, where we obtain additional ordinal (or comparison) information for the unlabeled samples. We consider ordinal feedback of varying qualities where we have either a perfect ordering of the samples, a noisy ordering of the samples or noisy pairwise comparisons between the samples. We provide a precise quantification of the usefulness of these types of ordinal feedback in both nonparametric and linear regression, showing that in many cases it is possible to accurately estimate an underlying function with a very small labeled set, effectively escaping the curse of dimensionality. We also present lower bounds, that establish fundamental limits for the task and show that our algorithms are optimal in a variety of settings. Finally, we present extensive experiments on new datasets that demonstrate the efficacy and practicality of our algorithms and investigate their robustness to various sources of noise and model misspecification.

教師あり学習では、通常、完全にラベル付けされたデータセットを利用して、関数の推定や予測の方法を設計します。多くの実際の状況では、代替フィードバックを低コストで取得できます。大まかな目標は、この代替フィードバックの有用性を理解し、それを活用するアルゴリズムを設計することです。この論文では、ラベル付けされていないサンプルの追加の順序(または比較)情報を取得する半教師あり回帰設定を検討します。サンプルの完全な順序、サンプルのノイズの多い順序、またはサンプル間のノイズの多いペアワイズ比較のいずれかがあるさまざまな品質の順序フィードバックを検討します。ノンパラメトリック回帰と線形回帰の両方でこれらのタイプの順序フィードバックの有用性を正確に定量化し、多くの場合、非常に小さなラベル付きセットで基礎となる関数を正確に推定し、次元の呪いから効果的に逃れることができることを示します。また、タスクの基本的な制限を確立し、さまざまな設定でアルゴリズムが最適であることを示す下限も示します。最後に、私たちのアルゴリズムの有効性と実用性を実証し、さまざまなノイズ源やモデルの誤指定に対する堅牢性を調査する新しいデータセットでの広範な実験を紹介します。

apricot: Submodular selection for data summarization in Python
アプリコット:Pythonでのデータ要約のためのサブモジュラー選択

We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements several efficient greedy selection algorithms that offer strong theoretical guarantees on the quality of the selected set. Additionally, several submodular set functions are implemented, including facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and memoization as well as code optimization using numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and applications to two data sets.

私たちは、サブモジュラ最適化を使用して大規模なデータセットから代表的なサブセットを選択するオープンソースのPythonパッケージ、apricotを紹介します。このパッケージは、選択されたセットの品質について強力な理論的保証を提供する、いくつかの効率的な貪欲選択アルゴリズムを実装しています。さらに、広く適用できるがデータセット内の例の数の2乗のメモリを必要とするfacility locationや、適用範囲は狭いが数百万の例に拡張できる機能ベースの関数など、いくつかのサブモジュラセット関数が実装されています。Apricotは、遅延貪欲アルゴリズムやメモ化などのアルゴリズムによる高速化と、numbaを使用したコード最適化の両方を使用して、非常に効率的です。私たちは、データセット全体またはその代表的なサブセットを使用して、機械学習モデルを同等の精度でトレーニングすることにより、サブセット選択の使用方法を示します。この論文では、サブモジュラ選択の説明、apricotの機能の概要、および2つのデータセットへの適用を示します。

Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective
ロバストな散乱推定のための敵対的生成ネット:適切な採点ルールの展望

Robust covariance matrix estimation is a fundamental task in statistics. The recent discovery on the connection between robust estimation and generative adversarial nets (GANs) suggests that it is possible to compute depth-like robust estimators using similar techniques that optimize GANs. In this paper, we introduce a general learning via classification framework based on the notion of proper scoring rules. This framework allows us to understand both matrix depth function, a technique of rate-optimal robust estimation, and various GANs through the lens of variational approximations of $f$-divergences induced by proper scoring rules. We then propose a new class of robust covariance matrix estimators in this framework by carefully constructing discriminators with appropriate neural network structures. These estimators are proved to achieve the minimax rate of covariance matrix estimation under Huber’s contamination model. The results are also extended to robust scatter estimation for elliptical distributions. Our numerical results demonstrate the good performance of the proposed procedures under various settings against competitors in the literature.

ロバストな共分散行列推定は、統計学における基本的なタスクです。ロバスト推定と生成的敵対的ネット(GAN)の関係に関する最近の発見は、GANを最適化する同様の手法を使用して、深さのようなロバスト推定量を計算できることを示唆しています。この論文では、適切なスコアリングルールの概念に基づく分類フレームワークを介した一般的な学習を紹介します。このフレームワークにより、適切なスコアリングルールによって誘導される$f$ダイバージェンスの変分近似の観点から、行列深度関数、レート最適ロバスト推定の手法、およびさまざまなGANの両方を理解できます。次に、適切なニューラルネットワーク構造を持つ識別器を慎重に構築することにより、このフレームワークで新しいクラスのロバストな共分散行列推定量を提案します。これらの推定量は、Huberの汚染モデルの下で共分散行列推定のミニマックスレートを達成することが証明されています。結果は、楕円分布のロバストな散布推定にも拡張されます。数値結果は、さまざまな設定の下で、文献の競合相手に対して提案された手順の優れたパフォーマンスを示しています。

Generating Weighted MAX-2-SAT Instances with Frustrated Loops: an RBM Case Study
フラストレーションループによる重み付き MAX-2-SAT インスタンスの生成: RBM ケーススタディ

Many optimization problems can be cast into the maximum satisfiability (MAX-SAT) form, and many solvers have been developed for tackling such problems. To evaluate a MAXSAT solver, it is convenient to generate hard MAX-SAT instances with known solutions. Here, we propose a method of generating weighted MAX-2-SAT instances inspired by the frustrated-loop algorithm used by the quantum annealing community. We extend the algorithm for instances of general bipartite couplings, with the associated optimization problem being the minimization of the restricted Boltzmann machine (RBM) energy over the nodal values, which is useful for effectively pre-training the RBM. The hardness of the generated instances can be tuned through a central parameter known as the frustration index. Two versions of the algorithm are presented: the random- and structured-loop algorithms. For the random-loop algorithm, we provide a thorough theoretical and empirical analysis on its mathematical properties from the perspective of frustration, and observe empirically a double phase transition behavior in the hardness scaling behavior driven by the frustration index. For the structured-loop algorithm, we show that it offers an improvement in hardness over the random-loop algorithm in the regime of high loop density, with the variation of hardness tunable through the concentration of frustrated weights.

多くの最適化問題は、最大満足性(MAX-SAT)形式に変換することができ、そのような問題に取り組むための多くのソルバーが開発されています。MAXSATソルバーを評価するには、既知の解を持つ難しいMAX-SATインスタンスを生成すると便利です。ここでは、量子アニーリングコミュニティで使用されているフラストレーションループアルゴリズムにヒントを得た、重み付きMAX-2-SATインスタンスを生成する方法を提案します。このアルゴリズムを一般的な二部結合のインスタンスに拡張します。関連する最適化問題は、ノード値に対する制限付きボルツマンマシン(RBM)エネルギーの最小化であり、RBMを効果的に事前トレーニングするのに役立ちます。生成されたインスタンスの難しさは、フラストレーションインデックスと呼ばれる中心パラメーターによって調整できます。このアルゴリズムには、ランダムループアルゴリズムと構造化ループアルゴリズムの2つのバージョンがあります。ランダムループアルゴリズムについては、フラストレーションの観点からその数学的特性について徹底的な理論的および経験的分析を行い、フラストレーションインデックスによって駆動される困難さのスケーリング動作における二重相転移動作を経験的に観察します。構造化ループアルゴリズムについては、高ループ密度の領域でランダムループアルゴリズムよりも困難さが向上し、困難さの変化はフラストレーションウェイトの集中によって調整可能であることを示します。

Learning Big Gaussian Bayesian Networks: Partition, Estimation and Fusion
ビッグガウスベイジアンネットワークの学習: 分割、推定、融合

Structure learning of Bayesian networks has always been a challenging problem. Nowadays, massive-size networks with thousands or more of nodes but fewer samples frequently appear in many areas. We develop a divide-and-conquer framework, called partition-estimation-fusion (PEF), for structure learning of such big networks. The proposed method first partitions nodes into clusters, then learns a subgraph on each cluster of nodes, and finally fuses all learned subgraphs into one Bayesian network. The PEF method is designed in a flexible way so that any structure learning method may be used in the second step to learn a subgraph structure as either a DAG or a CPDAG. In the clustering step, we adapt hierarchical clustering to automatically choose a proper number of clusters. In the fusion step, we propose a novel hybrid method that sequentially adds edges between subgraphs. Extensive numerical experiments demonstrate the competitive performance of our PEF method, in terms of both speed and accuracy compared to existing methods. Our method can improve the accuracy of structure learning by 20% or more, while reducing running time up to two orders-of-magnitude.

ベイジアンネットワークの構造学習は、常に難しい問題でした。今日では、数千以上のノードがあり、サンプルが少ない大規模なネットワークが多くの分野で頻繁に登場しています。私たちは、このような大規模なネットワークの構造学習のために、分割統治フレームワークであるパーティション推定融合(PEF)を開発しました。提案された方法は、最初にノードをクラスターに分割し、次に各ノードのクラスターでサブグラフを学習し、最後に学習したすべてのサブグラフを1つのベイジアンネットワークに融合します。PEF法は、2番目のステップで任意の構造学習方法を使用して、サブグラフ構造をDAGまたはCPDAGとして学習できるように、柔軟に設計されています。クラスタリングステップでは、適切な数のクラスターを自動的に選択するために階層的クラスタリングを適用します。融合ステップでは、サブグラフ間にエッジを順次追加する新しいハイブリッド方法を提案します。広範な数値実験により、既存の方法と比較して、速度と精度の両方の点で、私たちのPEF法の競争力のあるパフォーマンスが実証されています。私たちの方法は、構造学習の精度を20%以上向上させながら、実行時間を最大2桁短縮することができます。

Streamlined Variational Inference with Higher Level Random Effects
高水準の変量効果による合理化された変分推論

We derive and present explicit algorithms to facilitate streamlined computing for variational inference for models containing higher level random effects. Existing literature is such that streamlined variational inference is restricted to mean field variational Bayes algorithms for two-level random effects models. Here we provide the following extensions: (1) explicit Gaussian response mean field variational Bayes algorithms for three-level models, (2) explicit algorithms for the alternative variational message passing approach in the case of two-level and three-level models, and (3) an explanation of how arbitrarily high levels of nesting can be handled based on the recently published matrix algebraic results of the authors. A pay-off from (2) is simple extension to non-Gaussian response models. In summary, we remove barriers for streamlining variational inference algorithms based on either the mean field variational Bayes approach or the variational message passing approach when higher level random effects are present.

私たちは、高レベルのランダム効果を含むモデルに対する変分推論の効率的な計算を容易にする明示的なアルゴリズムを導出し、提示します。既存の文献によると、効率的な変分推論は、2レベルランダム効果モデルの平均場変分ベイズアルゴリズムに限定されています。ここでは、次の拡張を提供する: (1) 3レベルモデルに対する明示的なガウス応答平均場変分ベイズアルゴリズム、(2) 2レベルおよび3レベルモデルの場合の代替変分メッセージパッシングアプローチに対する明示的なアルゴリズム、(3)著者らが最近発表した行列代数の結果に基づいて、任意の高レベルのネストを処理する方法の説明。(2)のメリットは、非ガウス応答モデルへの単純な拡張です。要約すると、高レベルのランダム効果が存在する場合に、平均場変分ベイズアプローチまたは変分メッセージパッシングアプローチのいずれかに基づく変分推論アルゴリズムを効率化するための障壁を取り除くことができます。

Asymptotic Consistency of α-Renyi-Approximate Posteriors
α-Renyi近似事後分布の漸近的一貫性

We study the asymptotic consistency properties of $\alpha$-{R}\’enyi approximate posteriors, a class of variational Bayesian methods that approximate an intractable Bayesian posterior with a member of a tractable family of distributions, the member chosen to minimize the $\alpha$-{R}\’enyi divergence from the true posterior. Unique to our work is that we consider settings with $\alpha > 1$, resulting in approximations that upperbound the log-likelihood, and consequently have wider spread than traditional variational approaches that minimize the Kullback-Liebler (KL) divergence from the posterior. Our primary result identifies sufficient conditions under which consistency holds, centering around the existence of a `good’ sequence of distributions in the approximating family that possesses, among other properties, the right rate of convergence to a limit distribution. We further characterize the good sequence by demonstrating that a sequence of distributions that converges too quickly cannot be a good sequence. We also extend our analysis to the setting where $\alpha$ equals one, corresponding to the minimizer of the reverse KL divergence, and to models with local latent variables. We also illustrate the existence of a good sequence with a number of examples. Our results complement a growing body of work focused on the frequentist properties of variational Bayesian methods.

私たちは、変分ベイズ法の一種である$\alpha$-{R}\’enyi近似事後分布の漸近的一貫性特性について研究します。この変分ベイズ法は扱いにくいベイズ事後分布を扱いやすい分布族のメンバーで近似するものであり、そのメンバーは真の事後分布からの$\alpha$-{R}\’enyiダイバージェンスを最小化するように選択されます。我々の研究のユニークな点は、$\alpha > 1$の設定を考慮し、対数尤度の上限となる近似値が得られる点であり、その結果、事後分布からのKullback-Liebler (KL)ダイバージェンスを最小化する従来の変分アプローチよりも広い範囲に及ぶ。我々の主な結果は、近似族内に、他の特性の中でも特に極限分布への適切な収束率を持つ「良い」分布列が存在することを中心に、一貫性が保たれる十分な条件を特定します。我々はさらに、収束が速すぎる分布列は良い列ではないことを実証することで、良い列を特徴付ける。また、逆KLダイバージェンスの最小値に対応する$\alpha$が1に等しい設定や、ローカル潜在変数を持つモデルに分析を拡張します。また、いくつかの例を使用して、良好なシーケンスの存在を示します。私たちの結果は、変分ベイズ法の頻度主義的特性に焦点を当てた、増え続ける研究を補完するものです。

Estimate Sequences for Stochastic Composite Optimization: Variance Reduction, Acceleration, and Robustness to Noise
確率的複合材料最適化の推定シーケンス: 分散低減、加速、ノイズに対するロバスト性

In this paper, we propose a unified view of gradient-based algorithms for stochastic convex composite optimization by extending the concept of estimate sequence introduced by Nesterov. More precisely, we interpret a large class of stochastic optimization methods as procedures that iteratively minimize a surrogate of the objective, which covers the stochastic gradient descent method and variants of the incremental approaches SAGA, SVRG, and MISO/Finito/SDCA. This point of view has several advantages: (i) we provide a simple generic proof of convergence for all of the aforementioned methods; (ii) we naturally obtain new algorithms with the same guarantees; (iii) we derive generic strategies to make these algorithms robust to stochastic noise, which is useful when data is corrupted by small random perturbations. Finally, we propose a new accelerated stochastic gradient descent algorithm and a new accelerated SVRG algorithm that is robust to stochastic noise.

この論文では、Nesterovによって導入された推定シーケンスの概念を拡張することにより、確率的凸複合最適化のための勾配ベースのアルゴリズムの統一的なビューを提案します。より正確には、確率的最適化手法の大規模なクラスを、確率的勾配降下法とインクリメンタルアプローチSAGA、SVRG、およびMISO/Finito/SDCAのバリアントをカバーする、目的の代理を反復的に最小化する手順として解釈します。この観点にはいくつかの利点があります:(i)前述のすべての方法に対して、収束の単純な一般的な証明を提供します。(ii)私たちは自然に同じ保証で新しいアルゴリズムを取得します。(iii)これらのアルゴリズムを確率的ノイズに対して堅牢にするための一般的な戦略を導き出し、これはデータが小さなランダム摂動によって破損する場合に役立ちます。最後に、新しい加速確率的勾配降下アルゴリズムと、確率的ノイズに対してロバストな新しい加速SVRGアルゴリズムを提案します。

Learning from Binary Multiway Data: Probabilistic Tensor Decomposition and its Statistical Optimality
バイナリ多元データからの学習:確率的テンソル分解とその統計的最適性

We consider the problem of decomposing a higher-order tensor with binary entries. Such data problems arise frequently in applications such as neuroimaging, recommendation system, topic modeling, and sensor network localization. We propose a multilinear Bernoulli model, develop a rank-constrained likelihood-based estimation method, and obtain the theoretical accuracy guarantees. In contrast to continuous-valued problems, the binary tensor problem exhibits an interesting phase transition phenomenon according to the signal-to-noise ratio. The error bound for the parameter tensor estimation is established, and we show that the obtained rate is minimax optimal under the considered model. Furthermore, we develop an alternating optimization algorithm with convergence guarantees. The efficacy of our approach is demonstrated through both simulations and analyses of multiple data sets on the tasks of tensor completion and clustering.

私たちは、バイナリエントリを持つ高次テンソルを分解する問題を考えます。このようなデータの問題は、ニューロイメージング、レコメンデーションシステム、トピックモデリング、センサーネットワークのローカリゼーションなどのアプリケーションで頻繁に発生します。多重線形ベルヌーイモデルを提案し、ランク制約付き尤度ベースの推定法を開発し、理論的な精度保証を取得します。連続値の問題とは対照的に、バイナリテンソル問題は、信号対雑音比に応じて興味深い相転移現象を示します。パラメータテンソル推定の誤差範囲が確立され、考慮されたモデルの下で得られたレートがミニマックス最適であることを示します。さらに、収束保証付きの交互最適化アルゴリズムを開発します。私たちのアプローチの有効性は、テンソル補完とクラスタリングのタスクに関する複数のデータセットのシミュレーションと分析の両方を通じて実証されています。

Spectral Algorithms for Community Detection in Directed Networks
有向ネットワークにおけるコミュニティ検出のためのスペクトルアルゴリズム

Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. First, this paper establishes theoretical guarantee for this algorithm and its variants for the directed degree-corrected block model (Directed-DCBM). Second, this paper provides significant improvements for the original D-SCORE algorithms by attaching the nodes outside of the community cores using the information of the original network instead of the singular vectors.

大規模なソーシャルネットワークでのコミュニティ検出は、ノードの次数不均一性の影響を受けます。有向ネットワークのD-SCOREアルゴリズムは、クラスタリングの前に隣接行列の特異ベクトルの要素ごとの比率を取ることにより、この影響を減らすために導入されました。統計学者引用ネットワークについては有意な結果が得られましたが、そのパフォーマンスに関する厳密な分析は欠落していました。まず、この論文では、このアルゴリズムとそのバリアントについて、有向度補正ブロックモデル(Directed-DCBM)の理論的保証を確立します。次に、この論文では、単一のベクトルの代わりに元のネットワークの情報を使用してコミュニティコアの外側にノードをアタッチすることにより、元のD-SCOREアルゴリズムを大幅に改善します。

Dual Iterative Hard Thresholding
デュアル反復ハードしきい値処理

Iterative Hard Thresholding (IHT) is a popular class of first-order greedy selection methods for loss minimization under cardinality constraint. The existing IHT-style algorithms, however, are proposed for minimizing the primal formulation. It is still an open issue to explore duality theory and algorithms for such a non-convex and NP-hard combinatorial optimization problem. To address this issue, we develop in this article a novel duality theory for $\ell_2$-regularized empirical risk minimization under cardinality constraint, along with an IHT-style algorithm for dual optimization. Our sparse duality theory establishes a set of sufficient and/or necessary conditions under which the original non-convex problem can be equivalently or approximately solved in a concave dual formulation. In view of this theory, we propose the Dual IHT (DIHT) algorithm as a super-gradient ascent method to solve the non-smooth dual problem with provable guarantees on primal-dual gap convergence and sparsity recovery. Numerical results confirm our theoretical predictions and demonstrate the superiority of DIHT to the state-of-the-art primal IHT-style algorithms in model estimation accuracy and computational efficiency.

反復ハード閾値法(IHT)は、基数制約下での損失最小化のための1次貪欲選択法の一般的なクラスです。ただし、既存のIHTスタイルのアルゴリズムは、主定式化を最小化するために提案されています。このような非凸でNP困難な組み合わせ最適化問題に対する双対性理論とアルゴリズムの探求は、まだ未解決の問題です。この問題に対処するために、この論文では、基数制約下での$\ell_2$正則化経験的リスク最小化の新しい双対性理論と、双対最適化のためのIHTスタイルのアルゴリズムを開発します。私たちのスパース双対性理論は、元の非凸問題を凹双対定式化で同等または近似的に解決できる十分な条件と/または必要な条件のセットを確立します。この理論を考慮して、主双対ギャップ収束とスパース回復の証明可能な保証を備えた非滑らかな双対問題を解くための超勾配上昇法として、デュアルIHT (DIHT)アルゴリズムを提案します。数値結果は理論的予測を裏付け、モデル推定精度と計算効率の点でDIHTが最先端のプライマルIHTスタイルアルゴリズムよりも優れていることを示しています。

Robust Reinforcement Learning with Bayesian Optimisation and Quadrature
ベイズ最適化と直交法によるロバストな強化学習

Bayesian optimisation has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables: state features that are unobservable and randomly determined by the environment in a physical setting but are controllable in a simulator. This article considers the problem of finding a robust policy while taking into account the impact of environment variables. We present Alternating Optimisation and Quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to address such settings. We also present Transferable ALOQ (TALOQ), for settings where simulator inaccuracies lead to difficulty in transferring the learnt policy to the physical system. We show that our algorithms are robust to the presence of significant rare events, which may not be observable under random sampling but play a substantial role in determining the optimal policy. Experimental results across different domains show that our algorithms learn robust policies efficiently.

ベイズ最適化は、さまざまな強化学習の問題にうまく適用されてきました。しかし、シミュレータで最適なポリシーを学習する従来のアプローチでは、特定の環境変数を調整することで学習を改善する機会が活用されていません。特定の環境変数とは、物理的な設定では観測できず、環境によってランダムに決定されるが、シミュレータでは制御可能な状態特性です。この記事では、環境変数の影響を考慮しながら堅牢なポリシーを見つける問題について検討します。このような設定に対処するためにベイズ最適化とベイズ求積法を使用する交互最適化および求積法(ALOQ)を紹介します。また、シミュレータの不正確さにより学習したポリシーを物理システムに転送することが困難な設定に対しては、転送可能なALOQ (TALOQ)を紹介します。私たちのアルゴリズムは、ランダムサンプリングでは観測できないが、最適なポリシーを決定する上で重要な役割を果たす、重要なまれなイベントの存在に対して堅牢であることを示します。さまざまなドメインにわたる実験結果は、私たちのアルゴリズムが堅牢なポリシーを効率的に学習することを示しています。

The Kalai-Smorodinsky solution for many-objective Bayesian optimization
多目的ベイズ最適化のためのカライ・スモロジンスキー解

An ongoing aim of research in multiobjective Bayesian optimization is to extend its applicability to a large number of objectives. While coping with a limited budget of evaluations, recovering the set of optimal compromise solutions generally requires numerous observations and is less interpretable since this set tends to grow larger with the number of objectives. We thus propose to focus on a specific solution originating from game theory, the Kalai-Smorodinsky solution, which possesses attractive properties. In particular, it ensures equal marginal gains over all objectives. We further make it insensitive to a monotonic transformation of the objectives by considering the objectives in the copula space. A novel tailored algorithm is proposed to search for the solution, in the form of a Bayesian optimization algorithm: sequential sampling decisions are made based on acquisition functions that derive from an instrumental Gaussian process prior. Our approach is tested on four problems with respectively four, six, eight, and nine objectives. The method is available in the R package GPGame.

多目的ベイズ最適化の研究の継続的な目標は、その適用範囲を多数の目的に拡張することです。限られた評価予算で対処しながら、最適な妥協解のセットを回復するには、通常、多数の観察が必要であり、このセットは目的の数とともに大きくなる傾向があるため、解釈が困難になります。そこで、ゲーム理論に由来する特定のソリューション、魅力的な特性を持つKalai-Smorodinskyソリューションに焦点を当てることを提案します。特に、このソリューションはすべての目的に対して均等な限界利益を保証します。さらに、コピュラ空間内の目的を考慮することで、目的の単調な変換に対して鈍感にします。ベイズ最適化アルゴリズムの形でソリューションを検索するための新しいカスタマイズされたアルゴリズムが提案されています。これは、道具的ガウス過程事前分布から派生した取得関数に基づいて、順次サンプリングの決定が行われます。私たちのアプローチは、それぞれ4、6、8、9の目的を持つ4つの問題でテストされています。この方法は、RパッケージGPGameで利用できます。

Distributionally Ambiguous Optimization for Batch Bayesian Optimization
バッチ・ベイズ最適化のための分布的にあいまいな最適化

We propose a novel, theoretically-grounded, acquisition function for Batch Bayesian Optimization informed by insights from distributionally ambiguous optimization. Our acquisition function is a lower bound on the well-known Expected Improvement function, which requires evaluation of a Gaussian expectation over a multivariate piecewise affine function. Our bound is computed instead by evaluating the best-case expectation over all probability distributions consistent with the same mean and variance as the original Gaussian distribution. Unlike alternative approaches, including Expected Improvement, our proposed acquisition function avoids multi-dimensional integrations entirely, and can be computed exactly – even on large batch sizes – as the solution of a tractable convex optimization problem. Our suggested acquisition function can also be optimized efficiently, since first and second derivative information can be calculated inexpensively as by-products of the acquisition function calculation itself. We derive various novel theorems that ground our work theoretically and we demonstrate superior performance via simple motivating examples, benchmark functions and real-world problems.

私たちは、分布的に曖昧な最適化からの洞察に基づく、バッチベイズ最適化のための新しい理論的根拠のある獲得関数を提案します。我々の獲得関数は、よく知られている期待改善関数の下限であり、多変量区分アフィン関数に対するガウス期待値の評価を必要とします。代わりに、我々の境界は、元のガウス分布と同じ平均と分散と一致するすべての確率分布に対する最善の期待値を評価することによって計算されます。期待改善などの代替アプローチとは異なり、我々が提案する獲得関数は、多次元積分を完全に回避し、扱いやすい凸最適化問題の解として、バッチサイズが大きい場合でも正確に計算できます。我々が提案する獲得関数は、獲得関数計算自体の副産物として1次および2次導関数情報を安価に計算できるため、効率的に最適化することもできます。私たちは、我々の研究を理論的に根拠付けるさまざまな新しい定理を導き出し、簡単な動機付けの例、ベンチマーク関数、および実際の問題を通じて優れたパフォーマンスを実証します。

Local Causal Network Learning for Finding Pairs of Total and Direct Effects
全体効果と直接効果のペアを見つけるための局所因果ネットワーク学習

In observational studies, it is important to evaluate not only the total effect but also the direct and indirect effects of a treatment variable on a response variable. In terms of local structural learning of causal networks, we try to find all possible pairs of total and direct causal effects, which can further be used to calculate indirect causal effects. An intuitive global learning approach is first to find an essential graph over all variables representing all Markov equivalent causal networks, and then enumerate all equivalent networks and estimate a pair of the total and direct effects for each of them. However, it could be inefficient to learn an essential graph and enumerate equivalent networks when the true causal graph is large. In this paper, we propose a local learning approach instead. In the local learning approach, we first learn locally a chain component containing the treatment. Then, if necessary, we learn locally a chain component containing the response. Next, we locally enumerate all possible pairs of the treatment’s parents and the response’s parents. Finally based on these pairs, we find all possible pairs of total and direct effects of the treatment on the response.

観察研究では、総効果だけでなく、治療変数が応答変数に及ぼす直接的および間接的な効果も評価することが重要です。因果ネットワークのローカル構造学習の観点から、総因果効果と直接因果効果のすべての可能なペアを見つけようとします。これは、間接因果効果を計算するためにさらに使用できます。直感的なグローバル学習アプローチは、最初にすべてのマルコフ等価因果ネットワークを表すすべての変数にわたる必須グラフを見つけ、次にすべての等価ネットワークを列挙し、それらのそれぞれについて総効果と直接効果のペアを推定することです。ただし、真の因果グラフが大きい場合、必須グラフを学習して等価ネットワークを列挙することは非効率的である可能性があります。この論文では、代わりにローカル学習アプローチを提案します。ローカル学習アプローチでは、最初に治療を含むチェーンコンポーネントをローカルに学習します。次に、必要に応じて、応答を含むチェーンコンポーネントをローカルに学習します。次に、治療の親と応答の親のすべての可能なペアをローカルに列挙します。最後に、これらのペアに基づいて、治療が反応に及ぼす総合的かつ直接的な影響のすべての可能なペアを見つけます。

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
確率的勾配法とスペクトルアルゴリズムによる分散学習のための最適収束

We study generalization properties of distributed algorithms in the setting of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We first investigate distributed stochastic gradient methods (SGM), with mini-batches and multi-passes over the data. We show that optimal generalization error bounds (up to logarithmic factor) can be retained for distributed SGM provided that the partition level is not too large. We then extend our results to spectral algorithms (SA), including kernel ridge regression (KRR), kernel principal component regression, and gradient methods. Our results show that distributed SGM has a smaller theoretical computational complexity, compared with distributed KRR and classic SGM. Moreover, even for a general non-distributed SA, they provide optimal, capacity-dependent convergence rates, for the case that the regression function may not be in the RKHS in the well-conditioned regimes.

私たちは、再生カーネルヒルベルト空間(RKHS)上のノンパラメトリック回帰の設定における分散アルゴリズムの一般化特性を研究します。まず、データのミニバッチとマルチパスを使用した分散確率的勾配法(SGM)を調査します。パーティションレベルが大きすぎない限り、分散SGMに対して最適な汎化誤差範囲(対数係数まで)を保持できることを示します。次に、結果をカーネルリッジ回帰(KRR)、カーネル主成分回帰、勾配法などのスペクトルアルゴリズム(SA)に拡張します。私たちの結果は、分散型SGMは、分散型KRRや従来のSGMと比較して、理論上の計算量が小さいことを示しています。さらに、一般的な非分散SAの場合でも、回帰関数が十分に条件付けられた領域のRKHSにない場合に備えて、最適な容量依存の収束率を提供します。

New Insights and Perspectives on the Natural Gradient Method
自然勾配法に関する新たな洞察と展望

Natural gradient descent is an optimization method traditionally motivated from the perspective of information geometry, and works well for many applications as an alternative to stochastic gradient descent. In this paper we critically analyze this method and its properties, and show how it can be viewed as a type of 2nd-order optimization method, with the Fisher information matrix acting as a substitute for the Hessian. In many important cases, the Fisher information matrix is shown to be equivalent to the Generalized Gauss-Newton matrix, which both approximates the Hessian, but also has certain properties that favor its use over the Hessian. This perspective turns out to have significant implications for the design of a practical and robust natural gradient optimizer, as it motivates the use of techniques like trust regions and Tikhonov regularization. Additionally, we make a series of contributions to the understanding of natural gradient and 2nd-order methods, including: a thorough analysis of the convergence speed of stochastic natural gradient descent (and more general stochastic 2nd-order methods) as applied to convex quadratics, a critical examination of the oft-used ‘empirical’ approximation of the Fisher matrix, and an analysis of the (approximate) parameterization invariance property possessed by natural gradient methods (which we show also holds for certain other curvature matrices, but notably not the Hessian).

自然勾配降下法は、情報幾何学の観点から伝統的に動機付けられている最適化手法であり、多くのアプリケーションで確率的勾配降下法の代替としてうまく機能します。この論文では、この手法とその特性を批判的に分析し、フィッシャー情報行列がヘッセ行列の代わりとして機能する、一種の2次最適化手法としてどのように見ることができるかを示します。多くの重要なケースで、フィッシャー情報行列は一般化ガウス-ニュートン行列と同等であることが示されています。これは、ヘッセ行列を近似するだけでなく、ヘッセ行列よりもフィッシャー情報行列を使用する方がよい特定の特性も持っています。この観点は、信頼領域やTikhonov正則化などの手法の使用を促すため、実用的で堅牢な自然勾配最適化装置の設計に重要な意味を持つことが判明しています。さらに、私たちは、自然勾配法と2次法の理解に、次のような一連の貢献をしています。凸二次方程式に適用された確率的自然勾配降下法(およびより一般的な確率的2次法)の収束速度の徹底的な分析、フィッシャー行列の頻繁に使用される「経験的」近似の批判的検討、自然勾配法が持つ(近似)パラメータ化不変性プロパティの分析(これは他の特定の曲率行列にも当てはまるが、ヘッシアン行列には当てはまらないことを示しています)。

Orlicz Random Fourier Features
オルリツ・ランダム・フーリエ特徴

Kernel techniques are among the most widely-applied and influential tools in machine learning with applications at virtually all areas of the field. To combine this expressive power with computational efficiency numerous randomized schemes have been proposed in the literature, among which probably random Fourier features (RFF) are the simplest and most popular. While RFFs were originally designed for the approximation of kernel values, recently they have been adapted to kernel derivatives, and hence to the solution of large-scale tasks involving function derivatives. Unfortunately, the understanding of the RFF scheme for the approximation of higher-order kernel derivatives is quite limited due to the challenging polynomial growing nature of the underlying function class in the empirical process. To tackle this difficulty, we establish a finite-sample deviation bound for a general class of polynomial-growth functions under $\alpha$-exponential Orlicz condition on the distribution of the sample. Instantiating this result for RFFs, our finite-sample uniform guarantee implies a.s. convergence with tight rate for arbitrary kernel with $\alpha$-exponential Orlicz spectrum and any order of derivative.

カーネル手法は、機械学習において最も広く適用され、影響力のあるツールの1つであり、この分野のほぼすべての領域で応用されています。この表現力と計算効率を組み合わせるために、文献では多数のランダム化スキームが提案されていますが、その中でもおそらくランダムフーリエ特徴(RFF)が最も単純で人気があります。RFFはもともとカーネル値の近似用に設計されていましたが、最近ではカーネル導関数に適応され、関数導関数を含む大規模なタスクの解決にも利用されています。残念ながら、高次カーネル導関数の近似に対するRFFスキームの理解は、経験的プロセスにおける基礎となる関数クラスの困難な多項式増加の性質のためにかなり限られています。この困難に取り組むために、サンプルの分布に関する$\alpha$指数オルリッツ条件の下で、多項式増加関数の一般的なクラスに対して有限サンプル偏差境界を確立します。この結果をRFFに具体化すると、有限サンプル均一保証は次のようになります。$\alpha$-指数オルリッツスペクトルと任意の次数の導関数を持つ任意のカーネルに対して、厳密な速度で収束します。

Empirical Priors for Prediction in Sparse High-dimensional Linear Regression
スパース高次元線形回帰における予測のための経験的事前確率

In this paper we adopt the familiar sparse, high-dimensional linear regression model and focus on the important but often overlooked task of prediction. In particular, we consider a new empirical Bayes framework that incorporates data in the prior in two ways: one is to center the prior for the non-zero regression coefficients and the other is to provide some additional regularization. We show that, in certain settings, the asymptotic concentration of the proposed empirical Bayes posterior predictive distribution is very fast, and we establish a Bernstein–von Mises theorem which ensures that the derived empirical Bayes prediction intervals achieve the targeted frequentist coverage probability. The empirical prior has a convenient conjugate form, so posterior computations are relatively simple and fast. Finally, our numerical results demonstrate the proposed method’s strong finite-sample performance in terms of prediction accuracy, uncertainty quantification, and computation time compared to existing Bayesian methods.

この論文では、よく知られているスパースな高次元線形回帰モデルを採用し、重要だが見落とされがちな予測タスクに焦点を当てています。特に、事前分布にデータを組み込む2つの方法、つまり、非ゼロ回帰係数の事前分布を中心化する方法と、追加の正則化を提供する方法という2つの方法で新しい経験的ベイズフレームワークを検討します。特定の設定では、提案された経験的ベイズ事後予測分布の漸近集中が非常に高速であることを示し、導出された経験的ベイズ予測区間が目標の頻度主義的カバレッジ確率を達成することを保証するBernstein-von Mises定理を確立します。経験的事前分布は便利な共役形式であるため、事後計算は比較的単純かつ高速です。最後に、数値結果により、既存のベイズ法と比較して、予測精度、不確実性の定量化、計算時間に関して、提案された方法が強力な有限サンプルパフォーマンスを発揮することがわかります。

A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints
期待制約条件を用いた確率的凸最適化のためのデータ効率的で実行可能なレベルセット法

Stochastic convex optimization problems with expectation constraints (SOECs) are encountered in statistics and machine learning, business, and engineering. The SOEC objective and constraints contain expectations defined with respect to complex distributions or large data sets, leading to high computational complexity when solved by the algorithms that use exact functions and their gradients. Recent stochastic first order methods exhibit low computational complexity when handling SOECs but guarantee near-feasibility and near-optimality only at convergence. These methods may thus return highly infeasible solutions when heuristically terminated, as is often the case, due to theoretical convergence criteria being highly conservative. This issue limits the use of first order methods in several applications where the SOEC constraints encode implementation requirements. We design a stochastic feasible level set method (SFLS) for SOECs that has low complexity and emphasizes feasibility before convergence. Specifically, our level-set method solves a root-finding problem by calling a novel first order oracle that computes a stochastic upper bound on the level-set function by extending mirror descent and online validation techniques. We establish that SFLS maintains a high-probability feasible solution at each root-finding iteration and exhibits favorable complexity compared to state-of-the-art deterministic feasible level set and stochastic subgradient methods. Numerical experiments on three diverse applications highlight how SFLS finds feasible solutions with small optimality gaps with lower complexity than the former approaches.

期待値制約付き確率的凸最適化問題(SOEC)は、統計学や機械学習、ビジネス、エンジニアリングの分野で発生します。SOECの目的と制約には、複雑な分布や大規模なデータセットに関して定義された期待値が含まれており、正確な関数とその勾配を使用するアルゴリズムで解決すると、計算の複雑性が高くなります。最近の確率的一次法は、SOECの処理時に計算の複雑性が低くなりますが、収束時にのみほぼ実現可能でほぼ最適であることが保証されます。そのため、これらの方法は、理論的な収束基準が非常に保守的であるため、よくあることですが、ヒューリスティックに終了すると、非常に実行不可能なソリューションを返す可能性があります。この問題により、SOEC制約が実装要件をエンコードするいくつかのアプリケーションで一次法の使用が制限されます。私たちは、収束前に実現可能性を重視し、複雑性が低いSOEC用の確率的実行可能レベルセット法(SFLS)を設計します。具体的には、レベルセット法は、ミラー降下法とオンライン検証技術を拡張してレベルセット関数の確率的上限を計算する新しい一次オラクルを呼び出すことで、ルート探索問題を解決します。SFLSは、ルート探索の各反復で高確率の実行可能ソリューションを維持し、最先端の決定論的実行可能レベルセット法と確率的サブグラディエント法と比較して好ましい複雑性を示すことを確立しました。3つの異なるアプリケーションでの数値実験により、SFLSが以前のアプローチよりも低い複雑性で、小さな最適性ギャップを持つ実行可能ソリューションをどのように見つけるかが明らかになりました。

Nesterov’s Acceleration for Approximate Newton
近似ニュートンに対するネステロフの加速度

Optimization plays a key role in machine learning. Recently, stochastic second-order methods have attracted considerable attention because of their low computational cost in each iteration. However, these methods might suffer from poor performance when the Hessian is hard to be approximate well in a computation-efficient way. To overcome this dilemma, we resort to Nesterov’s acceleration to improve the convergence performance of these second-order methods and propose accelerated approximate Newton. We give the theoretical convergence analysis of accelerated approximate Newton and show that Nesterov’s acceleration can improve the convergence rate. Accordingly, we propose an accelerated regularized sub-sampled Newton (ARSSN) which performs much better than the conventional regularized sub-sampled Newton empirically and theoretically. Moreover, we show that ARSSN has better performance than classical first-order methods empirically.

最適化は、機械学習において重要な役割を果たします。近年、確率的二次法は、各反復での計算コストが低いため、大きな注目を集めています。ただし、これらの方法は、ヘッセが計算効率の高い方法で適切に近似するのが難しい場合、パフォーマンスが低下する可能性があります。このジレンマを克服するために、ネステロフの加速に頼って、これらの2次法の収束性能を向上させ、加速された近似ニュートンを提案します。加速された近似ニュートンの理論収束解析を行い、ネステロフの加速が収束率を改善できることを示します。したがって、従来の正規化サブサンプリングニュートンよりもはるかに優れた性能を経験的および理論的に発揮する加速正規化サブサンプリングニュートン(ARSSN)を提案します。さらに、ARSSNが従来の一次法よりも優れた性能を経験的に示しています。

Importance Sampling Techniques for Policy Optimization
ポリシー最適化のための重要度サンプリング手法

How can we effectively exploit the collected samples when solving a continuous control task with Reinforcement Learning? Recent results have empirically demonstrated that multiple policy optimization steps can be performed with the same batch by using off-distribution techniques based on importance sampling. However, when dealing with off-distribution optimization, it is essential to take into account the uncertainty introduced by the importance sampling process. In this paper, we propose and analyze a class of model-free, policy search algorithms that extend the recent Policy Optimization via Importance Sampling (Metelli et al., 2018) by incorporating two advanced variance reduction techniques: per-decision and multiple importance sampling. For both of them, we derive a high-probability bound, of independent interest, and then we show how to employ it to define a suitable surrogate objective function that can be used for both action-based and parameter-based settings. The resulting algorithms are finally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods.

強化学習を使用して連続制御タスクを解決する際に、収集したサンプルを効果的に活用するにはどうすればよいでしょうか。最近の結果では、重要度サンプリングに基づくオフディストリビューション手法を使用することで、同じバッチで複数のポリシー最適化ステップを実行できることが実証されています。ただし、オフディストリビューション最適化を扱う場合、重要度サンプリングプロセスによって導入される不確実性を考慮することが不可欠です。この論文では、最近の重要度サンプリングによるポリシー最適化(Metelliら、2018年)を拡張し、2つの高度な分散削減手法(決定ごとと複数の重要度サンプリング)を組み込んだ、モデルフリーのポリシー検索アルゴリズムのクラスを提案し、分析します。これら2つについて、独立した関心のある高確率境界を導出し、それを使用してアクションベースとパラメーターベースの両方の設定に使用できる適切な代理目的関数を定義する方法を示します。結果として得られるアルゴリズムは、最終的に、線形ポリシーとディープポリシーの両方を使用して、一連の連続制御タスクで評価され、最新のポリシー最適化手法と比較されます。

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
統一されたテキスト-テキスト変換器による転移学習の限界の探求

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

転移学習は、モデルを最初にデータが豊富なタスクで事前トレーニングしてから、下流のタスクで微調整するもので、自然言語処理(NLP)の強力な手法として登場しました。転移学習の有効性により、さまざまなアプローチ、方法論、実践が生まれました。この論文では、すべてのテキストベースの言語問題をテキストからテキストへの形式に変換する統一フレームワークを紹介することで、NLPの転移学習手法の概要を説明します。私たちの体系的な研究では、事前トレーニングの目的、アーキテクチャ、ラベルなしデータセット、転移アプローチ、およびその他の要素を数十の言語理解タスクで比較します。規模と新しい「Colossal Clean Crawled Corpus」を備えた探索から得られた洞察を組み合わせることで、要約、質問応答、テキスト分類などを含む多くのベンチマークで最先端の結果を達成しました。NLPの転移学習に関する今後の作業を容易にするために、データセット、事前トレーニング済みモデル、コードを公開します。

Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Networks
連鎖と連鎖則の出会い:ニューラルネットワークの多重レベルエントロピー正則化と訓練

We derive generalization and excess risk bounds for neural networks using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural networks and by using the technique of chaining mutual information introduced by Asadi et al. ’18. The resulting bounds are algorithm-dependent and multiscale: they exploit the multilevel structure of neural networks. This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multiscale extension of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum. This leads to a new training procedure for neural networks with performance guarantees, which exploits the chain rule of relative entropy rather than the chain rule of derivatives (as in backpropagation), and which takes into account the interactions between different scales of the hypothesis sets of neural networks corresponding to different depths of the hidden layers. To obtain an efficient implementation of the latter, we further develop a multilevel Metropolis algorithm simulating the multiscale Gibbs distribution, with an experiment for a two-layer neural network on the MNIST data set.

私たちは、マルチレベル相対エントロピーに基づく複雑性尺度群を使用して、ニューラルネットワークの一般化および過剰リスクの境界を導出します。境界は、ニューラルネットワークの生成された階層的被覆の概念を導入し、Asadiら’18が導入した相互情報量の連鎖の手法を使用することで得られます。結果として得られる境界はアルゴリズムに依存し、マルチスケールです。つまり、ニューラルネットワークのマルチレベル構造を利用します。これは、マルチレベルエントロピー正則化を伴う経験的リスク最小化問題につながる。最小化問題は、有名なギブス事後分布のマルチスケール拡張を導入することで解決され、導出された分布が唯一の最小値を達成することが証明されます。これは、パフォーマンス保証付きのニューラルネットワークの新しいトレーニング手順につながる。この手順では、微分の連鎖則(バックプロパゲーションの場合など)ではなく、相対エントロピーの連鎖則を利用し、隠れ層の異なる深さに対応するニューラルネットワークの仮説セットの異なるスケール間の相互作用を考慮に入れます。後者の効率的な実装を得るために、MNISTデータセット上の2層ニューラルネットワークの実験を使用して、マルチスケールギブス分布をシミュレートするマルチレベルメトロポリスアルゴリズムをさらに開発します。

metric-learn: Metric Learning Algorithms in Python
metric-learn: Python のメトリクス学習アルゴリズム

metric-learn is an open source Python package implementing supervised and weakly-supervised distance metric learning algorithms. As part of scikit-learn-contrib, it provides a unified interface compatible with scikit-learn which allows to easily perform cross-validation, model selection, and pipelining with other machine learning estimators. metric-learn is thoroughly tested and available on PyPi under the MIT license.

metric-learnは、教師ありおよび弱教師付きの距離メトリック学習アルゴリズムを実装するオープンソースのPythonパッケージです。scikit-learn-contribの一部として、scikit-learnと互換性のある統一インターフェースを提供し、他の機械学習推定器との交差検証、モデル選択、パイプライン化を簡単に実行できます。metric-learnは徹底的にテストされ、MITライセンスの下でPyPiで利用できます。

Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting
連続アクションを持つコンテキストバンディット: スムージング、ズーム、適応

We study contextual bandit learning with an abstract policy class and continuous action space. We obtain two qualitatively different regret bounds: one competes with a smoothed version of the policy class under no continuity assumptions, while the other requires standard Lipschitz assumptions. Both bounds exhibit data-dependent “zooming” behavior and, with no tuning, yield improved guarantees for benign problems. We also study adapting to unknown smoothness parameters, establishing a price-of-adaptivity and deriving optimal adaptive algorithms that require no additional information.

私たちは、抽象的なポリシークラスと連続的な行動空間を使用して、文脈上のバンディット学習を研究します。質的に異なる2つの後悔限界が得られます:1つは連続性の仮定なしでポリシークラスの平滑化されたバージョンと競合し、もう1つは標準的なリプシッツの仮定を必要とします。どちらの境界もデータ依存の”ズーム”動作を示し、チューニングがない場合、良性の問題に対する保証が向上します。また、未知の平滑性パラメータへの適応、適応性の価格の確立、追加情報を必要としない最適な適応アルゴリズムの導出についても研究しています。

Convergence Rates for the Stochastic Gradient Descent Method for Non-Convex Objective Functions
非凸目的関数の確率的勾配降下法の収束率

We prove the convergence to minima and estimates on the rate of convergence for the stochastic gradient descent method in the case of not necessarily locally convex nor contracting objective functions. In particular, the analysis relies on a quantitative use of mini-batches to control the loss of iterates to non-attracted regions. The applicability of the results to simple objective functions arising in machine learning is shown.

私たちは、確率的勾配降下法の収束が極小値への収束を証明し、必ずしも局所的に凸状でも目的関数を収縮させるわけでもない場合に、収束率を推定します。特に、この分析では、ミニバッチの定量的な使用に依存して、非引き付け領域への反復の損失を制御します。機械学習で生じる単純な目的関数に対する結果の適用性を示します。

A Unified Framework of Online Learning Algorithms for Training Recurrent Neural Networks
リカレントニューラルネットワークを訓練するためのオンライン学習アルゴリズムの統一フレームワーク

We present a framework for compactly summarizing many recent results in efficient and/or biologically plausible online training of recurrent neural networks (RNN). The framework organizes algorithms according to several criteria: (a) past vs. future facing, (b) tensor structure, (c) stochastic vs. deterministic, and (d) closed form vs. numerical. These axes reveal latent conceptual connections among several recent advances in online learning. Furthermore, we provide novel mathematical intuitions for their degree of success. Testing these algorithms on two parametric task families shows that performances cluster according to our criteria. Although a similar clustering is also observed for pairwise gradient alignment, alignment with exact methods does not explain ultimate performance. This suggests the need for better comparison metrics.

私たちは、リカレントニューラルネットワーク(RNN)の効率的および/または生物学的にもっともらしいオンライントレーニングにおける多くの最近の結果をコンパクトに要約するためのフレームワークを提示します。このフレームワークは、(a)過去と未来、(b)テンソル構造、(c)確率論的対決定論的、(d)閉形式対数値的など、いくつかの基準に従ってアルゴリズムを編成します。これらの軸は、オンライン学習における最近のいくつかの進歩の間に潜在的な概念的なつながりを明らかにしています。さらに、その成功の度合いについて、新しい数学的直感を提供します。これらのアルゴリズムを2つのパラメトリックタスクファミリでテストすると、パフォーマンスが基準に従ってクラスタリングされることがわかります。ペアワイズ・グラディエント・アライメントでも同様のクラスタリングが観察されますが、厳密法によるアライメントでは最終的なパフォーマンスは説明できません。これは、より優れた比較指標の必要性を示唆しています。

Probabilistic Learning on Graphs via Contextual Architectures
コンテキストアーキテクチャによるグラフ上の確率的学習

We propose a novel methodology for representation learning on graph-structured data, in which a stack of Bayesian Networks learns different distributions of a vertex’s neighbourhood. Through an incremental construction policy and layer-wise training, we can build deeper architectures with respect to typical graph convolutional neural networks, with benefits in terms of context spreading between vertices. First, the model learns from graphs via maximum likelihood estimation without using target labels. Then, a supervised readout is applied to the learned graph embeddings to deal with graph classification and vertex classification tasks, showing competitive results against neural models for graphs. The computational complexity is linear in the number of edges, facilitating learning on large scale data sets. By studying how depth affects the performances of our model, we discover that a broader context generally improves performances. In turn, this leads to a critical analysis of some benchmarks used in literature.

私たちは、グラフ構造化データに対する表現学習の新しい方法論を提案します。この方法では、ベイジアンネットワークのスタックが頂点の近傍のさまざまな分布を学習します。増分構築ポリシーとレイヤーごとのトレーニングを通じて、一般的なグラフ畳み込みニューラルネットワークに関してより深いアーキテクチャを構築でき、頂点間のコンテキストの広がりという利点があります。まず、モデルはターゲットラベルを使用せずに最大尤度推定によってグラフから学習します。次に、学習したグラフ埋め込みに教師あり読み出しを適用して、グラフ分類および頂点分類タスクを処理し、グラフのニューラルモデルに対して競争力のある結果を示します。計算の複雑さはエッジの数に比例するため、大規模なデータセットでの学習が容易になります。深さがモデルのパフォーマンスにどのように影響するかを研究することで、コンテキストが広いほどパフォーマンスが向上することがわかりました。次に、これは文献で使用されているいくつかのベンチマークの批判的分析につながります。

Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers
スパースランク1行列補完のための勾配降下法 – スパース相互作用ワーカーのクラウドソーシング集約のための

We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model. In practice, skill-estimation is challenging because worker assignments are sparse and irregular due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation as a rank-one correlation-matrix completion problem, where the observed components correspond to observed label correlation between workers. We show that the correlation matrix can be successfully recovered and skills are identifiable if and only if the sampling matrix (observed components) does not have a bipartite connected component. We then propose a projected gradient descent scheme and show that skill estimates converge to the desired global optima for such sampling matrices. Our proof is original and the results are surprising in light of the fact that even the weighted rank-one matrix factorization problem is NP-hard in general. Next, we derive sample complexity bounds in terms of spectral properties of the signless Laplacian of the sampling matrix. Our proposed scheme achieves state-of-art performance on a number of real-world datasets.

私たちは、シングルコインDawid-Skeneクラウドソーシングモデルにおけるワーカーのスキル推定について検討します。実際には、ワーカーの割り当ては、ワーカーの可用性が任意で制御されていないため、まばらで不規則であるため、スキル推定は困難です。私たちは、ランク1相関行列補完問題としてスキル推定を定式化します。ここで、観測されたコンポーネントは、ワーカー間の観測されたラベル相関に対応します。私たちは、サンプリング行列(観測されたコンポーネント)に二部接続コンポーネントがない場合に限り、相関行列を正常に復元でき、スキルを識別できることを示します。次に、投影勾配降下法を提案し、スキル推定が、そのようなサンプリング行列の望ましいグローバル最適値に収束することを示します。我々の証明は独創的であり、重み付きランク1行列因数分解問題でさえ一般にNP困難であるという事実を考慮すると、結果は驚くべきものです。次に、サンプリング行列の符号なしラプラシアンのスペクトル特性の観点から、サンプル複雑度の境界を導出します。我々の提案する方式は、多くの実際のデータセットで最先端のパフォーマンスを実現します。

Monte Carlo Gradient Estimation in Machine Learning
機械学習におけるモンテカルロ勾配推定

This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis. In machine learning research, this gradient problem lies at the core of many learning problems, in supervised, unsupervised and reinforcement learning. We will generally seek to rewrite such gradients in a form that allows for Monte Carlo estimation, allowing them to be easily and efficiently used and analysed. We explore three strategies—the pathwise, score function, and measure-valued gradient estimators—exploring their historical development, derivation, and underlying assumptions. We describe their use in other fields, show how they are related and can be combined, and expand on their possible generalisations. Wherever Monte Carlo gradient estimators have been derived and deployed in the past, important advances have followed. A deeper and more widely-held understanding of this problem will lead to further advances, and it is these advances that we wish to support.

この論文では、機械学習や統計科学全般におけるモンテカルロ勾配推定に利用できる手法を幅広くわかりやすく概説したものです。モンテカルロ勾配推定には、積分される分布を定義するパラメータに関する関数の期待値の勾配を計算する問題、感度分析の問題などがあります。機械学習の研究では、この勾配の問題は、教師あり学習、教師なし学習、強化学習など、多くの学習問題の核心にあります。私たちは通常、このような勾配をモンテカルロ推定を可能にする形式で書き直し、簡単に効率的に使用および分析できるようにします。私たちは、パスワイズ、スコア関数、および測度値勾配推定量の3つの戦略を検討し、それらの歴史的発展、導出、および基礎となる仮定を探ります。他の分野での使用について説明し、それらがどのように関連し、組み合わせられるかを示し、それらの可能な一般化について詳しく説明します。モンテカルロ勾配推定量が過去に導出され、展開された場所では、重要な進歩が続いています。この問題に対する理解がより深く、より広く共有されることで、さらなる進歩がもたらされます。そして、私たちがサポートしたいのは、まさにこうした進歩なのです。

Convergence of Sparse Variational Inference in Gaussian Processes Regression
ガウス過程回帰におけるスパース変分推論の収束

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, $N$, due to the cubic (in $N$) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on $M \ll N$ inducing variables to form an approximation at a cost of $\mathcal{O}\left(NM^2\right)$. While the computational cost appears linear in $N$, the true complexity depends on how $M$ must scale with $N$ to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how $M$ needs to grow with $N$ to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with $M \ll N$. Specifically, for the popular squared exponential kernel and $D$-dimensional Gaussian distributed covariates, $M = \mathcal{O}((\log N)^D)$ suffice and a method with an overall computational cost of $\mathcal{O}\left(N(\log N)^{2D}(\log \log N)^2\right)$ can be used to perform inference.

ガウス過程は関数上の分布であり、ベイズモデリングにおいて多用途で数学的に便利な事前分布です。しかし、正確な推論で使用される行列演算のコストがNの3乗であるため、観測値の数が多いデータではその使用が妨げられることがよくあります。コスト$\mathcal{O}\left(NM^2\right)$で近似値を形成するために変数$M \ll N$に依存する多くのソリューションが提案されています。計算コストはNに比例しているように見えますが、実際の複雑さは、近似値の特定の品質を確保するために$M$がNとともにどの程度拡大する必要があるかによって異なります。この研究では、高品質の近似値を確保するために$M$がNとともにどの程度拡大する必要があるかの上限と下限を調査します。$M \ll N$のガウスノイズ回帰モデルでは、近似モデルと正確な事後分布の間のKLダイバージェンスを任意に小さくできることを示します。具体的には、一般的な二乗指数カーネルと$D$次元ガウス分布共変量の場合、$M = \mathcal{O}((\log N)^D)$で十分であり、全体的な計算コストが$\mathcal{O}\left(N(\log N)^{2D}(\log \log N)^2\right)$の方法で推論を実行できます。

AI Explainability 360: An Extensible Toolkit for Understanding Data and Machine Learning Models
AI Explainability 360: データと機械学習モデルを理解するための拡張可能なツールキット

As artificial intelligence algorithms make further inroads in high-stakes societal applications, there are increasing calls from multiple stakeholders for these algorithms to explain their outputs. To make matters more challenging, different personas of consumers of explanations have different requirements for explanations. Toward addressing these needs, we introduce AI Explainability 360, an open-source Python toolkit featuring ten diverse and state-of-the-art explainability methods and two evaluation metrics. Equally important, we provide a taxonomy to help entities requiring explanations to navigate the space of interpretation and explanation methods, not only those in the toolkit but also in the broader literature on explainability. For data scientists and other users of the toolkit, we have implemented an extensible software architecture that organizes methods according to their place in the AI modeling pipeline. The toolkit is not only the software, but also guidance material, tutorials, and an interactive web demo to introduce AI explainability to different audiences. Together, our toolkit and taxonomy can help identify gaps where more explainability methods are needed and provide a platform to incorporate them as they are developed.

人工知能アルゴリズムが社会的に重要なアプリケーションにさらに浸透するにつれて、さまざまな利害関係者から、これらのアルゴリズムに出力を説明するよう求める声が高まっています。さらに困難なことに、説明の消費者のペルソナが異なれば、説明に対する要件も異なります。これらのニーズに対処するために、AI Explainability 360を紹介します。これは、10種類の多様で最先端の説明可能性手法と2つの評価指標を備えたオープンソースのPythonツールキットです。同様に重要なのは、説明を必要とするエンティティが、ツールキット内の手法だけでなく、説明可能性に関するより広範な文献にある解釈および説明手法の領域をナビゲートするのに役立つ分類法を提供することです。データサイエンティストやツールキットの他のユーザー向けに、AIモデリングパイプラインにおける位置に応じて手法を整理する拡張可能なソフトウェアアーキテクチャを実装しました。ツールキットはソフトウェアだけでなく、さまざまな対象者にAIの説明可能性を紹介するためのガイダンスマテリアル、チュートリアル、インタラクティブなWebデモも含まれています。当社のツールキットと分類法を組み合わせることで、より多くの説明方法が必要とされるギャップを特定し、開発時にそれらを組み込むためのプラットフォームを提供することができます。

A General System of Differential Equations to Model First-Order Adaptive Algorithms
1次適応アルゴリズムをモデル化するための微分方程式の一般系

First-order optimization algorithms play a major role in large scale machine learning. A new class of methods, called adaptive algorithms, was recently introduced to adjust iteratively the learning rate for each coordinate. Despite great practical success in deep learning, their behavior and performance on more general loss functions are not well understood. In this paper, we derive a non-autonomous system of differential equations, which is the continuous-time limit of adaptive optimization methods. We study the convergence of its trajectories and give conditions under which the differential system, underlying all adaptive algorithms, is suitable for optimization. We discuss convergence to a critical point in the non-convex case and give conditions for the dynamics to avoid saddle points and local maxima. For convex loss function, we introduce a suitable Lyapunov functional which allows us to study its rate of convergence. Several other properties of both the continuous and discrete systems are briefly discussed. The differential system studied in the paper is general enough to encompass many other classical algorithms (such as Heavy Ball and Nesterov’s accelerated method) and allow us to recover several known results for these algorithms.

一次最適化アルゴリズムは、大規模機械学習で重要な役割を果たします。最近、各座標の学習率を反復的に調整する適応アルゴリズムと呼ばれる新しいクラスの手法が導入されました。ディープラーニングでは実用上大きな成功を収めていますが、より一般的な損失関数でのその動作とパフォーマンスは十分に理解されていません。この論文では、適応最適化手法の連続時間限界である非自律微分方程式系を導出します。その軌跡の収束を調べ、すべての適応アルゴリズムの基礎となる微分系が最適化に適している条件を示します。非凸の場合の臨界点への収束について説明し、ダイナミクスが鞍点と極大値を回避するための条件を示します。凸損失関数については、収束率を調べることができる適切なリアプノフ関数を導入します。連続システムと離散システムの他のいくつかの特性についても簡単に説明します。この論文で研究されている微分システムは、他の多くの古典的なアルゴリズム（Heavy BallやNesterovの加速法など）を包含できるほど汎用性が高く、これらのアルゴリズムのいくつかの既知の結果を回復することができます。

A Regularization-Based Adaptive Test for High-Dimensional GLMs
高次元GLMの正則化に基づく適応試験

In spite of its urgent importance in the era of big data, testing high-dimensional parameters in generalized linear models (GLMs) in the presence of high-dimensional nuisance parameters has been largely under-studied, especially with regard to constructing powerful tests for general (and unknown) alternatives. Most existing tests are powerful only against certain alternatives and may yield incorrect Type 1 error rates under high-dimensional nuisance parameter situations. In this paper, we propose the adaptive interaction sum of powered score (aiSPU) test in the framework of penalized regression with a non-convex penalty, called truncated Lasso penalty (TLP), which can maintain correct Type 1 error rates while yielding high statistical power across a wide range of alternatives. To calculate its p-values analytically, we derive its asymptotic null distribution. Via simulations, its superior finite-sample performance is demonstrated over several representative existing methods. In addition, we apply it and other representative tests to an Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set, detecting possible gene-gender interactions for Alzheimer’s disease. We also put R package “aispu” implementing the proposed test on GitHub.

ビッグデータ時代における緊急の重要性にもかかわらず、高次元のニューサンスパラメータが存在する場合の一般化線形モデル(GLM)の高次元パラメータのテストは、特に一般的な(および未知の)代替案に対する強力なテストの構築に関しては、ほとんど研究されていません。既存のテストのほとんどは、特定の代替案に対してのみ強力であり、高次元のニューサンスパラメータの状況では誤ったタイプ1エラー率をもたらす可能性があります。この論文では、切り捨てLassoペナルティ(TLP)と呼ばれる非凸ペナルティを伴うペナルティ付き回帰のフレームワークで、適応型相互作用パワードスコア合計(aiSPU)テストを提案します。このテストは、正しいタイプ1エラー率を維持しながら、幅広い代替案にわたって高い統計的検出力をもたらします。そのp値を解析的に計算するために、その漸近的ヌル分布を導出します。シミュレーションにより、いくつかの代表的な既存の方法よりも優れた有限サンプル性能が実証されています。さらに、このテストと他の代表的なテストをアルツハイマー病神経画像化イニシアチブ(ADNI)データセットに適用し、アルツハイマー病の遺伝子と性別の相互作用の可能性を検出しました。また、提案されたテストを実装したRパッケージ「aispu」をGitHubに公開しました。

Apache Mahout: Machine Learning on Distributed Dataflow Systems
Apache Mahout: 分散データフローシステムでの機械学習

Apache Mahout is a library for scalable machine learning (ML) on distributed dataflow systems, offering various implementations of classification, clustering, dimensionality reduction and recommendation algorithms. Mahout was a pioneer in large-scale machine learning in 2008, when it started and targeted MapReduce, which was the predominant abstraction for scalable computing in industry at that time. Mahout has been widely used by leading web companies and is part of several commercial cloud offerings. In recent years, Mahout migrated to a general framework enabling a mix of dataflow programming and linear algebraic computations on backends such as Apache Spark and Apache Flink. This design allows users to execute data preprocessing and model training in a single, unified dataflow system, instead of requiring a complex integration of several specialized systems. Mahout is maintained as a community-driven open source project at the Apache Software Foundation, and is available under https://mahout.apache.org.

Apache Mahoutは、分散データフローシステム上のスケーラブルな機械学習(ML)用ライブラリであり、分類、クラスタリング、次元削減、推奨アルゴリズムのさまざまな実装を提供します。Mahoutは、2008年に開始され、当時業界でスケーラブルコンピューティングの主要な抽象化であったMapReduceをターゲットにしたとき、大規模機械学習の先駆者でした。Mahoutは、大手Web企業で広く使用されており、いくつかの商用クラウドサービスの一部となっています。近年、Mahoutは、Apache SparkやApache Flinkなどのバックエンドでデータフロープログラミングと線形代数計算を組み合わせることができる汎用フレームワークに移行しました。この設計により、ユーザーは、複数の専用システムを複雑に統合する代わりに、単一の統合されたデータフローシステムでデータの前処理とモデルトレーニングを実行できます。Mahoutは、Apache Software Foundationでコミュニティ主導のオープンソースプロジェクトとして維持されており、https://mahout.apache.orgで入手できます。

Distributed Minimum Error Entropy Algorithms
分散最小誤差エントロピーアルゴリズム

Minimum Error Entropy (MEE) principle is an important approach in Information Theoretical Learning (ITL). It is widely applied and studied in various fields for its robustness to noise. In this paper, we study a reproducing kernel-based distributed MEE algorithm, DMEE, which is designed to work with both fully supervised data and semi-supervised data. The divide-and-conquer approach is employed, so there is no inter-node communication overhead. Similar as other distributed algorithms, DMEE significantly reduces the computational complexity and memory requirement on single computing nodes. With fully supervised data, our proved learning rates equal the minimax optimal learning rates of the classical pointwise kernel-based regressions. Under the semi-supervised learning scenarios, we show that DMEE exploits unlabeled data effectively, in the sense that first, under the settings with weak regularity assumptions, additional unlabeled data significantly improves the learning rates of DMEE. Second, with sufficient unlabeled data, labeled data can be distributed to many more computing nodes, that each node takes only O(1) labels, without spoiling the learning rates in terms of the number of labels. This conclusion overcomes the saturation phenomenon in unlabeled data size. It parallels a recent results for regularized least squares (Lin and Zhou, 2018), and suggests that an inflation of unlabeled data is a solution to the MEE learning problems with decentralized data source for the concerns of privacy protection. Our work refers to pairwise learning and non-convex loss. The theoretical analysis is achieved by distributed U-statistics and error decomposition techniques in integral operators.

最小エラーエントロピー(MEE)原理は、情報理論的学習(ITL)における重要なアプローチです。ノイズに対する堅牢性のため、さまざまな分野で広く適用され、研究されています。この論文では、完全教師ありデータと半教師ありデータの両方で動作するように設計された、再現カーネルベースの分散MEEアルゴリズムDMEEについて検討します。分割統治法が採用されているため、ノード間通信のオーバーヘッドはありません。他の分散アルゴリズムと同様に、DMEEは単一のコンピューティングノードでの計算の複雑さとメモリ要件を大幅に削減します。完全教師ありデータでは、証明された学習率は、従来のポイントワイズカーネルベース回帰のミニマックス最適学習率に等しくなります。半教師あり学習シナリオでは、DMEEがラベルなしデータを効果的に活用することを示します。つまり、まず、弱い規則性の仮定を伴う設定では、追加のラベルなしデータによってDMEEの学習率が大幅に向上します。第二に、十分なラベルなしデータがあれば、ラベル付きデータをより多くのコンピューティングノードに分散することができ、各ノードはラベルの数に関して学習率を損なうことなくO(1)ラベルのみを取得します。この結論は、ラベルなしデータサイズの飽和現象を克服します。これは、正規化最小二乗法(Lin and Zhou、2018)の最近の結果と並行しており、プライバシー保護の懸念に対する分散データソースを使用したMEE学習問題に対する解決策として、ラベルなしデータのインフレーションが考えられることを示唆しています。私たちの研究は、ペアワイズ学習と非凸損失について言及しています。理論分析は、分散U統計と積分演算子の誤差分解手法によって実現されます。

Optimal Algorithms for Continuous Non-monotone Submodular and DR-Submodular Maximization
連続非モノトーンサブモジュラおよびDRサブモジュラ最大化のための最適アルゴリズム

In this paper we study the fundamental problems of maximizing a continuous non-monotone submodular function over the hypercube, both with and without coordinate-wise concavity. This family of optimization problems has several applications in machine learning, economics, and communication systems. Our main result is the first $\frac{1}{2}$-approximation algorithm for continuous submodular function maximization; this approximation factor of $\frac{1}{2}$ is the best possible for algorithms that only query the objective function at polynomially many points. For the special case of DR-submodular maximization, i.e. when the submodular function is also coordinate-wise concave along all coordinates, we provide a different $\frac{1}{2}$-approximation algorithm that runs in quasi-linear time. Both these results improve upon prior work (Bian et al. 2017; Soma and Yoshida, 2017). Our first algorithm uses novel ideas such as reducing the guaranteed approximation problem to analyzing a zero-sum game for each coordinate, and incorporates the geometry of this zero-sum game to fix the value at this coordinate. Our second algorithm exploits coordinate-wise concavity to identify a monotone equilibrium condition sufficient for getting the required approximation guarantee, and hunts for the equilibrium point using binary search. We further run experiments to verify the performance of our proposed algorithms in related machine learning applications.

この論文では、座標方向の凹面がある場合とない場合の両方で、ハイパーキューブ上の連続非単調サブモジュラ関数を最大化する基本的な問題を研究します。この最適化問題群は、機械学習、経済学、通信システムなど、さまざまな分野で応用されています。私たちの主な結果は、連続サブモジュラ関数の最大化に対する最初の$\frac{1}{2}$近似アルゴリズムです。この近似係数$\frac{1}{2}$は、多項式的に多くの点で目的関数を照会するだけのアルゴリズムにとって最適なものです。DRサブモジュラ最大化の特殊なケース、つまりサブモジュラ関数がすべての座標に沿って座標方向の凹面でもある場合、準線形時間で実行される別の$\frac{1}{2}$近似アルゴリズムを提供します。これらの結果は両方とも、以前の研究(Bianら2017; Soma and Yoshida, 2017)を改善しています。最初のアルゴリズムでは、保証された近似問題を各座標のゼロサムゲームの分析に縮小するなどの斬新なアイデアを使用し、このゼロサムゲームの幾何学を組み込んでこの座標の値を固定します。2番目のアルゴリズムでは、座標ごとの凹面を利用して、必要な近似保証を得るのに十分な単調な平衡条件を特定し、バイナリ検索を使用して平衡点を探します。さらに、関連する機械学習アプリケーションで提案されたアルゴリズムのパフォーマンスを検証するための実験を実行します。

Fast Bayesian Inference of Sparse Networks with Automatic Sparsity Determination
自動スパース性決定によるスパースネットワークの高速ベイズ推論

Structure learning of Gaussian graphical models typically involves careful tuning of penalty parameters, which balance the tradeoff between data fidelity and graph sparsity. Unfortunately, this tuning is often a “black art” requiring expert experience or brute-force search. It is therefore tempting to develop tuning-free algorithms that can determine the sparsity of the graph adaptively from the observed data in an automatic fashion. In this paper, we propose a novel approach, named BISN (Bayesian inference of Sparse Networks), for automatic Gaussian graphical model selection. Specifically, we regard the off-diagonal entries in the precision matrix as random variables and impose sparse-promoting horseshoe priors on them, resulting in automatic sparsity determination. With the help of stochastic gradients, an efficient variational Bayes algorithm is derived to learn the model. We further propose a decaying recursive stochastic gradient (DRSG) method to reduce the variance of the stochastic gradients and to accelerate the convergence. Our theoretical analysis shows that the time complexity of BISN scales only quadratically with the dimension, whereas the theoretical time complexity of the state-of-the-art methods for automatic graphical model selection is typically a third-order function of the dimension. Furthermore, numerical results show that BISN can achieve comparable or better performance than the state-of-the-art methods in terms of structure recovery, and yet its computational time is several orders of magnitude shorter, especially for large dimensions.

ガウスグラフィカルモデルの構造学習では、通常、データの忠実性とグラフのスパース性のトレードオフのバランスをとるペナルティパラメータを慎重に調整する必要があります。残念ながら、この調整は多くの場合、専門家の経験や力ずくの探索を必要とする「黒魔術」です。そのため、自動的に観測データからグラフのスパース性を適応的に決定できる、調整不要のアルゴリズムを開発することは魅力的です。この論文では、ガウスグラフィカルモデルの自動選択のためのBISN (スパースネットワークのベイズ推論)という新しいアプローチを提案します。具体的には、精度行列の非対角エントリをランダム変数と見なし、スパースを促進する馬蹄型事前分布を適用することで、スパース性を自動的に決定します。確率的勾配を利用して、モデルを学習するための効率的な変分ベイズアルゴリズムを導出します。さらに、確率的勾配の分散を減らして収束を加速するための減衰再帰確率的勾配(DRSG)法を提案します。私たちの理論分析によると、BISNの時間計算量は次元の2乗にしか比例しませんが、自動グラフィカルモデル選択の最先端の方法の理論的な時間計算量は、通常、次元の3次関数です。さらに、数値結果によると、BISNは構造回復の点で最先端の方法と同等かそれ以上のパフォーマンスを達成でき、計算時間は特に大きな次元では数桁短くなります。

Tensor Regression Networks
テンソル回帰ネットワーク

Convolutional neural networks typically consist of many convolutional layers followed by one or more fully connected layers. While convolutional layers map between high-order activation tensors, the fully connected layers operate on flattened activation vectors. Despite empirical success, this approach has notable drawbacks. Flattening followed by fully connected layers discards multilinear structure in the activations and requires many parameters. We address these problems by incorporating tensor algebraic operations that preserve multilinear structure at every layer. First, we introduce Tensor Contraction Layers (TCLs) that reduce the dimensionality of their input while preserving their multilinear structure using tensor contraction. Next, we introduce Tensor Regression Layers (TRLs), which express outputs through a low-rank multilinear mapping from a high-order activation tensor to an output tensor of arbitrary order. We learn the contraction and regression factors end-to-end, and produce accurate nets with fewer parameters. Additionally, our layers regularize networks by imposing low-rank constraints on the activations (TCL) and regression weights (TRL). Experiments on ImageNet show that, applied to VGG and ResNet architectures, TCLs and TRLs reduce the number of parameters compared to fully connected layers by more than 65% while maintaining or increasing accuracy. In addition to the space savings, our approach’s ability to leverage topological structure can be crucial for structured data such as MRI. In particular, we demonstrate significant performance improvements over comparable architectures on three tasks associated with the UK Biobank dataset.

畳み込みニューラルネットワークは、通常、多数の畳み込み層と、それに続く1つ以上の完全接続層で構成されます。畳み込み層は高次の活性化テンソル間をマッピングしますが、完全接続層は平坦化された活性化ベクトルに対して動作します。実験的には成功していますが、このアプローチには顕著な欠点があります。平坦化の後に完全接続層が続くと、活性化の多重線形構造が破棄され、多くのパラメーターが必要になります。私たちは、すべての層で多重線形構造を維持するテンソル代数演算を組み込むことで、これらの問題に対処します。まず、テンソル収縮層(TCL)を導入します。これは、テンソル収縮を使用して多重線形構造を維持しながら、入力の次元を削減します。次に、テンソル回帰層(TRL)を導入します。これは、高次の活性化テンソルから任意の順序の出力テンソルへの低ランク多重線形マッピングを通じて出力を表現します。収縮係数と回帰係数をエンドツーエンドで学習し、より少ないパラメーターで正確なネットを生成します。さらに、私たちのレイヤーは、活性化(TCL)と回帰重み(TRL)に低ランク制約を課すことでネットワークを正規化します。ImageNetでの実験では、VGGおよびResNetアーキテクチャに適用されたTCLとTRLは、完全に接続されたレイヤーと比較して、精度を維持または向上させながら、パラメーターの数を65%以上削減することが示されています。スペースの節約に加えて、私たちのアプローチのトポロジカル構造を活用する能力は、MRIなどの構造化データにとって非常に重要です。特に、UK Biobankデータセットに関連する3つのタスクで、同等のアーキテクチャよりも大幅なパフォーマンスの向上が実証されています。

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering
カーネル推定ノンパラメトリックオーバーラップベースの合胞体クラスタリング

Commonly-used clustering algorithms usually find ellipsoidal, spherical or other regular-structured clusters, but are more challenged when the underlying groups lack formal structure or definition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial clustering algorithm that can be used with $k$-means and other algorithms. Our approach estimates the cumulative distribution function of the normed residuals from an appropriately fit $k$-groups model and calculates the estimated nonparametric overlap between each pair of clusters. Groups with high pairwise overlap are merged as long as the estimated generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets and can be applied to datasets with scatter or incomplete records. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and the distinct kinds of activation in a functional Magnetic Resonance Imaging study.

一般的に使用されるクラスタリングアルゴリズムは、通常、楕円体、球体、またはその他の規則的な構造のクラスターを見つけますが、基礎となるグループに正式な構造や定義がない場合は、より困難になります。シンシチアルクラスタリングは、標準的なクラスタリングアルゴリズムから取得されたグループをマージして、データ内の複雑なグループ構造を明らかにする方法に導入された名前です。ここでは、$k$平均法やその他のアルゴリズムで使用できる、分布フリーの完全に自動化されたシンシチアルクラスタリングアルゴリズムを開発します。私たちのアプローチは、適切に適合された$k$グループモデルから標準化残差の累積分布関数を推定し、各クラスターペア間の推定ノンパラメトリックオーバーラップを計算します。推定された一般化オーバーラップが減少する限り、ペアワイズオーバーラップが高いグループはマージされます。私たちの方法論は、複数のデータセットで規則的および不規則な構造を持つグループを識別する際に常に最高のパフォーマンスを発揮し、散在レコードや不完全なレコードを含むデータセットに適用できます。このアプローチは、バーストおよび過渡的放射源実験4Brカタログ内の異なる種類のガンマ線バーストや、機能的磁気共鳴イメージング研究における異なる種類の活性化を識別するためにも使用されます。

Agnostic Estimation for Phase Retrieval
位相検索のための非依存的推定

The goal of noisy high-dimensional phase retrieval is to estimate an $s$-sparse parameter $\boldsymbol{\beta}^*\in \mathbb{R}^d$ from $n$ realizations of the model $Y = (\mathbf{X}^T \boldsymbol{\beta}^*)^2 + \varepsilon$. Based on this model, we propose a significant semi-parametric generalization called misspecified phase retrieval (MPR), in which $Y = f(\mathbf{X}^T \boldsymbol{\beta}^*, \varepsilon)$ with unknown $f$ and $\operatorname{Cov}(Y, (\mathbf{X}^T \boldsymbol{\beta}^*)^2) > 0$. For example, MPR encompasses $Y = h(|\mathbf{X}^T \boldsymbol{\beta}^*|) + \varepsilon$ with increasing $h$ as a special case. Despite the generality of the MPR model, it eludes the reach of most existing semi-parametric estimators. In this paper, we propose an estimation procedure, which consists of solving a cascade of two convex programs and provably recovers the direction of $\boldsymbol{\beta}^*$. Furthermore, we prove that our procedure is minimax optimal over the class of MPR models. Interestingly, our minimax analysis characterizes the statistical price of misspecifying the link function in phase retrieval models. Our theory is backed up by thorough numerical results.

ノイズの多い高次元位相回復の目標は、モデル$Y = (\mathbf{X}^T \boldsymbol{\beta}^*)^2 + \varepsilon$の$n$個の実現から$s$スパースパラメータ$\boldsymbol{\beta}^*\in \mathbb{R}^d$を推定することです。このモデルに基づいて、$Y = f(\mathbf{X}^T \boldsymbol{\beta}^*, \varepsilon)$で、$f$が不明で、$\operatorname{Cov}(Y, (\mathbf{X}^T \boldsymbol{\beta}^*)^2) > 0$である、誤指定位相回復(MPR)と呼ばれる重要なセミパラメトリック一般化を提案します。たとえば、MPRは、$h$が増加する$Y = h(|\mathbf{X}^T \boldsymbol{\beta}^*|) + \varepsilon$を特別なケースとして包含します。MPRモデルの一般性にもかかわらず、既存のほとんどのセミパラメトリック推定器では対応できません。この論文では、2つの凸計画のカスケードを解くことで$\boldsymbol{\beta}^*$の方向を証明できる推定手順を提案します。さらに、この手順がMPRモデルのクラスでミニマックス最適であることを証明します。興味深いことに、ミニマックス分析は、位相回復モデルでリンク関数を誤って指定することの統計的コストを特徴付けます。私たちの理論は、徹底した数値結果によって裏付けられています。

A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning
大規模学習のための並列二重確率的アルゴリズムのクラス

We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. RAPSA is doubly stochastic since each processor utilizes a random set of functions to compute the stochastic gradient associated with a randomly chosen sets of variable coordinates. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is strongly convex. Moreover, we present an accelerated version of RAPSA (ARAPSA) that incorporates the objective function curvature information by premultiplying the descent direction by a Hessian approximation matrix. We further extend the results for asynchronous settings and show that if the processors perform their updates without any coordination the algorithms are still convergent to the optimal argument. RAPSA and its extensions are then numerically evaluated on a linear estimation problem and a binary image classification task using the MNIST handwritten digit dataset.

私たちは、トレーニング例の数と特徴ベクトルの次元の両方が大きいトレーニングセットでの学習問題について考えます。これらの問題を解決するために、ランダム並列確率アルゴリズム(RAPSA)を提案します。このアルゴリズムをランダム並列と呼ぶのは、複数の並列プロセッサを使用して、特徴ベクトルのブロックのランダムに選択されたサブセットを操作するためです。RAPSAは、各プロセッサが関数のランダムセットを使用して、変数座標のランダムに選択されたセットに関連付けられた確率的勾配を計算するため、二重確率的です。これらの次元のいずれかで並列なアルゴリズムは存在しますが、RAPSAは、ブロックの選択とトレーニングセットの要素の選択の両方で並列な方法論の最初の試みです。RAPSAでは、プロセッサはランダムに選択された関数を使用して、ランダムに選択されたブロックに関連付けられた確率的勾配コンポーネントを計算します。この論文の技術的な貢献は、トレーニング目標が強く凸である場合に、この最小限に調整されたアルゴリズムが最適な分類器に収束することを示すことです。さらに、降下方向をヘッセ近似行列で事前に乗算することで目的関数の曲率情報を組み込んだRAPSAの高速バージョン(ARAPSA)を紹介します。さらに非同期設定の結果も拡張し、プロセッサが調整なしで更新を実行する場合でも、アルゴリズムが最適な引数に収束することを示します。次に、RAPSAとその拡張を、MNIST手書き数字データセットを使用して線形推定問題とバイナリ画像分類タスクで数値的に評価します。

Bayesian Closed Surface Fitting Through Tensor Products
テンソル積によるベイジアン閉曲面フィッティング

Closed surfaces provide a useful model for $3$-d shapes, with the data typically consisting of a cloud of points in $\mathbb{R}^3$. The existing literature on closed surface modeling focuses on frequentist point estimation methods that join surface patches along the edges, with surface patches created via Bézier surfaces or tensor products of B-splines. However, the resulting surfaces are not smooth along the edges and the geometric constraints required to join the surface patches lead to computational drawbacks. In this article, we develop a Bayesian model for closed surfaces based on tensor products of a cyclic basis resulting in infinitely smooth surface realizations. We impose sparsity on the control points through a double-shrinkage prior. Theoretical properties of the support of our proposed prior are studied and it is shown that the posterior achieves the optimal rate of convergence under reasonable assumptions on the prior. The proposed approach is illustrated with some examples.

閉曲面は、通常$\mathbb{R}^3$内の点群で構成されるデータを持つ$3$次元形状の便利なモデルを提供します。閉曲面モデリングに関する既存の文献は、エッジに沿って表面パッチを結合する頻度主義的な点推定法に焦点を当てており、表面パッチはベジェ曲面またはBスプラインのテンソル積によって作成されます。ただし、結果として得られる表面はエッジに沿って滑らかではなく、表面パッチを結合するために必要な幾何学的制約により計算上の欠点が生じます。この記事では、無限に滑らかな表面を実現する巡回基底のテンソル積に基づく閉曲面のベイズモデルを開発します。二重収縮事前分布によって制御点にスパース性を課します。提案された事前分布のサポートの理論的特性について研究し、事前分布に関する合理的な仮定の下で事後分布が最適な収束率を達成することを示します。提案されたアプローチは、いくつかの例で説明されます。

Tslearn, A Machine Learning Toolkit for Time Series Data
tslearn、時系列データ用の機械学習ツールキット

tslearn is a general-purpose Python machine learning library for time series that offers tools for pre-processing and feature extraction as well as dedicated models for clustering, classification and regression. It follows scikit-learn’s Application Programming Interface for transformers and estimators, allowing the use of standard pipelines and model selection tools on top of tslearn objects. It is distributed under the BSD-2-Clause license, and its source code is available at https://github.com/tslearn-team/tslearn.

tslearnは、時系列用の汎用Python機械学習ライブラリであり、前処理と特徴抽出のためのツールと、クラスタリング、分類、回帰のための専用モデルを提供します。これは、scikit-learnのトランスフォーマーとエスティメータ用のアプリケーションプログラミングインターフェイスに従っており、tslearnオブジェクト上で標準のパイプラインとモデル選択ツールを使用できます。BSD-2-Clauseライセンスの下で配布されており、そのソースコードはhttps://github.com/tslearn-team/tslearnで入手できます。

Regularized Estimation of High-dimensional Factor-Augmented Vector Autoregressive (FAVAR) Models
高次元因子増強ベクトル自己回帰(FAVAR)モデルの正則化推定

A factor-augmented vector autoregressive (FAVAR) model is defined by a VAR equation that captures lead-lag correlations amongst a set of observed variables $X$ and latent factors $F$, and a calibration equation that relates another set of observed variables $Y$ with $F$ and $X$. The latter equation is used to estimate the factors that are subsequently used in estimating the parameters of the VAR system. The FAVAR model has become popular in applied economic research, since it can summarize a large number of variables of interest as a few factors through the calibration equation and subsequently examine their influence on core variables of primary interest through the VAR equation. However, there is increasing need for examining lead-lag relationships between a large number of time series, while incorporating information from another high-dimensional set of variables. Hence, in this paper we investigate the FAVAR model under high-dimensional scaling. We introduce an appropriate identification constraint for the model parameters, which when incorporated into the formulated optimization problem yields estimates with good statistical properties. Further, we address a number of technical challenges introduced by the fact that estimates of the VAR system model parameters are based on estimated rather than directly observed quantities. The performance of the proposed estimators is evaluated on synthetic data. Further, the model is applied to commodity prices and reveals interesting and interpretable relationships between the prices and the factors extracted from a set of global macroeconomic indicators.

因子拡張ベクトル自己回帰(FAVAR)モデルは、観測変数$X$と潜在因子$F$の一連の間のリードラグ相関を捉えるVAR方程式と、別の観測変数$Y$を$F$および$X$と関連付ける較正方程式によって定義されます。後者の方程式は、VARシステムのパラメータを推定する際に後で使われる因子を推定するために使用されます。FAVARモデルは、較正方程式によって多数の関心変数を少数の因子としてまとめ、その後VAR方程式によって主要な関心の中心となる変数への影響を調べることができるため、応用経済研究で人気が高まっています。しかし、別の高次元変数セットからの情報を取り込みながら、多数の時系列間のリードラグ関係を調べる必要性が高まっています。そのため、この論文では、高次元スケーリングの下でのFAVARモデルを調査します。モデルパラメータに適切な識別制約を導入します。これを定式化された最適化問題に組み込むと、統計特性の優れた推定値が得られます。さらに、VARシステムモデルパラメータの推定が直接観測された量ではなく推定された量に基づいているという事実によって生じるいくつかの技術的課題に対処します。提案された推定値のパフォーマンスは合成データで評価されます。さらに、モデルは商品価格に適用され、価格と一連のグローバルマクロ経済指標から抽出された要因との間の興味深く解釈可能な関係を明らかにします。

GluonTS: Probabilistic and Neural Time Series Modeling in Python
GluonTS:Pythonでの確率的およびニューラル時系列モデリング

We introduce the Gluon Time Series Toolkit (GluonTS), a Python library for deep learning based time series modeling for ubiquitous tasks, such as forecasting and anomaly detection. GluonTS simplifies the time series modeling pipeline by providing the necessary components and tools for quick model development, efficient experimentation and evaluation. In addition, it contains reference implementations of state-of-the-art time series models that enable simple benchmarking of new algorithms.

私たちは、Gluon Time Series Toolkit(GluonTS)は、予測や異常検出などのユビキタスタスクのためのディープラーニングベースの時系列モデリングのためのPythonライブラリです。GluonTSは、迅速なモデル開発、効率的な実験、評価に必要なコンポーネントとツールを提供することで、時系列モデリングパイプラインを簡素化します。さらに、新しいアルゴリズムの簡単なベンチマークを可能にする最先端の時系列モデルのリファレンス実装が含まれています。

Identifiability and Consistent Estimation of Nonparametric Translation Hidden Markov Models with General State Space
一般状態空間を持つノンパラメトリック平行移動隠れマルコフモデルの識別可能性と一貫性推定

This paper considers hidden Markov models where the observations are given as the sum of a latent state which lies in a general state space and some independent noise with unknown distribution. It is shown that these fully nonparametric translation models are identifiable with respect to both the distribution of the latent variables and the distribution of the noise, under mostly a light tail assumption on the latent variables. Two nonparametric estimation methods are proposed and we prove that the corresponding estimators are consistent for the weak convergence topology. These results are illustrated with numerical experiments.

この論文では、観測値が一般的な状態空間にある潜在状態と未知の分布を持ついくつかの独立したノイズの合計として与えられる隠れマルコフモデルを検討します。これらの完全ノンパラメトリック変換モデルは、主に潜在変数のライトテール仮定の下で、潜在変数の分布とノイズの分布の両方に関して識別可能であることが示されています。2つのノンパラメトリック推定方法が提案され、対応する推定量が弱い収束トポロジーに対して一貫していることを証明します。これらの結果は、数値実験で示されています。

NEVAE: A Deep Generative Model for Molecular Graphs
NEVAE:分子グラフのための深層生成モデル

Deep generative models have been praised for their ability to learn smooth latent representations of images, text, and audio, which can then be used to generate new, plausible data. Motivated by these success stories, there has been a surge of interest in developing deep generative models for automated molecule design. However, these models face several difficulties due to the unique characteristics of molecular graphs—their underlying structure is not Euclidean or grid-like, they remain isomorphic under permutation of the nodes’ labels, and they come with a different number of nodes and edges. In this paper, we first propose a novel variational autoencoder for molecular graphs, whose encoder and decoder are specially designed to account for the above properties by means of several technical innovations. Moreover, in contrast with the state of the art, our decoder is able to provide the spatial coordinates of the atoms of the molecules it generates. Then, we develop a gradient-based algorithm to optimize the decoder of our model so that it learns to generate molecules that maximize the value of certain property of interest and, given any arbitrary molecule, it is able to optimize the spatial configuration of its atoms for greater stability. Experiments reveal that our variational autoencoder can discover plausible, diverse and novel molecules more effectively than several state of the art models. Moreover, for several properties of interest, our optimized decoder is able to identify molecules with property values 121% higher than those identified by several state of the art methods based on Bayesian optimization and reinforcement learning.

深層生成モデルは、画像、テキスト、音声の滑らかな潜在表現を学習し、それを使用して新しい妥当なデータを生成する能力が高く評価されています。これらの成功事例に刺激されて、自動分子設計用の深層生成モデルの開発への関心が高まっています。しかし、これらのモデルは、分子グラフのユニークな特性のためにいくつかの困難に直面しています。つまり、基礎となる構造がユークリッドやグリッド状ではなく、ノードのラベルの順列の下で同型のままであり、ノードとエッジの数が異なります。この論文では、まず分子グラフ用の新しい変分オートエンコーダを提案します。このエンコーダとデコーダは、いくつかの技術革新によって上記の特性を考慮して特別に設計されています。さらに、最先端のものとは対照的に、私たちのデコーダは、生成する分子の原子の空間座標を提供することができます。次に、モデルのデコーダーを最適化する勾配ベースのアルゴリズムを開発し、特定の対象プロパティの値を最大化する分子を生成することを学習し、任意の分子が与えられた場合に、その原子の空間構成を最適化して安定性を高めることができるようにします。実験により、変分オートエンコーダーは、いくつかの最先端モデルよりも効果的に、妥当で多様で新しい分子を発見できることが明らかになりました。さらに、いくつかの対象プロパティについて、最適化されたデコーダーは、ベイズ最適化と強化学習に基づくいくつかの最先端手法によって識別されるものよりも121%高いプロパティ値を持つ分子を識別できます。

Prediction regions through Inverse Regression
逆回帰による予測領域

Predicting a new response from a covariate is a challenging task in regression, which raises new question since the era of high-dimensional data. In this paper, we are interested in the inverse regression method from a theoretical viewpoint. Theoretical results for the well-known Gaussian linear model are well-known, but the curse of dimensionality has increased the interest of practitioners and theoreticians into generalization of those results for various estimators, calibrated for the high-dimension context. We propose to focus on inverse regression. It is known to be a reliable and efficient approach when the number of features exceeds the number of observations. Indeed, under some conditions, dealing with the inverse regression problem associated to a forward regression problem drastically reduces the number of parameters to estimate, makes the problem tractable and allows to consider more general distributions, as elliptical distributions. When both the responses and the covariates are multivariate, estimators constructed by the inverse regression are studied in this paper, the main result being explicit asymptotic prediction regions for the response. The performances of the proposed estimators and prediction regions are also analyzed through a simulation study and compared with usual estimators.

共変量からの新しい応答を予測することは回帰分析における困難なタスクであり、高次元データの時代以来、新たな問題を提起しています。この論文では、理論的観点から逆回帰法に注目しています。よく知られているガウス線形モデルの理論的結果はよく知られていますが、次元の呪いにより、高次元コンテキストに合わせて調整されたさまざまな推定量に対するそれらの結果を一般化することに実践者や理論家の関心が高まっています。私たちは逆回帰に焦点を当てることを提案します。逆回帰は、特徴の数が観測数を超える場合に信頼性が高く効率的なアプローチであることが知られています。実際、いくつかの条件下では、順方向回帰問題に関連する逆回帰問題に対処すると、推定するパラメーターの数が大幅に削減され、問題が扱いやすくなり、楕円分布などのより一般的な分布を考慮できるようになります。この論文では、応答と共変量の両方が多変量である場合に、逆回帰によって構築された推定量を研究し、主な結果として応答の明示的な漸近予測領域が得られることを示した。提案された推定量と予測領域のパフォーマンスは、シミュレーション研究を通じて分析され、通常の推定量と比較されます。

High-dimensional Linear Discriminant Analysis Classifier for Spiked Covariance Model
スパイク共分散モデルのための高次元線形判別分析分類器

Linear discriminant analysis (LDA) is a popular classifier that is built on the assumption of common population covariance matrix across classes. The performance of LDA depends heavily on the quality of estimating the mean vectors and the population covariance matrix. This issue becomes more challenging in high-dimensional settings where the number of features is of the same order as the number of training samples. Several techniques for estimating the covariance matrix can be found in the literature. One of the most popular approaches are estimators based on using a regularized sample covariance matrix, giving the name regularized LDA (R-LDA) to the corresponding classifier. These estimators are known to be more resilient to the sample noise than the traditional sample covariance matrix estimator. However, the main challenge of the regularization approach is the choice of the optimal regularization parameter, as an arbitrary choice could lead to severe degradation of the classifier performance. In this work, we propose an improved LDA classifier based on the assumption that the covariance matrix follows a spiked covariance model. The main principle of our proposed technique is the design of a parametrized inverse covariance matrix estimator, the parameters of which are shown to be easily optimized. Numerical simulations, using both real and synthetic data, show that the proposed classifier yields better classification performance than the classical R-LDA while requiring lower computational complexity.

線形判別分析(LDA)は、クラス間で共通の母共分散行列を仮定して構築された一般的な分類器です。LDAのパフォーマンスは、平均ベクトルと母共分散行列の推定品質に大きく依存します。この問題は、特徴の数がトレーニングサンプルの数と同じオーダーである高次元設定ではさらに困難になります。共分散行列を推定するいくつかの手法が文献に記載されています。最も一般的なアプローチの1つは、正規化されたサンプル共分散行列を使用した推定器であり、対応する分類器は正規化LDA (R-LDA)と呼ばれます。これらの推定器は、従来のサンプル共分散行列推定器よりもサンプルノイズに対して耐性があることが知られています。ただし、正規化アプローチの主な課題は、最適な正規化パラメーターの選択です。恣意的な選択は、分類器のパフォーマンスの大幅な低下につながる可能性があるためです。この研究では、共分散行列がスパイク共分散モデルに従うという仮定に基づいて、改良されたLDA分類器を提案します。提案する手法の主な原理は、パラメータ化された逆共分散行列推定器の設計であり、そのパラメータは簡単に最適化できることが示されています。実際のデータと合成データの両方を使用した数値シミュレーションでは、提案された分類器は、従来のR-LDAよりも計算の複雑さが少なく、より優れた分類パフォーマンスを発揮することが示されています。

MFE: Towards reproducible meta-feature extraction
MFE:再現性のあるメタフィーチャー抽出に向けて

Automated recommendation of machine learning algorithms is receiving a large deal of attention, not only because they can recommend the most suitable algorithms for a new task, but also because they can support efficient hyper-parameter tuning, leading to better machine learning solutions. The automated recommendation can be implemented using meta-learning, learning from previous learning experiences, to create a meta-model able to associate a data set to the predictive performance of machine learning algorithms. Although a large number of publications report the use of meta-learning, reproduction and comparison of meta-learning experiments is a difficult task. The literature lacks extensive and comprehensive public tools that enable the reproducible investigation of the different meta-learning approaches. An alternative to deal with this difficulty is to develop a meta-feature extractor package with the main characterization measures, following uniform guidelines that facilitate the use and inclusion of new meta-features. In this paper, we propose two Meta-Feature Extractor (MFE) packages, written in both Python and R, to fill this lack. The packages follow recent frameworks for meta-feature extraction, aiming to facilitate the reproducibility of meta-learning experiments.

機械学習アルゴリズムの自動推奨は、新しいタスクに最適なアルゴリズムを推奨できるだけでなく、効率的なハイパーパラメータ調整をサポートして、より優れた機械学習ソリューションにつながるため、大きな注目を集めています。自動推奨は、以前の学習経験から学習するメタ学習を使用して実装でき、データセットを機械学習アルゴリズムの予測パフォーマンスに関連付けることができるメタモデルを作成できます。多数の出版物でメタ学習の使用が報告されていますが、メタ学習実験の再現と比較は困難な作業です。文献には、さまざまなメタ学習アプローチの再現可能な調査を可能にする広範かつ包括的な公開ツールが欠けています。この困難に対処するための代替手段は、新しいメタ機能の使用と組み込みを容易にする統一されたガイドラインに従って、主要な特性評価基準を備えたメタ機能抽出パッケージを開発することです。この論文では、この不足を補うために、PythonとRの両方で記述された2つのメタ機能抽出(MFE)パッケージを提案します。このパッケージは、メタ学習実験の再現性を高めることを目的として、メタ特徴抽出の最新のフレームワークに従っています。

ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization
ProxSARAH:確率的複合非凸最適化のための効率的なアルゴリズムフレームワーク

We propose a new stochastic first-order algorithmic framework to solve stochastic composite nonconvex optimization problems that covers both finite-sum and expectation settings. Our algorithms rely on the SARAH estimator and consist of two steps: a proximal gradient and an averaging step making them different from existing nonconvex proximal-type algorithms. The algorithms only require an average smoothness assumption of the nonconvex objective term and additional bounded variance assumption if applied to expectation problems. They work with both constant and dynamic step-sizes, while allowing single sample and mini-batches. In all these cases, we prove that our algorithms can achieve the best-known complexity bounds in terms of stochastic first-order oracle. One key step of our methods is the new constant and dynamic step-sizes resulting in the desired complexity bounds while improving practical performance. Our constant step-size is much larger than existing methods including proximal SVRG scheme in the single sample case. We also specify our framework to the non-composite case that covers existing state-of-the-arts in terms of oracle complexity bounds. Our update also allows one to trade-off between step-sizes and mini-batch sizes to improve performance. We test the proposed algorithms on two composite nonconvex problems and neural networks using several well-known data sets.

私たちは、有限和と期待値の両方の設定をカバーする確率的複合非凸最適化問題を解決するための新しい確率的一次アルゴリズムフレームワークを提案します。我々のアルゴリズムはSARAH推定量に依存し、近似勾配と平均化ステップの2つのステップで構成されており、既存の非凸近似型アルゴリズムとは異なります。アルゴリズムは、非凸目的項の平均平滑性仮定と、期待値問題に適用する場合の追加の有界分散仮定のみを必要とします。これらは、単一サンプルとミニバッチを許可しながら、定数と動的ステップサイズの両方で機能します。これらすべてのケースで、私たちは、我々のアルゴリズムが確率的一次オラクルに関して最もよく知られている複雑さの境界を達成できることを証明しています。我々の方法の重要なステップの1つは、実用的なパフォーマンスを改善しながら望ましい複雑さの境界をもたらす新しい定数と動的ステップサイズです。我々の定数ステップサイズは、単一サンプルの場合の近似SVRGスキームを含む既存の方法よりもはるかに大きい。また、オラクルの複雑性境界に関して既存の最先端技術をカバーする非複合ケースにフレームワークを指定します。この更新により、ステップサイズとミニバッチサイズの間でトレードオフしてパフォーマンスを向上させることもできます。提案されたアルゴリズムを、いくつかのよく知られたデータセットを使用して、2つの複合非凸問題とニューラルネットワークでテストします。

Bayesian Model Selection with Graph Structured Sparsity
グラフ構造のスパース性によるベイジアンモデルの選択

We propose a general algorithmic framework for Bayesian model selection. A spike-and-slab Laplacian prior is introduced to model the underlying structural assumption. Using the notion of effective resistance, we derive an EM-type algorithm with closed-form iterations to efficiently explore possible candidates for Bayesian model selection. The deterministic nature of the proposed algorithm makes it more scalable to large-scale and high-dimensional data sets compared with existing stochastic search algorithms. When applied to sparse linear regression, our framework recovers the EMVS algorithm by Ročková and George (2014) as a special case. We also discuss extensions of our framework using tools from graph algebra to incorporate complex Bayesian models such as biclustering and submatrix localization. Extensive simulation studies and real data applications are conducted to demonstrate the superior performance of our methods over its frequentist competitors such as $\ell_0$ or $\ell_1$ penalization.

私たちは、ベイズモデル選択のための一般的なアルゴリズムフレームワークを提案します。スパイクアンドスラブラプラシアン事前分布を導入して、基礎となる構造仮定をモデル化します。有効抵抗の概念を使用して、ベイズモデル選択の候補を効率的に探索するための閉形式の反復を含むEMタイプのアルゴリズムを導出します。提案されたアルゴリズムの決定論的な性質により、既存の確率的検索アルゴリズムと比較して、大規模で高次元のデータセットへの拡張性が向上します。スパース線形回帰に適用すると、フレームワークは、Ročková とGeorge (2014)によるEMVSアルゴリズムを特殊なケースとして復元します。また、グラフ代数のツールを使用して、バイクラスタリングやサブマトリックスローカリゼーションなどの複雑なベイズモデルを組み込むフレームワークの拡張についても説明します。広範囲にわたるシミュレーション研究と実際のデータへの適用により、$\ell_0$や$\ell_1$ペナルティなどの頻度主義の競合方法よりも優れたパフォーマンスを発揮する方法を実証します。

ThunderGBM: Fast GBDTs and Random Forests on GPUs
ThunderGBM:GPU上の高速GBDTとランダムフォレスト

Gradient Boosting Decision Trees (GBDTs) and Random Forests (RFs) have been used in many real-world applications. They are often a standard recipe for building state-of-the-art solutions to machine learning and data mining problems. However, training and prediction are very expensive computationally for large and high dimensional problems. This article presents an efficient and open source software toolkit called ThunderGBM which exploits the high-performance Graphics Processing Units (GPUs) for GBDTs and RFs. ThunderGBM supports classification, regression and ranking, and can run on single or multiple GPUs of a machine. Our experimental results show that ThunderGBM outperforms the existing libraries while producing similar models, and can handle high dimensional problems where existing GPU-based libraries fail. Documentation, examples, and more details about ThunderGBM are available at https://github.com/xtra-computing/thundergbm.

勾配ブースティング決定木(GBDT)とランダムフォレスト(RF)は、多くの実世界のアプリケーションで使用されています。多くの場合、機械学習やデータマイニングの問題に対する最先端のソリューションを構築するための標準的なレシピです。ただし、トレーニングと予測は、大規模で高次元の問題に対して計算的に非常にコストがかかります。この記事では、GBDTおよびRF用の高性能グラフィックスプロセッシングユニット(GPU)を活用するThunderGBMと呼ばれる効率的でオープンソースのソフトウェアツールキットを紹介します。ThunderGBMは、分類、回帰、およびランク付けをサポートし、マシンの単一または複数のGPUで実行できます。私たちの実験結果は、ThunderGBMが同様のモデルを生成しながら既存のライブラリよりも優れており、既存のGPUベースのライブラリが失敗する高次元の問題を処理できることを示しています。ThunderGBMに関するドキュメント、例、および詳細については、https://github.com/xtra-computing/thundergbmで入手できます。

Change Point Estimation in a Dynamic Stochastic Block Model
動的確率的ブロックモデルにおける変化点推定

We consider the problem of estimating the location of a single change point in a network generated by a dynamic stochastic block model mechanism. This model produces community structure in the network that exhibits change at a single time epoch. We propose two methods of estimating the change point, together with the model parameters, before and after its occurrence. The first employs a least-squares criterion function and takes into consideration the full structure of the stochastic block model and is evaluated at each point in time. Hence, as an intermediate step, it requires estimating the community structure based on a clustering algorithm at every time point. The second method comprises the following two steps: in the first one, a least-squares function is used and evaluated at each time point, but ignoring the community structure and only considering a random graph generating mechanism exhibiting a change point. Once the change point is identified, in the second step, all network data before and after it are used together with a clustering algorithm to obtain the corresponding community structures and subsequently estimate the generating stochastic block model parameters. The first method, since it requires knowledge of the community structure and hence clustering at every point in time, is significantly more computationally expensive than the second one. On the other hand, it requires a significantly less stringent identifiability condition for consistent estimation of the change point and the model parameters than the second method; however, it also requires a condition on the misclassification rate of misallocating network nodes to their respective communities that may fail to hold in many realistic settings. Despite the apparent stringency of the identifiability condition for the second method, we show that networks generated by a stochastic block mechanism exhibiting a change in their structure can easily satisfy this condition under a multitude of scenarios, including merging/splitting communities, nodes joining another community, etc. Further, for both methods under their respective identifiability and certain additional regularity conditions, we establish rates of convergence and derive the asymptotic distributions of the change point estimators. The results are illustrated on synthetic data. In summary, this work provides an in-depth investigation of the novel problem of change point analysis for networks generated by stochastic block models, identifies key conditions for the consistent estimation of the change point, and proposes a computationally fast algorithm that solves the problem in many settings that occur in applications. Finally, it discusses challenges posed by employing clustering algorithms in this problem, that require additional investigation for their full resolution.

私たちは、動的確率ブロックモデルメカニズムによって生成されたネットワーク内の単一の変化点の位置を推定する問題について考えます。このモデルは、単一の時間エポックで変化を示すネットワーク内のコミュニティ構造を生成します。変化点の発生前と発生後にモデルパラメータとともに変化点を推定する2つの方法を提案します。最初の方法は、最小二乗基準関数を使用し、確率ブロックモデルの完全な構造を考慮し、各時点で評価します。したがって、中間ステップとして、各時点でクラスタリングアルゴリズムに基づいてコミュニティ構造を推定する必要があります。2番目の方法は、次の2つのステップで構成されます。最初のステップでは、最小二乗関数が使用され、各時点で評価されますが、コミュニティ構造は無視され、変化点を示すランダムグラフ生成メカニズムのみが考慮されます。変化点が特定されると、2番目のステップでは、変化点の前後のすべてのネットワークデータがクラスタリングアルゴリズムとともに使用され、対応するコミュニティ構造が取得され、その後、生成する確率ブロックモデルパラメータが推定されます。最初の方法は、コミュニティ構造の知識、したがってすべての時点でのクラスタリングを必要とするため、2番目の方法よりも計算コストが大幅に高くなります。一方、変化点とモデルパラメーターの一貫した推定に必要な識別可能性条件は、2番目の方法よりも大幅に緩やかです。ただし、ネットワークノードをそれぞれのコミュニティに誤って割り当てる誤分類率に関する条件も必要であり、この条件は多くの現実的な設定では成立しない可能性があります。2番目の方法の識別可能性条件は明らかに厳格ですが、構造の変化を示す確率的ブロックメカニズムによって生成されたネットワークは、コミュニティの合併/分割、ノードの別のコミュニティへの参加など、さまざまなシナリオでこの条件を簡単に満たすことができることを示しています。さらに、それぞれの識別可能性条件と特定の追加の規則性条件の下で、両方の方法について収束率を確立し、変化点推定量の漸近分布を導出します。結果は合成データで示されます。要約すると、この研究では、確率的ブロックモデルによって生成されたネットワークの変化点分析という新しい問題を詳細に調査し、変化点の一貫した推定のための重要な条件を特定し、アプリケーションで発生する多くの設定で問題を解決する計算速度の速いアルゴリズムを提案しています。最後に、この問題でクラスタリングアルゴリズムを使用することで生じる課題について説明します。これらの課題を完全に解決するには、追加の調査が必要です。

Quadratic Decomposable Submodular Function Minimization: Theory and Practice
二次分解可能部分モジュラー関数の最小化:理論と実践

We introduce a new convex optimization problem, termed quadratic decomposable submodular function minimization (QDSFM), which allows to model a number of learning tasks on graphs and hypergraphs. The problem exhibits close ties to decomposable submodular function minimization (DSFM) yet is much more challenging to solve. We approach the problem via a new dual strategy and formulate an objective that can be optimized through a number of double-loop algorithms. The outer-loop uses either random coordinate descent (RCD) or alternative projection (AP) methods, for both of which we prove linear convergence rates. The inner-loop computes projections onto cones generated by base polytopes of the submodular functions via the modified min-norm-point or Frank-Wolfe algorithms. We also describe two new applications of QDSFM: hypergraph-adapted PageRank and semi-supervised learning. The proposed hypergraph-based PageRank algorithm can be used for local hypergraph partitioning and comes with provable performance guarantees. For hypergraph-adapted semi-supervised learning, we provide numerical experiments demonstrating the efficiency of our QDSFM solvers and their significant improvements on prediction accuracy when compared to state-of-the-art methods.

私たちは、2次分解可能サブモジュラ関数最小化(QDSFM)と呼ばれる新しい凸最適化問題を導入します。これにより、グラフとハイパーグラフ上の多くの学習タスクをモデル化できます。この問題は、分解可能サブモジュラ関数最小化(DSFM)と密接な関係がありますが、解決がはるかに困難です。私たちは、新しいデュアル戦略でこの問題に取り組み、いくつかの二重ループアルゴリズムで最適化できる目的を定式化します。外側のループは、ランダム座標降下法(RCD)または代替射影法(AP)のいずれかを使用し、どちらの方法も線形収束率を証明します。内側のループは、修正された最小ノルム点アルゴリズムまたはFrank-Wolfeアルゴリズムを使用して、サブモジュラ関数のベース多面体によって生成された円錐への射影を計算します。また、QDSFMの2つの新しいアプリケーション、ハイパーグラフ適応型PageRankと半教師あり学習についても説明します。提案されたハイパーグラフベースのPageRankアルゴリズムは、ローカルハイパーグラフの分割に使用でき、証明可能なパフォーマンス保証が付属しています。ハイパーグラフ適応型半教師あり学習については、QDSFMソルバーの効率性と、最先端の方法と比較した場合の予測精度の大幅な改善を示す数値実験を提供します。

Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization
確率的条件付き勾配法:凸最小化からサブモジュラ最大化へ

This paper considers stochastic optimization problems for a large class of objective functions, including convex and continuous submodular. Stochastic proximal gradient methods have been widely used to solve such problems; however, their applicability remains limited when the problem dimension is large and the projection onto a convex set is computationally costly. Instead, stochastic conditional gradient algorithms are proposed as an alternative solution which rely on (i) Approximating gradients via a simple averaging technique requiring a single stochastic gradient evaluation per iteration; (ii) Solving a linear program to compute the descent/ascent direction. The gradient averaging technique reduces the noise of gradient approximations as time progresses, and replacing projection step in proximal methods by a linear program lowers the computational complexity of each iteration. We show that under convexity and smoothness assumptions, our proposed stochastic conditional gradient method converges to the optimal objective function value at a sublinear rate of $\mathcal{O}(1/t^{1/3})$. Further, for a monotone and continuous DR-submodular function and subject to a general convex body constraint, we prove that our proposed method achieves a $((1-1/e)\text{OPT} -\epsilon)$ guarantee (in expectation) with $\mathcal{O}{(1/\epsilon^3)}$ stochastic gradient computations. This guarantee matches the known hardness results and closes the gap between deterministic and stochastic continuous submodular maximization. Additionally, we achieve $((1/e)\text{OPT} -\epsilon)$ guarantee after operating on $\mathcal{O}{(1/\epsilon^3)}$ stochastic gradients for the case that the objective function is continuous DR-submodular but non-monotone and the constraint set is a down-closed convex body. By using stochastic continuous optimization as an interface, we also provide the first $(1-1/e)$ tight approximation guarantee for maximizing a monotone but stochastic submodular set function subject to a general matroid constraint and $(1/e)$ approximation guarantee for the non-monotone case.

この論文では、凸および連続サブモジュラーを含む、大規模な目的関数の確率的最適化問題を考察します。確率的近似勾配法は、このような問題を解決するために広く使用されていますが、問題の次元が大きく、凸集合への射影が計算コストが高い場合、その適用範囲は依然として限られています。代わりに、(i)反復ごとに1回の確率的勾配評価を必要とする単純な平均化手法による勾配の近似、(ii)線形計画法を解いて降下/上昇方向を計算する、という2つの方法を使用する代替ソリューションとして、確率的条件付き勾配アルゴリズムが提案されています。勾配平均化手法により、時間の経過に伴う勾配近似のノイズが軽減され、近似法の射影ステップを線形計画法に置き換えることで、各反復の計算の複雑さが軽減されます。凸性と滑らかさの仮定の下で、提案する確率的条件付き勾配法は、サブ線形速度$\mathcal{O}(1/t^{1/3})$で最適な目的関数値に収束することを示します。さらに、単調で連続的なDRサブモジュラ関数で、一般的な凸体制約に従う場合、提案手法が$\mathcal{O}{(1/\epsilon^3)}$の確率的勾配計算で$((1-1/e)\text{OPT} -\epsilon)$保証(期待値)を達成することを証明します。この保証は既知の困難性の結果と一致し、決定論的連続サブモジュラ最大化と確率的連続サブモジュラ最大化の間のギャップを埋めます。さらに、目的関数が連続DRサブモジュラだが非単調であり、制約セットが下向きに閉じた凸体である場合に、$\mathcal{O}{(1/\epsilon^3)}$の確率的勾配を操作した後で$((1/e)\text{OPT} -\epsilon)$保証を達成します。確率的連続最適化をインターフェースとして使用することにより、一般的なマトロイド制約に従う単調だが確率的なサブモジュラ集合関数を最大化するための最初の$(1-1/e)$厳密な近似保証と、非単調なケースに対する$(1/e)$近似保証も提供します。

Sparse Projection Oblique Randomer Forests
スパースプロジェクション、斜め、ランダー、フォレスト

Decision forests, including Random Forests and Gradient Boosting Trees, have recently demonstrated state-of-the-art performance in a variety of machine learning settings. Decision forests are typically ensembles of axis-aligned decision trees; that is, trees that split only along feature dimensions. In contrast, many recent extensions to decision forests are based on axis-oblique splits. Unfortunately, these extensions forfeit one or more of the favorable properties of decision forests based on axis-aligned splits, such as robustness to many noise dimensions, interpretability, or computational efficiency. We introduce yet another decision forest, called “Sparse Projection Oblique Randomer Forests” (SPORF). SPORF trees recursively split along very sparse random projections. Our method significantly improves accuracy over existing state-of-the-art algorithms on a standard benchmark suite for classification with $>100$ problems of varying dimension, sample size, and number of classes. To illustrate how SPORF addresses the limitations of both axis-aligned and existing oblique decision forest methods, we conduct extensive simulated experiments. SPORF typically yields improved performance over existing decision forest methods, while mitigating computational efficiency and scalability and maintaining interpretability. Very sparse random projections can be incorporated into gradient boosted trees to obtain potentially similar gains.

ランダムフォレストや勾配ブースティングツリーなどの決定フォレストは、最近、さまざまな機械学習設定で最先端のパフォーマンスを発揮しています。決定フォレストは通常、軸に沿った決定ツリーの集合体です。つまり、特徴次元に沿ってのみ分割されるツリーです。対照的に、決定フォレストの最近の拡張の多くは、軸斜め分割に基づいています。残念ながら、これらの拡張では、多くのノイズ次元に対する堅牢性、解釈可能性、計算効率など、軸に沿った分割に基づく決定フォレストの1つ以上の好ましい特性が失われています。ここでは、「スパース投影斜めランダマーフォレスト」(SPORF)と呼ばれる、さらに別の決定フォレストを紹介します。SPORFツリーは、非常にスパースなランダム投影に沿って再帰的に分割します。私たちの方法は、さまざまな次元、サンプルサイズ、クラス数の100を超える問題での分類の標準ベンチマークスイートで、既存の最先端のアルゴリズムよりも精度を大幅に向上させます。SPORFが軸整列型および既存の斜め決定フォレスト法の両方の限界にどのように対処するかを示すために、広範なシミュレーション実験を実施しました。SPORFは通常、既存の決定フォレスト法よりも優れたパフォーマンスを発揮し、計算効率とスケーラビリティを緩和し、解釈可能性を維持します。非常にスパースなランダム投影を勾配ブーストツリーに組み込むことで、同様の利点を得ることができます。

Stochastic Nested Variance Reduction for Nonconvex Optimization
非凸最適化のための確率的枝分かれ分散縮小

We study nonconvex optimization problems, where the objective function is either an average of $n$ nonconvex functions or the expectation of some stochastic function. We propose a new stochastic gradient descent algorithm based on nested variance reduction, namely, Stochastic Nested Variance-Reduced Gradient descent ($\text{SNVRG}$). Compared with conventional stochastic variance reduced gradient ($\text{SVRG}$) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses $K+1$ nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, $\text{SNVRG}$ converges to an $\epsilon$-approximate first-order stationary point within $\tilde O(n\land\epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})$ number of stochastic gradient evaluations. This improves the best known gradient complexity of $\text{SVRG}$ $O(n+n^{2/3}\epsilon^{-2})$ and that of $\text{SCSG}$ $O(n\land \epsilon^{-2}+\epsilon^{-10/3}\land n^{2/3}\epsilon^{-2})$. For gradient dominated functions, $\text{SNVRG}$ also achieves better gradient complexity than the state-of-the-art algorithms. Based on $\text{SNVRG}$, we further propose two algorithms that can find local minima faster than state-of-the-art algorithms in both finite-sum and general stochastic (online) nonconvex optimization. In particular, for finite-sum optimization problems, the proposed $\text{SNVRG}+\text{Neon2}^{\text{finite}}$ algorithm achieves $\tilde{O}(n^{1/2}\epsilon^{-2}+n\epsilon_H^{-3}+n^{3/4}\epsilon_H^{-7/2})$ gradient complexity to converge to an $(\epsilon, \epsilon_H)$-second-order stationary point, which outperforms $\text{SVRG}+\text{Neon2}^{\text{finite}}$ (Allen-Zhu and Li, 2018), the best existing algorithm, in a wide regime. For general stochastic optimization problems, the proposed $\text{SNVRG}+\text{Neon2}^{\text{online}}$ achieves $\tilde{O}(\epsilon^{-3}+\epsilon_H^{-5}+\epsilon^{-2}\epsilon_H^{-3})$ gradient complexity, which is better than both $\text{SVRG}+\text{Neon2}^{\text{online}}$ (Allen-Zhu and Li, 2018) and $\text{Natasha2}$ (Allen-Zhu, 2018a) in certain regimes. Thorough experimental results on different nonconvex optimization problems back up our theory.

私たちは、目的関数が$n$個の非凸関数の平均か、何らかの確率関数の期待値のいずれかである非凸最適化問題を研究しています。私たちは、ネストされた分散減少に基づく新しい確率的勾配降下アルゴリズム、すなわち、確率的ネストされた分散減少勾配降下法($\text{SNVRG}$)を提案します。2つの参照点を使用して各反復で減少する分散を持つ半確率的勾配を構築する従来の確率的分散減少勾配($\text{SVRG}$)アルゴリズムと比較して、我々のアルゴリズムは、各反復でその分散をさらに減らすために、$K+1$個のネストされた参照点を使用して半確率的勾配を構築します。滑らかな非凸関数の場合、$\text{SNVRG}$は、確率的勾配評価回数$\tilde O(n\land\epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})$以内で$\epsilon$近似の1次定常点に収束します。これにより、$\text{SVRG}$の既知の最高勾配複雑度$O(n+n^{2/3}\epsilon^{-2})$と$\text{SCSG}$の最高勾配複雑度$O(n\land \epsilon^{-2}+\epsilon^{-10/3}\land n^{2/3}\epsilon^{-2})$が向上します。勾配が支配的な関数の場合、$\text{SNVRG}$は最先端のアルゴリズムよりも優れた勾配複雑度も実現します。さらに、$\text{SNVRG}$に基づいて、有限和および一般確率的(オンライン)非凸最適化の両方において最先端のアルゴリズムよりも高速に局所最小値を見つけることができる2つのアルゴリズムを提案します。特に、有限和最適化問題の場合、提案された$\text{SNVRG}+\text{Neon2}^{\text{finite}}$アルゴリズムは、$(\epsilon, \epsilon_H)$ 2次定常点に収束するのに$\tilde{O}(n^{1/2}\epsilon^{-2}+n\epsilon_H^{-3}+n^{3/4}\epsilon_H^{-7/2})$の勾配複雑度を達成し、広い領域で既存の最良のアルゴリズムである$\text{SVRG}+\text{Neon2}^{\text{finite}}$ (Allen-Zhu and Li, 2018)よりも優れています。一般的な確率的最適化問題の場合、提案された$\text{SNVRG}+\text{Neon2}^{\text{online}}$は$\tilde{O}(\epsilon^{-3}+\epsilon_H^{-5}+\epsilon^{-2}\epsilon_H^{-3})$の勾配複雑度を達成します。これは、特定の状況では$\text{SVRG}+\text{Neon2}^{\text{online}}$ (Allen-Zhu and Li, 2018)および$\text{Natasha2}$ (Allen-Zhu, 2018a)の両方よりも優れています。さまざまな非凸最適化問題に関する徹底的な実験結果が、私たちの理論を裏付けています。

AI-Toolbox: A C++ library for Reinforcement Learning and Planning (with Python Bindings)
AI-Toolbox: 強化学習と強化計画のための C++ ライブラリ (Python バインディング付き)

This paper describes AI-Toolbox, a C++ software library that contains reinforcement learning and planning algorithms, and supports both single and multi agent problems, as well as partial observability. It is designed for simplicity and clarity, and contains extensive documentation of its API and code. It supports Python to enable users not comfortable with C++ to take advantage of the library’s speed and functionality. AI-Toolbox is free software, and is hosted online at https://github.com/Svalorzen/AI-Toolbox.

この論文では、強化学習と計画アルゴリズムを含み、シングルエージェント問題とマルチエージェント問題の両方、および部分的な可観測性をサポートするC ++ソフトウェアライブラリであるAI-Toolboxについて説明します。シンプルさと明瞭さを追求して設計されており、APIとコードに関する広範なドキュメントが含まれています。Pythonをサポートしているため、C ++に慣れていないユーザーがライブラリの速度と機能を利用できるようになります。AI-Toolboxはフリーソフトウェアであり、https://github.com/Svalorzen/AI-Toolboxでオンラインホストされています。

Regularized Gaussian Belief Propagation with Nodes of Arbitrary Size
任意のサイズのノードによる正則化ガウス信念伝播

Gaussian belief propagation (GaBP) is a message-passing algorithm that can be used to perform approximate inference on a pairwise Markov graph (MG) constructed from a multivariate Gaussian distribution in canonical parameterization. The output of GaBP is a set of approximate univariate marginals for each variable in the pairwise MG. An extension of GaBP (labeled GaBP-m), allowing for the approximation of higher-dimensional marginal distributions, was explored by Kamper et al. (2019). The idea is to create an MG in which each node is allowed to receive more than one variable. As in the univariate case, the multivariate extension does not necessarily converge in loopy graphs and, even if convergence occurs, is not guaranteed to provide exact inference. To address the problem of convergence, we consider a multivariate extension of the principle of node regularization proposed by Kamper et al. (2018). We label this algorithm slow GaBP-m (sGaBP-m), where the term ‘slow’ relates to the damping effect of the regularization on the message passing. We prove that, given sufficient regularization, this algorithm will converge and provide the exact marginal means at convergence, regardless of the way variables are assigned to nodes. The selection of the degree of regularization is addressed through the use of a heuristic, which is based on a tree representation of sGaBP-m. As a further contribution, we extend other GaBP variants in the literature to allow for higher-dimensional marginalization. We show that our algorithm compares favorably with these variants, both in terms of convergence speed and inference quality.

ガウス確信伝播法(GaBP)は、正準パラメータ化の多変量ガウス分布から構築されたペアワイズマルコフグラフ(MG)で近似推論を実行するために使用できるメッセージパッシングアルゴリズムです。GaBPの出力は、ペアワイズMG内の各変数の近似単変量周辺分布のセットです。高次元周辺分布の近似を可能にするGaBPの拡張(GaBP-mとラベル付け)は、Kamperら(2019)によって検討されました。アイデアは、各ノードが複数の変数を受け取ることができるMGを作成することです。単変量の場合と同様に、多変量拡張はループグラフで必ずしも収束するわけではなく、収束が起こったとしても正確な推論を提供することは保証されません。収束の問題に対処するために、Kamperら(2018)によって提案されたノード正則化の原理の多変量拡張を検討します。このアルゴリズムを低速GaBP-m (sGaBP-m)と名付けます。ここで「低速」という用語は、メッセージパッシングに対する正則化の減衰効果に関連しています。十分な正則化が与えられれば、このアルゴリズムは収束し、変数がノードに割り当てられる方法に関係なく、収束時に正確な限界平均を提供することを証明します。正則化の度合いの選択は、sGaBP-mのツリー表現に基づくヒューリスティックの使用によって対処されます。さらなる貢献として、文献にある他のGaBPバリアントを拡張して、より高次元の限界化を可能にします。収束速度と推論品質の両方の点で、私たちのアルゴリズムがこれらのバリアントに匹敵することを示します。

General Latent Feature Models for Heterogeneous Datasets
異種データセットの一般的な潜在特徴モデル

Latent variable models allow capturing the hidden structure underlying the data. In particular, feature allocation models represent each observation by a linear combination of latent variables. These models are often used to make predictions either for new observations or for missing information in the original data, as well as to perform exploratory data analysis. Although there is an extensive literature on latent feature allocation models for homogeneous datasets, where all the attributes that describe each object are of the same (continuous or discrete) type, there is no general framework for practical latent feature modeling for heterogeneous datasets. In this paper, we introduce a general Bayesian nonparametric latent feature allocation model suitable for heterogeneous datasets, where the attributes describing each object can be arbitrary combinations of real-valued, positive real-valued, categorical, ordinal and count variables. The proposed model presents several important properties. First, it is suitable for heterogeneous data while keeping the properties of conjugate models, which enables us to develop an inference algorithm that presents linear complexity with respect to the number of objects and attributes per MCMC iteration. Second, the Bayesian nonparametric component allows us to place a prior distribution on the number of features required to capture the latent structure in the data. Third, the latent features in the model are binary-valued, which facilitates the interpretability of the obtained latent features in exploratory data analysis. Finally, a software package, called GLFM toolbox, is made publicly available for other researchers to use and extend. It is available at https://ivaleram.github.io/GLFM/. We show the flexibility of the proposed model by solving both prediction and data analysis tasks on several real-world datasets.

潜在変数モデルは、データの背後にある隠れた構造を捉えることを可能にします。特に、特徴割り当てモデルは、各観測値を潜在変数の線形結合で表します。これらのモデルは、新しい観測値または元のデータに欠けている情報の予測や、探索的データ分析の実行によく使用されます。各オブジェクトを記述するすべての属性が同じ(連続または離散)タイプである同質データセットの潜在特徴割り当てモデルに関する広範な文献がありますが、異質データセットの実用的な潜在特徴モデリングのための一般的なフレームワークはありません。この論文では、異質データセットに適した一般的なベイジアンノンパラメトリック潜在特徴割り当てモデルを紹介します。異質データセットでは、各オブジェクトを記述する属性は、実数値、正の実数値、カテゴリ、順序、およびカウント変数の任意の組み合わせにすることができます。提案されたモデルは、いくつかの重要な特性を示します。まず、共役モデルの特性を維持しながら異種データに適しているため、MCMC反復ごとのオブジェクトと属性の数に関して線形複雑性を示す推論アルゴリズムを開発できます。次に、ベイジアンノンパラメトリックコンポーネントを使用すると、データの潜在構造をキャプチャするために必要な特徴の数に事前分布を設定できます。3番目に、モデル内の潜在的な特徴はバイナリ値であるため、探索的データ分析で取得された潜在的な特徴の解釈が容易になります。最後に、GLFMツールボックスと呼ばれるソフトウェアパッケージが、他の研究者が使用および拡張できるように公開されています。https://ivaleram.github.io/GLFM/で入手できます。いくつかの実際のデータセットで予測タスクとデータ分析タスクの両方を解決することで、提案モデルの柔軟性を示します。

Joint Causal Inference from Multiple Contexts
複数のコンテキストからの共同因果推論

The gold standard for discovering causal relations is by means of experimentation. Over the last decades, alternative methods have been proposed that can infer causal relations between variables from certain statistical patterns in purely observational data. We introduce Joint Causal Inference (JCI), a novel approach to causal discovery from multiple data sets from different contexts that elegantly unifies both approaches. JCI is a causal modeling framework rather than a specific algorithm, and it can be implemented using any causal discovery algorithm that can take into account certain background knowledge. JCI can deal with different types of interventions (e.g., perfect, imperfect, stochastic, etc.) in a unified fashion, and does not require knowledge of intervention targets or types in case of interventional data. We explain how several well-known causal discovery algorithms can be seen as addressing special cases of the JCI framework, and we also propose novel implementations that extend existing causal discovery methods for purely observational data to the JCI setting. We evaluate different JCI implementations on synthetic data and on flow cytometry protein expression data and conclude that JCI implementations can considerably outperform state-of-the-art causal discovery algorithms.

因果関係を発見するためのゴールドスタンダードは、実験によるものです。過去数十年にわたり、純粋に観察されたデータの特定の統計パターンから変数間の因果関係を推測できる代替方法が提案されてきました。ここでは、異なるコンテキストの複数のデータセットから因果関係を発見する新しいアプローチであるJoint Causal Inference (JCI)を紹介します。これは、両方のアプローチをエレガントに統合したものです。JCIは、特定のアルゴリズムではなく因果モデリングフレームワークであり、特定の背景知識を考慮できる因果関係発見アルゴリズムを使用して実装できます。JCIは、さまざまな種類の介入(完全、不完全、確率的など)を統一的に処理でき、介入データの場合は介入対象やタイプに関する知識を必要としません。いくつかのよく知られている因果関係発見アルゴリズムが、JCIフレームワークの特殊なケースに対応していると見なすことができる理由を説明し、純粋に観察されたデータに対する既存の因果関係発見方法をJCI設定に拡張する新しい実装も提案します。私たちは、合成データとフローサイトメトリータンパク質発現データでさまざまなJCI実装を評価し、JCI実装が最先端の因果発見アルゴリズムを大幅に上回る性能を発揮できるという結論に達しました。

A General Framework for Consistent Structured Prediction with Implicit Loss Embeddings
暗黙的な損失埋め込みによる一貫性のある構造化予測のための一般的なフレームワーク

We propose and analyze a novel theoretical and algorithmic framework for structured prediction. While so far the term has referred to discrete output spaces, here we consider more general settings, such as manifolds or spaces of probability measures. We define structured prediction as a problem where the output space lacks a vectorial structure. We identify and study a large class of loss functions that implicitly defines a suitable geometry on the problem. The latter is the key to develop an algorithmic framework amenable to a sharp statistical analysis and yielding efficient computations. When dealing with output spaces with infinite cardinality, a suitable implicit formulation of the estimator is shown to be crucial.

私たちは、構造化された予測のための新しい理論的およびアルゴリズム的フレームワークを提案し、分析します。これまで、この用語は離散出力空間を指していましたが、ここでは、多様体や確率測度の空間など、より一般的な設定について考察します。構造化予測とは、出力空間にベクトル構造がない問題と定義します。問題上の適切な幾何学を暗黙的に定義する損失関数の大きなクラスを特定し、研究します。後者は、鋭い統計分析に対応し、効率的な計算を行うアルゴリズムフレームワークを開発するための鍵です。無限のカーディナリティを持つ出力空間を扱う場合、推定量の適切な暗黙的な定式化が重要であることが示されています。

Loss Control with Rank-one Covariance Estimate for Short-term Portfolio Optimization
短期ポートフォリオ最適化のためのランク1共分散推定による損失制御

In short-term portfolio optimization (SPO), some financial characteristics like the expected return and the true covariance might be dynamic. Then there are only a small window size $w$ of observations that are sufficiently close to the current moment and reliable to make estimations. $w$ is usually much smaller than the number of assets $d$, which leads to a typical undersampled problem. Worse still, the asset price relatives are not likely subject to any proper distributions. These facts violate the statistical assumptions of the traditional covariance estimates and invalidate their statistical efficiency and consistency in risk measurement. In this paper, we propose to reconsider the function of covariance estimates in the perspective of operators, and establish a rank-one covariance estimate in the principal rank-one tangent space at the observation matrix. Moreover, we propose a loss control scheme with this estimate, which effectively catches the instantaneous risk structure and avoids extreme losses. We conduct extensive experiments on $7$ real-world benchmark daily or monthly data sets with stocks, funds and portfolios from diverse regional markets to show that the proposed method achieves state-of-the-art performance in comprehensive downside risk metrics and gains good investing incomes as well. It offers a novel perspective of rank-related approaches for undersampled estimations in SPO.

短期ポートフォリオ最適化(SPO)では、期待収益や真の共分散などの一部の財務特性が動的である可能性があります。その場合、現在の瞬間に十分近く、推定を行うのに信頼できる観測のウィンドウサイズ$w$は小さくなります。$w$は通常、資産数$d$よりもはるかに小さいため、典型的なサンプル不足の問題が発生します。さらに悪いことに、資産価格の相対値は適切な分布に従う可能性は低いです。これらの事実は、従来の共分散推定値の統計的仮定に違反し、リスク測定における統計的効率と一貫性を無効にします。この論文では、オペレーターの観点から共分散推定値の機能を再考し、観測行列の主ランク1接線空間でランク1共分散推定値を確立することを提案します。さらに、この推定値を使用して、瞬間的なリスク構造を効果的に捉え、極端な損失を回避する損失制御スキームを提案します。私たちは、多様な地域市場の株式、ファンド、ポートフォリオを含む7つの実際のベンチマーク日次または月次データセットで広範な実験を行い、提案された方法が包括的な下落リスク指標で最先端のパフォーマンスを達成し、良好な投資収益も得られることを実証しました。これは、SPOにおけるアンダーサンプル推定に対するランク関連アプローチの新しい視点を提供します。

pyDML: A Python Library for Distance Metric Learning
pyDML: 距離メトリック学習のためのPythonライブラリ

pyDML is an open-source python library that provides a wide range of distance metric learning algorithms. Distance metric learning can be useful to improve similarity learning algorithms, such as the nearest neighbors classifier, and also has other applications, like dimensionality reduction. The pyDML package currently provides more than 20 algorithms, which can be categorized, according to their purpose, in: dimensionality reduction algorithms, algorithms to improve nearest neighbors or nearest centroids classifiers, information theory based algorithms or kernel based algorithms, among others. In addition, the library also provides some utilities for the visualization of classifier regions, parameter tuning and a stats website with the performance of the implemented algorithms. The package relies on the scipy ecosystem, it is fully compatible with scikit-learn, and is distributed under GPLv3 license. Source code and documentation can be found at https://github.com/jlsuarezdiaz/pyDML.

pyDMLは、幅広い距離メトリック学習アルゴリズムを提供するオープンソースのPythonライブラリです。距離メトリック学習は、最近傍分類器などの類似性学習アルゴリズムを改善するのに役立ち、次元削減などの他のアプリケーションもあります。pyDMLパッケージは現在、20以上のアルゴリズムを提供しており、その目的に応じて、次元削減アルゴリズム、最近傍または最近重心分類器を改善するアルゴリズム、情報理論ベースのアルゴリズム、カーネルベースのアルゴリズムなどに分類できます。さらに、ライブラリは、分類器領域の視覚化、パラメーターの調整、および実装されたアルゴリズムのパフォーマンスに関する統計Webサイトのためのいくつかのユーティリティも提供します。このパッケージはscipyエコシステムに依存しており、scikit-learnと完全に互換性があり、GPLv3ライセンスの下で配布されています。ソースコードとドキュメントはhttps://github.com/jlsuarezdiaz/pyDMLにあります。

Cornac: A Comparative Framework for Multimodal Recommender Systems
Cornac:マルチモーダルレコメンダーシステムの比較フレームワーク

Cornac is an open-source Python framework for multimodal recommender systems. In addition to core utilities for accessing, building, evaluating, and comparing recommender models, Cornac is distinctive in putting emphasis on recommendation models that leverage auxiliary information in the form of a social network, item textual descriptions, product images, etc. Such multimodal auxiliary data supplement user-item interactions (e.g., ratings, clicks), which tend to be sparse in practice. To facilitate broad adoption and community contribution, Cornac is publicly available at https://github.com/PreferredAI/cornac, and it can be installed via Anaconda or the Python Package Index (pip). Not only is it well-covered by unit tests to ensure code quality, but it is also accompanied with a detailed documentation, tutorials, examples, and several built-in benchmarking data sets.

Cornacは、マルチモーダルレコメンダーシステム用のオープンソースのPythonフレームワークです。Cornacは、レコメンダーモデルへのアクセス、構築、評価、比較のためのコアユーティリティに加えて、ソーシャルネットワーク、アイテムのテキスト説明、製品画像などの形式の補助情報を活用するレコメンデーションモデルに重点を置いている点が特徴です。このようなマルチモーダルな補助データは、実際にはまばらになりがちなユーザーとアイテムのインタラクション(評価、クリックなど)を補完します。幅広い採用とコミュニティの貢献を促進するために、Cornacはhttps://github.com/PreferredAI/cornacで公開されており、AnacondaまたはPython Package Index (pip)を介してインストールできます。コードの品質を確保するための単体テストで十分にカバーされているだけでなく、詳細なドキュメント、チュートリアル、例、およびいくつかの組み込みのベンチマークデータセットも付属しています。

Minimax Nonparametric Parallelism Test
ミニマックスノンパラメトリック平行度検定

Testing the hypothesis of parallelism is a fundamental statistical problem arising from many applied sciences. In this paper, we develop a nonparametric parallelism test for inferring whether the trends are parallel in treatment and control groups. In particular, the proposed nonparametric parallelism test is a Wald type test based on a smoothing spline ANOVA (SSANOVA) model which can characterize the complex patterns of the data. We derive that the asymptotic null distribution of the test statistic is a Chi-square distribution, unveiling a new version of Wilks phenomenon. Notably, we establish the minimax sharp lower bound of the distinguishable rate for the nonparametric parallelism test by using the information theory, and further prove that the proposed test is minimax optimal. Simulation studies are conducted to investigate the empirical performance of the proposed test. DNA methylation and neuroimaging studies are presented to illustrate potential applications of the test. The software is available at https://github.com/BioAlgs/Parallelism.

平行性の仮説を検定することは、多くの応用科学から生じる基本的な統計的問題です。この論文では、治療群と対照群の傾向が平行であるかどうかを推測するためのノンパラメトリック平行性検定を開発します。特に、提案されたノンパラメトリック平行性検定は、データの複雑なパターンを特徴付けることができる平滑化スプライン分散分析(SSANOVA)モデルに基づくWald型検定です。検定統計量の漸近的帰無分布がカイ二乗分布であることを導き、ウィルクス現象の新しいバージョンを明らかにします。特に、情報理論を使用してノンパラメトリック平行性検定の識別率のミニマックスシャープ下限を確立し、さらに提案された検定がミニマックス最適であることを証明します。提案された検定の実証的性能を調査するために、シミュレーション研究が行われます。検定の潜在的な用途を示すために、DNAメチル化と神経画像研究が提示されます。ソフトウェアはhttps://github.com/BioAlgs/Parallelismで入手できます。

Distributed Kernel Ridge Regression with Communications
通信による分散カーネルリッジ回帰

This paper focuses on generalization performance analysis for distributed algorithms in the framework of learning theory. Taking distributed kernel ridge regression (DKRR) for example, we succeed in deriving its optimal learning rates in expectation and providing theoretically optimal ranges of the number of local processors. Due to the gap between theory and experiments, we also deduce optimal learning rates for DKRR in probability to essentially reflect the generalization performance and limitations of DKRR. Furthermore, we propose a communication strategy to improve the learning performance of DKRR and demonstrate the power of communications in DKRR via both theoretical assessments and numerical experiments.

この論文では、学習理論のフレームワークにおける分散アルゴリズムの一般化パフォーマンス分析に焦点を当てています。分散カーネルリッジ回帰(DKRR)を例にとると、期待値で最適な学習率を導き出し、理論的には最適なローカルプロセッサの数の範囲を提供することに成功しました。理論と実験の間にはギャップがあるため、DKRRの一般化性能と限界を本質的に反映する確率におけるDKRRの最適な学習率も推論します。さらに、DKRRの学習性能を向上させるためのコミュニケーション戦略を提案し、理論的評価と数値実験の両面からDKRRにおけるコミュニケーションの力を実証します。

Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients
メトロポリゼーションハミルトニアンモンテカルロの高速混合:マルチステップグラジエントの利点

Hamiltonian Monte Carlo (HMC) is a state-of-the-art Markov chain Monte Carlo sampling algorithm for drawing samples from smooth probability densities over continuous spaces. We study the variant most widely used in practice, Metropolized HMC with the Stormer-Verlet or leapfrog integrator, and make two primary contributions. First, we provide a non-asymptotic upper bound on the mixing time of the Metropolized HMC with explicit choices of step-size and number of leapfrog steps. This bound gives a precise quantification of the faster convergence of Metropolized HMC relative to simpler MCMC algorithms such as the Metropolized random walk, or Metropolized Langevin algorithm. Second, we provide a general framework for sharpening mixing time bounds of Markov chains initialized at a substantial distance from the target distribution over continuous spaces. We apply this sharpening device to the Metropolized random walk and Langevin algorithms, thereby obtaining improved mixing time bounds from a non-warm initial distribution.

ハミルトニアンモンテカルロ(HMC)は、連続空間上の滑らかな確率密度からサンプルを抽出する最先端のマルコフ連鎖モンテカルロサンプリングアルゴリズムです。私たちは、実際に最も広く使用されている変種である、ストーマーヴェルレ積分器またはリープフロッグ積分器を使用したメトロポリ化HMCを研究し、2つの主要な貢献をしました。まず、ステップサイズとリープフロッグステップの数を明示的に選択したメトロポリ化HMCの混合時間の非漸近的な上限を提供します。この上限は、メトロポリ化ランダムウォークやメトロポリ化ランジュバンアルゴリズムなどのより単純なMCMCアルゴリズムと比較して、メトロポリ化HMCの収束が速いことを正確に定量化します。次に、連続空間上のターゲット分布からかなり離れた位置に初期化されたマルコフ連鎖の混合時間境界をシャープにする一般的なフレームワークを提供します。このシャープ化デバイスをメトロポリ化ランダムウォークとランジュバンアルゴリズムに適用し、非ウォーム初期分布から改善された混合時間境界を取得します。

Simultaneous Inference for Pairwise Graphical Models with Generalized Score Matching
一般化スコアマッチングによるペアワイズグラフィカルモデルのための同時推論

Probabilistic graphical models provide a flexible yet parsimonious framework for modeling dependencies among nodes in networks. There is a vast literature on parameter estimation and consistent model selection for graphical models. However, in many of the applications, scientists are also interested in quantifying the uncertainty associated with the estimated parameters and selected models, which current literature has not addressed thoroughly. In this paper, we propose a novel estimator for statistical inference on edge parameters in pairwise graphical models based on generalized Hyvarinen scoring rule. Hyvarinen scoring rule is especially useful in cases where the normalizing constant cannot be obtained efficiently in a closed form, which is a common problem for graphical models, including Ising models and truncated Gaussian graphical models. Our estimator allows us to perform statistical inference for general graphical models whereas the existing works mostly focus on statistical inference for Gaussian graphical models where finding normalizing constant is computationally tractable. Under mild conditions that are typically assumed in the literature for consistent estimation, we prove that our proposed estimator is $\sqrt{n}$-consistent and asymptotically normal, which allows us to construct confidence intervals and build hypothesis tests for edge parameters. Moreover, we show how our proposed method can be applied to test hypotheses that involve a large number of model parameters simultaneously. We illustrate validity of our estimator through extensive simulation studies on a diverse collection of data-generating processes.

確率的グラフィカルモデルは、ネットワーク内のノード間の依存関係をモデル化するための柔軟かつ簡潔なフレームワークを提供します。グラフィカルモデルのパラメータ推定と一貫したモデル選択に関する文献は膨大にあります。しかし、多くのアプリケーションにおいて、科学者は推定されたパラメータと選択されたモデルに関連する不確実性の定量化にも関心があり、現在の文献ではこの点が十分に扱われていません。この論文では、一般化ヒヴァリネンスコアリングルールに基づく、ペアワイズグラフィカルモデルのエッジパラメータの統計的推論のための新しい推定量を提案します。ヒヴァリネンスコアリングルールは、イジングモデルや切り捨てガウスグラフィカルモデルなどのグラフィカルモデルに共通する問題である、正規化定数を閉形式で効率的に取得できない場合に特に役立ちます。既存の研究は主に、正規化定数を見つけることが計算的に扱いやすいガウスグラフィカルモデルの統計的推論に焦点を当てていますが、私たちの推定量を使用すると、一般的なグラフィカルモデルの統計的推論を実行できます。文献で一般的に想定されている一貫性のある推定の穏やかな条件下で、提案する推定量が$\sqrt{n}$一貫性があり、漸近的に正規であることを証明します。これにより、信頼区間を構築し、エッジパラメータの仮説検定を構築できます。さらに、提案した方法を適用して、多数のモデルパラメータを同時に含む仮説をテストする方法を示します。さまざまなデータ生成プロセスに関する広範なシミュレーション研究を通じて、推定量の妥当性を示します。

Probabilistic Symmetries and Invariant Neural Networks
確率的対称性と不変ニューラルネットワーク

Treating neural network inputs and outputs as random variables, we characterize the structure of neural networks that can be used to model data that are invariant or equivariant under the action of a compact group. Much recent research has been devoted to encoding invariance under symmetry transformations into neural network architectures, in an effort to improve the performance of deep neural networks in data-scarce, non-i.i.d., or unsupervised settings. By considering group invariance from the perspective of probabilistic symmetry, we establish a link between functional and probabilistic symmetry, and obtain generative functional representations of probability distributions that are invariant or equivariant under the action of a compact group. Our representations completely characterize the structure of neural networks that can be used to model such distributions and yield a general program for constructing invariant stochastic or deterministic neural networks. We demonstrate that examples from the recent literature are special cases, and develop the details of the general program for exchangeable sequences and arrays.

ニューラルネットワークの入力と出力をランダム変数として扱い、コンパクトグループの作用下で不変または等変であるデータをモデル化するために使用できるニューラルネットワークの構造を特徴付けます。最近の多くの研究は、データが乏しい、非i.i.d.、または教師なしの設定でのディープニューラルネットワークのパフォーマンスを向上させるために、対称変換下での不変をニューラルネットワークアーキテクチャにエンコードすることに専念しています。確率的対称性の観点からグループ不変を考慮することにより、機能的対称性と確率的対称性の間のリンクを確立し、コンパクトグループの作用下で不変または等変である確率分布の生成機能表現を取得します。私たちの表現は、そのような分布をモデル化するために使用できるニューラルネットワークの構造を完全に特徴付け、不変の確率的または決定論的ニューラルネットワークを構築するための一般的なプログラムを生み出します。最近の文献の例は特殊なケースであることを実証し、交換可能なシーケンスと配列の一般的なプログラムの詳細を展開します。

Causal Discovery from Heterogeneous/Nonstationary Data
異種/非定常データからの因果関係の発見

It is commonplace to encounter heterogeneous or nonstationary data, of which the underlying generating process changes across domains or over time. Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper, we develop a framework for causal discovery from such data, called Constraint-based causal Discovery from heterogeneous/NOnstationary Data (CD-NOD), to find causal skeleton and directions and estimate the properties of mechanism changes. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a method to determine causal orientations by making use of independent changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. After learning the causal structure, next, we investigate how to efficiently estimate the “driving force” of the nonstationarity of a causal mechanism. That is, we aim to extract from data a low-dimensional representation of changes. The proposed methods are nonparametric, with no hard restrictions on data distributions and causal mechanisms, and do not rely on window segmentation. Furthermore, we find that data heterogeneity benefits causal structure identification even with particular types of confounders. Finally, we show the connection between heterogeneity/nonstationarity and soft intervention in causal discovery. Experimental results on various synthetic and real-world data sets (task-fMRI and stock market data) are presented to demonstrate the efficacy of the proposed methods.

異質または非定常のデータに遭遇することは珍しくなく、その基礎となる生成プロセスはドメイン間または時間の経過とともに変化します。このような分布シフト機能は、因果発見の課題と機会の両方をもたらします。この論文では、異質/非定常データからの制約ベースの因果発見(CD-NOD)と呼ばれる、そのようなデータからの因果発見のフレームワークを開発し、因果の骨組みと方向を見つけ、メカニズムの変化の特性を推定します。まず、ローカルメカニズムが変化する変数を検出し、観測された変数の因果構造の骨組みを回復するための、強化された制約ベースの手順を提案します。次に、変化する分布によって運ばれる情報の恩恵を受けて、基礎となる因果モデルによって暗示されるデータ分布の独立した変化を利用して因果の方向性を決定する方法を示します。因果構造を学習した後、次に、因果メカニズムの非定常性の「駆動力」を効率的に推定する方法を調査します。つまり、データから変化の低次元表現を抽出することを目指しています。提案された方法はノンパラメトリックであり、データ分布と因果メカニズムに厳しい制限はなく、ウィンドウセグメンテーションに依存しません。さらに、データの異質性は、特定の種類の交絡因子がある場合でも因果構造の特定に役立つことがわかりました。最後に、異質性/非定常性と因果発見におけるソフト介入の関係を示します。提案された方法の有効性を示すために、さまざまな合成データセットと実世界のデータセット（タスクfMRIと株式市場データ）での実験結果を示します。

Target–Aware Bayesian Inference: How to Beat Optimal Conventional Estimators
ターゲットを意識したベイズ推論: 最適な従来型推定量を打ち負かす方法

Standard approaches for Bayesian inference focus solely on approximating the posterior distribution. Typically, this approximation is, in turn, used to calculate expectations for one or more target functions—a computational pipeline that is inefficient when the target function(s) are known upfront. We address this inefficiency by introducing a framework for target–aware Bayesian inference (TABI) that estimates these expectations directly. While conventional Monte Carlo estimators have a fundamental limit on the error they can achieve for a given sample size, our TABI framework is able to breach this limit; it can theoretically produce arbitrarily accurate estimators using only three samples, while we show empirically that it can also breach this limit in practice. We utilize our TABI framework by combining it with adaptive importance sampling approaches and show both theoretically and empirically that the resulting estimators are capable of converging faster than the standard $\mathcal{O}(1/N)$ Monte Carlo rate, potentially producing rates as fast as $\mathcal{O}(1/N^2)$. We further combine our TABI framework with amortized inference methods, to produce a method for amortizing the cost of calculating expectations. Finally, we show how TABI can be used to convert any marginal likelihood estimator into a target aware inference scheme and demonstrate the substantial benefits this can yield.

ベイズ推論の標準的なアプローチは、事後分布の近似のみに焦点を当てています。通常、この近似は、1つ以上のターゲット関数の期待値を計算するために使用されます。これは、ターゲット関数が事前にわかっている場合には非効率的な計算パイプラインです。私たちは、これらの期待値を直接推定するターゲット認識ベイズ推論(TABI)のフレームワークを導入することで、この非効率性に対処します。従来のモンテカルロ推定量には、特定のサンプルサイズで達成できる誤差に基本的な制限がありますが、私たちのTABIフレームワークはこの制限を破ることができます。理論的には3つのサンプルのみを使用して任意の精度の推定量を作成できますが、実際にこの制限を破ることができることを経験的に示しています。私たちは、TABIフレームワークを適応型重要度サンプリング手法と組み合わせて使用し、結果として得られる推定量が標準的な$\mathcal{O}(1/N)$モンテカルロレートよりも速く収束し、潜在的に$\mathcal{O}(1/N^2)$ほどの速さのレートを生成できることを理論的にも経験的にも示しています。さらに、TABIフレームワークを償却推論法と組み合わせて、期待値を計算するコストを償却する方法を作成します。最後に、TABIを使用して、あらゆる周辺尤度推定量をターゲット認識推論スキームに変換する方法を示し、これによって得られる大きな利点を実証します。

Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data
ゲノムデータにおけるピーク検出のための制約付き動的計画法と教師ありペナルティ学習アルゴリズム

Peak detection in genomic data involves segmenting counts of DNA sequence reads aligned to different locations of a chromosome. The goal is to detect peaks with higher counts, and filter out background noise with lower counts. Most existing algorithms for this problem are unsupervised heuristics tailored to patterns in specific data types. We propose a supervised framework for this problem, using optimal changepoint detection models with learned penalty functions. We propose the first dynamic programming algorithm that is guaranteed to compute the optimal solution to changepoint detection problems with constraints between adjacent segment mean parameters. Implementing this algorithm requires the choice of penalty parameter that determines the number of segments that are estimated. We show how the supervised learning ideas of Rigaill et al. (2013) can be used to choose this penalty. We compare the resulting implementation of our algorithm to several baselines in a benchmark of labeled ChIP-seq data sets with two different patterns (broad H3K36me3 data and sharp H3K4me3 data). Whereas baseline unsupervised methods only provide accurate peak detection for a single pattern, our supervised method achieves state-of-the-art accuracy in all data sets. The log-linear timings of our proposed dynamic programming algorithm make it scalable to the large genomic data sets that are now common. Our implementation is available in the PeakSegOptimal R package on CRAN.

ゲノムデータにおけるピーク検出には、染色体の異なる位置に整列されたDNAシーケンスリードのカウントをセグメント化することが含まれます。目標は、カウントが高いピークを検出し、カウントが低いバックグラウンドノイズを除去することです。この問題に対する既存のアルゴリズムのほとんどは、特定のデータタイプのパターンに合わせた教師なしヒューリスティックです。私たちは、学習されたペナルティ関数を使用した最適な変化点検出モデルを使用して、この問題に対する教師ありフレームワークを提案します。私たちは、隣接するセグメント平均パラメータ間の制約を伴う変化点検出問題に対する最適解を計算することが保証された最初の動的プログラミングアルゴリズムを提案します。このアルゴリズムを実装するには、推定されるセグメント数を決定するペナルティパラメータを選択する必要があります。私たちは、Rigaillら(2013)の教師あり学習のアイデアを使用してこのペナルティを選択する方法を示します。私たちは、2つの異なるパターン(ブロードH3K36me3データとシャープH3K4me3データ)を持つラベル付きChIP-seqデータセットのベンチマークで、アルゴリズムの結果として得られた実装をいくつかのベースラインと比較します。ベースラインの教師なし手法では、単一のパターンに対してのみ正確なピーク検出が行われますが、私たちの教師あり手法では、すべてのデータセットで最先端の精度が実現されます。私たちが提案する動的プログラミングアルゴリズムの対数線形タイミングにより、現在一般的になっている大規模なゲノムデータセットに拡張できます。私たちの実装は、CRANのPeakSegOptimal Rパッケージで入手できます。

Convergence Rate of Optimal Quantization and Application to the Clustering Performance of the Empirical Measure
最適量子化の収束率と経験的尺度のクラスタリング性能への適用

We study the convergence rate of the optimal quantization for a probability measure sequence $(\mu_{n})_{n\in\mathbb{N}^{*}}$ on $\mathbb{R}^{d}$ converging in the Wasserstein distance in two aspects: the first one is the convergence rate of optimal quantizer $x^{(n)}\in(\mathbb{R}^{d})^{K}$ of $\mu_{n}$ at level $K$; the other one is the convergence rate of the distortion function valued at $x^{(n)}$, called the “performance” of $x^{(n)}$. Moreover, we also study the mean performance of the optimal quantization for the empirical measure of a distribution $\mu$ with finite second moment but possibly unbounded support. As an application, we show an upper bound with a convergence rate $\mathcal{O}(\frac{\log n}{\sqrt{n}})$ of the mean performance for the empirical measure of the multidimensional normal distribution $\mathcal{N}(m, \Sigma)$ and of distributions with hyper-exponential tails. This extends the results from Biau et al. (2008) obtained for compactly supported distribution. We also derive an upper bound which is sharper in the quantization level $K$ but suboptimal in $n$ by applying results in Fournier and Guillin (2015).

私たちは、ワッサーシュタイン距離に収束する$\mathbb{R}^{d}$上の確率測度列$(\mu_{n})_{n\in\mathbb{N}^{*}}$の最適量子化の収束率を2つの側面から調べます。1つ目は、レベル$K$での$\mu_{n}$の最適量子化子$x^{(n)}\in(\mathbb{R}^{d})^{K}$の収束率です。もう1つは、$x^{(n)}$で値が設定される歪み関数の収束率で、これは$x^{(n)}$の「パフォーマンス」と呼ばれます。さらに、有限の二次モーメントを持ちながらおそらく無制限のサポートを持つ分布$\mu$の経験的測度に対する最適量子化の平均パフォーマンスも調べます。応用として、多次元正規分布$\mathcal{N}(m, \Sigma)$と超指数裾を持つ分布の経験的測度の平均性能の収束率$\mathcal{O}(\frac{\log n}{\sqrt{n}})$の上限を示します。これは、コンパクトにサポートされた分布に対して得られたBiauら(2008)の結果を拡張したものです。また、FournierとGuillin(2015)の結果を適用して、量子化レベル$K$ではより鋭いが、$n$では最適ではない上限も導出します。

Effective Ways to Build and Evaluate Individual Survival Distributions
個々の生存分布を構築および評価する効果的な方法

An accurate model of a patient’s individual survival distribution can help determine the appropriate treatment for terminal patients. Unfortunately, risk scores (for example from Cox Proportional Hazard models) do not provide survival probabilities, single-time probability models (for instance the Gail model, predicting 5 year probability) only provide for a single time point, and standard Kaplan-Meier survival curves provide only population averages for a large class of patients, meaning they are not specific to individual patients. This motivates an alternative class of tools that can learn a model that provides an individual survival distribution for each subject, which gives survival probabilities across all times, such as extensions to the Cox model, Accelerated Failure Time, an extension to Random Survival Forests, and Multi-Task Logistic Regression. This paper first motivates such ‘individual survival distribution’ (ISD) models, and explains how they differ from standard models. It then discusses ways to evaluate such models — namely Concordance, 1-Calibration, Integrated Brier score, and versions of L1-loss — then motivates and defines a novel approach, ‘D-Calibration’, which determines whether a model’s probability estimates are meaningful. We also discuss how these measures differ, and use them to evaluate several ISD prediction tools over a range of survival data sets. We also provide a code base for all of these survival models and evaluation measures, at https://github.com/haiderstats/ISDEvaluation.

患者の個々の生存分布の正確なモデルは、末期患者に対する適切な治療を決定するのに役立ちます。残念ながら、リスクスコア(たとえば、Cox比例ハザードモデル)は生存確率を提供しません。単一時点の確率モデル(たとえば、5年確率を予測するGailモデル)は単一の時点のみを提供し、標準的なKaplan-Meier生存曲線は大規模な患者の集団平均のみを提供するため、個々の患者に固有のものではありません。このため、Coxモデルの拡張、加速故障時間、ランダム生存フォレストの拡張、マルチタスクロジスティック回帰など、すべての時点にわたる生存確率を与える、各被験者の個々の生存分布を提供するモデルを学習できる別のクラスのツールが必要になります。この論文では、まずこのような「個々の生存分布」(ISD)モデルを説明し、標準モデルとの違いを説明します。次に、このようなモデルを評価する方法(つまり、コンコーダンス、1キャリブレーション、統合ブライアスコア、およびL1損失のバージョン)について説明し、モデルの確率推定が意味があるかどうかを判断する新しいアプローチである「Dキャリブレーション」の動機と定義を示します。また、これらの測定方法の違いについても説明し、さまざまな生存データセットで複数のISD予測ツールを評価するために使用します。また、これらの生存モデルと評価測定方法のコードベースもhttps://github.com/haiderstats/ISDEvaluationで提供しています。

Model-Preserving Sensitivity Analysis for Families of Gaussian Distributions
ガウス分布の族に対するモデル保持感度解析

The accuracy of probability distributions inferred using machine-learning algorithms heavily depends on data availability and quality. In practical applications it is therefore fundamental to investigate the robustness of a statistical model to misspecification of some of its underlying probabilities. In the context of graphical models, investigations of robustness fall under the notion of sensitivity analyses. These analyses consist in varying some of the model’s probabilities or parameters and then assessing how far apart the original and the varied distributions are. However, for Gaussian graphical models, such variations usually make the original graph an incoherent representation of the model’s conditional independence structure. Here we develop an approach to sensitivity analysis which guarantees the original graph remains valid after any probability variation and we quantify the effect of such variations using different measures. To achieve this we take advantage of algebraic techniques to both concisely represent conditional independence and to provide a straightforward way of checking the validity of such relationships. Our methods are demonstrated to be robust and comparable to standard ones, which can break the conditional independence structure of the model, using an artificial example and a medical real-world application.

機械学習アルゴリズムを使用して推定される確率分布の精度は、データの可用性と品質に大きく依存します。したがって、実際のアプリケーションでは、統計モデルの基礎となる確率の一部が誤って指定された場合の堅牢性を調査することが重要です。グラフィカルモデルのコンテキストでは、堅牢性の調査は感度分析の概念に該当します。これらの分析は、モデルの確率またはパラメーターの一部を変更し、元の分布と変更された分布がどれだけ離れているかを評価することから構成されます。ただし、ガウスグラフィカルモデルの場合、このような変更により、通常、元のグラフはモデルの条件付き独立構造の一貫性のない表現になります。ここでは、元のグラフが確率の変化後も有効であることを保証する感度分析のアプローチを開発し、さまざまな尺度を使用してこのような変化の影響を定量化します。これを実現するために、条件付き独立性を簡潔に表現し、このような関係の有効性を確認する簡単な方法を提供するために、代数的手法を活用します。私たちの方法は、人工的な例と医療の実際のアプリケーションを使用して、モデルの条件付き独立構造を破ることができる標準的な方法と堅牢で同等であることが実証されています。

Discerning the Linear Convergence of ADMM for Structured Convex Optimization through the Lens of Variational Analysis
変分解析のレンズによる構造化凸最適化のためのADMMの線形収束の識別

Despite the rich literature, the linear convergence of alternating direction method of multipliers (ADMM) has not been fully understood even for the convex case. For example, the linear convergence of ADMM can be empirically observed in a wide range of applications arising in statistics, machine learning, and related areas, while existing theoretical results seem to be too stringent to be satisfied or too ambiguous to be checked and thus why the ADMM performs linear convergence for these applications still seems to be unclear. In this paper, we systematically study the local linear convergence of ADMM in the context of convex optimization through the lens of variational analysis. We show that the local linear convergence of ADMM can be guaranteed without the strong convexity of objective functions together with the full rank assumption of the coefficient matrices, or the full polyhedricity assumption of their subdifferential; and it is possible to discern the local linear convergence for various concrete applications, especially for some representative models arising in statistical learning. We use some variational analysis techniques sophisticatedly; and our analysis is conducted in the most general proximal version of ADMM with Fortin and Glowinski’s larger step size so that all major variants of the ADMM known in the literature are covered.

豊富な文献があるにもかかわらず、交互方向乗数法(ADMM)の線形収束は、凸の場合であっても完全には理解されていません。たとえば、ADMMの線形収束は、統計、機械学習、および関連分野で生じる幅広いアプリケーションで経験的に観察できますが、既存の理論的結果は満たすには厳しすぎるか、確認するには曖昧すぎるように思われ、そのため、ADMMがこれらのアプリケーションで線形収束を実行する理由は依然として不明瞭です。この論文では、変分解析の観点から、凸最適化のコンテキストにおけるADMMの局所線形収束を体系的に研究します。ADMMの局所線形収束は、目的関数の強い凸性と係数行列のフルランク仮定、またはそのサブ微分の全多面体仮定を組み合わせなくても保証できること、およびさまざまな具体的なアプリケーション、特に統計学習で生じるいくつかの代表的なモデルで局所線形収束を識別できることを示します。いくつかの変分解析手法を巧みに使用します。私たちの分析は、文献で知られているADMMの主要な変種がすべてカバーされるように、FortinとGlowinskiのより大きなステップサイズを使用して、ADMMの最も一般的な近位バージョンで実行されます。

Sequential change-point detection in high-dimensional Gaussian graphical models
高次元ガウスグラフモデルにおける逐次変化点検出

High dimensional piecewise stationary graphical models represent a versatile class for modelling time varying networks arising in diverse application areas, including biology, economics, and social sciences. There has been recent work in offline detection and estimation of regime changes in the topology of sparse graphical models. However, the online setting remains largely unexplored, despite its high relevance to applications in sensor networks and other engineering monitoring systems, as well as financial markets. To that end, this work introduces a novel scalable online algorithm for detecting an unknown number of abrupt changes in the inverse covariance matrix of sparse Gaussian graphical models with small delay. The proposed algorithm is based upon monitoring the conditional log-likelihood of all nodes in the network and can be extended to a large class of continuous and discrete graphical models. We also investigate asymptotic properties of our procedure under certain mild regularity conditions on the graph size, sparsity level, number of samples, and pre- and post-changes in the topology of the network. Numerical works on both synthetic and real data illustrate the good performance of the proposed methodology both in terms of computational and statistical efficiency across numerous experimental settings.

高次元の区分定常グラフィカルモデルは、生物学、経済学、社会科学など、さまざまな応用分野で発生する時間変動ネットワークをモデル化するための多用途のクラスです。スパースグラフィカルモデルのトポロジーにおけるレジームチェンジのオフライン検出と推定に関する最近の研究が行われています。ただし、オンライン設定は、センサーネットワークやその他のエンジニアリングモニタリングシステム、金融市場への応用との関連性が高いにもかかわらず、ほとんど研究されていません。この目的のために、この研究では、スパースガウスグラフィカルモデルの逆共分散行列における未知の数の突然の変化を小さな遅延で検出するための、新しいスケーラブルなオンラインアルゴリズムを紹介します。提案されたアルゴリズムは、ネットワーク内のすべてのノードの条件付き対数尤度を監視することに基づいており、連続および離散グラフィカルモデルの大規模なクラスに拡張できます。また、グラフサイズ、スパースレベル、サンプル数、およびネットワークのトポロジーの前後の変更に関する特定の軽度の規則性条件下での手順の漸近特性を調査します。合成データと実データの両方に対する数値実験では、多数の実験設定にわたって計算効率と統計効率の両方の点で提案された方法論の優れたパフォーマンスが実証されています。

Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly
大学院生なしでハイパーパラメータを調整する: Dragonfly によるスケーラブルでロバストなベイズ最適化

Bayesian Optimisation (BO) refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently search for the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multi-fidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at dragonfly.github.io.

ベイズ最適化(BO)とは、高価なブラックボックス関数のグローバル最適化のための一連の手法のことで、関数の内省的なベイズモデルを使用して最適値を効率的に検索します。BOは多くのアプリケーションでうまく適用されてきましたが、現代の最適化タスクでは、従来の方法が見事に失敗する新しい課題が生じています。この研究では、スケーラブルで堅牢なBO用のオープンソースPythonライブラリであるDragonflyを紹介します。Dragonflyには、BOを困難な現実世界の環境に適用できるようにする最近開発された複数の方法が組み込まれています。これには、高次元ドメインを処理するためのより優れた方法、高価な関数の安価な近似が利用可能な場合にマルチフィデリティ評価を処理する方法、ニューラルネットワークアーキテクチャの空間などの構造化された組み合わせ空間を最適化する方法、並列評価を処理する方法が含まれます。さらに、ベイズモデルの選択、獲得関数の選択、さまざまな変数タイプと追加の制約を持つ複雑なドメインの最適化について、BOの新しい方法論的改善を開発します。Dragonflyをグローバル最適化のための他の一連のパッケージおよびアルゴリズムと比較し、上記の方法を統合するとBOのパフォーマンスが大幅に向上することを実証します。Dragonflyライブラリはdragonfly.github.ioで入手できます。

Memoryless Sequences for General Losses
一般的な損失に対するメモリレスシーケンス

One way to define the randomness of a fixed individual sequence is to ask how hard it is to predict relative to a given loss function. A sequence is memoryless if, with respect to average loss, no continuous function can predict the next entry of the sequence from a finite window of previous entries better than a constant prediction. For squared loss, memoryless sequences are known to have stochastic attributes analogous to those of truly random sequences. In this paper, we address the question of how changing the loss function changes the set of memoryless sequences, and in particular, the stochastic attributes they possess. For convex differentiable losses we establish that the statistic or property elicited by the loss determines the identity and stochastic attributes of the corresponding memoryless sequences. We generalize these results to convex non-differentiable losses, under additional assumptions, and to non-convex Bregman divergences. In particular, our results show that any Bregman divergence has the same set of memoryless sequences as squared loss. We apply our results to price calibration in prediction markets.

固定された個々のシーケンスのランダム性を定義する1つの方法は、与えられた損失関数に対して予測するのがどれほど難しいかを問うことです。平均損失に関して、連続関数が、定数予測よりも優れた、有限の以前のエントリのウィンドウからシーケンスの次のエントリを予測できない場合、シーケンスはメモリレスです。二乗損失の場合、メモリレスシーケンスは、真にランダムなシーケンスに類似した確率的属性を持つことが知られています。この論文では、損失関数を変更すると、メモリレスシーケンスのセット、特にそれらが持つ確率的属性がどのように変化するかという問題に取り組みます。凸微分可能損失の場合、損失によって引き出される統計またはプロパティが、対応するメモリレスシーケンスのアイデンティティと確率的属性を決定することを確立します。これらの結果を、追加の仮定の下で凸微分不可能損失に、および非凸ブレグマンダイバージェンスに一般化します。特に、私たちの結果は、どのブレグマンダイバージェンスも二乗損失と同じメモリレスシーケンスのセットを持つことを示しています。私たちは、予測市場の価格調整に結果を応用します。

Quantile Graphical Models: a Bayesian Approach
分位点グラフィカルモデル:ベイズアプローチ

Graphical models are ubiquitous tools to describe the interdependence between variables measured simultaneously such as large-scale gene or protein expression data. Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices and they are generated under a multivariate normal joint distribution. However, they suffer from several shortcomings since they are based on Gaussian distribution assumptions. In this article, we propose a Bayesian quantile based approach for sparse estimation of graphs. We demonstrate that the resulting graph estimation is robust to outliers and applicable under general distributional assumptions. Furthermore, we develop efficient variational Bayes approximations to scale the methods for large data sets. Our methods are applied to a novel cancer proteomics data dataset where-in multiple proteomic antibodies are simultaneously assessed on tumor samples using reverse-phase protein arrays (RPPA) technology.

グラフィカルモデルは、大規模な遺伝子またはタンパク質発現データなど、同時に測定された変数間の相互依存性を記述するための普遍的なツールです。ガウスグラフィカルモデル(GGM)は、精度行列を使用して依存構造を確率的に探索するための確立されたツールであり、多変量正規分布に基づいて生成されます。ただし、ガウス分布の仮定に基づいているため、いくつかの欠点があります。この記事では、グラフのスパース推定のためのベイジアン分位数ベースのアプローチを提案します。結果として得られるグラフ推定は外れ値に対して堅牢であり、一般的な分布の仮定の下で適用可能であることを示します。さらに、大規模なデータセットにこの方法を拡張するための効率的な変分ベイズ近似を開発します。私たちの方法は、逆相タンパク質アレイ(RPPA)技術を使用して腫瘍サンプルで複数のプロテオーム抗体を同時に評価する、新しい癌プロテオミクスデータデータセットに適用されます。

Harmless Overfitting: Using Denoising Autoencoders in Estimation of Distribution Algorithms
無害なオーバーフィット:分布アルゴリズムの推定におけるノイズ除去オートエンコーダの使用

Estimation of Distribution Algorithms (EDAs) are metaheuristics where learning a model and sampling new solutions replaces the variation operators recombination and mutation used in standard Genetic Algorithms. The choice of these models as well as the corresponding training processes are subject to the bias/variance tradeoff, also known as under- and overfitting: simple models cannot capture complex interactions between problem variables, whereas complex models are susceptible to modeling random noise. This paper suggests using Denoising Autoencoders (DAEs) as generative models within EDAs (DAE-EDA). The resulting DAE-EDA is able to model complex probability distributions. Furthermore, overfitting is less harmful, since DAEs overfit by learning the identity function. This overfitting behavior introduces unbiased random noise into the samples, which is no major problem for the EDA but just leads to higher population diversity. As a result, DAE-EDA runs for more generations before convergence and searches promising parts of the solution space more thoroughly. We study the performance of DAE-EDA on several combinatorial single-objective optimization problems. In comparison to the Bayesian Optimization Algorithm, DAE-EDA requires a similar number of evaluations of the objective function but is much faster and can be parallelized efficiently, making it the preferred choice especially for large and difficult optimization problems.

分布推定アルゴリズム(EDA)は、モデルを学習して新しいソリューションをサンプリングすることで、標準的な遺伝的アルゴリズムで使用される変異演算子の組み換えと突然変異に代わるメタヒューリスティックです。これらのモデルの選択と対応するトレーニングプロセスは、バイアスと分散のトレードオフ(アンダーフィッティングとオーバーフィッティングとも呼ばれる)の影響を受けます。単純なモデルでは、問題の変数間の複雑な相互作用を捉えることができませんが、複雑なモデルはランダムノイズをモデル化してしまう可能性があります。この論文では、EDA内の生成モデルとしてノイズ除去オートエンコーダ(DAE)を使用することを提案します(DAE-EDA)。結果として得られるDAE-EDAは、複雑な確率分布をモデル化できます。さらに、DAEは恒等関数を学習することでオーバーフィッティングするため、オーバーフィッティングによる害は少なくなります。このオーバーフィッティングの動作により、サンプルに偏りのないランダムノイズが導入されますが、これはEDAにとって大きな問題ではなく、集団の多様性が高まるだけです。その結果、DAE-EDAは収束するまでにより多くの世代を実行し、ソリューション空間の有望な部分をより徹底的に検索します。私たちは、いくつかの組み合わせ単一目的最適化問題におけるDAE-EDAのパフォーマンスを研究しています。ベイズ最適化アルゴリズムと比較すると、DAE-EDAは目的関数の評価回数は同程度ですが、はるかに高速で効率的に並列化できるため、特に大規模で困難な最適化問題では好ましい選択肢となります。

Multi-Player Bandits: The Adversarial Case
マルチプレイヤーバンディット:敵対的なケース

We consider a setting where multiple players sequentially choose among a common set of actions (arms). Motivated by an application to cognitive radio networks, we assume that players incur a loss upon colliding, and that communication between players is not possible. Existing approaches assume that the system is stationary. Yet this assumption is often violated in practice, e.g., due to signal strength fluctuations. In this work, we design the first multi-player Bandit algorithm that provably works in arbitrarily changing environments, where the losses of the arms may even be chosen by an adversary. This resolves an open problem posed by Rosenski et al. (2016).

私たちは、複数のプレイヤーが共通のアクション(アーム)の中から順番に選択する設定を検討します。コグニティブ無線ネットワークへの応用に動機づけられて、プレイヤーは衝突時に損失を被り、プレイヤー間の通信は不可能であると仮定します。既存のアプローチでは、システムが静止していることを前提としています。しかし、この仮定は、信号強度の変動などにより、実際にはしばしば破られます。この研究では、武器の損失が敵によって選択される可能性のある、任意に変化する環境で証明可能に機能する最初のマルチプレイヤーBanditアルゴリズムを設計します。これにより、Rosenskiら(2016)によって提起された未解決の問題が解決されます。

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning
GADMM:分散機械学習のための高速で通信効率の高いフレームワーク

When the data is distributed across multiple servers, lowering the communication cost between the servers (or workers) while solving the distributed learning problem is an important problem and is the focus of this paper. In particular, we propose a fast, and communication-efficient decentralized framework to solve the distributed machine learning (DML) problem. The proposed algorithm, Group Alternating Direction Method of Multipliers (GADMM) is based on the Alternating Direction Method of Multipliers (ADMM) framework. The key novelty in GADMM is that it solves the problem in a decentralized topology where at most half of the workers are competing for the limited communication resources at any given time. Moreover, each worker exchanges the locally trained model only with two neighboring workers, thereby training a global model with a lower amount of communication overhead in each exchange. We prove that GADMM converges to the optimal solution for convex loss functions, and numerically show that it converges faster and more communication-efficient than the state-of-the-art communication-efficient algorithms such as the Lazily Aggregated Gradient (LAG) and dual averaging, in linear and logistic regression tasks on synthetic and real datasets. Furthermore, we propose Dynamic GADMM (D-GADMM), a variant of GADMM, and prove its convergence under the time-varying network topology of the workers.

データが複数のサーバーに分散されている場合、分散学習の問題を解決しながらサーバー(またはワーカー)間の通信コストを下げることは重要な問題であり、この論文の焦点です。特に、分散機械学習(DML)の問題を解決するための高速で通信効率の高い分散フレームワークを提案します。提案されたアルゴリズムであるGroup Alternating Direction Method of Multipliers (GADMM)は、Alternating Direction Method of Multipliers (ADMM)フレームワークに基づいています。GADMMの重要な新規性は、最大で半数のワーカーが特定の時点で限られた通信リソースを競合する分散トポロジで問題を解決することです。さらに、各ワーカーは、ローカルにトレーニングされたモデルを2人の隣接するワーカーとのみ交換するため、各交換での通信オーバーヘッドが少ないグローバルモデルをトレーニングできます。GADMMが凸損失関数の最適解に収束することを証明し、合成データセットと実データセットの線形回帰タスクとロジスティック回帰タスクにおいて、遅延集約勾配法(LAG)や二重平均法などの最先端の通信効率の高いアルゴリズムよりも収束が速く、通信効率が高いことを数値的に示します。さらに、GADMMのバリエーションである動的GADMM (D-GADMM)を提案し、ワーカーの時間変動ネットワークトポロジーの下での収束を証明します。

Identifiability of Additive Noise Models Using Conditional Variances
条件付き分散を使用した加法性ノイズモデルの識別可能性

This paper considers a new identifiability condition for additive noise models (ANMs) in which each variable is determined by an arbitrary Borel measurable function of its parents plus an independent error. It has been shown that ANMs are fully recoverable under some identifiability conditions, such as when all error variances are equal. However, this identifiable condition could be restrictive, and hence, this paper focuses on a relaxed identifiability condition that involves not only error variances, but also the influence of parents. This new class of identifiable ANMs does not put any constraints on the form of dependencies, or distributions of errors, and allows different error variances. It further provides a statistically consistent and computationally feasible structure learning algorithm for the identifiable ANMs based on the new identifiability condition. The proposed algorithm assumes that all relevant variables are observed, while it does not assume faithfulness or a sparse graph. Demonstrated through extensive simulated and real multivariate data is that the proposed algorithm successfully recovers directed acyclic graphs.

この論文では、各変数がその親の任意のボレル測定可能関数と独立誤差によって決定される加法性ノイズモデル(ANM)の新しい識別可能性条件について検討します。すべての誤差分散が等しい場合など、いくつかの識別可能性条件下ではANMが完全に回復可能であることが示されています。ただし、この識別可能条件は制限的である可能性があるため、この論文では、誤差分散だけでなく親の影響も考慮する緩和された識別可能性条件に焦点を当てています。この新しいクラスの識別可能なANMは、依存関係の形式や誤差の分布に制約を課さず、さまざまな誤差分散を許可します。さらに、新しい識別可能性条件に基づいて、識別可能なANMの統計的に一貫性があり、計算上実行可能な構造学習アルゴリズムを提供します。提案されたアルゴリズムは、すべての関連変数が観測されることを前提としていますが、忠実性やスパースグラフは想定していません。広範囲にわたるシミュレーションと実際の多変量データを通じて、提案されたアルゴリズムが有向非巡回グラフを正常に回復できることが実証されています。

High-dimensional Gaussian graphical models on network-linked data
ネットワークリンクデータ上の高次元ガウスグラフモデル

Graphical models are commonly used to represent conditional dependence relationships between variables. There are multiple methods available for exploring them from high-dimensional data, but almost all of them rely on the assumption that the observations are independent and identically distributed. At the same time, observations connected by a network are becoming increasingly common, and tend to violate these assumptions. Here we develop a Gaussian graphical model for observations connected by a network with potentially different mean vectors, varying smoothly over the network. We propose an efficient estimation algorithm and demonstrate its effectiveness on both simulated and real data, obtaining meaningful and interpretable results on a statistics coauthorship network. We also prove that our method estimates both the inverse covariance matrix and the corresponding graph structure correctly under the assumption of network “cohesion”, which refers to the empirically observed phenomenon of network neighbors sharing similar traits.

グラフィカルモデルは、変数間の条件付き依存関係を表すためによく使用されます。高次元データからそれらを探索する方法は複数ありますが、ほとんどすべては、観測が独立しており、同一に分布しているという仮定に依存しています。同時に、ネットワークで接続された観測がますます一般的になり、これらの仮定に違反する傾向があります。ここでは、ネットワーク上で滑らかに変化する可能性のある異なる平均ベクトルを持つネットワークで接続された観測のガウスグラフィカルモデルを開発します。効率的な推定アルゴリズムを提案し、シミュレーションデータと実際のデータの両方でその有効性を実証し、統計共著ネットワークで有意義で解釈可能な結果を得ました。また、ネットワークの「凝集性」の仮定の下で、逆共分散行列と対応するグラフ構造の両方を、私たちの方法が正確に推定することを証明します。これは、ネットワークの隣人が同様の特性を共有するという経験的に観察された現象を指します。

Scalable Approximate MCMC Algorithms for the Horseshoe Prior
馬蹄形事前分布のためのスケーラブルな近似 MCMC アルゴリズム

The horseshoe prior is frequently employed in Bayesian analysis of high-dimensional models, and has been shown to achieve minimax optimal risk properties when the truth is sparse. While optimization-based algorithms for the extremely popular Lasso and elastic net procedures can scale to dimension in the hundreds of thousands, algorithms for the horseshoe that use Markov chain Monte Carlo (MCMC) for computation are limited to problems an order of magnitude smaller. This is due to high computational cost per step and growth of the variance of time-averaging estimators as a function of dimension. We propose two new MCMC algorithms for computation in these models that have significantly improved performance compared to existing alternatives. One of the algorithms also approximates an expensive matrix product to give orders of magnitude speedup in high-dimensional applications. We prove guarantees for the accuracy of the approximate algorithm, and show that gradually decreasing the approximation error as the chain extends results in an exact algorithm. The scalability of the algorithm is illustrated in simulations with problem size as large as $N=5,000$ observations and $p=50,000$ predictors, and an application to a genome-wide association study with $N=2,267$ and $p=98,385$. The empirical results also show that the new algorithm yields estimates with lower mean squared error, intervals with better coverage, and elucidates features of the posterior that were often missed by previous algorithms in high dimensions, including bimodality of posterior marginals indicating uncertainty about which covariates belong in the model.

馬蹄事前分布は、高次元モデルのベイズ分析で頻繁に使用され、真実がスパースである場合に、ミニマックス最適リスク特性を実現することが示されています。非常に人気のあるLassoおよびエラスティックネット手順の最適化ベースのアルゴリズムは、数十万の次元まで拡張できますが、計算にマルコフ連鎖モンテカルロ(MCMC)を使用する馬蹄事前分布のアルゴリズムは、1桁小さい問題に限定されます。これは、ステップあたりの計算コストが高く、時間平均推定量の分散が次元の関数として増加するためです。これらのモデルの計算用に、既存の代替方法と比較して大幅にパフォーマンスが向上した2つの新しいMCMCアルゴリズムを提案します。アルゴリズムの1つは、高価な行列積を近似して、高次元アプリケーションで桁違いの高速化を実現します。近似アルゴリズムの精度が保証されていることを証明し、チェーンが拡張するにつれて近似誤差を徐々に減らすことで、正確なアルゴリズムが得られることを示します。このアルゴリズムのスケーラビリティは、問題サイズが$N=5,000$観測値と$p=50,000$予測子という大規模なシミュレーションと、$N=2,267$および$p=98,385$の全ゲノム関連研究への適用で実証されています。実験結果では、新しいアルゴリズムによって、平均二乗誤差が低い推定値と、より広範囲の区間が得られ、高次元で以前のアルゴリズムでは見逃されがちな事後分布の特徴(どの共変量がモデルに属するかが不確実であることを示す事後分布の二峰性など)が解明されることも示されています。

(1 + epsilon)-class Classification: an Anomaly Detection Method for Highly Imbalanced or Incomplete Data Sets
(1 + ε) クラス分類: 非常に不均衡または不完全なデータセットに対する異常検出方法

Anomaly detection is not an easy problem since distribution of anomalous samples is unknown a priori. We explore a novel method that gives a trade-off possibility between one-class and two-class approaches, and leads to a better performance on anomaly detection problems with small or non-representative anomalous samples. The method is evaluated using several data sets and compared to a set of conventional one-class and two-class approaches.

異常サンプルの分布は先験的に不明であるため、異常検出は簡単な問題ではありません。私たちは、1クラスアプローチと2クラスアプローチの間にトレードオフの可能性を与え、小さなまたは代表的でない異常サンプルの異常検出問題に対するパフォーマンスの向上につながる新しい方法を探求しています。この方法は、いくつかのデータセットを使用して評価され、従来の1クラスおよび2クラスのアプローチのセットと比較されます。

Estimation of a Low-rank Topic-Based Model for Information Cascades
情報カスケードのための低ランクトピックベースモデルの推定

We consider the problem of estimating the latent structure of a social network based on the observed information diffusion events, or cascades, where the observations for a given cascade consist of only the timestamps of infection for infected nodes but not the source of the infection. Most of the existing work on this problem has focused on estimating a diffusion matrix without any structural assumptions on it. In this paper, we propose a novel model based on the intuition that an information is more likely to propagate among two nodes if they are interested in similar topics which are also prominent in the information content. In particular, our model endows each node with an influence vector (which measures how authoritative the node is on each topic) and a receptivity vector (which measures how susceptible the node is for each topic). We show how this node-topic structure can be estimated from the observed cascades, and prove the consistency of the estimator. Experiments on synthetic and real data demonstrate the improved performance and better interpretability of our model compared to existing state-of-the-art methods.

私たちは、観測された情報拡散イベント、つまりカスケードに基づいてソーシャルネットワークの潜在的構造を推定する問題について検討します。この場合、特定のカスケードの観測は、感染したノードの感染のタイムスタンプのみで構成され、感染源は含まれません。この問題に関する既存の研究のほとんどは、構造上の仮定なしに拡散行列を推定することに焦点を当てています。この論文では、2つのノードが情報コンテンツでも目立つ類似のトピックに興味を持っている場合、情報は2つのノード間で伝播する可能性が高くなるという直感に基づいた新しいモデルを提案します。特に、我々のモデルは、各ノードに影響ベクトル(各トピックに対するノードの権威を測定)と受容ベクトル(各トピックに対するノードの感受性を測定)を付与します。私たちは、このノードトピック構造を観測されたカスケードから推定する方法を示し、推定値の一貫性を証明します。合成データと実際のデータでの実験により、既存の最先端の方法と比較して、我々のモデルのパフォーマンスが向上し、解釈可能性が向上することが実証されています。

Representation Learning for Dynamic Graphs: A Survey
動的グラフの表現学習:調査

Graphs arise naturally in many real-world applications including social networks, recommender systems, ontologies, biology, and computational finance. Traditionally, machine learning models for graphs have been mostly designed for static graphs. However, many applications involve evolving graphs. This introduces important challenges for learning and inference since nodes, attributes, and edges change over time. In this survey, we review the recent advances in representation learning for dynamic graphs, including dynamic knowledge graphs. We describe existing models from an encoder-decoder perspective, categorize these encoders and decoders based on the techniques they employ, and analyze the approaches in each category. We also review several prominent applications and widely used datasets and highlight directions for future research.

グラフは、ソーシャルネットワーク、レコメンダーシステム、オントロジー、生物学、コンピュテーショナルファイナンスなど、多くの実世界のアプリケーションで自然に発生します。従来、グラフの機械学習モデルは、ほとんどが静的グラフ用に設計されてきました。ただし、多くのアプリケーションでは、グラフの進化が伴います。これにより、ノード、属性、およびエッジが時間とともに変化するため、学習と推論に重要な課題が生じます。本調査では、動的知識グラフを含む動的グラフの表現学習における最近の進歩を概観します。既存のモデルをエンコーダー・デコーダーの観点から記述し、これらのエンコーダーとデコーダーを、それらが採用している手法に基づいて分類し、各カテゴリーのアプローチを分析します。また、いくつかの著名なアプリケーションと広く使用されているデータセットをレビューし、将来の研究の方向性を強調します。

Union of Low-Rank Tensor Spaces: Clustering and Completion
低ランクテンソル空間の和集合: クラスタリングと補完

We consider the problem of clustering and completing a set of tensors with missing data that are drawn from a union of low-rank tensor spaces. In the clustering problem, given a partially sampled tensor data that is composed of a number of subtensors, each chosen from one of a certain number of unknown tensor spaces, we need to group the subtensors that belong to the same tensor space. We provide a geometrical analysis on the sampling pattern and subsequently derive the sampling rate that guarantees the correct clustering under some assumptions with high probability. Moreover, we investigate the fundamental conditions for finite/unique completability for the union of tensor spaces completion problem. Both deterministic and probabilistic conditions on the sampling pattern to ensure finite/unique completability are obtained. For both the clustering and completion problems, our tensor analysis provides significantly better bound than the bound given by the matrix analysis applied to any unfolding of the tensor data.

私たちは、低ランクのテンソル空間の和集合から抽出された欠損データを持つテンソルのセットをクラスタリングおよび補完する問題について検討します。クラスタリング問題では、それぞれが一定数の未知のテンソル空間の1つから選択された多数のサブテンソルで構成される部分的にサンプリングされたテンソルデータが与えられた場合、同じテンソル空間に属するサブテンソルをグループ化する必要があります。サンプリングパターンの幾何学的解析を提供し、その後、いくつかの仮定の下で高い確率で正しいクラスタリングを保証するサンプリングレートを導出します。さらに、テンソル空間の和集合の補完問題に対する有限/一意の補完可能性の基本条件を調査します。有限/一意の補完可能性を保証するためのサンプリングパターンの決定論的条件と確率的条件の両方が得られます。クラスタリング問題と補完問題の両方で、テンソルデータの展開に適用される行列解析によって与えられる境界よりも、テンソル解析によって大幅に優れた境界が得られます。

On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics
確率勾配ランジュバン動力学の停留点到達時間とエルゴード性について

Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm in stochastic optimization. Recent work by Zhang et al. (2017) presents an analysis for the hitting time of SGLD for the first and second order stationary points. The proof in Zhang et al. (2017) is a two-stage procedure through bounding the Cheeger’s constant, which is rather complicated and leads to loose bounds. In this paper, using intuitions from stochastic differential equations, we provide a direct analysis for the hitting times of SGLD to the first and second order stationary points. Our analysis is straightforward. It only relies on basic linear algebra and probability theory tools. Our direct analysis also leads to tighter bounds comparing to Zhang et al. (2017) and shows the explicit dependence of the hitting time on different factors, including dimensionality, smoothness, noise strength, and step size effects. Under suitable conditions, we show that the hitting time of SGLD to first-order stationary points can be dimension-independent. Moreover, we apply our analysis to study several important online estimation problems in machine learning, including linear regression, matrix factorization, and online PCA.

確率的勾配ランジュバン動力学(SGLD)は、確率的最適化における基本的なアルゴリズムです。Zhangら(2017)による最近の研究では、1次および2次の定常点に対するSGLDの到達時間の分析が示されています。Zhangら(2017)の証明は、Cheeger定数を制限する2段階の手順であり、かなり複雑で、境界が緩くなります。この論文では、確率微分方程式からの直観を使用して、1次および2次の定常点に対するSGLDの到達時間を直接分析します。私たちの分析は単純です。基本的な線形代数と確率論のツールのみに依存しています。私たちの直接分析は、Zhangら(2017)と比較してより厳しい境界にもつながり、次元、滑らかさ、ノイズ強度、ステップサイズ効果などのさまざまな要因に対する到達時間の明示的な依存性を示しています。適切な条件下では、SGLDの1次定常点への到達時間は次元に依存しないことを示します。さらに、線形回帰、行列分解、オンラインPCAなど、機械学習におけるいくつかの重要なオンライン推定問題を研究するために、この分析を適用します。

The weight function in the subtree kernel is decisive
サブツリーカーネルの重み関数が決定的です

Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000’s. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small data sets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.

ツリーデータは、植物の構造、RNAの二次構造、XMLファイルの階層など、さまざまな状況をモデル化するため、どこにでもあります。ただし、これらの非ユークリッドデータの分析は、それ自体が困難です。この論文では、2000年代初頭にVishwanathanとSmolaによって導入されたツリーデータ用の畳み込みカーネルであるサブツリーカーネルに焦点を当てます。より正確には、重み関数の影響を理論的観点と実際のデータアプリケーションから調査します。2クラス確率モデルに基づいて、葉の重みがなくなるとサブツリーカーネルのパフォーマンスが向上することを確立しました。これにより、データから学習され、通常のようにユーザーが固定するものではない新しい重み関数を定義することになりました。この目的のために、順序付きまたは順序なしのツリーからサブツリーカーネルを計算するための統一されたフレームワークを定義します。これは、特にパラメーターの調整に適しています。8つの実際のデータ分類問題を通じて、特に小規模なデータセットの場合にこのアプローチが非常に効率的であることを示します。これは、重み関数の重要性の高さも示しています。最後に、重要な特徴を視覚化するツールを導出します。

WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions
WONDER:高次元での重み付きワンショット分布リッジ回帰

In many areas, practitioners need to analyze large data sets that challenge conventional single-machine computing. To scale up data analysis, distributed and parallel computing approaches are increasingly needed. Here we study a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method for supervised learning, and has several optimality properties, thus it is important to study. We study one-shot methods that construct weighted combinations of ridge regression estimators computed on each machine. By analyzing the mean squared error in a high-dimensional random-effects model where each predictor has a small effect, we discover several new phenomena. Infinite-worker limit: The distributed estimator works well for very large numbers of machines, a phenomenon we call ‘infinite-worker limit’. Optimal weights: The optimal weights for combining local estimators sum to more than unity, due to the downward bias of ridge. Thus, all averaging methods are suboptimal. We also propose a new Weighted ONe-shot DistributEd Ridge regression algorithm (WONDER). We test WONDER in simulation studies and using the Million Song Dataset as an example. There it can save at least 100x in computation time, while nearly preserving test accuracy.

多くの分野で、実務者は従来の単一マシンコンピューティングでは困難な大規模なデータセットを分析する必要があります。データ分析をスケールアップするには、分散および並列コンピューティングアプローチがますます必要になっています。ここでは、この分野の基本的で非常に重要な問題、つまり、分散コンピューティング環境でリッジ回帰を実行する方法について検討します。リッジ回帰は、教師あり学習の非常に一般的な方法であり、いくつかの最適性特性があるため、検討することが重要です。各マシンで計算されたリッジ回帰推定量の重み付き組み合わせを構築するワンショットメソッドを検討します。各予測子の影響が小さい高次元ランダム効果モデルの平均二乗誤差を分析することで、いくつかの新しい現象を発見しました。無限ワーカー制限:分散推定量は非常に多数のマシンで適切に機能します。この現象を「無限ワーカー制限」と呼びます。最適な重み:リッジの下方バイアスにより、ローカル推定量を組み合わせるための最適な重みの合計は1を超えます。したがって、すべての平均化方法は最適ではありません。また、新しい加重ワンショット分散リッジ回帰アルゴリズム(WONDER)も提案しています。シミュレーション研究でWONDERをテストし、Million Song Datasetを例として使用しました。これにより、テストの精度をほぼ維持しながら、計算時間を少なくとも100倍節約できます。

Smoothed Nonparametric Derivative Estimation using Weighted Difference Quotients
加重差分商を使用した平滑化ノンパラメトリック微分推定

Derivatives play an important role in bandwidth selection methods (e.g., plug-ins), data analysis and bias-corrected confidence intervals. Therefore, obtaining accurate derivative information is crucial. Although many derivative estimation methods exist, the majority require a fixed design assumption. In this paper, we propose an effective and fully data-driven framework to estimate the first and second order derivative in random design. We establish the asymptotic properties of the proposed derivative estimator, and also propose a fast selection method for the tuning parameters. The performance and flexibility of the method is illustrated via an extensive simulation study.

デリバティブは、帯域幅選択法(プラグインなど)、データ分析、バイアス補正信頼区間において重要な役割を果たします。したがって、正確な派生情報を取得することが重要です。多くの微分推定法が存在しますが、その大部分は固定された設計仮定を必要とします。この論文では、ランダム設計で一次および二次導関数を推定するための効果的で完全にデータ駆動型のフレームワークを提案します。提案された微分推定量の漸近特性を確立し、調整パラメータの高速選択方法を提案します。この手法の性能と柔軟性は、広範なシミュレーション研究によって示されています。

Community-Based Group Graphical Lasso
コミュニティベースのグループグラフィカルラッソ

A new strategy for probabilistic graphical modeling is developed that draws parallels to community detection analysis. The method jointly estimates an undirected graph and homogeneous communities of nodes. The structure of the communities is taken into account when estimating the graph and at the same time, the structure of the graph is accounted for when estimating communities of nodes. The procedure uses a joint group graphical lasso approach with community detection-based grouping, such that some groups of edges co-occur in the estimated graph. The grouping structure is unknown and is estimated based on community detection algorithms. Theoretical derivations regarding graph convergence and sparsistency, as well as accuracy of community recovery are included, while the method’s empirical performance is illustrated in an fMRI context, as well as with simulated examples.

コミュニティ検出分析との類似性を引き出す確率的グラフィカルモデリングの新しい戦略が開発されました。この方法では、無向グラフとノードの均質なコミュニティを共同で推定します。グラフを推定する際にはコミュニティの構造が考慮され、同時にノードのコミュニティを推定する際にはグラフの構造が考慮されます。この手順では、コミュニティ検出ベースのグループ化による結合グループグラフィカルなげなわアプローチを使用して、推定されたグラフでエッジの一部のグループが共起します。グループ化構造は不明であり、コミュニティ検出アルゴリズムに基づいて推定されます。グラフの収束性とスパース性、およびコミュニティ回復の精度に関する理論的導出が含まれており、この手法の経験的パフォーマンスは、fMRIのコンテキストとシミュレートされた例で示されています。

Unique Sharp Local Minimum in L1-minimization Complete Dictionary Learning
L1最小化における一意なシャープな局所最小値完全辞書学習

We study the problem of globally recovering a dictionary from a set of signals via $\ell_1$-minimization. We assume that the signals are generated as i.i.d. random linear combinations of the $K$ atoms from a complete reference dictionary $D^*\in \mathbb R^{K\times K}$, where the linear combination coefficients are from either a Bernoulli type model or exact sparse model. First, we obtain a necessary and sufficient norm condition for the reference dictionary $D^*$ to be a sharp local minimum of the expected $\ell_1$ objective function. Our result substantially extends that of Wu and Yu (2015) and allows the combination coefficient to be non-negative. Secondly, we obtain an explicit bound on the region within which the objective value of the reference dictionary is minimal. Thirdly, we show that the reference dictionary is the unique sharp local minimum, thus establishing the first known global property of $\ell_1$-minimization dictionary learning. Motivated by the theoretical results, we introduce a perturbation based test to determine whether a dictionary is a sharp local minimum of the objective function. In addition, we also propose a new dictionary learning algorithm based on Block Coordinate Descent, called DL-BCD, which is guaranteed to decrease the obective function monotonically. Simulation studies show that DL-BCD has competitive performance in terms of recovery rate compared to other state-of-the-art dictionary learning algorithms when the reference dictionary is generated from random Gaussian matrices.

私たちは、$\ell_1$最小化を介して信号セットから辞書をグローバルに復元する問題を研究します。信号は、完全な参照辞書$D^*\in \mathbb R^{K\times K}$からの$K$原子のi.i.d.ランダム線形結合として生成されるものと仮定します。ここで、線形結合係数は、ベルヌーイ型モデルまたは正確なスパースモデルのいずれかから得られます。まず、参照辞書$D^*$が期待される$\ell_1$目的関数の鋭い局所最小値となるための必要かつ十分なノルム条件を取得します。我々の結果は、WuとYu (2015)の結果を大幅に拡張し、結合係数が非負になることを可能にします。次に、参照辞書の目的値が最小となる領域の明示的な境界を取得します。最後に、参照辞書が唯一の鋭い局所最小値であることを示し、$\ell_1$最小化辞書学習の最初の既知のグローバル特性を確立します。理論的な結果に基づいて、辞書が目的関数の鋭い局所最小値であるかどうかを判断するための摂動ベースのテストを導入します。さらに、ブロック座標降下法に基づく新しい辞書学習アルゴリズム(DL-BCD)も提案します。これは、目的関数を単調に減少させることが保証されています。シミュレーション研究では、参照辞書がランダムなガウス行列から生成される場合、DL-BCDは他の最先端の辞書学習アルゴリズムと比較して回復率の点で競争力のあるパフォーマンスを発揮することが示されています。

Generalized Optimal Matching Methods for Causal Inference
因果推論のための一般化最適マッチング法

We develop an encompassing framework for matching, covariate balancing, and doubly-robust methods for causal inference from observational data called generalized optimal matching (GOM). The framework is given by generalizing a new functional-analytical formulation of optimal matching, giving rise to the class of GOM methods, for which we provide a single unified theory to analyze tractability and consistency. Many commonly used existing methods are included in GOM and, using their GOM interpretation, can be extended to optimally and automatically trade off balance for variance and outperform their standard counterparts. As a subclass, GOM gives rise to kernel optimal matching (KOM), which, as supported by new theoretical and empirical results, is notable for combining many of the positive properties of other methods in one. KOM, which is solved as a linearly-constrained convex-quadratic optimization problem, inherits both the interpretability and model-free consistency of matching but can also achieve the $\sqrt{n}$-consistency of well-specified regression and the bias reduction and robustness of doubly robust methods. In settings of limited overlap, KOM enables a very transparent method for interval estimation for partial identification and robust coverage. We demonstrate this in examples with both synthetic and real data.

私たちは、一般化最適マッチング(GOM)と呼ばれる、観測データからの因果推論のためのマッチング、共変量バランシング、および二重に堅牢な方法のための包括的なフレームワークを開発しました。このフレームワークは、最適マッチングの新しい機能分析定式化を一般化することによって提供され、GOM方法のクラスを生み出しました。これに対して、扱いやすさと一貫性を分析するための単一の統一理論を提供します。多くの一般的に使用されている既存の方法がGOMに含まれており、それらのGOM解釈を使用して、分散とバランスを最適かつ自動的にトレードオフし、標準的な方法よりも優れたパフォーマンスを発揮するように拡張できます。サブクラスとして、GOMはカーネル最適マッチング(KOM)を生み出します。これは、新しい理論的および経験的結果によってサポートされているように、他の方法の多くの肯定的な特性を1つに組み合わせていることで注目に値します。KOMは線形制約付き凸二次最適化問題として解かれ、マッチングの解釈可能性とモデルフリーの一貫性の両方を継承しますが、適切に指定された回帰の$\sqrt{n}$一貫性と、二重に堅牢な方法のバイアス削減と堅牢性も実現できます。オーバーラップが制限された設定では、KOMは部分識別と堅牢なカバレッジの区間推定のための非常に透過的な方法を可能にします。これを合成データと実際のデータの両方の例で実証します。

Multiparameter Persistence Landscapes
マルチパラメータ永続性ランドスケープ

An important problem in the field of Topological Data Analysis is defining topological summaries which can be combined with traditional data analytic tools. In recent work Bubenik introduced the persistence landscape, a stable representation of persistence diagrams amenable to statistical analysis and machine learning tools. In this paper we generalise the persistence landscape to multiparameter persistence modules providing a stable representation of the rank invariant. We show that multiparameter landscapes are stable with respect to the interleaving distance and persistence weighted Wasserstein distance, and that the collection of multiparameter landscapes faithfully represents the rank invariant. Finally we provide example calculations and statistical tests to demonstrate a range of potential applications and how one can interpret the landscapes associated to a multiparameter module.

トポロジカルデータ分析の分野における重要な問題は、従来のデータ分析ツールと組み合わせることができるトポロジカルサマリーを定義することです。最近の研究で、Bubenikは、統計分析や機械学習ツールに適した永続性図の安定した表現である永続性ランドスケープを導入しました。この論文では、パーシステンスランドスケープを、ランク不変量の安定した表現を提供するマルチパラメータパーシステンスモジュールに一般化します。マルチパラメーターランドスケープは、インターリーブ距離と永続性加重Wasserstein距離に関して安定しており、マルチパラメーターランドスケープのコレクションがランク不変量を忠実に表していることを示します。最後に、さまざまな潜在的なアプリケーションと、マルチパラメータモジュールに関連付けられたランドスケープをどのように解釈できるかを示すために、計算例と統計的テストを提供します。

Kymatio: Scattering Transforms in Python
Kymatio: Python での散乱変換

The wavelet scattering transform is an invariant and stable signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks, including PyTorch and TensorFlow/Keras. The transforms are implemented on both CPUs and GPUs, the latter offering a significant speedup over the former. The package also has a small memory footprint. Source code, documentation, and examples are available under a BSD license at https://www.kymat.io.

ウェーブレット散乱変換は、多くの信号処理および機械学習アプリケーションに適した不変で安定した信号表現です。PyTorchやTensorFlow/Kerasなどの最新のディープラーニングフレームワークと互換性のある、1D、2D、および3Dの散乱変換の使いやすく高性能なPython実装であるKymatioソフトウェアパッケージを紹介します。変換はCPUとGPUの両方に実装され、後者は前者よりも大幅に高速化されます。また、このパッケージはメモリフットプリントが小さいです。ソースコード、ドキュメント、および例は、https://www.kymat.ioのBSDライセンスの下で入手できます。

Exact Guarantees on the Absence of Spurious Local Minima for Non-negative Rank-1 Robust Principal Component Analysis
非負のRank-1ロバスト主成分分析のためのスプリアス局所最小値の不在に関する厳密な保証

This work is concerned with the non-negative rank-1 robust principal component analysis (RPCA), where the goal is to recover the dominant non-negative principal components of a data matrix precisely, where a number of measurements could be grossly corrupted with sparse and arbitrary large noise. Most of the known techniques for solving the RPCA rely on convex relaxation methods by lifting the problem to a higher dimension, which significantly increase the number of variables. As an alternative, the well-known Burer-Monteiro approach can be used to cast the RPCA as a non-convex and non-smooth $\ell_1$ optimization problem with a significantly smaller number of variables. In this work, we show that the low-dimensional formulation of the symmetric and asymmetric positive rank-1 RPCA based on the Burer-Monteiro approach has benign landscape, i.e., 1) it does not have any spurious local solution, 2) has a unique global solution, and 3) its unique global solution coincides with the true components. An implication of this result is that simple local search algorithms are guaranteed to achieve a zero global optimality gap when directly applied to the low-dimensional formulation. Furthermore, we provide strong deterministic and probabilistic guarantees for the exact recovery of the true principal components. In particular, it is shown that a constant fraction of the measurements could be grossly corrupted and yet they would not create any spurious local solution.

この研究では、非負ランク1ロバスト主成分分析(RPCA)に関するもので、その目的は、データマトリックスの主要な非負主成分を正確に復元することですが、スパースで任意の大きなノイズによって測定値が大幅に破損している可能性があります。RPCAを解決するための既知の手法のほとんどは、問題をより高い次元に持ち上げる凸緩和法に依存しており、これにより変数の数が大幅に増加します。別の方法として、よく知られているBurer-Monteiroアプローチを使用して、RPCAを、大幅に少ない変数の数を持つ非凸で非滑らかな$\ell_1$最適化問題としてキャストできます。この研究では、Burer-Monteiroアプローチに基づく対称および非対称の正ランク1 RPCAの低次元定式化が良性ランドスケープを持つこと、つまり1)偽のローカルソリューションがないこと、2)一意のグローバルソリューションがあること、3)その一意のグローバルソリューションが真のコンポーネントと一致することを示します。この結果から、単純なローカルサーチアルゴリズムを低次元定式化に直接適用すると、グローバル最適性ギャップがゼロになることが保証されます。さらに、真の主成分の正確な回復に対する強力な決定論的および確率的保証を提供します。特に、測定値の一定の割合が大幅に破損しても、偽のローカルソリューションは作成されないことが示されています。

Robust Asynchronous Stochastic Gradient-Push: Asymptotically Optimal and Network-Independent Performance for Strongly Convex Functions
ロバストな非同期確率的勾配プッシュ:強凸関数に対する漸近最適でネットワークに依存しない性能

We consider the standard model of distributed optimization of a sum of functions $F(\mathbf z) = \sum_{i=1}^n f_i(\mathbf z)$, where node $i$ in a network holds the function $f_i(\mathbf z)$. We allow for a harsh network model characterized by asynchronous updates, message delays, unpredictable message losses, and directed communication among nodes. In this setting, we analyze a modification of the Gradient-Push method for distributed optimization, assuming that (i) node $i$ is capable of generating gradients of its function $f_i(\mathbf z)$ corrupted by zero-mean bounded-support additive noise at each step, (ii) $F(\mathbf z)$ is strongly convex, and (iii) each $f_i(\mathbf z)$ has Lipschitz gradients. We show that our proposed method asymptotically performs as well as the best bounds on centralized gradient descent that takes steps in the direction of the sum of the noisy gradients of all the functions $f_1(\mathbf z), \ldots, f_n(\mathbf z)$ at each step.

私たちは、関数の和$F(mathbf z) = sum_{i=1}^n f_i(mathbf z)$の分散最適化の標準モデルを考えます。ここで、ネットワーク内のノード$i$は関数$f_i(mathbf z)$を保持します。非同期更新、メッセージの遅延、予測不可能なメッセージ損失、ノード間の直接通信を特徴とする過酷なネットワークモデルを可能にします。この設定では、(i)ノード$i$が、各ステップでゼロ平均有界支持加法ノイズによって破損した関数$f_i(mathbf z)$の勾配を生成できること、(ii) $F(mathbf z)$が強く凸的であること、(iii)各$f_i(mathbf z)$がリプシッツ勾配を持つことを前提として、分散最適化のためのGradient-Push法の修正を分析します。提案手法は、各ステップですべての関数$f_1(mathbf z)、ldots、f_n(mathbf z)$のすべての関数のノイズの多い勾配の合計の方向にステップを踏む中央集権的な勾配降下法で、漸近的に最良範囲と同様に機能することを示します。

Self-paced Multi-view Co-training
自分のペースで進められるマルチビューの共同トレーニング

Co-training is a well-known semi-supervised learning approach which trains classifiers on two or more different views and exchanges pseudo labels of unlabeled instances in an iterative way. During the co-training process, pseudo labels of unlabeled instances are very likely to be false especially in the initial training, while the standard co-training algorithm adopts a ‘draw without replacement’ strategy and does not remove these wrongly labeled instances from training stages. Besides, most of the traditional co-training approaches are implemented for two-view cases, and their extensions in multi-view scenarios are not intuitive. These issues not only degenerate their performance as well as available application range but also hamper their fundamental theory. Moreover, there is no optimization model to explain the objective a co-training process manages to optimize. To address these issues, in this study we design a unified self-paced multi-view co-training (SPamCo) framework which draws unlabeled instances with replacement. Two specified co-regularization terms are formulated to develop different strategies for selecting pseudo-labeled instances during training. Both forms share the same optimization strategy which is consistent with the iteration process in co-training and can be naturally extended to multi-view scenarios. A distributed optimization strategy is also introduced to train the classifier of each view in parallel to further improve the efficiency of the algorithm. Furthermore, the SPamCo algorithm is proved to be PAC learnable, supporting its theoretical soundness. Experiments conducted on synthetic, text categorization, person re-identification, image recognition and object detection data sets substantiate the superiority of the proposed method.

共同トレーニングは、2つ以上の異なるビューで分類器をトレーニングし、ラベルのないインスタンスの疑似ラベルを反復的に交換する、よく知られた半教師あり学習アプローチです。共同トレーニングプロセス中、ラベルのないインスタンスの疑似ラベルは、特に初期トレーニングでは誤りである可能性が非常に高くなりますが、標準的な共同トレーニングアルゴリズムは「置き換えなしで描画」戦略を採用し、これらの誤ってラベル付けされたインスタンスをトレーニングステージから削除しません。さらに、従来の共同トレーニングアプローチのほとんどは2つのビューの場合に実装されており、マルチビューシナリオでの拡張は直感的ではありません。これらの問題は、パフォーマンスと利用可能なアプリケーション範囲を低下させるだけでなく、基本理論を妨げます。さらに、共同トレーニングプロセスが最適化する目的を説明する最適化モデルはありません。これらの問題に対処するために、この研究では、ラベルのないインスタンスを置き換えて描画する、統合された自己ペースのマルチビュー共同トレーニング(SPamCo)フレームワークを設計します。2つの指定された共正則化項は、トレーニング中に疑似ラベル付きインスタンスを選択するためのさまざまな戦略を開発するために策定されています。両方の形式は、共トレーニングの反復プロセスと一致し、マルチビューシナリオに自然に拡張できる同じ最適化戦略を共有しています。分散最適化戦略も導入され、各ビューの分類器を並行してトレーニングして、アルゴリズムの効率をさらに向上させます。さらに、SPamCoアルゴリズムはPAC学習可能であることが証明されており、その理論的な健全性を裏付けています。合成、テキスト分類、人物再識別、画像認識、およびオブジェクト検出データセットで実施された実験により、提案された方法の優位性が実証されています。

Fast Rates for General Unbounded Loss Functions: From ERM to Generalized Bayes
一般非有界損失関数の高速レート:ERMから一般化ベイズへ

We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $\eta$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate $\eta$ is set correctly. For general loss functions, our bounds rely on two separate conditions: the $v$-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter $v$ in the $v$-GRIP conditions determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-GRIP conditions generalize; favorable $v$ in combination with small model complexity leads to $\tilde{O}(1/n)$ rates. The witness condition allows us to connect the excess risk to an ‘annealed’ version thereof, by which we generalize several previous results connecting Hellinger and Rényi divergence to KL divergence.

私たちは、損失の分布がヘビーテールになる可能性がある、対数損失や二乗損失などの一般的な非有界損失関数に対する新しい超過リスク境界を提示します。境界は一般的な推定量に当てはまりますが、$\eta$一般化ベイズ推定量、MDL推定量、および経験的リスク最小化推定量に適用すると最適化されます。対数損失の場合、学習率$\eta$が正しく設定されている限り、境界は、ヘリンガーメトリックの一般化の観点から、誤指定の下での一般化ベイズ推論の収束率を意味します。一般的な損失関数の場合、境界は2つの別々の条件に依存します。1つは超過損失の下側を制御する$v$-GRIP (一般化逆情報投影)条件、もう1つは上側を制御する新しく導入された目撃条件です。$v$-GRIP条件のパラメータ$v$は達成可能なレートを決定し、Tsybakovマージン条件および有界損失のBernstein条件の指数に類似しており、$v$-GRIP条件はこれらを一般化します。つまり、好ましい$v$と小さいモデル複雑性の組み合わせにより、レートは$\tilde{O}(1/n)$になります。証人条件により、過剰リスクをその「焼きなまし」バージョンに関連付けることができます。これにより、HellingerおよびRényiダイバージェンスをKLダイバージェンスに関連付ける以前のいくつかの結果を一般化できます。

Conjugate Gradients for Kernel Machines
カーネルマシンの共役勾配

Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a low-rank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of least-squares regression as encountered in machine learning.

正則化最小二乗法(カーネルリッジ/ガウス過程)回帰は、統計と機械学習の基本的なアルゴリズムです。厳密解の汎用アルゴリズムでは、データポイントの数に3次的な複雑さがあるため、大規模なデータセットでは近似に頼る必要があります。この研究では、最小二乗予測の計算自体が確率的推論問題として扱われます。カーネル行列の射影を使用してカーネルと行列の低ランク近似を取得するカーネル関数上の構造化ガウス回帰モデルを提案します。中心的な結果は、機械学習で遭遇する最小二乗回帰の特定の設定に共役勾配の方法を使用する強化された方法です。

GraKeL: A Graph Kernel Library in Python
GraKeL:Pythonのグラフカーネルライブラリ

The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Graph kernels have recently emerged as a promising approach to this problem. There are now many kernels, each focusing on different structural aspects of graphs. Here, we present GraKeL, a library that unifies several graph kernels into a common framework. The library is written in Python and adheres to the scikit-learn interface. It is simple to use and can be naturally combined with scikit-learn’s modules to build a complete machine learning pipeline for tasks such as graph classification and clustering. The code is BSD licensed and is available at: https://github.com/ysig/GraKeL.

グラフ間の類似性を正確に測定するという問題は、さまざまな分野の多くのアプリケーションの中核をなすものです。グラフカーネルは、この問題に対する有望なアプローチとして最近登場しました。現在、多くのカーネルがあり、それぞれがグラフの異なる構造的側面に焦点を当てています。ここでは、いくつかのグラフカーネルを共通のフレームワークに統合するライブラリであるGraKeLを紹介します。ライブラリはPythonで書かれており、scikit-learnインターフェースに準拠しています。使い方は簡単で、scikit-learnのモジュールと自然に組み合わせて、グラフ分類やクラスタリングなどのタスクのための完全な機械学習パイプラインを構築できます。このコードはBSDライセンスで、https://github.com/ysig/GraKeLで入手できます。

High-Dimensional Inference for Cluster-Based Graphical Models
クラスタベースのグラフィカルモデルのための高次元推論

Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models, in which variable clustering is applied as an initial step for reducing the dimension of the feature space. We employ model assisted clustering, in which the clusters contain features that are similar to the same unobserved latent variable. Two different cluster-based Gaussian graphical models are considered: the latent variable graph, corresponding to the graphical model associated with the unobserved latent variables, and the cluster-average graph, corresponding to the vector of features averaged over clusters. Our study reveals that likelihood based inference for the latent graph, not analyzed previously, is analytically intractable. Our main contribution is the development and analysis of alternative estimation and inference strategies, for the precision matrix of an unobservable latent vector Z. We replace the likelihood of the data by an appropriate class of empirical risk functions, that can be specialized to the latent graphical model and to the simpler, but under-analyzed, cluster-average graphical model. The estimators thus derived can be used for inference on the graph structure, for instance on edge strength or pattern recovery. Inference is based on the asymptotic limits of the entry-wise estimates of the precision matrices associated with the conditional independence graphs under consideration. While taking the uncertainty induced by the clustering step into account, we establish Berry-Esseen central limit theorems for the proposed estimators. It is noteworthy that, although the clusters are estimated adaptively from the data, the central limit theorems regarding the entries of the estimated graphs are proved under the same conditions one would use if the clusters were known in advance. As an illustration of the usage of these newly developed inferential tools, we show that they can be reliably used for recovery of the sparsity pattern of the graphs we study, under FDR control, which is verified via simulation studies and an fMRI data analysis. These experimental results confirm the theoretically established difference between the two graph structures. Furthermore, the data analysis suggests that the latent variable graph, corresponding to the unobserved cluster centers, can help provide more insight into the understanding of the brain connectivity networks relative to the simpler, average-based, graph.

この論文では、非常に多くの特徴に基づいてグラフィカルモデルを構築する最新のアプリケーションに着想を得て、特徴空間の次元を削減するための最初のステップとして変数クラスタリングを適用する、新しいクラスのクラスターベースのグラフィカルモデルを紹介します。モデル支援クラスタリングを採用し、クラスターには同じ観測されない潜在変数に類似する特徴が含まれます。2つの異なるクラスターベースのガウスグラフィカルモデルが考慮されます。1つは観測されない潜在変数に関連付けられたグラフィカルモデルに対応する潜在変数グラフ、もう1つはクラスター全体で平均化された特徴のベクトルに対応するクラスター平均グラフです。この研究では、これまで分析されていなかった潜在グラフの尤度ベースの推論が解析的に扱いにくいことが明らかになりました。我々の主な貢献は、観測不可能な潜在ベクトルZの精度行列に対する代替推定および推論戦略の開発と分析です。データの尤度を、潜在グラフィカルモデルと、より単純だが十分に分析されていないクラスター平均グラフィカルモデルに特化できる適切なクラスの経験的リスク関数に置き換えます。このようにして得られた推定量は、エッジの強度やパターン回復など、グラフ構造の推論に使用できます。推論は、検討中の条件付き独立グラフに関連付けられた精度行列のエントリごとの推定値の漸近限界に基づいています。クラスタリングステップによって生じる不確実性を考慮しながら、提案された推定量に対してベリー-エッセン中心極限定理を確立します。クラスターはデータから適応的に推定されますが、推定されたグラフのエントリに関する中心極限定理は、クラスターが事前にわかっている場合と同じ条件下で証明されていることは注目に値します。新しく開発されたこれらの推論ツールの使用例として、シミュレーション研究とfMRIデータ分析によって検証されたFDR制御下で、研究対象のグラフのスパースパターンの回復にこれらのツールが確実に使用できることを示しています。これらの実験結果は、2つのグラフ構造間の理論的に確立された違いを裏付けています。さらに、データ分析では、観測されていないクラスターセンターに対応する潜在変数グラフが、より単純な平均ベースのグラフと比較して、脳の接続ネットワークの理解にさらに役立つことが示唆されています。

Expected Policy Gradients for Reinforcement Learning
強化学習の予想される方策勾配

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to the matrix exponential of the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.

私たちは、強化学習のために確率的ポリシー勾配(SPG)と決定論的ポリシー勾配(DPG)を統合した期待ポリシー勾配(EPG)を提案します。期待サルサにヒントを得たEPGは、サンプリングされた軌跡内のアクションのみに頼るのではなく、勾配を推定する際にアクション全体を積分(または合計)します。連続アクション空間の場合、まずガウスポリシーと二次批評家の実用的な結果を導き、次にそれを汎用的な解析手法に拡張し、ガウス、指数族、および制限付きサポートを持つポリシーを含む幅広いクラスのアクターと批評家をカバーします。ガウスポリシーの場合、アクションに関して批評家のスケールされたヘッセ行列の指数に比例する共分散を使用する探索手法を導入します。離散アクション空間の場合、ソフトマックスポリシーに基づいてEPGのバリアントを導き出す。また、確率的および決定論的ポリシー勾配定理が特殊なケースである、新しい一般的なポリシー勾配定理も確立します。さらに、EPGは決定論的ポリシーを必要とせず、計算オーバーヘッドもほとんどなく、勾配推定値の分散を削減できることを証明します。最後に、EPGの広範な実験評価を提供し、複数の困難な制御ドメインで既存のアプローチよりも優れていることを示します。

Learning Causal Networks via Additive Faithfulness
加法忠実度による因果ネットワークの学習

In this paper we introduce a statistical model, called additively faithful directed acyclic graph (AFDAG), for causal learning from observational data. Our approach is based on additive conditional independence (ACI), a recently proposed three-way statistical relation that shares many similarities with conditional independence but without resorting to multi-dimensional kernels. This distinct feature strikes a balance between a parametric model and a fully nonparametric model, which makes the proposed model attractive for handling large networks. We develop an estimator for AFDAG based on a linear operator that characterizes ACI, and establish the consistency and convergence rates of this estimator, as well as the uniform consistency of the estimated DAG. Moreover, we introduce a modified PC-algorithm to implement the estimating procedure efficiently, so that its complexity is determined by the level of sparseness rather than the dimension of the network. Through simulation studies we show that our method outperforms existing methods when commonly assumed conditions such as Gaussian or Gaussian copula distributions do not hold. Finally, the usefulness of AFDAG formulation is demonstrated through an application to a proteomics data set.

この論文では、観測データから因果学習を行うための、加法的に忠実な有向非巡回グラフ(AFDAG)と呼ばれる統計モデルを紹介します。私たちのアプローチは、加法的条件付き独立性(ACI)に基づいています。これは、最近提案された3元統計関係で、条件付き独立性と多くの類似点を持ちますが、多次元カーネルには頼りません。この明確な特徴は、パラメトリックモデルと完全なノンパラメトリックモデルのバランスをとっており、提案モデルは大規模ネットワークの処理に適しています。私たちは、ACIを特徴付ける線形演算子に基づいてAFDAGの推定量を開発し、この推定量の一貫性と収束率、および推定されたDAGの均一一貫性を確立しました。さらに、推定手順を効率的に実装するために修正されたPCアルゴリズムを導入し、その複雑さがネットワークの次元ではなくスパース性のレベルによって決定されるようにしました。シミュレーション研究を通じて、ガウス分布やガウスコピュラ分布などの一般的に想定される条件が成立しない場合、私たちの方法が既存の方法よりも優れていることを示しています。最後に、AFDAG定式化の有用性は、プロテオミクスデータセットへの適用を通じて実証されます。

Sparse and low-rank multivariate Hawkes processes
スパースで低ランクの多変量ホークス過程

We consider the problem of unveiling the implicit network structure of node interactions (such as user interactions in a social network), based only on high-frequency timestamps. Our inference is based on the minimization of the least-squares loss associated with a multivariate Hawkes model, penalized by $\ell_1$ and trace norm of the interaction tensor. We provide a first theoretical analysis for this problem, that includes sparsity and low-rank inducing penalizations. This result involves a new data-driven concentration inequality for matrix martingales in continuous time with observable variance, which is a result of independent interest and a broad range of possible applications since it extends to matrix martingales former results restricted to the scalar case. A consequence of our analysis is the construction of sharply tuned $\ell_1$ and trace-norm penalizations, that leads to a data-driven scaling of the variability of information available for each users. Numerical experiments illustrate the significant improvements achieved by the use of such data-driven penalizations.

私たちは、高頻度タイムスタンプのみに基づいて、ノード相互作用（ソーシャルネットワークにおけるユーザー相互作用など）の暗黙のネットワーク構造を明らかにする問題を考察します。我々の推論は、相互作用テンソルの$\ell_1$およびトレースノルムによってペナルティが課せられた、多変量Hawkesモデルに関連する最小二乗損失の最小化に基づいています。私たちは、この問題に対する最初の理論的分析を提供し、これにはスパース性と低ランクを誘導するペナルティが含まれます。この結果には、観測可能な分散を伴う連続時間の行列マルチンゲールに対する新しいデータ駆動型集中不等式が含まれ、これは独立した関心の結果であり、スカラーの場合に限定されていた以前の結果を行列マルチンゲールに拡張するため、幅広い適用が可能です。我々の分析の結果は、鋭く調整された$\ell_1$およびトレースノルムペナルティの構築であり、これは各ユーザーが利用できる情報の変動性をデータ駆動型でスケーリングすることにつながる。数値実験では、このようなデータ駆動型のペナルティの使用によって達成される大幅な改善が示されています。

Ensemble Learning for Relational Data
リレーショナルデータのためのアンサンブル学習

We present a theoretical analysis framework for relational ensemble models. We show that ensembles of collective classifiers can improve predictions for graph data by reducing errors due to variance in both learning and inference. In addition, we propose a relational ensemble framework that combines a relational ensemble learning approach with a relational ensemble inference approach for collective classification. The proposed ensemble techniques are applicable for both single and multiple graph settings. Experiments on both synthetic and real-world data demonstrate the effectiveness of the proposed framework. Finally, our experimental results support the theoretical analysis and confirm that ensemble algorithms that explicitly focus on both learning and inference processes and aim at reducing errors associated with both, are the best performers.

私たちは、リレーショナルアンサンブルモデルの理論的分析フレームワークを提示します。集合分類器のアンサンブルが、学習と推論の両方の分散によるエラーを減らすことにより、グラフデータの予測を改善できることを示します。さらに、関係アンサンブル学習アプローチと集団分類のための関係アンサンブル推論アプローチを組み合わせた関係アンサンブルフレームワークを提案します。提案されたアンサンブル手法は、単一グラフ設定と複数グラフ設定の両方に適用できます。合成データと実世界データの両方に対する実験は、提案されたフレームワークの有効性を実証しています。最後に、私たちの実験結果は理論分析をサポートし、学習プロセスと推論プロセスの両方に明示的に焦点を当て、両方に関連するエラーを減らすことを目的としたアンサンブルアルゴリズムが最高のパフォーマンスを発揮することを確認しています。

Skill Rating for Multiplayer Games. Introducing Hypernode Graphs and their Spectral Theory
マルチプレイヤーゲームのスキルレーティング。ハイパーノードグラフとそのスペクトル理論の紹介

We consider the skill rating problem for multiplayer games, that is how to infer player skills from game outcomes in multiplayer games. We formulate the problem as a minimization problem $\arg \min_{s} s^T \Delta s$ where $\Delta$ is a positive semidefinite matrix and $s$ a real-valued function, of which some entries are the skill values to be inferred and other entries are constrained by the game outcomes. We leverage graph-based semi-supervised learning (SSL) algorithms for this problem. We apply our algorithms on several data sets of multiplayer games and obtain very promising results compared to Elo Duelling (see Elo, 1978) and TrueSkill (see Herbrich et al., 2006).. As we leverage graph-based SSL algorithms and because games can be seen as relations between sets of players, we then generalize the approach. For this aim, we introduce a new finite model, called hypernode graph, defined to be a set of weighted binary relations between sets of nodes. We define Laplacians of hypernode graphs. Then, we show that the skill rating problem for multiplayer games can be formulated as $\arg \min_{s} s^T \Delta s$ where $\Delta$ is the Laplacian of a hypernode graph constructed from a set of games. From a fundamental perspective, we show that hypernode graph Laplacians are symmetric positive semidefinite matrices with constant functions in their null space. We show that problems on hypernode graphs can not be solved with graph constructions and graph kernels. We relate hypernode graphs to signed graphs showing that positive relations between groups can lead to negative relations between individuals.

私たちは、マルチプレイヤーゲームのスキル評価問題、つまりマルチプレイヤーゲームでゲーム結果からプレイヤーのスキルを推測する方法について考えます。この問題を最小化問題$\arg \min_{s} s^T \Delta s$として定式化します。ここで、$\Delta$は半正定値行列、$s$は実数値関数で、一部のエントリは推測されるスキル値であり、その他のエントリはゲーム結果によって制約されます。この問題には、グラフベースの半教師あり学習(SSL)アルゴリズムを活用します。このアルゴリズムをマルチプレイヤーゲームの複数のデータセットに適用し、Elo Duelling (Elo、1978を参照)やTrueSkill (Herbrichら、2006を参照)と比較して非常に有望な結果を得ました。グラフベースのSSLアルゴリズムを活用し、ゲームはプレイヤーセット間の関係として見ることができるため、このアプローチを一般化します。この目的のために、我々はハイパーノードグラフと呼ばれる新しい有限モデルを導入し、これはノードセット間の重み付き二項関係のセットとして定義されます。ハイパーノードグラフのラプラシアンを定義します。次に、マルチプレイヤーゲームのスキル評価問題が$\arg \min_{s} s^T \Delta s$として定式化できることを示します。ここで、$\Delta$はゲームセットから構築されたハイパーノードグラフのラプラシアンです。基本的な観点から、ハイパーノードグラフのラプラシアンは、ヌル空間に定数関数を持つ対称正半定値行列であることを示します。ハイパーノードグラフの問題は、グラフ構築とグラフカーネルでは解決できないことを示します。ハイパーノードグラフを符号付きグラフに関連付け、グループ間の正の関係が個人間の負の関係につながる可能性があることを示します。

Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement
非復元抽出法における先祖のガンベル・トップkサンプリング

We develop ancestral Gumbel-Top-$k$ sampling: a generic and efficient method for sampling without replacement from discrete-valued Bayesian networks, which includes multivariate discrete distributions, Markov chains and sequence models. The method uses an extension of the Gumbel-Max trick to sample without replacement by finding the top $k$ of perturbed log-probabilities among all possible configurations of a Bayesian network. Despite the exponentially large domain, the algorithm has a complexity linear in the number of variables and sample size $k$. Our algorithm allows to set the number of parallel processors $m$, to trade off the number of iterations versus the total cost (iterations times $m$) of running the algorithm. For $m = 1$ the algorithm has minimum total cost, whereas for $m = k$ the number of iterations is minimized, and the resulting algorithm is known as Stochastic Beam Search. We provide extensions of the algorithm and discuss a number of related algorithms. We analyze the properties of ancestral Gumbel-Top-$k$ sampling and compare against alternatives on randomly generated Bayesian networks with different levels of connectivity. In the context of (deep) sequence models, we show its use as a method to generate diverse but high-quality translations and statistical estimates of translation quality and entropy.

私たちは、先祖のGumbel-Top-$k$サンプリングを開発しました。これは、多変量離散分布、マルコフ連鎖、シーケンスモデルを含む離散値ベイジアンネットワークから非復元サンプリングを行う汎用的で効率的な方法です。この方法では、ベイジアンネットワークのすべての可能な構成から摂動対数確率の上位$k$を見つけることにより、非復元サンプリングを行うGumbel-Maxトリックの拡張を使用します。指数関数的に大きなドメインにもかかわらず、アルゴリズムの複雑さは変数の数とサンプルサイズ$k$に比例します。このアルゴリズムでは、並列プロセッサの数$m$を設定して、反復回数とアルゴリズム実行の総コスト(反復回数 ×$m$)をトレードオフできます。$m = 1$の場合、アルゴリズムの総コストは最小になりますが、$m = k$の場合、反復回数は最小化されます。結果として得られるアルゴリズムは、確率的ビームサーチとして知られています。アルゴリズムの拡張を示し、いくつかの関連アルゴリズムについて説明します。祖先Gumbel-Top-$k$サンプリングの特性を分析し、さまざまなレベルの接続性を持つランダムに生成されたベイジアンネットワーク上の代替方法と比較します。(ディープ)シーケンスモデルのコンテキストでは、多様でありながら高品質の翻訳と、翻訳の品質とエントロピーの統計的推定を生成する方法としてその使用法を示します。

pyts: A Python Package for Time Series Classification
pyts: 時系列分類のための Python パッケージ

pyts is an open-source Python package for time series classification. This versatile toolbox provides implementations of many algorithms published in the literature, preprocessing functionalities, and data set loading utilities. pyts relies on the standard scientific Python packages numpy, scipy, scikit-learn, joblib, and numba, and is distributed under the BSD-3-Clause license. Documentation contains installation instructions, a detailed user guide, a full API description, and concrete self-contained examples.

pytsは、時系列分類用のオープンソースのPythonパッケージです。この汎用性の高いツールボックスは、文献で公開されている多くのアルゴリズムの実装、前処理機能、およびデータセットの読み込みユーティリティを提供します。pytsは、標準の科学的なPythonパッケージnumpy, scipy, scikit-learn, joblib, numbaに依存しており、BSD-3-Clauseライセンスの下で配布されています。ドキュメントには、インストール手順、詳細なユーザーガイド、完全なAPIの説明、および具体的な自己完結型の例が含まれています。

A Convex Parametrization of a New Class of Universal Kernel Functions
新しいクラスのユニバーサルカーネル関数の凸パラメータ化

The accuracy and complexity of kernel learning algorithms is determined by the set of kernels over which it is able to optimize. An ideal set of kernels should: admit a linear parameterization (tractability); be dense in the set of all kernels (accuracy); and every member should be universal so that the hypothesis space is infinite-dimensional (scalability). Currently, there is no class of kernel that meets all three criteria – e.g. Gaussians are not tractable or accurate; polynomials are not scalable. We propose a new class that meet all three criteria – the Tessellated Kernel (TK) class. Specifically, the TK class: admits a linear parameterization using positive matrices; is dense in all kernels; and every element in the class is universal. This implies that the use of TK kernels for learning the kernel can obviate the need for selecting candidate kernels in algorithms such as SimpleMKL and parameters such as the bandwidth. Numerical testing on soft margin Support Vector Machine (SVM) problems show that algorithms using TK kernels outperform other kernel learning algorithms and neural networks. Furthermore, our results show that when the ratio of the number of training data to features is high, the improvement of TK over MKL increases significantly.

カーネル学習アルゴリズムの精度と複雑さは、最適化できるカーネルのセットによって決まります。理想的なカーネルセットは、線形パラメータ化が可能であること(扱いやすさ)、すべてのカーネルセット内で密であること(精度)、仮説空間が無限次元になるようにすべてのメンバーがユニバーサルであること(スケーラビリティ)です。現在、3つの基準をすべて満たすカーネルクラスはありません。たとえば、ガウス分布は扱いにくく正確ではありません。多項式はスケーラブルではありません。私たちは、3つの基準をすべて満たす新しいクラス、Tessellated Kernel (TK)クラスを提案します。具体的には、TKクラスは、正の行列を使用した線形パラメータ化が可能であること、すべてのカーネル内で密であること、クラス内のすべての要素がユニバーサルであること、です。これは、カーネルの学習にTKカーネルを使用すると、SimpleMKLなどのアルゴリズムで候補カーネルを選択したり、帯域幅などのパラメータを選択する必要がなくなることを意味します。ソフトマージンサポートベクターマシン(SVM)の問題に関する数値テストでは、TKカーネルを使用するアルゴリズムが他のカーネル学習アルゴリズムやニューラルネットワークよりも優れていることが示されています。さらに、トレーニングデータ数と特徴数の比率が高い場合、TKのMKLに対する改善が大幅に増加することも示されています。

Dynamical Systems as Temporal Feature Spaces
時間的特徴空間としての動的システム

Parametrised state space models in the form of recurrent networks are often used in machine learning to learn from data streams exhibiting temporal dependencies. To break the black box nature of such models it is important to understand the dynamical features of the input-driving time series that are formed in the state space. We propose a framework for rigorous analysis of such state representations in vanishing memory state space models such as echo state networks (ESN). In particular, we consider the state space a temporal feature space and the readout mapping from the state space a kernel machine operating in that feature space. We show that: (1) The usual ESN strategy of randomly generating input-to-state, as well as state coupling leads to shallow memory time series representations, corresponding to cross-correlation operator with fast exponentially decaying coefficients; (2) Imposing symmetry on dynamic coupling yields a constrained dynamic kernel matching the input time series with straightforward exponentially decaying motifs or exponentially decaying motifs of the highest frequency; (3) Simple ring (cycle) high-dimensional reservoir topology specified only through two free parameters can implement deep memory dynamic kernels with a rich variety of matching motifs. We quantify richness of feature representations imposed by dynamic kernels and demonstrate that for dynamic kernel associated with cycle reservoir topology, the kernel richness undergoes a phase transition close to the edge of stability.

再帰ネットワークの形でパラメータ化された状態空間モデルは、時間的依存性を示すデータストリームから学習するために機械学習でよく使用されます。このようなモデルのブラックボックス特性を打破するには、状態空間で形成される入力駆動時系列の動的特徴を理解することが重要です。私たちは、エコー状態ネットワーク(ESN)などのメモリ消失状態空間モデルにおけるこのような状態表現の厳密な分析のためのフレームワークを提案します。特に、状態空間を時間的特徴空間と見なし、状態空間からの読み出しマッピングをその特徴空間で動作するカーネルマシンと見なします。私たちは次のことを示します。(1)入力から状態へのランダム生成の通常のESN戦略と状態結合により、急速に指数関数的に減少する係数を持つ相互相関演算子に対応する浅いメモリ時系列表現が生成されます。(2)動的結合に対称性を課すと、入力時系列を単純な指数関数的に減少するモチーフまたは最も頻度の高い指数関数的に減少するモチーフと一致させる制約付き動的カーネルが生成されます。（３）２つの自由パラメータのみで指定される単純なリング（サイクル）高次元リザーバトポロジーは、マッチングモチーフの多様性に富んだディープメモリ動的カーネルを実装することができます。動的カーネルによって課される特徴表現の豊かさを定量化し、サイクルリザーバトポロジーに関連付けられた動的カーネルの場合、カーネルの豊かさが安定性の限界に近い相転移を起こすことを実証した。

Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data
貪欲攻撃とガンベル攻撃: 離散データに対する敵対的例の生成

We present a probabilistic framework for studying adversarial attacks on discrete data. Based on this framework, we derive a perturbation-based method, Greedy Attack, and a scalable learning-based method, Gumbel Attack, that illustrate various tradeoffs in the design of attacks. We demonstrate the effectiveness of these methods using both quantitative metrics and human evaluation on various state-of-the-art models for text classification, including a word-based CNN, a character-based CNN and an LSTM. As an example of our results, we show that the accuracy of character-based convolutional networks drops to the level of random selection by modifying only five characters through Greedy Attack.

私たちは、離散データに対する敵対的攻撃を研究するための確率的フレームワークを提示します。このフレームワークに基づいて、摂動ベースの手法であるGreedy Attackとスケーラブルな学習ベースの手法であるGumbel Attackを導き出し、攻撃の設計におけるさまざまなトレードオフを示しています。単語ベースのCNN、文字ベースのCNN、LSTMなど、テキスト分類のさまざまな最先端のモデルで、定量的な測定方法と人間の評価の両方を使用して、これらの方法の有効性を実証します。結果の例として、Greedy Attackを通じて5人のキャラクターのみを変更することにより、キャラクターベースの畳み込みネットワークの精度がランダム選択のレベルに低下することを示しています。

Branch and Bound for Piecewise Linear Neural Network Verification
区分線形ニューラルネットワーク検証のための分岐と限定

The success of Deep Learning and its potential use in many safety-critical applicationshas motivated research on formal verification of Neural Network (NN) models. In thiscontext, verification involves proving or disproving that an NN model satisfies certaininput-output properties. Despite the reputation of learned NN models as black boxes,and the theoretical hardness of proving useful properties about them, researchers havebeen successful in verifying some classes of models by exploiting their piecewise linearstructure and taking insights from formal methods such as Satisifiability Modulo Theory.However, these methods are still far from scaling to realistic neural networks. To facilitateprogress on this crucial area, we exploit the Mixed Integer Linear Programming (MIP) formulation of verification to propose a family of algorithms based on Branch-and-Bound (BaB). We show that our family contains previous verification methods as special cases.With the help of the BaB framework, we make three key contributions. Firstly, we identifynew methods that combine the strengths of multiple existing approaches, accomplishingsignificant performance improvements over previous state of the art. Secondly, we introducean effective branching strategy on ReLU non-linearities. This branching strategy allows usto efficiently and successfully deal with high input dimensional problems with convolutionalnetwork architecture, on which previous methods fail frequently. Finally, we proposecomprehensive test data sets and benchmarks which includes a collection of previouslyreleased testcases. We use the data sets to conduct a thorough experimental comparison ofexisting and new algorithms and to provide an inclusive analysis of the factors impactingthe hardness of verification problems.

ディープラーニングの成功と、多くの安全性が重要なアプリケーションでその潜在的利用が、ニューラルネットワーク(NN)モデルの形式検証に関する研究の動機となっています。この文脈では、検証には、NNモデルが特定の入出力プロパティを満たすことを証明または反証することが含まれます。学習済みNNモデルはブラックボックスであるという評判と、それらに関する有用なプロパティを証明する理論的困難さにもかかわらず、研究者は、モデルの区分線形構造を活用し、充足可能性モジュロ理論などの形式手法から洞察を得ることで、一部のクラスのモデルの検証に成功しています。ただし、これらの手法は、現実的なニューラルネットワークに拡張するにはまだほど遠いものです。この重要な分野での進歩を促進するために、検証の混合整数線形計画法(MIP)定式化を利用して、分岐限定法(BaB)に基づくアルゴリズムファミリを提案します。このファミリには、以前の検証手法が特別なケースとして含まれていることを示します。BaBフレームワークの助けを借りて、3つの重要な貢献をします。まず、複数の既存アプローチの長所を組み合わせた新しい方法を特定し、従来の最先端技術に比べて大幅なパフォーマンスの向上を実現します。次に、ReLU非線形性に関する効果的な分岐戦略を紹介します。この分岐戦略により、以前の方法では頻繁に失敗する、畳み込みネットワークアーキテクチャによる高入力次元の問題を効率的かつ確実に処理できます。最後に、以前にリリースされたテストケースのコレクションを含む包括的なテストデータセットとベンチマークを提案します。このデータセットを使用して、既存のアルゴリズムと新しいアルゴリズムの徹底的な実験比較を行い、検証問題の難しさに影響を与える要因の包括的な分析を提供します。

Switching Regression Models and Causal Inference in the Presence of Discrete Latent Variables
離散潜在変数の存在下での回帰モデルの切り替えと因果推論

Given a response $Y$ and a vector $X = (X^1, \dots, X^d)$ of $d$ predictors, we investigate the problem of inferring direct causes of $Y$ among the vector $X$. Models for $Y$ that use all of its causal covariates as predictors enjoy the property of being invariant across different environments or interventional settings. Given data from such environments, this property has been exploited for causal discovery. Here, we extend this inference principle to situations in which some (discrete-valued) direct causes of $ Y $ are unobserved. Such cases naturally give rise to switching regression models. We provide sufficient conditions for the existence, consistency and asymptotic normality of the MLE in linear switching regression models with Gaussian noise, and construct a test for the equality of such models. These results allow us to prove that the proposed causal discovery method obtains asymptotic false discovery control under mild conditions. We provide an algorithm, make available code, and test our method on simulated data. It is robust against model violations and outperforms state-of-the-art approaches. We further apply our method to a real data set, where we show that it does not only output causal predictors, but also a process-based clustering of data points, which could be of additional interest to practitioners.

応答$Y$とd個の予測子のベクトル$X = (X^1, \dots, X^d)$が与えられた場合、ベクトル$X$の中から$Y$の直接的な原因を推論する問題を調査します。すべての因果共変量を予測子として使用する$Y$のモデルは、異なる環境または介入設定にわたって不変であるという特性があります。このような環境からのデータがあれば、この特性は因果発見に活用されています。ここでは、この推論原理を、$ Y $の直接的な原因の一部(離散値)が観測されない状況に拡張します。このようなケースでは、当然スイッチング回帰モデルが生じます。ガウスノイズを含む線形スイッチング回帰モデルにおけるMLEの存在、一貫性、および漸近正規性に関する十分な条件を提供し、このようなモデルの等価性に関するテストを構築します。これらの結果により、提案された因果発見方法が、穏やかな条件下で漸近的な誤発見制御を実現することを証明できます。アルゴリズムを提供し、コードを公開し、シミュレーションデータでこの方法をテストします。この方法はモデル違反に対して堅牢であり、最先端のアプローチよりも優れています。さらに、この方法を実際のデータセットに適用し、因果予測子を出力するだけでなく、データポイントのプロセスベースのクラスタリングも出力できることを示しました。これは、実践者にとってさらに興味深いものになる可能性があります。

Optimal Bipartite Network Clustering
最適な二者ネットワーククラスタリング

We study bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. This is further formalized by deriving a minimax lower bound over a class of biclustering problems. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth, from sparse networks with average degrees growing arbitrarily slowly to fairly dense networks with average degrees of order $\sqrt{n}$. As a special case, we recover the known exact recovery threshold in the $\log n$ regime of sparsity. To obtain the consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable algorithm is derived from a general class of pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations.

私たちは、ネットワークにおける二部コミュニティ検出、またはより一般的にはネットワークバイクラスタリング問題を研究します。私たちは、スペクトル初期化とそれに続く疑似尤度分類器の2回の適用に基づく高速2段階手順を提示します。軽度の正則性条件下では、一般的な二部確率ブロックモデルの下で手順の弱一貫性(すなわち、誤分類率がゼロに収束すること)を確立します。私たちは、この手順が、バイクラスタリングオラクルによって達成可能な最適収束率を、クラス全体にわたって定数まで適応的に達成するという意味で最適であることを示す。これは、バイクラスタリング問題のクラスにわたってミニマックス下限を導出することによってさらに形式化されます。我々が得た最適率は、既存の結果のいくつかを鮮明にし、平均次数が任意にゆっくりと増加する疎なネットワークから、平均次数が$\sqrt{n}$オーダーのかなり密なネットワークまで、平均次数増加の広い範囲に他の結果を一般化します。特別なケースとして、スパース性の$\log n$領域で既知の正確な回復しきい値を回復します。一貫性の結果を得るために、アルゴリズムの証明可能バージョンの一部として、計算的にも魅力的なサブブロック分割スキームを導入し、最適性を犠牲にすることなくアルゴリズムの分散実装を可能にします。証明可能なアルゴリズムは、単純なEMタイプ更新を使用する疑似尤度バイクラスタリングアルゴリズムの一般的なクラスから派生しています。数値シミュレーションによって、この一般的なクラスの有効性を示します。

Learning Linear Non-Gaussian Causal Models in the Presence of Latent Variables
潜在変数の存在下での線形非ガウス因果モデルの学習

We consider the problem of learning causal models from observational data generated by linear non-Gaussian acyclic causal models with latent variables. Without considering the effect of latent variables, the inferred causal relationships among the observed variables are often wrong. Under faithfulness assumption, we propose a method to check whether there exists a causal path between any two observed variables. From this information, we can obtain the causal order among the observed variables. The next question is whether the causal effects can be uniquely identified as well. We show that causal effects among observed variables cannot be identified uniquely under mere assumptions of faithfulness and non-Gaussianity of exogenous noises. However, we are able to propose an efficient method that identifies the set of all possible causal effects that are compatible with the observational data. We present additional structural conditions on the causal graph under which causal effects among observed variables can be determined uniquely. Furthermore, we provide necessary and sufficient graphical conditions for unique identification of the number of variables in the system. Experiments on synthetic data and real-world data show the effectiveness of our proposed algorithm for learning causal models.

私たちは、潜在変数を持つ線形非ガウス非巡回因果モデルによって生成された観測データから因果モデルを学習する問題について考える。潜在変数の影響を考慮しないと、観測変数間の推定因果関係はしばしば間違っています。忠実性仮定の下で、任意の2つの観測変数間に因果パスが存在するかどうかを確認する方法を提案します。この情報から、観測変数間の因果順序を取得できます。次の質問は、因果効果も一意に識別できるかどうかです。忠実性と外生ノイズの非ガウス性という単なる仮定の下では、観測変数間の因果効果を一意に識別できないことを示します。ただし、観測データと互換性のあるすべての可能な因果効果の集合を識別する効率的な方法を提案できます。観測変数間の因果効果を一意に決定できる因果グラフ上の追加の構造条件を示します。さらに、システム内の変数の数を一意に識別するための必要かつ十分なグラフィカル条件を提供します。合成データと実世界のデータでの実験により、因果モデルを学習するための提案アルゴリズムの有効性が示されました。

Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification
潜在シンプレックス位置モデル:不確実性の定量化による高次元多視点クラスタリング

High dimensional data often contain multiple facets, and several clustering patterns can co-exist under different variable subspaces, also known as the views. While multi-view clustering algorithms were proposed, the uncertainty quantification remains difficult — a particular challenge is in the high complexity of estimating the cluster assignment probability under each view, and sharing information among views. In this article, we propose an approximate Bayes approach — treating the similarity matrices generated over the views as rough first-stage estimates for the co-assignment probabilities; in its Kullback-Leibler neighborhood, we obtain a refined low-rank matrix, formed by the pairwise product of simplex coordinates. Interestingly, each simplex coordinate directly encodes the cluster assignment uncertainty. For multi-view clustering, we let each view draw a parameterization from a few candidates, leading to dimension reduction. With high model flexibility, the estimation can be efficiently carried out as a continuous optimization problem, hence enjoys gradient-based computation. The theory establishes the connection of this model to a random partition distribution under multiple views. Compared to single-view clustering approaches, substantially more interpretable results are obtained when clustering brains from a human traumatic brain injury study, using high-dimensional gene expression data.

高次元データには複数のファセットが含まれることが多く、異なる変数サブスペース(ビューとも呼ばれる)の下に複数のクラスタリングパターンが共存することがあります。マルチビュークラスタリングアルゴリズムが提案されていますが、不確実性の定量化は依然として困難です。特に課題となるのは、各ビューでのクラスタ割り当て確率の推定とビュー間での情報共有の複雑さの高さです。この記事では、近似ベイズアプローチを提案します。ビューで生成された類似性マトリックスを共割り当て確率の大まかな第1段階の推定値として扱います。Kullback-Leibler近傍では、単体座標のペアワイズ積によって形成された、洗練された低ランクマトリックスが得られます。興味深いことに、各単体座標は、クラスタ割り当ての不確実性を直接エンコードします。マルチビュークラスタリングでは、各ビューにいくつかの候補からパラメーター化を描画させ、次元削減を実現します。モデルの柔軟性が高いため、推定は連続最適化問題として効率的に実行でき、勾配ベースの計算を利用できます。理論は、このモデルと複数のビューでのランダムパーティション分布との関連を確立します。高次元遺伝子発現データを使用して、人間の外傷性脳損傷研究から脳をクラスタリングすると、単一ビューのクラスタリング手法と比較して、大幅に解釈しやすい結果が得られます。

Causal Discovery Toolbox: Uncovering causal relationships in Python
因果関係発見ツールボックス: Python で因果関係を明らかにする

This paper presents a new open source Python framework for causal discovery from observational data and domain background knowledge, aimed at causal graph and causal mechanism modeling. The cdt package implements an end-to-end approach, recovering the direct dependencies (the skeleton of the causal graph) and the causal relationships between variables. It includes algorithms from the `Bnlearn’ and `Pcalg’ packages, together with algorithms for pairwise causal discovery such as ANM.

この論文では、因果グラフと因果メカニズムのモデリングを目的とした、観測データとドメインの背景知識からの因果関係の発見のための新しいオープンソースPythonフレームワークを紹介します。cdtパッケージは、直接的な依存関係(因果グラフのスケルトン)と変数間の因果関係を回復する、エンドツーエンドのアプローチを実装しています。これには、「Bnlearn」および「Pcalg」パッケージのアルゴリズムと、ANMなどのペアワイズ因果関係発見のアルゴリズムが含まれています。

Noise Accumulation in High Dimensional Classification and Total Signal Index
高次元分類におけるノイズ蓄積と総信号指数

Great attention has been paid to Big Data in recent years. Such data hold promise for scientific discoveries but also pose challenges to analyses. One potential challenge is noise accumulation. In this paper, we explore noise accumulation in high dimensional two-group classification. First, we revisit a previous assessment of noise accumulation with principal component analyses, which yields a different threshold for discriminative ability than originally identified. Then we extend our scope to its impact on classifiers developed with three common machine learning approaches—random forest, support vector machine, and boosted classification trees. We simulate four scenarios with differing amounts of signal strength to evaluate each method. After determining noise accumulation may affect the performance of these classifiers, we assess factors that impact it. We conduct simulations by varying sample size, signal strength, signal strength proportional to the number predictors, and signal magnitude with random forest classifiers. These simulations suggest that noise accumulation affects the discriminative ability of high-dimensional classifiers developed using common machine learning methods, which can be modified by sample size, signal strength, and signal magnitude. We developed the measure total signal index (TSI) to track the trends of total signal and noise accumulation.

近年、ビッグデータに大きな注目が集まっています。このようなデータは科学的発見の可能性を秘めていますが、分析には課題も伴います。潜在的な課題の1つはノイズの蓄積です。この論文では、高次元2グループ分類におけるノイズの蓄積について検討します。まず、主成分分析によるノイズ蓄積の以前の評価を再検討します。この評価では、当初特定されたものとは異なる識別能力の閾値が示されます。次に、3つの一般的な機械学習アプローチ(ランダムフォレスト、サポートベクターマシン、ブースト分類ツリー)を使用して開発された分類器への影響に範囲を広げます。信号強度の異なる4つのシナリオをシミュレートして各手法を評価します。ノイズ蓄積がこれらの分類器のパフォーマンスに影響を与える可能性があると判断した後、ノイズ蓄積に影響を与える要因を評価します。ランダムフォレスト分類器を使用して、サンプルサイズ、信号強度、予測子の数に比例する信号強度、および信号の大きさを変えてシミュレーションを実行します。これらのシミュレーションは、ノイズ蓄積が、サンプルサイズ、信号強度、および信号の大きさによって変更できる一般的な機械学習方法を使用して開発された高次元分類器の識別能力に影響を与えることを示唆しています。総信号とノイズの蓄積の傾向を追跡するために、総信号指数(TSI)の測定基準を開発しました。

Learning with Fenchel-Young losses
フェンチェル・ヤングの損失学習

Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction. Understanding the core principles and theoretical properties underpinning these losses is key to choose the right loss for the right problem, as well as to create new losses which combine their strengths. In this paper, we introduce Fenchel-Young losses, a generic way to construct a convex loss function for a regularized prediction function. We provide an in-depth study of their properties in a very broad setting, covering all the aforementioned supervised learning tasks, and revealing new connections between sparsity, generalized entropies, and separation margins. We show that Fenchel-Young losses unify many well-known loss functions and allow to create useful new ones easily. Finally, we derive efficient predictive and training algorithms, making Fenchel-Young losses appealing both in theory and practice.

過去数十年にわたり、回帰、分類、ランキング、より一般的には構造化予測など、さまざまな教師あり学習タスクに対して多数の損失関数が提案されてきました。これらの損失の根底にある中核原理と理論的特性を理解することは、適切な問題に適切な損失を選択するだけでなく、それらの長所を組み合わせた新しい損失を作成するための鍵となります。この論文では、正規化予測関数の凸損失関数を構築する一般的な方法であるFenchel-Young損失を紹介します。非常に幅広い設定でその特性を詳細に研究し、前述のすべての教師あり学習タスクをカバーし、スパース性、一般化エントロピー、分離マージンの間の新しい関係を明らかにします。Fenchel-Young損失は多くのよく知られている損失関数を統合し、有用な新しい損失関数を簡単に作成できることを示します。最後に、効率的な予測およびトレーニングアルゴリズムを導出し、Fenchel-Young損失を理論と実践の両方で魅力的なものにします。

Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent
分布確率的サブグラディエント降下法のためのグラフ依存陰的正則化

We propose graph-dependent implicit regularisation strategies for synchronised distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning. Under the standard assumptions of convexity, Lipschitz continuity, and smoothness, we establish statistical learning rates that retain, up to logarithmic terms, single-machine serial statistical guarantees through implicit regularisation (step size tuning and early stopping) with appropriate dependence on the graph topology. Our approach avoids the need for explicit regularisation in decentralised learning problems, such as adding constraints to the empirical risk minimisation rule. Particularly for distributed methods, the use of implicit regularisation allows the algorithm to remain simple, without projections or dual methods. To prove our results, we establish graph-independent generalisation bounds for Distributed SGD that match the single-machine serial SGD setting (using algorithmic stability), and we establish graph-dependent optimisation bounds that are of independent interest. We present numerical experiments to show that the qualitative nature of the upper bounds we derive can be representative of real behaviours.

私たちは、マルチエージェント学習の凸問題に対する同期分散確率的劣勾配降下法(Distributed SGD)のためのグラフ依存の暗黙的正則化戦略を提案します。凸性、リプシッツ連続性、滑らかさという標準的な仮定の下で、グラフトポロジーに適切に依存した暗黙的正則化(ステップサイズの調整と早期停止)を通じて、対数項まで単一マシンシリアル統計保証を保持する統計学習率を確立します。我々のアプローチは、経験的リスク最小化ルールに制約を追加するなど、分散学習問題における明示的な正則化の必要性を回避します。特に分散法の場合、暗黙的正則化を使用することで、投影やデュアルメソッドを使用せずにアルゴリズムを単純なままにすることができます。我々の結果を証明するために、単一マシンシリアルSGD設定と一致するDistributed SGDのグラフ非依存の一般化境界を確立し(アルゴリズムの安定性を使用)、独立した関心事であるグラフ依存の最適化境界を確立します。導出した上限の定性的な性質が実際の動作を代表できることを示す数値実験を紹介します。

On the Complexity Analysis of the Primal Solutions for the Accelerated Randomized Dual Coordinate Ascent
加速ランダム化双座標上昇のための主解の複雑性解析について

Dual first-order methods are essential techniques for large-scale constrained convex optimization. However, when recovering the primal solutions, we need $T(\epsilon^{-2})$ iterations to achieve an $\epsilon$-optimal primal solution when we apply an algorithm to the non-strongly convex dual problem with $T(\epsilon^{-1})$ iterations to achieve an $\epsilon$-optimal dual solution, where $T(x)$ can be $x$ or $\sqrt{x}$. In this paper, we prove that the iteration complexity of the primal solutions and dual solutions have the same $O\left(\frac{1}{\sqrt{\epsilon}}\right)$ order of magnitude for the accelerated randomized dual coordinate ascent. When the dual function further satisfies the quadratic functional growth condition, by restarting the algorithm at any period, we establish the linear iteration complexity for both the primal solutions and dual solutions even if the condition number is unknown. When applied to the regularized empirical risk minimization problem, we prove the iteration complexity of $O\left(n\log n+\sqrt{\frac{n}{\epsilon}}\right)$ in both primal space and dual space, where $n$ is the number of samples. Our result takes out the $\left(\log \frac{1}{\epsilon}\right)$ factor compared with the methods based on smoothing/regularization or Catalyst reduction. As far as we know, this is the first time that the optimal $O\left(\sqrt{\frac{n}{\epsilon}}\right)$ iteration complexity in the primal space is established for the dual coordinate ascent based stochastic algorithms. We also establish the accelerated linear complexity for some problems with nonsmooth loss, e.g., the least absolute deviation and SVM.

双対一次法は、大規模な制約付き凸最適化に不可欠な手法です。しかし、主解を回復する場合、$\epsilon$最適な主解を得るのに$T(\epsilon^{-2})$回の反復が必要です。$\epsilon$最適な双対解を得るのに$T(\epsilon^{-1})$回の反復が必要なアルゴリズムを非強凸双対問題に適用すると、$T(x)$は$x$または$\sqrt{x}$になります。この論文では、加速ランダム化双対座標上昇に対して、主解と双対解の反復計算量が$O\left(\frac{1}{\sqrt{\epsilon}}\right)$桁と同じであることを証明します。双対関数がさらに二次関数成長条件を満たす場合、任意の期間でアルゴリズムを再開することにより、条件数が不明であっても、主解と双対解の両方に対して線形反復複雑度を確立します。正規化された経験的リスク最小化問題に適用すると、主空間と双対空間の両方で反復複雑度が$O\left(n\log n+\sqrt{\frac{n}{\epsilon}}\right)$であることを証明します。ここで、$n$はサンプル数です。私たちの結果は、平滑化/正規化またはCatalyst削減に基づく方法と比較して、$\left(\log \frac{1}{\epsilon}\right)$係数を取り除きます。私たちの知る限り、双対座標上昇ベースの確率的アルゴリズムに対して主空間で最適な$O\left(\sqrt{\frac{n}{\epsilon}}\right)$反復複雑度が確立されたのはこれが初めてです。また、最小絶対偏差やSVMなど、滑らかでない損失を伴ういくつかの問題に対して加速線形複雑度を確立します。

Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping
位相アンラッピングへの応用による平滑関数のモジュロ1サンプルの証明可能なロバスト推定

Consider an unknown smooth function $f: [0,1]^d \rightarrow \mathbb{R}$, and assume we are given $n$ noisy mod 1 samples of $f$, i.e., $y_i = (f(x_i) + \eta_i) \bmod 1$, for $x_i \in [0,1]^d$, where $\eta_i$ denotes the noise. Given the samples $(x_i,y_i)_{i=1}^{n}$, our goal is to recover smooth, robust estimates of the clean samples $f(x_i) \bmod 1$. We formulate a natural approach for solving this problem, which works with angular embeddings of the noisy mod 1 samples over the unit circle, inspired by the angular synchronization framework. This amounts to solving a smoothness regularized least-squares problem — a quadratically constrained quadratic program (QCQP) — where the variables are constrained to lie on the unit circle. Our proposed approach is based on solving its relaxation, which is a trust-region sub-problem and hence solvable efficiently. We provide theoretical guarantees demonstrating its robustness to noise for adversarial, as well as random Gaussian and Bernoulli noise models. To the best of our knowledge, these are the first such theoretical results for this problem. We demonstrate the robustness and efficiency of our proposed approach via extensive numerical simulations on synthetic data, along with a simple least-squares based solution for the unwrapping stage, that recovers the original samples of $f$ (up to a global shift). It is shown to perform well at high levels of noise, when taking as input the denoised modulo $1$ samples. Finally, we also consider two other approaches for denoising the modulo 1 samples that leverage tools from Riemannian optimization on manifolds, including a Burer-Monteiro approach for a semidefinite programming relaxation of our formulation. For the two-dimensional version of the problem, which has applications in synthetic aperture radar interferometry (InSAR), we are able to solve instances of real-world data with a million sample points in under 10 seconds, on a personal laptop.

未知の滑らかな関数$f: [0,1]^d \rightarrow \mathbb{R}$を考え、$f$の$n$個のノイズのあるmod 1サンプル、つまり$y_i = (f(x_i) + \eta_i) \bmod 1$ ($x_i \in [0,1]^d$)が与えられていると仮定します。ここで、$\eta_i$はノイズを表します。サンプル$(x_i,y_i)_{i=1}^{n}$が与えられた場合、私たちの目標は、クリーンなサンプル$f(x_i) \bmod 1$の滑らかで堅牢な推定値を回復することです。私たちは、角度同期フレームワークに触発された、単位円上のノイズのあるmod 1サンプルの角度埋め込みを使用してこの問題を解決するための自然なアプローチを定式化します。これは、平滑性正規化最小二乗問題、つまり変数が単位円上にあるように制約される二次制約二次計画(QCQP)を解くことに相当します。提案するアプローチは、その緩和を解くことに基づいています。これは信頼領域サブ問題であり、したがって効率的に解くことができます。敵対的ノイズモデル、ランダムガウスノイズモデル、ベルヌーイノイズモデルに対するノイズに対する堅牢性を示す理論的保証を提供します。私たちの知る限り、これはこの問題に対する初めての理論的結果です。合成データに対する広範な数値シミュレーションと、アンラップステージ用の単純な最小二乗ベースのソリューションにより、提案するアプローチの堅牢性と効率性を実証します。このソリューションは、元の$f$サンプル(グローバルシフトまで)を復元します。ノイズ除去されたモジュロ$1$サンプルを入力として受け取ると、ノイズレベルが高い場合でも優れたパフォーマンスを発揮することが示されています。最後に、多様体上のリーマン最適化のツールを活用したモジュロ1サンプルのノイズ除去のための他の2つのアプローチも検討します。これには、定式化の半正定値計画法緩和のためのBurer-Monteiroアプローチが含まれます。合成開口レーダー干渉法(InSAR)に適用される問題の2次元バージョンでは、個人のラップトップで100万のサンプルポイントを持つ実際のデータのインスタンスを10秒未満で解決できます。

Generalized Nonbacktracking Bounds on the Influence
影響の一般化非バックトラッキング境界

This paper develops deterministic upper and lower bounds on the influence measure in a network, more precisely, the expected number of nodes that a seed set can influence in the independent cascade model. In particular, our bounds exploit r-nonbacktracking walks and Fortuin-Kasteleyn-Ginibre (FKG) type inequalities, and are computed by message passing algorithms. Further, we provide parameterized versions of the bounds that control the trade-off between efficiency and accuracy. Finally, the tightness of the bounds is illustrated on various network models.

この論文では、ネットワークの影響度、より正確には、シードセットが独立カスケードモデルで影響を与えることができるノードの期待数について、決定論的な上限と下限を開発します。特に、私たちの境界はr-nonbacktracking walksとFortuin-Kasteleyn-Ginibre (FKG)型の不等式を利用し、メッセージパッシングアルゴリズムによって計算されます。さらに、効率と精度のトレードオフを制御する範囲のパラメーター化されたバージョンを提供します。最後に、境界の厳密さがさまざまなネットワークモデルで示されています。

Tensor Train Decomposition on TensorFlow (T3F)
TensorFlow でのテンソルトレイン分解 (T3F)

Tensor Train decomposition is used across many branches of machine learning. We present T3F—a library for Tensor Train decomposition based on TensorFlow. T3F supports GPU execution, batch processing, automatic differentiation, and versatile functionality for the Riemannian optimization framework, which takes into account the underlying manifold structure to construct efficient optimization methods. The library makes it easier to implement machine learning papers that rely on the Tensor Train decomposition. T3F includes documentation, examples and 94% test coverage.

テンソルトレイン分解は、機械学習の多くの分野で使用されています。TensorFlowに基づくTensor Train分解用のライブラリであるT3Fを紹介します。T3Fは、GPU実行、バッチ処理、自動微分、およびリーマン最適化フレームワークの多彩な機能をサポートしており、基礎となる多様体構造を考慮して効率的な最適化手法を構築します。このライブラリを使用すると、Tensor Train分解に依存する機械学習の論文を簡単に実装できます。T3Fには、ドキュメント、例、94%のテストカバレッジが含まれています。

The Maximum Separation Subspace in Sufficient Dimension Reduction with Categorical Response
カテゴリカル応答による十分な次元削減における最大分離部分空間

Sufficient dimension reduction (SDR) is a very useful concept for exploratory analysis and data visualization in regression, especially when the number of covariates is large. Many SDR methods have been proposed for regression with a continuous response, where the central subspace (CS) is the target of estimation. Various conditions, such as the linearity condition and the constant covariance condition, are imposed so that these methods can estimate at least a portion of the CS. In this paper we study SDR for regression and discriminant analysis with categorical response. Motivated by the exploratory analysis and data visualization aspects of SDR, we propose a new geometric framework to reformulate the SDR problem in terms of manifold optimization and introduce a new concept called Maximum Separation Subspace (MASES). The MASES naturally preserves the “sufficiency” in SDR without imposing additional conditions on the predictor distribution, and directly inspires a semi-parametric estimator. Numerical studies show MASES exhibits superior performance as compared with competing SDR methods in specific settings.

十分な次元削減(SDR)は、回帰分析における探索的分析とデータ可視化、特に共変量数が多い場合に非常に役立つ概念です。連続応答を伴う回帰に対して、中心部分空間(CS)が推定の対象となる多くのSDR手法が提案されています。線形条件や定数共分散条件などのさまざまな条件が課されるため、これらの手法ではCSの少なくとも一部を推定できます。この論文では、カテゴリ応答を伴う回帰分析と判別分析のためのSDRについて検討します。SDRの探索的分析とデータ可視化の側面に着目し、SDR問題を多様体最適化の観点から再定式化する新しい幾何学的フレームワークを提案し、最大分離部分空間(MASES)と呼ばれる新しい概念を導入します。MASESは、予測変数の分布に追加の条件を課すことなく、SDRの「十分性」を自然に保持し、セミパラメトリック推定量を直接刺激します。数値的研究によれば、MASESは特定の設定において競合するSDR方式と比較して優れたパフォーマンスを発揮します。

On the consistency of graph-based Bayesian semi-supervised learning and the scalability of sampling algorithms
グラフベースベイジアン半教師あり学習の一貫性とサンプリングアルゴリズムのスケーラビリティについて

This paper considers a Bayesian approach to graph-based semi-supervised learning. We show that if the graph parameters are suitably scaled, the graph-posteriors converge to a continuum limit as the size of the unlabeled data set grows. This consistency result has profound algorithmic implications: we prove that when consistency holds, carefully designed Markov chain Monte Carlo algorithms have a uniform spectral gap, independent of the number of unlabeled inputs. Numerical experiments illustrate and complement the theory.

この論文では、グラフベースの半教師あり学習に対するベイズアプローチについて考察します。グラフパラメータが適切にスケーリングされている場合、ラベル付けされていないデータセットのサイズが大きくなるにつれて、グラフ事後は連続体の限界に収束することを示します。この一貫性の結果は、アルゴリズムに深い意味を持ちます:一貫性が保持されると、慎重に設計されたマルコフ連鎖モンテカルロアルゴリズムは、ラベル付けされていない入力の数に関係なく、均一なスペクトルギャップを持つことを証明します。数値実験は、理論を説明し、補完します。

A New Class of Time Dependent Latent Factor Models with Applications
アプリケーションを用いた新しいクラスの時間依存潜在因子モデル

In many applications, observed data are influenced by some combination of latent causes. For example, suppose sensors are placed inside a building to record responses such as temperature, humidity, power consumption and noise levels. These random, observed responses are typically affected by many unobserved, latent factors (or features) within the building such as the number of individuals, the turning on and off of electrical devices, power surges, etc. These latent factors are usually present for a contiguous period of time before disappearing; further, multiple factors could be present at a time. This paper develops new probabilistic methodology and inference methods for random object generation influenced by latent features exhibiting temporal persistence. Every datum is associated with subsets of a potentially infinite number of hidden, persistent features that account for temporal dynamics in an observation. The ensuing class of dynamic models constructed by adapting the Indian Buffet Process — a probability measure on the space of random, unbounded binary matrices — finds use in a variety of applications arising in operations, signal processing, biomedicine, marketing, image analysis, etc. Illustrations using synthetic and real data are provided.

多くのアプリケーションでは、観測データは潜在的な原因の組み合わせによって影響を受けます。たとえば、温度、湿度、電力消費、騒音レベルなどの応答を記録するために、建物内にセンサーが設置されているとします。これらのランダムな観測応答は通常、建物内の人の数、電気機器のオン/オフ、電力サージなど、多くの観測されない潜在的な要因(または特徴)の影響を受けます。これらの潜在的な要因は通常、消えるまで連続した期間存在しますが、複数の要因が同時に存在することもあります。この論文では、時間的持続性を示す潜在的な特徴の影響を受けるランダムなオブジェクト生成のための新しい確率的方法論と推論方法を開発します。すべてのデータは、観測における時間的ダイナミクスを説明する潜在的に無限の数の隠された永続的な特徴のサブセットに関連付けられています。ランダムで無制限のバイナリ行列の空間における確率測度であるインド・バフェット過程を適応させることによって構築された一連の動的モデルは、操作、信号処理、生物医学、マーケティング、画像分析などのさまざまなアプリケーションで使用されています。合成データと実際のデータを使用した例が提供されています。

Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes
複数の高次元データクラスからの逆共分散行列のターゲット融合リッジ推定

We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An $\ell_2$-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other $\ell_2$-penalized estimators as special cases. In addition, the approach allows specification of target matrices through which prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the constituent classes are believed to chiefly share the same structure while potentially differing in a number of locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by data sets and subtypes of diseases. The performance of the proposed estimator in the graphical modeling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by the differential network modeling of 12 large-scale gene expression data sets of diffuse large B-cell lymphoma subtypes. The estimator and its related procedures are incorporated into the R-package rags2ridges.

私たちは、異なるクラスからなる高次元データから複数の逆共分散行列を共同で推定する問題について検討します。$\ell_2$ペナルティ付き最尤法を採用します。提案されたアプローチは柔軟かつ汎用的であり、いくつかの他の$\ell_2$ペナルティ付き推定量を特殊なケースとして組み込む。さらに、このアプローチでは、事前知識を組み込むことができ、高次元設定での推定手順を安定化できるターゲット行列の指定が可能になります。その結果、構成クラスの精度行列が主に同じ構造を共有していると考えられるが、いくつかの関心領域で異なる可能性がある場合に使用できる、ターゲット融合リッジ推定量が得られます。これは、(多)因子研究設計で多くの用途があります。私たちは、精度行列のグラフィカルな解釈に焦点を当て、提案された推定量を統合的またはメタ分析的なガウスグラフィカルモデリングの基礎として利用します。クラスがデータセットと疾患のサブタイプによって定義される状況が考慮されます。グラフィカルモデリング設定における提案された推定器のパフォーマンスは、広範なシミュレーション実験によって評価されます。その実用的な有用性は、びまん性大細胞型B細胞リンパ腫サブタイプの12の大規模遺伝子発現データセットの差分ネットワークモデリングによって実証されます。推定器とそれに関連する手順は、Rパッケージrags2ridgesに組み込まれています。

Lower Bounds for Testing Graphical Models: Colorings and Antiferromagnetic Ising Models
グラフィカルモデルの試験の下限: カラーリングと反強磁性イジングモデル

We study the identity testing problem in the context of spin systems or undirected graphical models, where it takes the following form: given the parameter specification of the model $M$ and a sampling oracle for the distribution $\mu_{M^*}$ of an unknown model $M^*$, can we efficiently determine if the two models $M$ and $M^*$ are the same? We consider identity testing for both soft-constraint and hard-constraint systems. In particular, we prove hardness results in two prototypical cases, the Ising model and proper colorings, and explore whether identity testing is any easier than structure learning. For the ferromagnetic (attractive) Ising model, Daskalakis et al. (2018) presented a polynomial-time algorithm for identity testing. We prove hardness results in the antiferromagnetic (repulsive) setting in the same regime of parameters where structure learning is known to require a super-polynomial number of samples. Specifically, for $n$-vertex graphs of maximum degree $d$, we prove that if $|\beta| d = \omega(\log{n})$ (where $\beta$ is the inverse temperature parameter), then there is no polynomial running time identity testing algorithm unless $RP=NP$. In the hard-constraint setting, we present hardness results for identity testing for proper colorings. Our results are based on the presumed hardness of #BIS, the problem of (approximately) counting independent sets in bipartite graphs.

私たちは、スピンシステムまたは無向グラフィカルモデルのコンテキストで同一性テスト問題を研究します。この問題は、次のような形式をとります。モデル$M$のパラメータ仕様と、未知のモデル$M^*$の分布$\mu_{M^*}$のサンプリングオラクルがある場合、2つのモデル$M$と$M^*$が同じかどうかを効率的に判断できますか?ソフト制約システムとハード制約システムの両方について、同一性テストを検討します。特に、2つのプロトタイプケースであるIsingモデルと適切な色付けで困難性が証明され、同一性テストが構造学習よりも簡単かどうかが検討されます。強磁性(吸引) Isingモデルの場合、Daskalakisら(2018)は、同一性テストの多項式時間アルゴリズムを提示しました。構造学習には超多項式の数のサンプルが必要であることが知られている同じパラメータ領域で、反強磁性(反発)設定で困難性が証明されます。具体的には、最大次数$d$の$n$頂点グラフについて、$|\beta| d = \omega(\log{n})$ (ここで$\beta$は逆温度パラメータ)の場合、$RP=NP$でない限り、多項式実行時間同一性テストアルゴリズムは存在しないことを証明します。ハード制約設定では、適切な色の同一性テストの困難性の結果を示します。私たちの結果は、2部グラフ内の独立集合を(近似的に)数える問題である#BISの想定される困難性に基づいています。

Distributed Feature Screening via Componentwise Debiasing
コンポーネントごとのバイアス除去による分散特徴スクリーニング

Feature screening is a powerful tool in processing high-dimensional data. When the sample size N and the number of features p are both large, the implementation of classic screening methods can be numerically challenging. In this paper, we propose a distributed screening framework for big data setup. In the spirit of ‘divide-and-conquer’, the proposed framework expresses a correlation measure as a function of several component parameters, each of which can be distributively estimated using a natural U-statistic from data segments. With the component estimates aggregated, we obtain a final correlation estimate that can be readily used for screening features. This framework enables distributed storage and parallel computing and thus is computationally attractive. Due to the unbiased distributive estimation of the component parameters, the final aggregated estimate achieves a high accuracy that is insensitive to the number of data segments m. Under mild conditions, we show that the aggregated correlation estimator is as efficient as the centralized estimator in terms of the probability convergence bound and the mean squared error rate; the corresponding screening procedure enjoys sure screening property for a wide range of correlation measures. The promising performances of the new method are supported by extensive numerical examples.

特徴スクリーニングは、高次元データの処理における強力なツールです。サンプルサイズNと特徴の数pが両方とも大きい場合、従来のスクリーニング方法の実装は数値的に困難になる可能性があります。この論文では、ビッグデータセットアップ用の分散スクリーニングフレームワークを提案します。「分割統治」の精神で、提案されたフレームワークは、相関尺度を複数のコンポーネントパラメータの関数として表現します。各コンポーネントパラメータは、データセグメントから自然なU統計量を使用して分散的に推定できます。コンポーネント推定値を集約すると、特徴のスクリーニングにすぐに使用できる最終的な相関推定値が得られます。このフレームワークは、分散ストレージと並列コンピューティングを可能にするため、計算上魅力的です。コンポーネントパラメータの偏りのない分散推定により、最終的な集約推定値は、データセグメントの数mに影響されない高い精度を実現します。穏やかな条件下では、集約された相関推定値は、確率収束境界と平均二乗誤差率の点で集中型推定値と同じくらい効率的であることを示します。対応するスクリーニング手順は、幅広い相関尺度に対して確実なスクリーニング特性を備えています。新しい方法の有望なパフォーマンスは、広範な数値例によって裏付けられています。

GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing
GluonCVとGluonNLP:コンピュータビジョンと自然言語処理におけるディープラーニング

We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototyping and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customization. Leveraging the MXNet ecosystem, the deep learning models in GluonCV and GluonNLP can be deployed onto a variety of platforms with different programming languages. The Apache 2.0 license has been adopted by GluonCV and GluonNLP to allow for software distribution, modification, and usage.

私たちは、Apache MXNet(インキュベーション)に基づくコンピュータービジョンと自然言語処理のためのディープラーニングツールキットであるGluonCVとGluonNLPを紹介します。これらのツールキットは、最先端の事前学習済みモデル、学習スクリプト、および学習ログを提供し、迅速なプロトタイピングを容易にし、再現性のある研究を促進します。また、効率的なカスタマイズを可能にする柔軟なビルディングブロックを備えたモジュラーAPIも提供しています。MXNetエコシステムを活用することで、GluonCVとGluonNLPのディープラーニングモデルは、さまざまなプログラミング言語でさまざまなプラットフォームにデプロイできます。Apache 2.0ライセンスは、ソフトウェアの配布、変更、および使用を可能にするために、GluonCVとGluonNLPによって採用されました。

A Unified Framework for Structured Graph Learning via Spectral Constraints
スペクトル制約による構造化グラフ学習のための統一フレームワーク

Graph learning from data is a canonical problem that has received substantial attention in the literature. Learning a structured graph is essential for interpretability and identification of the relationships among data. In general, learning a graph with a specific structure is an NP-hard combinatorial problem and thus designing a general tractable algorithm is challenging. Some useful structured graphs include connected, sparse, multi-component, bipartite, and regular graphs. In this paper, we introduce a unified framework for structured graph learning that combines Gaussian graphical model and spectral graph theory. We propose to convert combinatorial structural constraints into spectral constraints on graph matrices and develop an optimization framework based on block majorization-minimization to solve structured graph learning problem. The proposed algorithms are provably convergent and practically amenable for a number of graph based applications such as data clustering. Extensive numerical experiments with both synthetic and real data sets illustrate the effectiveness of the proposed algorithms. An open source R package containing the code for all the experiments is available at https://CRAN.R-project.org/package=spectralGraphTopology.

データからのグラフ学習は、文献でかなりの注目を集めている標準的な問題です。構造化グラフの学習は、データ間の関係の解釈と識別に不可欠です。一般に、特定の構造を持つグラフの学習はNP困難な組み合わせ問題であるため、一般的な扱いやすいアルゴリズムの設計は困難です。有用な構造化グラフには、接続グラフ、スパースグラフ、マルチコンポーネントグラフ、二部グラフ、および通常グラフがあります。この論文では、ガウスグラフィカルモデルとスペクトルグラフ理論を組み合わせた構造化グラフ学習の統一フレームワークを紹介します。組み合わせ構造制約をグラフマトリックスのスペクトル制約に変換し、ブロックメジャー化最小化に基づく最適化フレームワークを開発して、構造化グラフ学習問題を解決することを提案します。提案されたアルゴリズムは、データクラスタリングなどの多くのグラフベースのアプリケーションに収束することが証明されており、実際に適用可能です。合成データセットと実際のデータセットの両方を使用した広範な数値実験により、提案されたアルゴリズムの有効性が実証されています。すべての実験のコードを含むオープンソースのRパッケージは、https://CRAN.R-project.org/package=spectralGraphTopologyから入手できます。

Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems
方策最適化のための微分フリー法:線形二次システムの保証

We study derivative-free methods for policy optimization over the class of linear policies. We focus on characterizing the convergence rate of these methods when applied to linear-quadratic systems, and study various settings of driving noise and reward feedback. Our main theoretical result provides an explicit bound on the sample or evaluation complexity: we show that these methods are guaranteed to converge to within any pre-specified tolerance of the optimal policy with a number of zero-order evaluations that is an explicit polynomial of the error tolerance, dimension, and curvature properties of the problem. Our analysis reveals some interesting differences between the settings of additive driving noise and random initialization, as well as the settings of one-point and two-point reward feedback. Our theory is corroborated by simulations of derivative-free methods in application to these systems. Along the way, we derive convergence rates for stochastic zero-order optimization algorithms when applied to a certain class of non-convex problems.

私たちは、線形ポリシーのクラスに対するポリシー最適化のための導関数を使用しない方法を研究します。私たちは、線形二次システムに適用した場合のこれらの方法の収束率の特徴付けに焦点を当て、駆動ノイズと報酬フィードバックのさまざまな設定を研究します。我々の主な理論的結果は、サンプルまたは評価の複雑さの明示的な境界を提供します。すなわち、これらの方法は、問題の誤差許容度、次元、および曲率特性の明示的な多項式であるゼロ次評価の数で、最適ポリシーの事前に指定された許容範囲内に収束することが保証されることを示す。我々の分析は、加法駆動ノイズとランダム初期化の設定、および1点報酬フィードバックと2点報酬フィードバックの設定の間にいくつかの興味深い違いを明らかにした。我々の理論は、これらのシステムに適用された導関数を使用しない方法のシミュレーションによって裏付けられています。その過程で、私たちは、特定のクラスの非凸問題に適用した場合の確率的ゼロ次最適化アルゴリズムの収束率を導出します。

Convergences of Regularized Algorithms and Stochastic Gradient Methods with Random Projections
ランダム射影による正則化アルゴリズムと確率的勾配法の収束

We study the least-squares regression problem over a Hilbert space, covering nonparametric regression over a reproducing kernel Hilbert space as a special case. We first investigate regularized algorithms adapted to a projection operator on a closed subspace of the Hilbert space. We prove convergence results with respect to variants of norms, under a capacity assumption on the hypothesis space and a regularity condition on the target function. As a result, we obtain optimal rates for regularized algorithms with randomized sketches, provided that the sketch dimension is proportional to the effective dimension up to a logarithmic factor. As a byproduct, we obtain similar results for Nystr\”{o}m regularized algorithms. Our results provide optimal, distribution-dependent rates that do not have any saturation effect for sketched/Nystr\”{o}m regularized algorithms, considering both the attainable and non-attainable cases, in the well-conditioned regimes. We then study stochastic gradient methods with projection over the subspace, allowing multi-pass over the data and minibatches, and we derive similar optimal statistical convergence results.

私たちは、ヒルベルト空間上の最小二乗回帰問題を研究し、再生核ヒルベルト空間上のノンパラメトリック回帰を特別なケースとして扱う。まず、ヒルベルト空間の閉じた部分空間上の射影演算子に適応した正規化アルゴリズムを調べる。仮説空間の容量仮定と目標関数の正則性条件の下で、ノルムのバリアントに関する収束結果を証明した。その結果、スケッチ次元が対数係数まで有効次元に比例するという条件で、ランダム化されたスケッチを使用した正規化アルゴリズムの最適レートを得た。副産物として、Nystr\”{o}m正規化アルゴリズムについても同様の結果を得た。我々の結果は、条件が整った領域で達成可能なケースと達成不可能なケースの両方を考慮し、スケッチ/Nystr\”{o}m正規化アルゴリズムに対して飽和効果のない最適で分布に依存するレートを提供します。次に、データとミニバッチのマルチパスを可能にするサブスペースへの投影による確率的勾配法を研究し、同様の最適な統計的収束結果を導き出します。

High-Dimensional Interactions Detection with Sparse Principal Hessian Matrix
スパース主ヘッセ行列による高次元相互作用検出

In statistical learning framework with regressions, interactions are the contributions to the response variable from the products of the explanatory variables. In high-dimensional problems, detecting interactions is challenging due to combinatorial complexity and limited data information. We consider detecting interactions by exploring their connections with the principal Hessian matrix. Specifically, we propose a one-step synthetic approach for estimating the principal Hessian matrix by a penalized M-estimator. An alternating direction method of multipliers (ADMM) is proposed to efficiently solve the encountered regularized optimization problem. Based on the sparse estimator, we detect the interactions by identifying its nonzero components. Our method directly targets at the interactions, and it requires no structural assumption on the hierarchy of the interactions effects. We show that our estimator is theoretically valid, computationally efficient, and practically useful for detecting the interactions in a broad spectrum of scenarios.

回帰による統計学習フレームワークでは、相互作用は説明変数の積から応答変数への寄与です。高次元の問題では、組み合わせの複雑さとデータ情報の制限により、相互作用の検出は困難です。相互作用を主ヘッセ行列との関連を調べることで検出することを検討します。具体的には、ペナルティ付きM推定量によって主ヘッセ行列を推定するワンステップ合成アプローチを提案します。交互方向乗数法(ADMM)は、遭遇する正則化最適化問題を効率的に解決するために提案されます。スパース推定量に基づいて、その非ゼロ成分を識別することで相互作用を検出します。この方法は相互作用を直接対象としており、相互作用効果の階層に関する構造的仮定を必要としません。この推定量は理論的に有効で、計算効率が高く、幅広いシナリオで相互作用を検出するのに実用的であることを示します。

Connecting Spectral Clustering to Maximum Margins and Level Sets
スペクトルクラスタリングの最大マージンとレベルセットへの接続

We study the connections between spectral clustering and the problems of maximum margin clustering, and estimation of the components of level sets of a density function. Specifically, we obtain bounds on the eigenvectors of graph Laplacian matrices in terms of the between cluster separation, and within cluster connectivity. These bounds ensure that the spectral clustering solution converges to the maximum margin clustering solution as the scaling parameter is reduced towards zero. The sensitivity of maximum margin clustering solutions to outlying points is well known, but can be mitigated by first removing such outliers, and applying maximum margin clustering to the remaining points. If outliers are identified using an estimate of the underlying probability density, then the remaining points may be seen as an estimate of a level set of this density function. We show that such an approach can be used to consistently estimate the components of the level sets of a density function under very mild assumptions.

私たちは、スペクトルクラスタリングと最大マージンクラスタリングの問題、および密度関数のレベルセットのコンポーネントの推定との関係を調べます。具体的には、グラフラプラシアンマトリックスの固有ベクトルの境界を、クラスター間の分離とクラスター内の接続の観点から求めます。これらの境界により、スケーリングパラメーターが0に近づくにつれて、スペクトルクラスタリングソリューションが最大マージンクラスタリングソリューションに収束することが保証されます。最大マージンクラスタリングソリューションが外れ値に対して敏感であることはよく知られていますが、まずそのような外れ値を削除し、残りのポイントに最大マージンクラスタリングを適用することで、この影響を軽減できます。外れ値が基礎となる確率密度の推定値を使用して識別された場合、残りのポイントはこの密度関数のレベルセットの推定値と見なすことができます。このようなアプローチを使用して、非常に緩やかな仮定の下で、密度関数のレベルセットのコンポーネントを一貫して推定できることを示します。

Expectation Propagation as a Way of Life: A Framework for Bayesian Inference on Partitioned Data
生活様式としての期待伝播:分割データに対するベイズ推論のフレームワーク

A common divide-and-conquer approach for Bayesian computation with big data is to partition the data, perform local inference for each piece separately, and combine the results to obtain a global posterior approximation. While being conceptually and computationally appealing, this method involves the problematic need to also split the prior for the local inferences; these weakened priors may not provide enough regularization for each separate computation, thus eliminating one of the key advantages of Bayesian methods. To resolve this dilemma while still retaining the generalizability of the underlying local inference method, we apply the idea of expectation propagation (EP) as a framework for distributed Bayesian inference. The central idea is to iteratively update approximations to the local likelihoods given the state of the other approximations and the prior. The present paper has two roles: we review the steps that are needed to keep EP algorithms numerically stable, and we suggest a general approach, inspired by EP, for approaching data partitioning problems in a way that achieves the computational benefits of parallelism while allowing each local update to make use of relevant information from the other sites. In addition, we demonstrate how the method can be applied in a hierarchical context to make use of partitioning of both data and parameters. The paper describes a general algorithmic framework, rather than a specific algorithm, and presents an example implementation for it.

ビッグデータを使ったベイズ計算の一般的な分割統治法は、データを分割し、各部分に対して個別にローカル推論を実行し、その結果を組み合わせてグローバル事後近似値を得るというものです。概念的にも計算的にも魅力的である一方で、この方法ではローカル推論のために事前分布も分割する必要があるという問題があります。これらの弱められた事前分布では、各計算に対して十分な正則化が提供されない可能性があり、ベイズ法の主な利点の1つが失われます。このジレンマを解決し、基礎となるローカル推論法の一般化可能性を維持するため、分散ベイズ推論のフレームワークとして期待伝播(EP)の考え方を適用します。中心となる考え方は、他の近似値と事前分布の状態を前提として、ローカル尤度に対する近似値を反復的に更新することです。本論文の役割は2つあります。EPアルゴリズムを数値的に安定させるために必要な手順を確認することと、EPにヒントを得た、並列処理の計算上の利点を実現しながら、各ローカル更新で他のサイトの関連情報を利用できるような方法でデータ分割問題に取り組むための一般的なアプローチを提案することです。さらに、この方法を階層的なコンテキストに適用して、データとパラメータの両方の分割を活用する方法を示します。この論文では、特定のアルゴリズムではなく、一般的なアルゴリズムフレームワークについて説明し、その実装例を示します。

Practical Locally Private Heavy Hitters
実用的なローカルプライベートヘビーヒッター

We present new practical local differentially private heavy hitters algorithms achieving optimal or near-optimal worst-case error and running time — TreeHist and Bitstogram. In both algorithms, server running time is $\tilde O(n)$ and user running time is $\tilde O(1)$, hence improving on the prior state-of-the-art result of Bassily and Smith [STOC 2015] requiring $O(n^{5/2})$ server time and $O(n^{3/2})$ user time. With a typically large number of participants in local algorithms (in the millions), this reduction in time complexity, in particular at the user side, is crucial for making locally private heavy hitters algorithms usable in practice. We implemented Algorithm TreeHist to verify our theoretical analysis and compared its performance with the performance of Google’s RAPPOR code.

私たちは、最適または最適に近いワーストケースのエラーと実行時間を実現する新しい実用的なローカル、差分、プライベートのヘビーヒッターアルゴリズム、TreeHistとBitstogramを紹介します。どちらのアルゴリズムでも、サーバーの実行時間は$tilde O(n)$で、ユーザーの実行時間は$tilde O(1)$であるため、$O(n^{5/2})$のサーバー時間と$O(n^{3/2})$のユーザー時間を必要とするBassily and Smith [STOC 2015]の以前の最先端の結果が改善されています。通常、ローカルアルゴリズムには多数の参加者(数百万人)がいるため、特にユーザー側での時間の複雑さの軽減は、ローカルプライベートのヘビーヒッターアルゴリズムを実際に使用できるようにするために重要です。私たちは、理論分析を検証するためにAlgorithm TreeHistを実装し、その性能をGoogleのRAPORコードのパフォーマンスと比較しました。

Perturbation Bounds for Procrustes, Classical Scaling, and Trilateration, with Applications to Manifold Learning
プロクラステス,古典的スケーリング,三辺測量のための摂動限界と多様体学習への応用

One of the common tasks in unsupervised learning is dimensionality reduction, where the goal is to find meaningful low-dimensional structures hidden in high-dimensional data. Sometimes referred to as manifold learning, this problem is closely related to the problem of localization, which aims at embedding a weighted graph into a low-dimensional Euclidean space. Several methods have been proposed for localization, and also manifold learning. Nonetheless, the robustness property of most of them is little understood. In this paper, we obtain perturbation bounds for classical scaling and trilateration, which are then applied to derive performance bounds for Isomap, Landmark Isomap, and Maximum Variance Unfolding. A new perturbation bound for procrustes analysis plays a key role.

教師なし学習の一般的なタスクの1つは次元削減であり、高次元データに隠された意味のある低次元構造を見つけることを目標としています。多様体学習と呼ばれることもあるこの問題は、低次元のユークリッド空間に重み付きグラフを埋め込むことを目的としたローカリゼーションの問題と密接に関連しています。ローカライゼーションと多様体学習のために、いくつかの方法が提案されています。それにもかかわらず、それらのほとんどの堅牢性特性はほとんど理解されていません。この論文では、古典的なスケーリングと三辺測量の摂動限界を取得し、それを適用してIsomap、Landmark Isomap、およびMaximum Variance Unfoldingの性能限界を導出します。procrustes解析にバインドされた新しい摂動が重要な役割を果たします。

On lp-Support Vector Machines and Multidimensional Kernels
lpサポートベクターマシンと多次元カーネルについて

In this paper, we extend the methodology developed for Support Vector Machines (SVM) using the $\ell_2$-norm ($\ell_2$-SVM) to the more general case of $\ell_p$-norms with $p>1$ ($\ell_p$-SVM). We derive second order cone formulations for the resulting dual and primal problems. The concept of kernel function, widely applied in $\ell_2$-SVM, is extended to the more general case of $\ell_p$-norms with $p>1$ by defining a new operator called multidimensional kernel. This object gives rise to reformulations of dual problems, in a transformed space of the original data, where the dependence on the original data always appear as homogeneous polynomials. We adapt known solution algorithms to efficiently solve the primal and dual resulting problems and some computational experiments on real-world datasets are presented showing rather good behavior in terms of the accuracy of $\ell_p$-SVM with $p>1$.

この論文では、$ell_2$-norm ($ell_2$-SVM)を使用してサポートベクターマシン(SVM)用に開発された方法論を、$p>1$ ($ell_p$-SVM)を持つ$ell_p$-normsのより一般的なケースに拡張します。結果として生じる双対問題と主問題に対して、2次コーンの定式化を導き出します。$ell_2$-SVMで広く適用されているカーネル関数の概念は、多次元カーネルと呼ばれる新しい演算子を定義することにより、$p>1$を持つ$ell_p$-normsのより一般的なケースに拡張されます。このオブジェクトは、元のデータの変換された空間で、元のデータへの依存性が常に均質な多項式として現れる二重問題の再定式化を引き起こします。私たちは、主問題と双対問題を効率的に解くために既知の解法アルゴリズムを適応させ、実世界のデータセットでの計算実験がいくつか提示され、$ell_p-SVMと$p>1$の精度の点でかなり良い振る舞いを示しています。

Generalized probabilistic principal component analysis of correlated data
相関データの一般化確率論的主成分分析

Principal component analysis (PCA) is a well-established tool in machine learning and data processing. The principal axes in PCA were shown to be equivalent to the maximum marginal likelihood estimator of the factor loading matrix in a latent factor model for the observed data, assuming that the latent factors are independently distributed as standard normal distributions. However, the independence assumption may be unrealistic for many scenarios such as modeling multiple time series, spatial processes, and functional data, where the outcomes are correlated. In this paper, we introduce the generalized probabilistic principal component analysis (GPPCA) to study the latent factor model for multiple correlated outcomes, where each factor is modeled by a Gaussian process. Our method generalizes the previous probabilistic formulation of PCA (PPCA) by providing the closed-form maximum marginal likelihood estimator of the factor loadings and other parameters. Based on the explicit expression of the precision matrix in the marginal likelihood that we derived, the number of the computational operations is linear to the number of output variables. Furthermore, we also provide the closed-form expression of the marginal likelihood when other covariates are included in the mean structure. We highlight the advantage of GPPCA in terms of the practical relevance, estimation accuracy and computational convenience. Numerical studies of simulated and real data confirm the excellent finite-sample performance of the proposed approach.

主成分分析(PCA)は、機械学習とデータ処理において確立されたツールです。潜在因子が標準正規分布として独立に分布していると仮定すると、PCAの主軸は、観測データの潜在因子モデルにおける因子負荷行列の最大周辺尤度推定量と同等であることが示されました。しかし、独立性の仮定は、結果が相関している複数の時系列、空間プロセス、機能データのモデリングなど、多くのシナリオでは非現実的である可能性があります。この論文では、各因子がガウス過程によってモデル化されている複数の相関結果の潜在因子モデルを研究するために、一般化確率主成分分析(GPPCA)を紹介します。私たちの方法は、因子負荷やその他のパラメータの閉形式の最大周辺尤度推定量を提供することで、PCAの以前の確率的定式化(PPCA)を一般化します。私たちが導出した周辺尤度の精度行列の明示的な表現に基づいて、計算操作の数は出力変数の数に比例します。さらに、他の共変量が平均構造に含まれている場合の周辺尤度の閉形式の表現も提供します。実用的な関連性、推定精度、計算の利便性の点で、GPPCAの利点を強調します。シミュレーションデータと実際のデータの数値研究により、提案されたアプローチの優れた有限サンプルパフォーマンスが確認されます。

Neyman-Pearson classification: parametrics and sample size requirement
ネイマン・ピアソン分類: パラメトリックとサンプルサイズの要件

The Neyman-Pearson (NP) paradigm in binary classification seeks classifiers that achieve a minimal type II error while enforcing the prioritized type I error controlled under some user-specified level $\alpha$. This paradigm serves naturally in applications such as severe disease diagnosis and spam detection, where people have clear priorities among the two error types. Recently, Tong, Feng, and Li (2018) proposed a nonparametric umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic regression, support vector machines, random forest) to respect the given type I error (i.e., conditional probability of classifying a class $0$ observation as class $1$ under the 0-1 coding) upper bound $\alpha$ with high probability, without specific distributional assumptions on the features and the responses. Universal the umbrella algorithm is, it demands an explicit minimum sample size requirement on class $0$, which is often the more scarce class, such as in rare disease diagnosis applications. In this work, we employ the parametric linear discriminant analysis (LDA) model and propose a new parametric thresholding algorithm, which does not need the minimum sample size requirements on class $0$ observations and thus is suitable for small sample applications such as rare disease diagnosis. Leveraging both the existing nonparametric and the newly proposed parametric thresholding rules, we propose four LDA-based NP classifiers, for both low- and high-dimensional settings. On the theoretical front, we prove NP oracle inequalities for one proposed classifier, where the rate for excess type II error benefits from the explicit parametric model assumption. Furthermore, as NP classifiers involve a sample splitting step of class $0$ observations, we construct a new adaptive sample splitting scheme that can be applied universally to NP classifiers, and this adaptive strategy reduces the type II error of these classifiers. The proposed NP classifiers are implemented in the R package nproc.

バイナリ分類におけるネイマン-ピアソン(NP)パラダイムは、ユーザー指定のレベル$\alpha$の下で優先順位付けされたタイプIエラーを強制しながら、タイプIIエラーを最小限に抑える分類器を求めます。このパラダイムは、2つのエラータイプ間で明確な優先順位がある重篤な疾患の診断やスパム検出などのアプリケーションで自然に機能します。最近、Tong、Feng、およびLi (2018)は、特徴と応答に関する特定の分布仮定なしに、与えられたタイプIエラー(つまり、0-1コーディングの下でクラス$0$の観測をクラス$1$として分類する条件付き確率)の上限$\alpha$を高い確率で尊重するように、すべてのスコアリング型分類方法(ロジスティック回帰、サポートベクターマシン、ランダムフォレストなど)を適応させるノンパラメトリックアンブレラアルゴリズムを提案しました。アンブレラアルゴリズムは普遍的ですが、希少疾患の診断アプリケーションなどではより希少なクラスであることが多いクラス$0$に対して明示的な最小サンプルサイズ要件を要求します。この研究では、パラメトリック線形判別分析(LDA)モデルを採用し、クラス$0$の観測値に対する最小サンプルサイズ要件を必要としない新しいパラメトリックしきい値アルゴリズムを提案します。そのため、希少疾患の診断などの小規模サンプルのアプリケーションに適しています。既存のノンパラメトリックしきい値ルールと新しく提案されたパラメトリックしきい値ルールの両方を活用して、低次元設定と高次元設定の両方に対して、4つのLDAベースのNP分類器を提案します。理論面では、提案された分類器の1つに対してNPオラクル不等式を証明します。この場合、過剰なタイプIIエラーの率は、明示的なパラメトリックモデル仮定の恩恵を受けます。さらに、NP分類器にはクラス$0$の観測値のサンプル分割ステップが含まれるため、NP分類器に普遍的に適用できる新しい適応型サンプル分割スキームを構築します。この適応型戦略により、これらの分類器のタイプIIエラーが削減されます。提案されたNP分類器は、Rパッケージnprocに実装されています。

Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information
サイド情報を持つ異種確率的ブロックモデルのための重み付けメッセージパッシングと最小エネルギーフロー

We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification.

私たちは、ノイズの多いラベル情報または部分的なラベル情報を持つ一般的な異種確率的ブロックモデル(SBM)におけるコミュニティ検出の誤分類エラーを研究します。誤分類率とSBMの局所近傍における最小エネルギーの概念との間に関連性を確立します。私たちは、特定のマルコフ遷移行列の最小エネルギーフローと固有ベクトルに基づいてSBMのラベルを再構築するために、最適に重み付けされたメッセージパッシングアルゴリズムを開発します。この論文で考慮される一般的なSBMは、不等化サイズのコミュニティ、程度の不均一性、およびブロック間の異なる接続確率を考慮に入れています。私たちは、誤分類を改善するために、メッセージの受け渡しを最適に評価する方法に焦点を当てています。

Online Sufficient Dimension Reduction Through Sliced Inverse Regression
スライス逆回帰によるオンラインの十分な次元削減

Sliced inverse regression is an effective paradigm that achieves the goal of dimension reduction through replacing high dimensional covariates with a small number of linear combinations. It does not impose parametric assumptions on the dependence structure. More importantly, such a reduction of dimension is sufficient in that it does not cause loss of information. In this paper, we adapt the stationary sliced inverse regression to cope with the rapidly changing environments. We propose to implement sliced inverse regression in an online fashion. This online learner consists of two steps. In the first step we construct an online estimate for the kernel matrix; in the second step we propose two online algorithms, one is motivated by the perturbation method and the other is originated from the gradient descent optimization, to perform online singular value decomposition. The theoretical properties of this online learner are established. We demonstrate the numerical performance of this online learner through simulations and real world applications. All numerical studies confirm that this online learner performs as well as the batch learner.

スライス逆回帰は、高次元の共変量を少数の線形結合に置き換えることで次元削減の目標を達成する効果的なパラダイムです。依存構造にパラメトリック仮定を課しません。さらに重要なことは、このような次元削減は、情報の損失を引き起こさないという点で十分であるということです。この論文では、急速に変化する環境に対処するために、定常スライス逆回帰を適応させます。スライス逆回帰をオンラインで実装することを提案します。このオンライン学習者は2つのステップで構成されます。最初のステップでは、カーネルマトリックスのオンライン推定を構築します。2番目のステップでは、オンライン特異値分解を実行するために、摂動法に基づくものと勾配降下最適化に基づくものの2つのオンラインアルゴリズムを提案します。このオンライン学習者の理論的特性が確立されています。シミュレーションと実際のアプリケーションを通じて、このオンライン学習者の数値パフォーマンスを実証します。すべての数値研究により、このオンライン学習者がバッチ学習者と同様に機能することが確認されています。

On Mahalanobis Distance in Functional Settings
機能設定におけるマハラノビス距離について

Mahalanobis distance is a classical tool in multivariate analysis. We suggest here an extension of this concept to the case of functional data. More precisely, the proposed definition concerns those statistical problems where the sample data are real functions defined on a compact interval of the real line. The obvious difficulty for such a functional extension is the non-invertibility of the covariance operator in infinite-dimensional cases. Unlike other recent proposals, our definition is suggested and motivated in terms of the Reproducing Kernel Hilbert Space (RKHS) associated with the stochastic process that generates the data. The proposed distance is a true metric; it depends on a unique real smoothing parameter which is fully motivated in RKHS terms. Moreover, it shares some properties of its finite dimensional counterpart: it is invariant under isometries, it can be consistently estimated from the data and its sampling distribution is known under Gaussian models. An empirical study for two statistical applications, outliers detection and binary classification, is included. The results are quite competitive when compared to other recent proposals in the literature.

マハラノビス距離は、多変量解析の古典的なツールです。ここでは、この概念を関数データの場合に拡張することを提案します。より正確には、提案された定義は、サンプルデータが実数直線のコンパクトな区間で定義された実関数である統計的問題に関係します。このような関数拡張の明らかな難しさは、無限次元の場合の共分散演算子の非可逆性です。他の最近の提案とは異なり、私たちの定義は、データを生成する確率過程に関連する再生カーネルヒルベルト空間(RKHS)の観点から提案され、説明されています。提案された距離は真のメトリックです。これは、RKHSの観点から完全に説明されている唯一の実スムージングパラメーターに依存します。さらに、有限次元の対応物といくつかの特性を共有しています。等長変換に対して不変であり、データから一貫して推定でき、そのサンプリング分布はガウスモデルの下で既知です。外れ値検出とバイナリ分類という2つの統計アプリケーションに関する実証研究が含まれています。結果は、文献の他の最近の提案と比較して非常に競争力があります。

DESlib: A Dynamic ensemble selection library in Python
DESlib: Python の動的アンサンブル選択ライブラリ

DESlib is an open-source python library providing the implementation of several dynamic selection techniques. The library is divided into three modules: (i) dcs, containing the implementation of dynamic classifier selection methods (DCS); (ii) des, containing the implementation of dynamic ensemble selection methods (DES); (iii) static, with the implementation of static ensemble techniques. The library is fully documented (documentation available online on Read the Docs), has a high test coverage (codecov.io) and is part of the scikit-learn-contrib supported projects. Documentation, code and examples can be found on its GitHub page: https://github.com/scikit-learn-contrib/DESlib.

DESlibは、いくつかの動的選択手法の実装を提供するオープンソースのPythonライブラリです。ライブラリは3つのモジュールに分かれています:(i)動的分類器選択方法(DCS)の実装を含むdcs。(ii)des、動的アンサンブル選択法(DES)の実装を含む。(iii)静的、静的アンサンブル技術の実装。ライブラリは完全に文書化されており(ドキュメントはRead the Docsでオンラインで入手できます)、高いテストカバレッジ(codecov.io)を持ち、scikit-learn-contribがサポートするプロジェクトの一部です。ドキュメント、コード、および例は、GitHubページ(https://github.com/scikit-learn-contrib/DESlib)にあります。

Target Propagation in Recurrent Neural Networks
リカレントニューラルネットワークにおけるターゲット伝播

Recurrent Neural Networks have been widely used to process sequence data, but have long been criticized for their biological implausibility and training difficulties related to vanishing and exploding gradients. This paper presents a novel algorithm for training recurrent networks, target propagation through time (TPTT), that outperforms standard backpropagation through time (BPTT) on four out of the five problems used for testing. The proposed algorithm is initially tested and compared to BPTT on four synthetic time lag tasks, and its performance is also measured using the sequential MNIST data set. In addition, as TPTT uses target propagation, it allows for discrete nonlinearities and could potentially mitigate the credit assignment problem in more complex recurrent architectures.

リカレントニューラルネットワークは、シーケンスデータの処理に広く使用されていますが、その生物学的な信じがたいことや、勾配の消失や爆発に関連する訓練の難しさについて、長い間批判されてきました。この論文では、再帰型ネットワークを訓練するための新しいアルゴリズムであるTPTT(Target Propagation Through Time)について紹介します。このアルゴリズムは、テストに使用された5つの問題のうち4つで、標準的なBPTT(Backpropagation through Time)を上回っています。提案されたアルゴリズムは、最初に4つの合成タイムラグタスクでBPTTに対してテストおよび比較され、そのパフォーマンスもシーケンシャルMNISTデータセットを使用して測定されます。さらに、TPTTはターゲット伝搬を使用するため、離散的な非線形性が許容され、より複雑なリカレントアーキテクチャでのクレジット割り当ての問題を軽減できる可能性があります。

Path-Based Spectral Clustering: Guarantees, Robustness to Outliers, and Fast Algorithms
パスベースのスペクトルクラスタリング:保証、外れ値に対するロバスト性、高速アルゴリズム

We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points.

私たちは、細長く不規則な形状のクラスターに有益な最長足路距離(LLPD)メトリックによるクラスタリングの問題を検討します。多数の高次元外れ値が存在する状況で、高次元空間内の複数の本質的に低次元のクラスターからランダムサンプルが抽出された場合、このメトリックに関するクラスタリングのパフォーマンスに対する有限サンプル保証を証明します。これらの結果をLLPDに関するスペクトルクラスタリングと組み合わせることで、ラプラシアン固有ギャップ統計が大規模なデータセットのクラスター数を正しく決定する条件を提供し、提案アルゴリズムのラベル付け精度に対する保証を証明します。我々の方法は非常に一般的であり、任意の超メトリックによるスペクトルクラスタリングのパフォーマンス保証を提供します。また、隣接グラフのマルチスケール分析に基づく、LLPDの効率的で実装しやすい近似アルゴリズムを紹介します。これにより、LLPDスペクトルクラスタリングの実行時間がデータポイントの数に対して準線形になります。

Lower Bounds for Parallel and Randomized Convex Optimization
並列およびランダム化凸最適化の下限

We study the question of whether parallelization in the exploration of the feasible set can be used to speed up convex optimization, in the local oracle model of computation and in the high-dimensional regime. We show that the answer is negative for both deterministic and randomized algorithms applied to essentially any of the interesting geometries and nonsmooth, weakly-smooth, or smooth objective functions. In particular, we show that it is not possible to obtain a polylogarithmic (in the sequential complexity of the problem) number of parallel rounds with a polynomial (in the dimension) number of queries per round. In the majority of these settings and when the dimension of the space is polynomial in the inverse target accuracy, our lower bounds match the oracle complexity of sequential convex optimization, up to at most a logarithmic factor in the dimension, which makes them (nearly) tight. Another conceptual contribution of our work is in providing a general and streamlined framework for proving lower bounds in the setting of parallel convex optimization. Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean ($\ell_2$) and $\ell_\infty$ spaces.

私たちは、実行可能セットの探索における並列化が、計算のローカルオラクルモデルおよび高次元領域において、凸最適化を高速化するために使用できるかどうかという問題を研究します。私たちは、本質的にあらゆる興味深いジオメトリおよび非平滑、弱平滑、または平滑な目的関数に適用された決定論的アルゴリズムとランダム化アルゴリズムの両方について、答えが否定的であることを示す。特に、ラウンドあたりのクエリ数が多項式（次元）である状態で、並列ラウンド数が多対数（問題の逐次複雑度）になることはできないことを示す。これらの設定の大部分において、および空間の次元が逆ターゲット精度の多項式である場合、我々の下限は、最大で次元の対数係数まで、逐次凸最適化のオラクル複雑度と一致し、それによって（ほぼ）タイトになります。我々の研究のもう1つの概念的貢献は、並列凸最適化の設定で下限を証明するための一般的で合理化されたフレームワークを提供することです。私たちの研究以前は、並列凸最適化アルゴリズムの下限値は、主にユークリッド空間($\ell_2$)と$\ell_\infty$に適用され、この論文で検討されている設定のごく一部でのみ知られていました。

Universal Latent Space Model Fitting for Large Networks with Edge Covariates
エッジ共変量を持つ大規模ネットワークのためのユニバーサル潜在空間モデルのフィッティング

Latent space models are effective tools for statistical modeling and visualization of network data. Due to their close connection to generalized linear models, it is also natural to incorporate covariate information in them. The current paper presents two universal fitting algorithms for networks with edge covariates: one based on nuclear norm penalization and the other based on projected gradient descent. Both algorithms are motivated by maximizing the likelihood function for an existing class of inner-product models, and we establish their statistical rates of convergence for these models. In addition, the theory informs us that both methods work simultaneously for a wide range of different latent space models that allow latent positions to affect edge formation in flexible ways, such as distance models. Furthermore, the effectiveness of the methods is demonstrated on a number of real world network data sets for different statistical tasks, including community detection with and without edge covariates, and network assisted learning.

潜在空間モデルは、ネットワークデータの統計的モデリングと視覚化に効果的なツールです。一般化線型モデルと密接な関係があるため、共変量情報を組み込むことも自然です。この論文では、エッジ共変量を持つネットワークの2つのユニバーサルフィッティングアルゴリズムを紹介します。1つは核ノルムペナルティに基づき、もう1つは投影勾配降下に基づきます。両方のアルゴリズムは、既存の内積モデルのクラスの尤度関数を最大化することを目的としたものなので、これらのモデルに対する統計的収束率を確立します。さらに、理論によれば、両方の方法は、距離モデルなど、潜在位置がエッジ形成に柔軟に影響を与えることを可能にするさまざまな潜在空間モデルで同時に機能します。さらに、これらの方法の有効性は、エッジ共変量の有無にかかわらずコミュニティ検出やネットワーク支援学習など、さまざまな統計タスクの実際のネットワークデータセットで実証されています。

A Model of Fake Data in Data-driven Analysis
データ駆動型分析におけるフェイクデータのモデル

Data-driven analysis has been increasingly used in various decision making processes. With more sources, including reviews, news, and pictures, can now be used for data analysis, the authenticity of data sources is in doubt. While previous literature attempted to detect fake data piece by piece, in the current work, we try to capture the fake data sender’s strategic behavior to detect the fake data source. Specifically, we model the tension between a data receiver who makes data-driven decisions and a fake data sender who benefits from misleading the receiver. We propose a potentially infinite horizon continuous time game-theoretic model with asymmetric information to capture the fact that the receiver does not initially know the existence of fake data and learns about it during the course of the game. We use point processes to model the data traffic, where each piece of data can occur at any discrete moment in a continuous time flow. We fully solve the model and employ numerical examples to illustrate the players’ strategies and payoffs for insights. Specifically, our results show that maintaining some suspicion about the data sources and understanding that the sender can be strategic are very helpful to the data receiver. In addition, based on our model, we propose a methodology of detecting fake data that is complementary to the previous studies on this topic, which suggested various approaches on analyzing the data piece by piece. We show that after analyzing each piece of data, understanding a source by looking at the its whole history of pushing data can be helpful.

データ駆動型分析は、さまざまな意思決定プロセスでますます使用されています。レビュー、ニュース、写真など、より多くのソースがデータ分析に使用できるようになったため、データソースの信頼性が疑われています。以前の文献では偽のデータを1つずつ検出しようとしましたが、現在の研究では、偽のデータ送信者の戦略的行動を捉えて偽のデータソースを検出しようとします。具体的には、データ駆動型の決定を行うデータ受信者と、受信者を誤解させることで利益を得る偽のデータ送信者との間の緊張をモデル化します。受信者が最初は偽のデータの存在を知らず、ゲームの過程でそれを知るという事実を捉えるために、非対称情報を持つ潜在的に無限のホライズン連続時間ゲーム理論モデルを提案します。ポイントプロセスを使用してデータトラフィックをモデル化します。ここで、各データは連続時間フローの任意の離散的な瞬間に発生する可能性があります。モデルを完全に解き、数値例を使用してプレーヤーの戦略と洞察に対する報酬を示します。具体的には、データソースについてある程度の疑いを持ち、送信者が戦略的である可能性があることを理解することが、データ受信者にとって非常に役立つことが、私たちの研究結果から明らかになりました。さらに、私たちのモデルに基づいて、このトピックに関する以前の研究を補完する偽データを検出する方法論を提案します。この研究では、データを1つ1つ分析するさまざまなアプローチが提案されています。各データを分析した後、データのプッシュ履歴全体を見てソースを理解することが役立つことを示しています。

A Statistical Learning Approach to Modal Regression
モーダル回帰への統計的学習アプローチ

This paper studies the nonparametric modal regression problem systematically from a statistical learning viewpoint. Originally motivated by pursuing a theoretical understanding of the maximum correntropy criterion based regression (MCCR), our study reveals that MCCR with a tending-to-zero scale parameter is essentially modal regression. We show that the nonparametric modal regression problem can be approached via the classical empirical risk minimization. Some efforts are then made to develop a framework for analyzing and implementing modal regression. For instance, the modal regression function is described, the modal regression risk is defined explicitly and its Bayes rule is characterized; for the sake of computational tractability, the surrogate modal regression risk, which is termed as the generalization risk in our study, is introduced. On the theoretical side, the excess modal regression risk, the excess generalization risk, the function estimation error, and the relations among the above three quantities are studied rigorously. It turns out that under mild conditions, function estimation consistency and convergence may be pursued in modal regression as in vanilla regression protocols such as mean regression, median regression, and quantile regression. On the practical side, the implementation issues of modal regression including the computational algorithm and the selection of the tuning parameters are discussed. Numerical validations on modal regression are also conducted to verify our findings.

この論文では、統計学習の観点からノンパラメトリックモーダル回帰問題を体系的に研究します。当初は最大コレントロピー基準に基づく回帰(MCCR)の理論的理解を追求することに動機づけられたが、この研究では、ゼロに向かうスケールパラメータを持つMCCRが本質的にモーダル回帰であることを明らかにした。ノンパラメトリックモーダル回帰問題には、古典的な経験的リスク最小化を介してアプローチできることを示す。次に、モーダル回帰を分析および実装するためのフレームワークを開発するためのいくつかの取り組みが行われます。たとえば、モーダル回帰関数が説明され、モーダル回帰リスクが明示的に定義され、そのベイズ規則が特徴付けられます。計算の扱いやすさのために、本研究では一般化リスクと呼ばれる代理モーダル回帰リスクが導入されます。理論面では、過剰モーダル回帰リスク、過剰一般化リスク、関数推定誤差、および上記3つの量間の関係が厳密に研究されます。穏やかな条件下では、平均回帰、中央値回帰、分位回帰などの標準的な回帰プロトコルと同様に、モーダル回帰でも関数推定の一貫性と収束を追求できることがわかりました。実用的な面では、計算アルゴリズムやチューニングパラメータの選択など、モーダル回帰の実装上の問題について説明します。また、モーダル回帰の数値検証も実施して、調査結果を検証します。

A Low Complexity Algorithm with O(√T) Regret and O(1) Constraint Violations for Online Convex Optimization with Long Term Constraints
長期制約によるオンライン凸最適化のためのO(√T)リグレットとO(1)制約違反を伴う低複雑性アルゴリズム

This paper considers online convex optimization over a complicated constraint set, which typically consists of multiple functional constraints and a set constraint. The conventional online projection algorithm (Zinkevich, 2003) can be difficult to implement due to the potentially high computation complexity of the projection operation. In this paper, we relax the functional constraints by allowing them to be violated at each round but still requiring them to be satisfied in the long term. This type of relaxed online convex optimization (with long term constraints) was first considered in Mahdavi et al. (2012). That prior work proposes an algorithm to achieve $O(\sqrt{T})$ regret and $O(T^{3/4})$ constraint violations for general problems and another algorithm to achieve an $O(T^{2/3})$ bound for both regret and constraint violations when the constraint set can be described by a finite number of linear constraints. A recent extension in Jenatton et al. (2016) can achieve $O(T^{\max\{\theta,1-\theta\}})$ regret and $O(T^{1-\theta/2})$ constraint violations where $\theta\in (0,1)$. The current paper proposes a new simple algorithm that yields improved performance in comparison to prior works. The new algorithm achieves an $O(\sqrt{T})$ regret bound with $O(1)$ constraint violations.

この論文では、通常複数の機能的制約と1つのセット制約で構成される複雑な制約セットでのオンライン凸最適化について検討します。従来のオンライン射影アルゴリズム(Zinkevich、2003)は、射影操作の計算が潜在的に複雑になるため、実装が困難な場合があります。この論文では、各ラウンドで機能的制約に違反することを許可しながらも、長期的には満たされることを要求することで、機能的制約を緩和します。このタイプの緩和されたオンライン凸最適化(長期制約付き)は、Mahdaviら(2012)で初めて検討されました。その先行研究では、一般的な問題に対して$O(\sqrt{T})$の後悔と$O(T^{3/4})$の制約違反を達成するアルゴリズムと、制約セットが有限個の線形制約で記述できる場合に後悔と制約違反の両方に対して$O(T^{2/3})$の境界を達成する別のアルゴリズムが提案されています。Jenattonらによる最近の拡張では、(2016)は、$\theta\in (0,1)$の場合、$O(T^{\max\{\theta,1-\theta\}})$の後悔と$O(T^{1-\theta/2})$の制約違反を達成できます。現在の論文では、以前の研究と比較してパフォーマンスが向上した新しい単純なアルゴリズムを提案しています。新しいアルゴリズムは、$O(1)$の制約違反で$O(\sqrt{T})$の後悔境界を達成します。

関連記事