Journal of Artificial Intelligence Resarch: Volume 13の論文一覧

Journal of Artificial Intelligence Resarch Vol. 13 (2000)に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Asimovian Adaptive Agents
アシモフ的適応エージェント

The goal of this research is to develop agents that are adaptive and predictable and timely. At first blush, these three requirements seem contradictory. For example, adaptation risks introducing undesirable side effects, thereby making agents’ behavior less predictable. Furthermore, although formal verification can assist in ensuring behavioral predictability, it is known to be time-consuming. Our solution to the challenge of satisfying all three requirements is the following. Agents have finite-state automaton plans, which are adapted online via evolutionary learning (perturbation) operators. To ensure that critical behavioral constraints are always satisfied, agents’ plans are first formally verified. They are then reverified after every adaptation. If reverification concludes that constraints are violated, the plans are repaired. The main objective of this paper is to improve the efficiency of reverification after learning, so that agents have a sufficiently rapid response time. We present two solutions: positive results that certain learning operators are a priori guaranteed to preserve useful classes of behavioral assurance constraints (which implies that no reverification is needed for these operators), and efficient incremental reverification algorithms for those learning operators that have negative a priori results.

本研究の目標は、適応性、予測可能性、およびタイムリー性を備えたエージェントを開発することです。一見すると、これら3つの要件は矛盾しているように見えます。例えば、適応は望ましくない副作用をもたらし、エージェントの行動予測可能性を低下させるリスクがあります。さらに、形式的検証は行動予測可能性の確保に役立ちますが、時間がかかることが知られています。これら3つの要件すべてを満たすという課題に対する私たちの解決策は次のとおりです。エージェントは有限状態オートマトン計画を持ち、進化学習（摂動）演算子を介してオンラインで適応されます。重要な行動制約が常に満たされることを保証するために、エージェントの計画はまず形式的に検証されます。そして、適応のたびに再検証されます。再検証によって制約違反が判明した場合、計画は修正されます。本論文の主な目的は、学習後の再検証の効率を改善し、エージェントが十分に高速な応答時間を持つようにすることです。私たちは2つの解決策を提示します。1つは、特定の学習演算子が行動保証制約の有用なクラスを保持することが事前に保証されているという肯定的な結果（つまり、これらの演算子については再検証が不要であることを意味する）であり、もう1つは、事前結果が否定的な学習演算子に対する効率的な増分再検証アルゴリズムです。

Value-Function Approximations for Partially Observable Markov Decision Processes
部分観測マルコフ決定過程の価値関数近似

Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price — exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain.

部分観測マルコフ決定過程（POMDP）は、システムの状態が不完全またはノイズの多い観測セットを介して間接的にしか観測できない確率領域における複雑な意思決定および計画問題をモデル化するための、洗練された数学的枠組みを提供します。しかしながら、POMDPのモデリング上の利点には代償が伴います。POMDPを解くための正確な方法は計算コストが非常に高く、そのため実際には非常に単純な問題にしか適用できません。本研究では、計算上の問題を軽減し、精度と速度をトレードオフする効率的な近似（ヒューリスティック）手法に焦点を当てます。本研究では2つの目的があります。まず、様々な近似手法を調査し、それらの特性と関係を分析し、それらの相違点に関する新たな知見を提供します。次に、いくつかの新しい近似手法と既存手法の斬新な改良を提示します。理論的結果は、エージェントナビゲーション領域の問題を用いた実験によって裏付けられています。

Conformant Planning via Symbolic Model Checking
記号モデル検査による適合計画

We tackle the problem of planning in nondeterministic domains, by presenting a new approach to conformant planning. Conformant planning is the problem of finding a sequence of actions that is guaranteed to achieve the goal despite the nondeterminism of the domain. Our approach is based on the representation of the planning domain as a finite state automaton. We use Symbolic Model Checking techniques, in particular Binary Decision Diagrams, to compactly represent and efficiently search the automaton. In this paper we make the following contributions. First, we present a general planning algorithm for conformant planning, which applies to fully nondeterministic domains, with uncertainty in the initial condition and in action effects. The algorithm is based on a breadth-first, backward search, and returns conformant plans of minimal length, if a solution to the planning problem exists, otherwise it terminates concluding that the problem admits no conformant solution. Second, we provide a symbolic representation of the search space based on Binary Decision Diagrams (BDDs), which is the basis for search techniques derived from symbolic model checking. The symbolic representation makes it possible to analyze potentially large sets of states and transitions in a single computation step, thus providing for an efficient implementation. Third, we present CMBP (Conformant Model Based Planner), an efficient implementation of the data structures and algorithm described above, directly based on BDD manipulations, which allows for a compact representation of the search layers and an efficient implementation of the search steps. Finally, we present an experimental comparison of our approach with the state-of-the-art conformant planners CGP, QBFPLAN and GPT. Our analysis includes all the planning problems from the distribution packages of these systems, plus other problems defined to stress a number of specific factors. Our approach appears to be the most effective: CMBP is strictly more expressive than QBFPLAN and CGP and, in all the problems where a comparison is possible, CMBP outperforms its competitors, sometimes by orders of magnitude.

我々は、非決定性領域における計画問題に取り組むため、適合計画への新しいアプローチを提示します。適合計画とは、領域の非決定性にもかかわらず、目標を達成することが保証される一連の行動を見つける問題です。我々のアプローチは、計画領域を有限状態オートマトンとして表現することに基づいています。我々は、記号モデル検査技術、特に二分決定図を用いて、オートマトンをコンパクトに表現し、効率的に探索します。本論文では、以下の貢献を行う。まず、初期条件と動作効果に不確実性を伴う、完全に非決定論的な領域に適用可能な、適合計画のための一般的な計画アルゴリズムを提示します。このアルゴリズムは幅優先の後方探索に基づいており、計画問題に対する解が存在する場合は最小長の適合計画を返し、存在しない場合は問題に適合解が存在しないと結論付けて終了します。次に、二分決定図（BDD）に基づく探索空間の記号表現を提示します。これは、記号モデル検査から派生した探索手法の基礎となります。この記号表現により、潜在的に大規模な状態と遷移の集合を単一の計算ステップで解析することが可能になり、効率的な実装が可能になります。最後に、上述のデータ構造とアルゴリズムをBDD操作に直接基づいて効率的に実装したCMBP（適合モデルベースプランナー）を提示します。これにより、探索層のコンパクトな表現と探索ステップの効率的な実装が可能になります。最後に、最先端の適合計画ツールであるCGP、QBFPLAN、GPTと本手法の実験的比較を示す。我々の分析には、これらのシステムの配電パッケージに含まれるすべての計画問題に加え、特定の要因に着目して定義された他の問題も含まれています。我々のアプローチは最も効果的であると思われます。CMBPはQBFPLANやCGPよりも表現力に優れており、比較可能なすべての問題において、CMBPは競合製品を、時には桁違いに上回る性能を発揮します。

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition
MAXQ価値関数分解を用いた階層的強化学習

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics—as a subroutine hierarchy—and a declarative semantics—as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

本稿では、対象のマルコフ決定過程(MDP)をより小さなMDPの階層に分解し、対象MDPの価値関数をより小さなMDPの価値関数の加法的な組み合わせに分解する、階層的強化学習の新しいアプローチを紹介します。この分解はMAXQ分解と呼ばれ、サブルーチン階層としての手続き型セマンティクスと、階層的ポリシーの価値関数の表現としての宣言型セマンティクスの両方を備えています。MAXQは、Singh、Kaelbling、Dayan、Hintonによる階層的強化学習に関するこれまでの研究を統合し、拡張したものです。MAXQは、プログラマが有用なサブゴールを識別し、それらのサブゴールを達成するサブタスクを定義できるという仮定に基づいています。このようなサブゴールを定義することで、プログラマは強化学習中に考慮する必要があるポリシーのセットを制限します。MAXQ価値関数分解は、与えられた階層と一致する任意のポリシーの価値関数を表すことができます。また、この分解により状態抽象化を活用する機会が生まれ、階層内の個々のMDPは状態空間の大部分を無視できます。これは、この方法の実用化にとって重要です。この論文では、MAXQ階層を定義し、その表現力に関する正式な結果を証明し、状態抽象化を安全に使用するための5つの条件を確立します。この論文では、オンラインモデルフリー学習アルゴリズムMAXQ-Qを提示し、5種類の状態抽象化が存在する場合でも、再帰的最適ポリシーと呼ばれる一種の局所最適ポリシーに確率1で収束することを証明します。この論文では、3つのドメインで一連の実験を通じてMAXQ表現とMAXQ-Qを評価し、状態抽象化を使用したMAXQ-QがフラットQ学習よりもはるかに高速に再帰的最適ポリシーに収束することを実験的に示します。MAXQが価値関数の表現を学習するという事実には重要な利点があります。それは、ポリシー反復におけるポリシー改善ステップと同様の手順で、改善された非階層的ポリシーを計算・実行できるようになることです。本論文では、この非階層的実行の有効性を実験的に実証します。最後に、関連研究との比較と、階層的強化学習における設計上のトレードオフの議論で本論文は締めくくられます。

OBDD-based Universal Planning for Synchronized Agents in Non-Deterministic Domains
非決定性領域における同期エージェントのためのOBDDベースのユニバーサルプランニング

Recently model checking representation and search techniques were shown to be efficiently applicable to planning, in particular to non-deterministic planning. Such planning approaches use Ordered Binary Decision Diagrams (OBDDs) to encode a planning domain as a non-deterministic finite automaton and then apply fast algorithms from model checking to search for a solution. OBDDs can effectively scale and can provide universal plans for complex planning domains. We are particularly interested in addressing the complexities arising in non-deterministic, multi-agent domains. In this article, we present UMOP, a new universal OBDD-based planning framework for non-deterministic, multi-agent domains. We introduce a new planning domain description language, NADL, to specify non-deterministic, multi-agent domains. The language contributes the explicit definition of controllable agents and uncontrollable environment agents. We describe the syntax and semantics of NADL and show how to build an efficient OBDD-based representation of an NADL description. The UMOP planning system uses NADL and different OBDD-based universal planning algorithms. It includes the previously developed strong and strong cyclic planning algorithms. In addition, we introduce our new optimistic planning algorithm that relaxes optimality guarantees and generates plausible universal plans in some domains where no strong nor strong cyclic solution exists. We present empirical results applying UMOP to domains ranging from deterministic and single-agent with no environment actions to non-deterministic and multi-agent with complex environment actions. UMOP is shown to be a rich and efficient planning system.

最近、モデルチェックの表現および検索手法が計画、特に非決定論的計画に効率的に適用できることが示されました。このような計画手法では、順序付き二分決定図(OBDD)を使用して計画ドメインを非決定性有限オートマトンとしてエンコードし、モデル検査の高速アルゴリズムを適用して解を探索します。OBDDは効果的に拡張でき、複雑な計画ドメインに普遍的な計画を提供できます。私たちは特に、非決定性マルチエージェントドメインで発生する複雑性に対処することに興味を持っています。この記事では、非決定性マルチエージェントドメイン用の新しい汎用OBDDベースの計画フレームワークであるUMOPを紹介します。非決定性マルチエージェントドメインを指定するために、新しい計画ドメイン記述言語NADLを紹介します。この言語は、制御可能なエージェントと制御不可能な環境エージェントの明示的な定義に貢献します。NADLの構文とセマンティクスについて説明し、NADL記述の効率的なOBDDベースの表現を構築する方法を示します。UMOP計画システムは、NADLとさまざまなOBDDベースの汎用計画アルゴリズムを使用します。これには、以前に開発された強力な計画アルゴリズムと強力な巡回計画アルゴリズムが含まれます。さらに、最適性保証を緩和し、強い解や強い巡回解が存在しない領域において妥当な普遍計画を生成する、新たな楽観的計画アルゴリズムを紹介します。UMOPを、環境行動を持たない決定論的・単一エージェントから、複雑な環境行動を持つ非決定論的・複数エージェントまで、幅広い領域に適用した実験結果を示します。UMOPは、豊富な機能と効率的な計画システムであることが示されます。

AIS-BN: An Adaptive Importance Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks
AIS-BN：大規模ベイジアンネットワークにおける証拠推論のための適応的重要度サンプリングアルゴリズム

Stochastic sampling algorithms, while an attractive alternative to exact algorithms in very large Bayesian network models, have been observed to perform poorly in evidential reasoning with extremely unlikely evidence. To address this problem, we propose an adaptive importance sampling algorithm, AIS-BN, that shows promising convergence rates even under extreme conditions and seems to outperform the existing sampling algorithms consistently. Three sources of this performance improvement are (1) two heuristics for initialization of the importance function that are based on the theoretical properties of importance sampling in finite-dimensional integrals and the structural advantages of Bayesian networks, (2) a smooth learning method for the importance function, and (3) a dynamic weighting function for combining samples from different stages of the algorithm. We tested the performance of the AIS-BN algorithm along with two state of the art general purpose sampling algorithms, likelihood weighting (Fung & Chang, 1989; Shachter & Peot, 1989) and self-importance sampling (Shachter & Peot, 1989). We used in our tests three large real Bayesian network models available to the scientific community: the CPCS network (Pradhan et al., 1994), the PathFinder network (Heckerman, Horvitz, & Nathwani, 1990), and the ANDES network (Conati, Gertner, VanLehn, & Druzdzel, 1997), with evidence as unlikely as 10^-41. While the AIS-BN algorithm always performed better than the other two algorithms, in the majority of the test cases it achieved orders of magnitude improvement in precision of the results. Improvement in speed given a desired precision is even more dramatic, although we are unable to report numerical results here, as the other algorithms almost never achieved the precision reached even by the first few iterations of the AIS-BN algorithm.

確率的サンプリングアルゴリズムは、非常に大規模なベイジアンネットワークモデルにおいて、厳密なアルゴリズムの魅力的な代替手段となるものの、極めて尤度が低い証拠を用いた証拠推論においては性能が低いことが観察されています。この問題に対処するため、我々は適応型重要度サンプリングアルゴリズムAIS-BNを提案します。このアルゴリズムは、極端な状況下でも有望な収束率を示し、既存のサンプリングアルゴリズムを一貫して上回る性能を示すと考えられます。この性能向上の3つの要因は、(1)有限次元積分における重要度サンプリングの理論的特性とベイジアンネットワークの構造的利点に基づく、重要度関数の初期化のための2つのヒューリスティック、(2)重要度関数の滑らかな学習法、(3)アルゴリズムの異なる段階からのサンプルを結合するための動的重み付け関数です。我々は、AIS-BNアルゴリズムの性能を、2つの最先端の汎用サンプリングアルゴリズム、尤度重み付け（Fung & Chang, 1989; Shachter & Peot, 1989）および自己重要度サンプリング（Shachter & Peot, 1989）と共にテストした。テストでは、科学界で利用可能な3つの大規模な実際のベイジアンネットワークモデルを使用しました。CPCSネットワーク(Pradhan他、1994)、PathFinderネットワーク(Heckerman、Horvitz、& Nathwani、1990)、およびANDESネットワーク(Conati、Gertner、VanLehn、& Druzdzel、1997)で、その証拠は10^-41というありそうもないものでした。AIS-BNアルゴリズムは常に他の2つのアルゴリズムよりも優れたパフォーマンスを発揮しましたが、テストケースの大部分で結果の精度が桁違いに向上しました。望ましい精度を与えられた場合の速度の向上はさらに劇的ですが、他のアルゴリズムではAIS-BNアルゴリズムの最初の数回の反復でさえ到達した精度にほとんど達しなかったため、ここで数値結果を報告することはできません。

Space Efficiency of Propositional Knowledge Representation Formalisms
命題的知識表現形式の空間効率

We investigate the space efficiency of a Propositional Knowledge Representation (PKR) formalism. Intuitively, the space efficiency of a formalism F in representing a certain piece of knowledge A, is the size of the shortest formula of F that represents A. In this paper we assume that knowledge is either a set of propositional interpretations (models) or a set of propositional formulae (theorems). We provide a formal way of talking about the relative ability of PKR formalisms to compactly represent a set of models or a set of theorems. We introduce two new compactness measures, the corresponding classes, and show that the relative space efficiency of a PKR formalism in representing models/theorems is directly related to such classes. In particular, we consider formalisms for nonmonotonic reasoning, such as circumscription and default logic, as well as belief revision operators and the stable model semantics for logic programs with negation. One interesting result is that formalisms with the same time complexity do not necessarily belong to the same space efficiency class.

我々は、命題知識表現（PKR）形式主義の空間効率を調査します。直感的には、ある知識Aを表現する形式主義Fの空間効率は、Aを表現するFの最短の式のサイズです。本稿では、知識は命題解釈（モデル）の集合、または命題式（定理）の集合であると仮定します。我々は、モデル集合または定理集合をコンパクトに表現するPKR形式主義の相対的な能力について、形式的に論じる方法を提供します。2つの新しいコンパクトさの尺度、すなわち対応するクラスを導入し、モデル/定理を表現するPKR形式主義の相対的な空間効率が、これらのクラスに直接関連していることを示す。特に、サーカムスクリプションやデフォルト論理といった非単調推論のための形式主義、そして信念修正演算子、そして否定を含む論理プログラムの安定モデル意味論を考察します。興味深い結果の一つは、同じ時間計算量を持つ形式主義が必ずしも同じ空間効率クラスに属するとは限らないということです。

関連記事