A flexible deep reinforcement learning agent reward / cost function and action control design on the basis of (repulsive and attractive) potential field functions, geodesics or vector field inequalities is a must for adapting the work and task spaces of the various n- degrees of freedom robots to dynamic and kinematics constraints of robots with end-effectors (such as grippers or other tools like spindles), parts to be added, removed or assembled, and objects, human operators and obstacles to interact with or avoid. The particular reward functions to be designed and implemented depend heavily on how realistic the observation-action spaces of related work and task spaces of related agile robot production scenarios are generated by the ACROBA Virtual Gym, a Unity-ROS-based simulator, for the selected pilot use cases. In the following an introduction to the problem of reward function design in deep reinforcement learning is given before modern geometric reward functions are defined allowing to define attractive and repulsive potential fields that can serve as the means to optimize robot motions and dynamic interaction with its environment.

**Introduction reward function design **

A reward function enables a deep reinforcement learning agent to find a policy in robot-environment-product state/observation-action space **(S, A) **that maximizes long-term accumulated rewards and/or minimizes long-term costs in trying to achieve a goal or specific target (sub)task or acquiring a skill. For shaping a proper reward function **R** the following requirements need to be met:

Robot-environment-product models that cover in relation to the task at hand sufficiently 1) degrees of freedom for robot sensing of itself, environment (including humans) and product and motor action with respect to those, 2) related reference frames for describing position and orientation of the robot itself as well as those for the environment and product and for determining their kinetics and dynamics, 3) mass and Coriolis matrices, nonlinear damping matrices and force and torque control matrices of robot, 4) guidance laws and/or constraints for robot-environment/product control movement, interaction and safety.

Robot-environment-product model observation-action space dimensionality reduction â€“ if this space is too large for simulation or during run-time the agent will not be able to find an optimal policy at all.

Assuming that a world-frame F and associated sensing and actuator devices are available to the robot (agent) then the position and pose/attitude/orientation of the robot modules, products and environment next to the translation, rotations and related velocities, and forces and moments can be established as well as adapted. Furthermore, the task execution by the robot is assumed to achieve performance levels after optimizing long-term accumulated reward / costs. One such performance level in general depends on optimizing not only one typical reward, it is often a combination of simultaneously or consecutively to be achieved ones. Often the reward function on observation-action-observation space is defined ad-hoc and involve additional hyper-parameters.

In the sequel systematically the following modern geometric reward functions are designed that will allow the deep reinforcement learning agent to optimize its behavior and skills for accomplishing its tasks:

Pose alignment reward functions:

Position alignment reward functions

Orientation alignment reward functions

Velocity alignment reward functions

Translation velocity alignment reward functions

Angular velocity alignment reward functions

Acceleration alignment reward functions

Translation acceleration alignment reward functions

Angular acceleration alignment reward functions

Dynamic interaction reward functions

Force alignment reward function

Torque alignment reward function

**Pose alignment reward functions**

For alignment we need to consider the inertial reference frames attached to the robots **F_r**, work-cell **F_w,** human operators **F_h** and the objects **F_0**; they consist of the position and rotation frame fields that will be expressed with respect to the base or world reference frame field **F**. These fields decompose in position and rotation frame fields, **F_x** and **F_Ï‰**, respectively.

*Position alignment reward functions*

Position may concern any location **x** on or of joint, link or end-effector (2,3 f hand, spindle) of one or more robot serial manipulators or any target location **x_t** (which we assume to be fixed in the world reference frame) on or of the work cell or target object to be grasped, worked (deburred) or assembled. The goal of the constraint task is to approach rapidly and cost-effectively without colliding with the object or sliding along the surface of the object at that position. This requires that **x** has to approach **x_t** with a not too high speed in order to be capable of in-time deceleration. Furthermore, **x is required **also mostly to approach the object from the proper side of and at the right angle at **x_t** (for the latter see orientation reward function).

Assuming that **N_t **and **T_t **is surface unit normal and projection of vector x on the tangent space at **x_t**, respectively, then two position alignment reward functions can be defined based on besides a signed distance function for how close **x** is to **x_t** (positive definite if **x** in free space and negative if **x **inside objects (forbidden space)**) **also on the temporal change in a signed distance function that quantifies whether and how much **x** approaches **x_t **or not (penalizing or rewarding the underlying position alignment action **a** that led to the behavior):

Normal position alignment reward function (for attractive and repulsive

**x_t)**

**R_np (x(t_0)** , **a(t_1),x(t_2)**, **x_t) = **

**A_n sign(N_1) [1 + |N_1|]^(- sign(N_1)) + **

**A_n sign(N_2) [1 + |N_2|]^(- sign(N_2)) **

**+ **

**Î±_n sign(|N_0| - |N_2|**) ** [1 + ||N_0| - |N_2||]^(- sign(|N_0| - |N_2|**) **) **

with

**N_i = (x(t_i)** â€“ **x_t**) . **N_t**

**Î±_n < A_n** >> 1

Tangential position alignment reward function (for attractive and repulsive

**x_t)**

**R_tp (x(t_0)** , **a(t_1),x(t_2)**, **x_t) = **

**A_t sign(T_1) [1 + |T_1|]^(- sign(T_1)) + **

**A_t sign(T_2) [1 + |T_2|]^(-sign(T_2)) **

**+ **

** Î±_t sign(|T_0| - |T_2|**) ** [1 + ||T_0| - |T_2||]^(- sign(|T_0| - |T_2|)) **

with

** T_i = (x(t_i)** â€“ **x_t**) - ((**x(t_i)** â€“ **x_t**). **N_t** ) **N_t**

** Î±_t < A_t** >> 1

Both types of reward functions neatly reflect the desired evaluation of the action **a(t_1)**, i.e., reward an agent receives when it realizes through its action **a(t_1) **that **x(t_2)** is close to **x_t **in free space as well as that **x** approaches in free space **x_t **in consecutive time-steps** t_0** and **t_2.** In addition, they penalize agent when **x** distances itself in free space from **x_t **irrespective normal or tangential direction. Furthermore, agent is severely punished (-**Î±_n/t, - A_n/t)** if **x is **in forbidden space and even more so if it is moving away from **x_t **deeper and deeper into forbidden space â€“ this all by simply using **sign** functions (see Figure 1). **Î±, A **may capture some physical notion or task-process relevance in case the surface normal and associated area captures and orders, e.g., the faces for grasping performance in relation to the grippers / fingers contact areas. For example, if we have a region of interest (ROI) associated to **x_t **on the surface S of an object (represented as a triangular mesh with vertex **x_t **and unit normal **N_t **and area **Î±, A **with the edges being boundaries of forbidden regions/zones) then we can scale â€“ **Î±(|x_t - Î¾_i|) ,- A(|x_t â€“ Î¾_i|) **of the reward function for the forbidden regions (not necessarily) on S by multiplying with an (linear) increasing function of the distance **|x_t â€“ Î¾_i| **between **x_t **and the vertices **Î¾_i **of the triangular meshes of the forbidden regions / zones). Doing so a natural reward function appears alike electric potential fields for in-homogeneously charged objects; reward function field thus does not need to be constant on the surface of object unless one would like to see the surface itself as the target.

Now these reward functions do not yet make operational the preference of certain surface patches or regions of interest over others. A new task based reward functions on observation-action spaces would be helpful that assigns opposite higher / lower credits to actions **a(t_1)** at** x ** depending on the different perhaps prioritized manufacturing tasks as grasping and avoiding object. The reward functions above were designed with in particular robot-object interaction task in mind; they are attractive and repulsive at the same time. For object avoidance similar reward functions but of a fully repulsive form are needed (see Figure 1). These reward functions are defined as follows:

Normal position alignment reward function (for full repulsive

**x_t)**

**R_np (x(t_0)** , **a(t_1),x(t_2)**, **x_t) = **

**- (A_n [1 + |N_1|]^(- sign(N_1)) + **

**A_n [1 + |N_2|]^(- sign(N_2)))**

**+ **

**- Î±_n [1 + ||N_0| - |N_2||]^(- sign(|N_0| - |N_2|**) **) **

with

**N_i = (x(t_i)** â€“ **x_t**) . **N_t**

**Î±_n < A_n** >> 1

Tangential position alignment reward function (for full repulsive

**x_t)**

**R_tp (x(t_0)** , **a(t_1),x(t_2)**, **x_t) = **

** - (A_t [1 + |T_1|]^(- sign(T_1)) + **

** A_t [1 + |T_2|]^(- sign(T_2)))**

** + **

** - Î±_t [1 + ||T_0| - |T_2||]^(- sign(|T_0| - |T_2|)) **

with

**T_i = (x(t_i)** â€“ **x_t**) - ((**x(t_i)** â€“ **x_t**). **N_t** ) **N_t**

** Î±_t < A_t** >> 1

It seems that these reward functions pretty well capture all aspects hidden in **(x(t_0)**, **a(t_1),x(t_2)**, **x_t)**. However, studying the position alignment more closely the reward functions do not yet satisfactorily handle collisions for the attractive-repulsive cases **R_np >= 0 **and **R_tp >= 0. **The reason for this is that when** x **approaches** x_t **the perpendicular and tangential speed **v_n** and **v_t**:

**v_n(x, **t_2) = **R_np (x(t_0)** , **a(t_1),x(t_2)**, **x_t) N_t /(t_2 - t_0)**

**v_t(x , **t_2) **= R_tp (x(t_0)** , **a(t_1),x(t_2)**, **x_t) N_t /(t_2 - t_0)**

could exceed a certain critical velocity that will make it impossible for the robot (agent) to decelerate in-time perpendicularly and tangentially;** x** will collide with or miss **x_t **at time **t_4**. The duration of time-to-collision/miss **Ï„_t/n** that the robot can exert maximum perpendicular and tangential deceleration |**a_t/n(x)| **on **x** to avoid collision or misses are given by:

**Ï„_n = |(x(t_2)** â€“ **x_t**)] . **N_t** | / |**v_n(x, **t_2)|

**Ï„_t = |T_2| / |(v_t(x , **t_2) .[(**x_t** - **x(t_2))/** |(**x_t** - **x(t_2))|]|**

Note that for **Ï„_t one has to consider the speed v_t(x , **t_2) along ** **[(**x_t** - **x(t_2))/** |(**x_t** - **x(t_2))|]. **As the robot parameters |**a_t/n(x)|** should normally be known in advance, the collision avoidance and miss constraints on the velocity fields herewith read:

|**v_n/t|(x)** <= |**a_t/n(x)| Ï„_t/n**

Using these constraints again a signature function sign (|**a_t/n(x)| Ï„** - |**v_n/t|(x)**) is defined to adapt the second terms of the repulsive-attractive reward functions as follows:

Normal position alignment reward function (for attractive and repulsive

**x_t)**

**R_np (x(t_0)** , **a(t_1),x(t_2)**, **x_t) = **

**A_n sign(N_1) [1 + |N_1|]^(- sign(N_1)) + **

**A_n sign(N_2) [1 + |N_2|]^(- sign(N_2)) **

**+ **

**Î±_n sign(|N_0| - |N_2|**) **[1 + ||N_0| - |N_2||]^(- sign(|N_0| - |N_2|**) **) **

**sign **(|**a_n(x)| Ï„_n** - |**v_n(x)|**)

with

**N_i = (x(t_i)** â€“ **x_t**) . **N_t**

**Î±_n < A_n** >> 1

Tangential position alignment reward function (for attractive and repulsive

**x_t)**

**R_tp (x(t_0)** , **a(t_1),x(t_2)**, **x_t) = **

**A_t sign(T_1) [1 + |T_1|]^(- sign(T_1)) + A_t sign(T_2) [1 + |T_2|]^(- sign(T_2)) **

**+ **

**Î±_t sign(|T_0| - |T_2|**) ** [1 + ||T_0| - |T_2||]^(- sign(|T_0| - |T_2|)) **

**sign** (|**a_t(x)| Ï„_t** - |**v_t|(x)|**)

with

**T_i = (x(t_i)** â€“ **x_t**) - ((**x(t_i)** â€“ **x_t**). **N_t** ) **N_t**

**Î±_t < A_t** >> 1

Clearly, if the robot will be able to mitigate in-time collisions or misses the reward function stays positive and otherwise negative. If not then still the impact of collisions will be minimized. In this way the position alignment reward functions at **x **are not only governed by position vector fields **x **but also by translation velocity vector fields **v_n/t(x)** and intrinsic robot parameters |**a_t/n(x)|. **Looking at the variables** (x(t_0)** , **a(t_1),x(t_2)**, **x_t) **of the reward functions this is actually to be desired when we associate action **a(t_1) **to deceleration constraints/tasks as we formulated above.

We could keep tangential and normal reward functions separate for credit assignments of the various actions **a(t)(x) **exerted by the robot. Alternatively, we could opt for combining the various normal and tangential reward functions into one reward function **R_p** assigning a reward to a complete reconfiguration of joint positions (including end-effector with/without part), etc:

** ** **R_np R_tp if R_np **__â‰¥__** 0 and R_tp â‰¥ 0**

**R_p** =

** ** **- |R_np| |R_tp|** otherwise

Note that here one can still distinguish between attractive-repulsive and full-repulsive reward functions that will be exploited simultaneously by the agent. The nice thing is that both are in a sense cooperative reward functions as the full repulsive reward functions enhances the attractive-repulsive one. One can also think of full attractive targets such as sheets of paper that are of course very thin but have opposite normal **N_t** on either sides.

*Figure **1**: Attractive-repulsive (**gold**) and full repulsive (**blue)** alignment reward as function of signed distance to the vertical red lines demarcating the forbidden spaces on the left and the right, respectively, and the free space in the middle. Note that here we assume **x_t** to be at the intersection of the horizontal and left vertical red lines and **x** living on horizontal red line (allowed of course to enter free as well as forbidden spaces). Furthermore, that the red vertical lines reflect the boundary of the safety zones enveloping the objects. Last but not least, that the right vertical red line envelopes an obstacle that needs to be avoided. Agent computes attractive-repulsive and full repulsive alignment reward at the same time**. **For object avoidance infinite repulsion / negative reward may be put at the objects' surface. There's a small caveat you probably noticed: our agent needs to compensate for the repulsive forces at the target interface - if tapped into this field charecteristic he/she will take care.*

*Orientation alignment reward function*

For deriving the orientation alignment reward function for the rotation frame fields **F_Ï‰(x)** and **F_Ï‰(x_t)** one can follow follow the same approach as that for the position alignment reward function. Thereto, first define the rotation **R** from a base reference / world rotation frame field **F_Ï‰(0)** to **F_Ï‰(x)** and **F_Ï‰(x_t),** respectively.

Using unit quaternions **Q** =q_0 + q_1 â€Š**i** + *q_2 ***j** + q_3 â€Š**k** (with |â€Š**z**â€Š| = 1) for representing such spatial rotations, **R** is given by:

This or other rotation representations allow identification of the individual frame vector fields **e^i (x) **and **e^j (x_t) **of **F_Ï‰(x)** and **F_Ï‰(x_t)**, respectively, in terms of the base reference / world rotation frame field **F_Ï‰(0)** as the columns of the corresponding matrices **R**(**F_Ï‰(0)**, **F_Ï‰(x)**) and **R(F_Ï‰(0),F_Ï‰(x_t))**, respectively. Now a first rotation alignment constraint can be defined as follows:

**R(e^1 (x), e^1 (x_t)) e^1 (x) . e^1 (x_t) = 1**

**which is equivalent to n^1( e^1 (x) , e^1 (x_t)) = e^1 (x) X e^1 (x_t) where X is outer product**

**(n^1( e^1 (x) , e^1 (x_t)) X e^1 (x) ) . e^1 (x_t) = 1**

Using** n^1( e^1 (x) , e^1 (x_t)) X e^1 (x) **to represent the rotation **R(e^1 (x), e^1 (x_t)) e^1 (x) **the above alignment can be defined as the direction **n^1 **and required rotation **Î¸^1** :

**Î¸^1 = arccos(e^1 (x) . e^1 (x_t))/Ï€ = 1**

In order to fully align the other frame fields of course a second rotation alignment constraint is needed. Thereto, first rotate **e^2 (x) **and **e^3 (x) **using** n^1(x, x_t) **(having already aligned** e^1 (x) **and **e^1 (x_t) **with respect to world rotation reference frame). This can easily be done by applying Rodriguesâ€™ rotation formula to those two rotation frame vector fields:

**R(e^1 (x), e^1 (x_t)) e^1 (x) = e^1 (x_t)**

** R(e^1 (x), e^1 (x_t)) e^2 (x) **=

**n^1 (n^1 . e^2 (x)) + cos(Î¸^1) (n^1X e^2 (x)) X n^1** + **sin(Î¸^1) (n^1 X e^2 (x))**

** R(e^1 (x), e^1 (x_t)) e^3 (x) **=

**n^1 (n^1 . e^3 (x)) + cos(Î¸^1) (n^1X e^3 (x)) X n^1** + **sin(Î¸^1) (n^1 X e^3 (x))**

The new rotation matrix **R(e^1 (x), e^1 (x_t)) R(F_Ï‰(0)**, **F_Ï‰(x)**) has a first column the rotation frame vector field **e^1 (x) = e^1 (x_t) **and as second and third column the new rotation frame vector fields **R(e^1 (x), e^1 (x_t)) e^2 (x) **and **R(e^1 (x), e^1 (x_t)) e^3 (x) - **all described with respect to the base reference frame **F_Ï‰(0). **

As the latter frame fields are still not aligned with **e^2 (x_t) **and **e^3 (x_t), **respectively, a simple rotation around **e^1 (x) **is needed and the second rotation alignment constraint for **R(e^1 (x), e^1 (x_t)) e^2 (x) **can readily be formulated as:

**Î¸^2 = arccos(R(e^1 (x), e^1 (x_t)) e^2 (x) . e^2 (x_t))/Ï€ = 1**

To compute readily the outer products of vectors one can make use of the Levi-Civita tensor:

**c = a X b = (Îµ^i_jk a^j b^k)**

One might be inclined to put forward the following three rotation alignment constraints:

**Î¸^i = arccos(e^i (x) . e^i (x_t))/Ï€ with i=1,2,3**

as they for basic alignment tasks will suffice. But for more complex tasks in which one has to operate in the planar spaces spanned by pairs of the frame fields **e^i **they do not. Therefore, stick to rotation measures **Î¸^1 **and **Î¸^2 **to define normal or tangential reward functions, similar to those for position alignment.

It is important to realize that the rotation alignments **Î¸^1 **and **Î¸^2 **depend on the direction going from **x **to **x_t** and that these changes are neither to be expected isotropic (for those readers acquainted with modern differential geometry please refer to Frenet frame fields for spatial curves that coincide with those that are generated by N_t and foliations of our reward function level sets. However, the rotation frame does not need to coincide with the Frenet frame field which rather relates to the Cartesian position vector and translation velocity and acceleration; rotational frame rather captures 3D geometry of a normal fields and foliations of the reward level sets).

Note that for orientation alignment again attractive-repulsive and full repulsive reward functions can be defined and that these reward functions also can incorporate angular velocity-acceleration** (Ï‰^i (x, t_2) , |Î©(x)|) **constraints with **max(|Î©(x)|) **known as robot parameters. Furthermore, that for a particular alignment task for â€˜orientatingâ€™ one of the frame vector fields with the normal of the target surface area it does not need to be so that **e^i (x).N_t(x_t) =1**; a certain negative angle between those vectors will be probably more appropriate (especially if one would like to cut out a part of an object with a spindle). Moreover, that robot self-collision and kinematic constraints using both the position and orientation alignment reward function can also be made operational (these constraints are normally already hard coded in the robots; collision of the robots joints and mid links with other objects or human operators are normally not). Last but not least as the robot and objects, humans and work-cell can be represented by triangular meshes - as we alluded to above - the attractive or repulsive target areas associated to their vertices can be used to represent so to-speak appropriate dynamic charges around **x** and **x_t**.

The objective function Î¦ to be maximized for achieving the goal of reaching **(x_t, e^i(x_t)) **in deep reinforcement learning (using experience replay) can be now defined as shortest / least (distance, duration or eventually work) space-time state-action-state steps of paths **[x(t), e^i(x(t), a(t), x_t, e^i(x_t), t = 0, â€¦..] **maximizing the accumulated discounted **(0 < Î³â‰¤ 1) **reward received by the agent:

**Î¦[x(t),e^i(x(t), a(t), x_t,e^i(x_t] = Î£Î³^t R(x(t),e^i(x(t)), a(t), x_t,e^i(x_t))**

**Velocity and acceleration alignment reward functions**

Partially this is covered already in the approach above that incorporates both translation and rotation pose, velocity and acceleration constraints, but if needed this can be extended by enlarging the number of samples in time-sequences of observation-action spaces to be explored and analyzed by the agent employing the reward functions. The normal and tangential reward functions for velocities and acceleration reward functions can directly reflect again attractive-repulsive as well as full repulsive constraints as for positioning and orientation. To harmonize and bring all these alignment functions under one and the same umbrella it is also very opportune to consider a canonical frame for pose consisting of the ordinary position **x^p** extended with **x^p + e^i(x^p)**. This has the advantage that the approach for pose alignment is mapped to cell **(x^p, x^p+e^i(x^p))** alignment that in turn translates simply in position, translation velocity and acceleration alignment for four points. Therewith, normal and tangential position alignment through attractive-repulsive and full repulsive functions already suffices to build those for velocity and acceleration alignment.

Note that alike tangential and normal reward functions for pose alignment one can combine those for pose, velocity and acceleration alignment into one special directed attractive-repulsive and full repulsive reward function. In case one would like to approach a target object from one side and land on the surface normal and tangential velocity need to go to zero. But again one has to reward landing from the positively and punish severely from below. In order to combine pose and velocity alignment of (**x^p, x^p + e^i(x^p))** and **(x_t^p, x^p + e_t^i(x_t^p)) **one merely needs to consider the four pairs of frame fields (**x^p, x_t^p) and (x^p + e^i(x^p), x^p + e_t^i(x_t^p)) ** and construct for all those vector pairs similar tangential and normal alignment reward as for position. Again the approach of these fields for **x, **normally and tangentially, need to be aligned in direction (opposite or same) with respect to those at x_t**;** possibly **n(x_t) = - n_t(x_t)**. All these signature-directional task requirement will also be reflected in the newly combined attractive-repulsive and full attractive reward function:

** ** __R_np__** **__R_tp__** if **__R_np^i__** **__â‰¥__** 0 and **__R_tp^i__** â‰¥ 0, i = 1,2,3,4 **

** R_p** =

** - |**__R_np__**| |**__R_tp__**|** otherwise

in which __R_np__** and **__R_tp__** are **the products of the normal and tangential position reward functions for the four vector pairs; physical velocity and acceleration constraints are built in and there is no need for orientational alignment. Adding acceleration or force/torque alignment task constraints does not complicate our analysis much further. At any spatio-temporal resolution the most salient markers (corners, centres and edges) of ROIs of human operators, robot, parts and objects can be used to construct on each accurately positioned stable rigid structures of vector pairs (not necessarily mathematical frame fields but rather physically defined entities) for further object avoidance or assembly task completion without one has to turn to cumbersome mathematical manipulations; such a coarse granular approach already suffices for fetching (CAD-model matched) parts from a bin and placing them on a tray.

**Force and torque field alignment reward functions**

Although we can follow a similar approach as for the pose, velocity and acceleration reward functions these functions require access to accurate mass and inertia, Coriolis and centrifugal, friction and other models of the robotics systems as well as systems they have to interact with. For example, unconstrained problem for robot serial manipulator motion (poses) and contact force (forces and torques) contro**l u i**n joint space coordinates can be described by :

u(q) = M(q) d^2q/dt^2 + C(q,dq/dt) dq/dt+ G(q)

with mass and inertia matrix M, Coriolis and centrifugal matrix C and gravity G:

M(q) = J ^T(q) M(x) J(q)

G(q) = J^T(q) G(x)

C(q,dq/dt) = (c_ijk dq^k/dt), c_ijk = Â½ (âˆ‚ M_ij / âˆ‚ q^k + âˆ‚ M_ik / âˆ‚ q^j - âˆ‚ M_jk / âˆ‚ q^i)

where mass-inertia matrix M(x) is mass (m^i) and inertia (I^i_xy) matrix, gravity G(x) due to the center of masses of the links with respect to base frame and Jacobian J(q) = (âˆ‚x^m/âˆ‚q^n) can be retained from the coordinate transformation T^n_0 (a,d,Î±,q) (a, d, Î± are a set of robot link parameters (ranges)):

T^n_0 (a,d,Î±,q) = A^1_0 A^2_1 â€¦. A^n_n-1

with **A**^i_i-1 representing pose frame F_i with respect to previous pose frame F_i-1:

cos(q^i) -cos(Î±^i) sin(q^i) sin(Î±^i) sin(q^i) a^i cos(q^i)

**A**^i_i-1 = sin(q^i) cos(Î±^i) cos(q^i) -sin(Î±^i) cos(q^i) a^i sin(q^i)

0 sin(Î±^i) cos(Î±^i) d^i

0 0 0 1

Pose **x**^i _0 of link i (i = n for end-effector) with respect to F_0 is therewith given by:

**x**^i _0 = **T**^i_0 **(a,d,Î±,q) q**

Now it is feasible to to use our reward functions and for example standard open motion planning library or other standard physics engine library functions to create action pose, velocity and acceleration and force and torque trajectories for the robot and subsequently evaluate in the ACROBA Virtual Gym the generated observation-action-observation to train the deep reinforcement learning neural networks / agents to recommend optimized motion planning strategies for operations.

However, such physics models are difficult to construe in advance as they also depend on the systems history that one has hardly knowledge of nor clue about over time; they may suddenly loose functionality or start behaving differently (robot joint actuator drop rotational ranges, changes in physical properties of objects, changing working procedures followed by human operator). Needless to say that whenever the robot interacts with its environment and manipulates objects their composed dynamics changes. Although creating such models for different types of robots extended with various types of tools and holding, manipulating and maneuvering objects is a must, the shear endless number of task constraints will be very cumbersome to be integrated (using (O Khatib. 1987) ) into robot-environment system and to be updated on the fly.

The only viable way out of this is to let the systems try to gauge and re-normalize themselves upon interactions with itself and its environment. Most conveniently this is done by extending the gravitational field G(x) and by defining the the task-based constrained contro**l action u i**n joint space coordinates to be governed by :

u(q) = M(q) d^2q/dt^2 + C(q,dq/dt) dq/dt+ G(q) + A(q)

with attractive-repulsive fields A(q) = J^T(q) P(x) in joint space determined by reward functions R(x):

**A(x) = - **Î£_R **Grad_(x^p,x^p + e^i(x^p)) R**

These control actions can serve the agile robot production simulations performed by the ACROBA Virtual Gym in the training phase of the deep reinforcement learning agents, i.e. in determining the direction, speed, acceleration and force/torque in and with which to follow the geodesics (integral curve belonging to A(x)) from x towards and at the task target area x_t while exploring observation-action-observation spaces without falling back to any complex forward kinematics or inverse rigid body dynamics model. As alluded to above one can play with the attractive and repulsive charges to induce proper virtual action-observation foliations for guiding the robot actions. Moreover, such reward fields can be made dynamic upon task completion by transforming from e.g. an assembly task into an object avoidance task. Moreover, the priority of tasks during assembly or other processes or scheduling of tasks can be managed by transforming over time the charges of the reward fields depending on initial and subsequent configurations and dynamics of x and x_t's. To enable identification of the different task target areas x_t one can assign these areas' charges and their potential fields with a specific frequency.m Knowing the task schedule order and its distance and pose with respect to the target areas the robot can select the most opportune paths. In addition, contact dynamics between end-effector and objects / humans such friction tensile stresses to be exerted can be represented in proper functional charges for the tasks at hand.

The gauged and re-normalized (averaged over many epochs of training) agent action neural network obtained after deep reinforcement learning can be succinctly captured in terms of some kind of mean-field for the control action in joint space:

<**A(q)> = - Grad_<(x^p,x^p + e^i(x^p)) > Î¦**

The advantage of this approach of credit assignment of observation-action spaces for our motion and dynamic robotic models is that one can prepare such observation-action-reward spaces in advance and during run-time. And it has a perfect integrable and differentiable structure for any discrete representation of the observation-action spaces. This means that (**x, e^i) **are differentiable allowing us to locally set up affine one-connection forms and define curvature and torsion two-forms for describing for instance defects in observation-action spaces. Therewith, it not only allows to define the classical robot-object interaction paradigms in which not really active transformations take place with respect to robot, object and work-cell (diffeo-morphisms of space-time, deformation of objects, ...), but also the robot agent to exert real forces and to perform real work by inflicting so-called dislocation (Burgersâ€™) and disinclination (Frankâ€™s) fields (earthly Einsteinâ€™s torsion and curvature fields) to really mold, sculpture or manufacture objects or to interact physically with human operators in gently manners.

As emphasized the robot control action u(q) should preferably be formulated such that intrinsic changes over time in the robot system such as changes in particular in M(q) are reflected into its governing equation. This requires suitable generalized 'barycentric' coordinates __q__ for the joint poses that make operational such dynamic changes, e.g.:

__q__= M(q,t) q with M(q,t) = J ^T(q) M(x,t) J(q)

The advantage of the latter coordinates is that alterations of the robot, end-effector or part of object / camera system attached to the robot / end-effector are captured by M. Using a corresponding frame field (Î¾(__q)__, Îµ(__q)__) and its dual field (dÎ¾(__q)__,dÎµ(__q)__) that induce a flat metric, a similar **(**Î¾**^p, Î¾^p + Îµ^i(Î¾^p)) = (**__q__**^j)**-cell** **alignment problem can be formulated as for unconstrained systems. Herein** **Î¾ and Îµ play the role of special affine frame fields that provide the right means to recover the dynamics of robot interaction with itself and its environment but scaled for such intrinsic dynamic aspects of M. The derivation of the law for the task-based constrained contro**l action u(**__q__) is straightforward using above (co)-frame fields, flat metric, A(__q__) and G(__q__). A further exposition on this will be quite lengthy and out of scope; it will be a topic in one of our reports on renormalized dynamic scale space paradigms.

Because even after retaining a more proper formulation of the equations of motion and interaction dynamics for a robot there still will be some flaws in the underlying physics model (e.g. neglecting internal friction of robot joints or kinematic constraints to avoid self-collision). Therefore, a more pragmatic approach is proposed in which a deep reinforcement learning neural network agent recommends the robot to perform control action A(q) given q and the robot provides the actual observed state q_obs exerted and recorded actual control A_obs by joint actuators, etc as feedback to the agent. This way the agent can embody, sustain and evolve in a natural manner the real robot control structures for motion and interaction dynamics into its neural network <A_obs(q_obs)>.

**Conclusion**

Khatib and Udwadia-Kalaba framework for deriving the equation of motion of constrained robot systems allows incorporating by design the various robot work and (goal) task space or configuration and dynamic interaction constraints without taking refuge to e.g. the use of Lagrange parameters that merely increases model complexity and ambiguities. But such frameworks become quite cumbersome or even impossible for unpredictable robot-environment motion and interaction dynamics: one cannot know in advance or foretell what the task-based constrained contro**l action would or should be. Nevertheless, it may be worthwhile to derive the proper work and **task-based constrained contro**l action dynamics equations in particular for changes in intrinsic properties of robot, end-effector and attached part of an object to be manipulated or internal friction or contact dynamics (symbolic software packages to generate these equations as above can be used to this end).**

Here instead, the use of modern geometric reward functions is advocated to make operational (sometimes rapidly changing and unpredictable) real-virtual operational work and task spaces for deep reinforcement learning, as dynamic charge overlays of the robot with end-effector and object motions as well as interaction dynamics. They form upon deep reinforcement learning optimized guidance fields for the neural networks / agents to explore and find on/off-line optimized sequences of control actions given the observations at hand for diverse robotic applications.

In the next phase of the ACROBA project the challenge is to specify and implement proper dynamic reward functions and therewith control actions such that the agents can acquire optimized (constraint and task-based) skills for the various uses cases, such as accurately grasping and placing PTHs on a PCB in the right most time-efficient order, rapidly deburring flashes from a container lid and avoiding collisions with human operator while performing simultaneously different tasks.

A direction for future research and development could be the generalization of the concept of repulsive-attractive motion and interaction dynamics for robots defined by the reward function for static positive and negative charges towards robot motion and interaction dynamics defined by reward functions for moving charges. Such charges could create besides electric fields A_1(q_1, t) also magnetic fields B_1(q_1, t) = v_1(q_1, t) X A_1(q_1, t). As known from physics the Lorentz force F_12(q_2,t) that is exerted by a charge N(q_1, t) (inducing fields A_1 and B_1) on another charge N(q_2, t) moving with velocity v(q_2, t) is simply given by:

F_12(q_2,t) = N_2(t) (A_1(q_2, t) + v(q_2, t) X B_1(q_2, t))

This concept together with those for defect dynamics based on dislocation and disinclination fields will open a whole realm of colorful new topics in true active robotics in which for instance non-trivial operations such as stitching and cutting has to be performed in collaboration with a surgeon and operation room assistants.

(Udwadia-Kalaba 1996) Udwadia, F. E. & Kalaba, R. E. 1996 Analytical dynamics: a new approach. Cambridge, UK: Cambridge University Press.

(Khatib 1987) O Khatib. A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal of Robotics and Automation, 3(1):43â€“53, 1987.

## Comentarios