Python Implementation of HCOPE lower bound evaluation as given in the paper: Thomas, Philip S., Georgios Theocharous, and Mohammad Ghavamzadeh. "High-Confidence Off-Policy Evaluation." AAAI. 2015.
method. Also, a graph of distribution of Importance sampling ratio is created which nicely explains the high variance of the simple IS estimator.
All the values required for offpolicy estimation are initialized in the HCOPE class initialization.
Currently the estimator policy is defined as a gaussian noise(mean,std_dev) added to the behavior policy for estimator policy initialization in the function
setup_e_policy(). The example in paper uses policies differing by natural gradient. But, this works as well.
To estimate c*, I use the BFGS method which does not require computing hessian or first order derivative.
hcope_estimator() method also implements a sanity check, by computing the discriminant of the quadratic in parameter delta(confidence). If it does not satisfy the basic constraints, the program prints the bound predicted is of zero confidence.
The random variables are implemented using simple importance sampling. Per-decision importance sampling might lead to better bounds and is to be explored.
A bilayer MLP policy is used for general problems.
Paper: Safe Exploration in Continuous Action Spaces - Dalal et al.
Comparision of different importance sampling estimators:
Image is taken from phD thesis of P.Thomas:
The relative reachability measure
Paper: Penalizing side effects using stepwise relative reachability - Krakovna et al.