郭丹丹2020年博士毕业于西安电子科技大学,此后在香港中文大学(深圳)机器人与智能制造研究院(IRIM)、数据科学学院进行博士后研究,师从数据科学学院执行院长、机器学习著名学者查宏远教授。她的主要研究方向是模式识别机器学习,包括概率模型构建与统计推断,元学习,算法公平性研究,最优传输理论。所涉及的应用有图像生成及分类、文本分析、自然语言生成等。目前,她专注于现实应用中小样本分类、小样本生成、训练数据分布有偏等问题,着重从分布校正、分布拟合、分布匹配等角度展开研究。她的科研成果发表在机器学习国际顶级会议、期刊上,如NeurIPS,ICML,ICLR, IJCV, TNNLS等。 她也是多个国际会议的程序委员会委员和期刊审稿人,如ICML,NeurIPS,ICLR,JMLR, TSP等。
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns or inability of exploration. Hence, it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with nonasymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.
We consider the distributed optimization problem where n agents, each possessing a local cost function, collaboratively minimize the average of the n cost functions over a connected network. Assuming stochastic gradient information is available, we study a distributed stochastic gradient algorithm, called exact diffusion with adaptive stepsizes (EDAS) adapted from the Exact Diffusion method and NIDS and perform a non-asymptotic convergence analysis. We not only show that EDAS asymptotically achieves the same network independent convergence rate as centralized stochastic gradient descent (SGD) for minimizing strongly convex and smooth objective functions, but also characterize the transient time needed for the algorithm to approach the asymptotic convergence rate, which behaves as KT=O(n/(1-λ2)), where 1-λ2 stands for the spectral gap of the mixing matrix. To the best of our knowledge, EDAS achieves the shortest transient time when the average of the n cost functions is strongly convex and each cost function is smooth. Numerical simulations further corroborate and strengthen the obtained theoretical results.