Masatoshi Uehara

Biography (Google Scholar, CV, Github, LinkedIn)

I am a Member of the Technical Staff (research) at OpenAI, where my current mission is to develop next-generation LLMs that can accelerate scientific discovery in the life sciences. Previously, I worked at Genentech and EvolutionaryScale (now part of Biohub).

My personal email address is AB136@gmail.com (A=uehara, B=masatoshi). I no longer used an email address when I was at Cornell. I am from Japan, and my name in 漢字 is 上原 (=Uehara = Shàng-yuán) 雅俊 (=Masatoshi=Yǎ-jùn). I received a Ph.D. in the computer science department at Cornell University from 2020 to 2023.

Services

NeurIPS Area Chair (2024-), ICML Area Chair (2024-), ICLR Area Chair (2026-), ALT Area Chair (2024-)
Organizers: “AI for New Drug Modalities” at Neurips 2024

I am giving invited talks at the following events.

Exploration in AI Today Workshop at ICML, Vancouver, July, 2025,
GenU, Copenhagen, September 2025

Research Interests (This is outdated: before joining OAI)

Multimodal generative models for scientific discovery, with an emphasis on diffusion and language models for drug design.
Test-time reinforcement learning and search methods for controllable diffusion models.
RL-based post-training for diffusion and discrete generative models.
Reinforcement learning for robotics.

I gave a 30-minute ICML workshop talk summarizing (2) and (3). My Ph.D. research focused on the algorithmic foundations of reinforcement learning.

Reward-Guided Generation in Diffusion Models: Toward Programmable Protein DesignThe speaker frames molecular design as sampling from a distribution combining naturalness and functionality, then focuses on reward-guided control in diffusion models. SVDD defines value functions...

Large-scale multimodal generative (diffusion and language) models for science

2. Inference-Time Alignment through RL and Search in Diffusion Models.

Summarized at

Uehara, M., Zhao, Y., Wang, C., Li, X., Regev, A., Levine, S., & Biancalani, T. (2025). Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review. arXiv preprint arXiv:2501.09685.

Representative work

Liner, Xi and Zhao, Yulai and Wang, Chenyu and Scalia, Gabriele and Eraslan, Gokcen and Nair, Surag and Biancalani, Tommaso and Regev, Aviv and Levine, Sergey and Uehara, Masatoshi (*) Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding NeurIPS, 2025

3. RL-Based Post-Training in Diffusion/Language (Discrete Diffusion) Models

Summarized at

Masatoshi Uehara (*), Yulai Zhao (*), Tommaso Biancalani, Sergey Levine. Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review. arXiv preprint arXiv:2407.13734

Representative works

Masatoshi Uehara (*), Yulai Zhao (*), Ehsan Hajiramezanali, Gabriele Scalia, Gökcen Eraslan, Avantika Lal, Sergey Levine, Tommaso Biancalani. Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models Neurips, 2024

Chenyu Wang (*) and Masatoshi Uehara (*) and Yichun He and Amy Wang and Tommaso Biancalani and Avantika Lal and Tommi Jaakkola and Sergey Levine and Hanchen Wang, and Aviv Regev. Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design ICLR, 2025

Masatoshi Uehara (*), Yulai Zhao (*), Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani (*), Sergey Levine (*). Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control R&R for JMLR

4. Reinforcement Learning (Offline RL, RL + Representation Learning, Imitation Learning)

Jonathan D Chang (*), Masatoshi Uehara (*), Dhruv Sreenivas, Rahul Kidambi, and Wen Sun. Mitigating covariate shift in imitation learning via offline data without great coverage. Neurips, 2021. (Code)

Masatoshi Uehara and Wen Sun. Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. ICLR, 2022. Presented at RL THEORY VIRTUAL SEMINAR 2021. (Talk Slide )

Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline rl in low-rank mdps. ICLR (Spotlight), 2022. Oral Paper in Ecological Theory of Reinforcement Learning Workshop at Neurips. (Talk Slide)

Nathan Kallus (*) and Masatoshi Uehara (*). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research and ICML, 2020. (Code)

Research During Ph.D.

Sample-Efficient Offline Policy Evaluation: Doubly robust and semiparametrically efficient estimators (1,2), incorporating deep neural networks with minimax loss (1,2). Slide is here.
Causal Inference + RL: Accounting for Unmeasured Confounders ([1], [2]), Data combination
Instrumental Variable Regression with Deep NNs (for Causal Inference) Semiparametric IV methods to estimate functionals without identification (+how to perform inference), Nonparametric IV methods without identification Slide is here
Sample-Efficient RL in POMDPs: OPE with general function approximation, Online RL with general function approximation, Computationally and statistically efficient PAC RL methods. Slide is here.

Publication

Red means I am the co-first/corresponding author. Blue means alphabetical order following the convention. The other papers follow the contribution-based ordering.

Conference Proceedings

Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, and Shuiwang Ji. A joint diffusion model with pre-trained priors for RNA sequence–structure co-design. ICLR, 2026.

Xingyu Su (*), Xiner Li (*), Masatoshi Uehara (*), Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji. Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models ICLR, 2026

Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, Masatoshi Uehara (*) Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding NeurIPS, 2025

Masatoshi Uehara (*), Xingyu Su (*), Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, Tommaso Biancalani. Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design ICML, 2025

Yulai Zhao, Masatoshi Uehara (*), Gabriele Scalia, Sunyuan Kung, Tommaso Biancalani, Sergey Levine, and Ehsan Hajiramezanali. Adding conditional control to diffusion models with reinforcement learning. ICLR, 2025

Chenyu Wang (*) and Masatoshi Uehara (*) and Yichun He and Amy Wang and Tommaso Biancalani and Avantika Lal and Tommi Jaakkola and Sergey Levine and Hanchen Wang, and Aviv Regev. Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design ICLR 2025

Yulai Zhao (*), Masatoshi Uehara (*), Gabriele Scalia, Tommaso Biancalani, Sergey Levine, Ehsan Hajiramezanali. Adding Conditional Control to Diffusion Models with Reinforcement Learning ICLR 2025

Masatoshi Uehara (*), Yulai Zhao (*), Ehsan Hajiramezanali, Gabriele Scalia, Gökcen Eraslan, Avantika Lal, Sergey Levine, Tommaso Biancalani. Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models Neurips, 2024

Masatoshi Uehara (*), Yulai Zhao (*), Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Sergey Levine (*), Tommaso Biancalani (*) Feedback Efficient Online Fine-Tuning of Diffusion Models ICML, 2024

Kuba, Jakub Grudzien, Masatoshi Uehara, Pieter Abbeel, and Sergey Levine. Functional Graphical Models: Structure Enables Offline Data-Driven Optimization. AISTATS 2024, 2024

Wenhao Zhan (*), Masatoshi Uehara (*), Nathan Kallus, Jason D. Lee, Wen Sun. Provable Offline Reinforcement Learning with Human Feedback. ICLR (Spotlight), 2024

Wenhao Zhan , Masatoshi Uehara, Wen Sun, and Jason D. Lee. How to Query Human Feedback Efficiently in RL?. ICLR (Spotlight), 2024

Masatoshi Uehara (*), Haruka Kiyohara (*), Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, and Wen Sun. Future-Dependent Value-Based Off-Policy Evaluation in POMDPs. Neurips 2023 (Spotlight). (SLIDE)

Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun. Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage Neurips 2023.

Kiyohara, Haruka, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. "Off-Policy Evaluation of Ranking Policies under Diverse User Behavior." In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1154-1163. 2023.

Runzhe Wu, Uehara, Masatoshi, and Wen Sun. Distributional offline policy evaluation with predictive error guarantees. ICML, 2023. (Code)

Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun. Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings. arXiv preprint arXiv:2206.12081 ICML 2023.

Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, and Masatoshi Uehara. Minimax Instrumental Variable Regression and L2 Convergence Guarantees without Identification or Closedness. COLT 2023.

Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, and Masatoshi Uehara. Inference on strongly identified functionals of weakly identified functions. COLT 2023.

Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee. PAC Reinforcement Learning for Predictive State Representations ICLR 2023

Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun. Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems. Neurips 2022.

Chengchun Shi, Masatoshi Uehara, Jiawei Huang, and Nan Jiang. A minimax learning approach to off-policy evaluation in partially observable markov decision processes. ICML(Long presentation), 2022. (Slide Code)

Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Wen Sun, and Alekh Agarwal. Efficient reinforcement learning in block mdps: A model-free representation learning approach. ICML 2022. Presented at RL THEORY VIRTUAL SEMINAR 2021 by Xuezhou. (Code)

Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline rl in low-rank mdps. ICLR (Spotlight), 2022. Oral Paper in Ecological Theory of Reinforcement Learning Workshop at Neurips. (Talk Slide)

Masatoshi Uehara and Wen Sun. Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. ICLR, 2022. Presented at RL THEORY VIRTUAL SEMINAR 2021. (Talk Slide )

Jonathan D Chang (*), Masatoshi Uehara (*), Dhruv Sreenivas, Rahul Kidambi, and Wen Sun. Mitigating covariate shift in imitation learning via offline data without great coverage. Neurips, 2021. (Code)

Nathan Kallus, Yuta Saito, and Masatoshi Uehara. Optimal off-policy evaluation from multiple logging policies. ICML, 2021. (Code)

Yichun Hu, Nathan Kallus, and Masatoshi Uehara. Fast rates for the regret of offline reinforcement learning. COLT, 2021. Presented at RL THEORY VIRTUAL SEMINAR 2021/11/26 by Yichucn. (“Minor Revision” requested from Mathematics of Operations Research)

Masatoshi Uehara (*), Masahiro Kato (*), and Shota Yasui. Off-policy evaluation and learning for external validity under a covariate shift. NeurIPS (Spotlight), 2020. (Talk Code)

Nathan Kallus and Masatoshi Uehara. Doubly robust off-policy value and gradient estimation for deterministic policies. NeurIPS, 2020. (Talk )

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. ICML, 2020. (Code)

Nathan Kallus and Masatoshi Uehara. Statistically efficient off-policy policy gradients. ICML, 2020.

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient and robust off-policy evaluation ICML, 2020. (Code )

Masatoshi Uehara, Takeru Matsuda, and Jae Kwang Kim. Imputation estimators for unnormalized models with missing data. AISTATS, 2020.

Masatoshi Uehara, Takafumi Kanamori, Takashi Takenouchi, and Takeru Matsuda. Unified estimation framework for unnormalized models with statistical efficiency. AISTATS, 2020.

Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. NeurIPS, 2019.

Journal Articles

Bennett, Andrew, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, and Masatoshi Uehara. Inference on Strongly Identified Functionals of Weakly Identified Functions . Journal of the Royal Statistical Society Series B: Statistical Methodology (JRSSB), 2025

Masatoshi Uehara , Chengchun Shi, and Nathan Kallus. An overview of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355 (Statistical Science, 2025+ )

Nathan Kallus, Xiaojie Mao, and Masaotshi Uehara. Localized debiased machine learning: Efficient estimation of quantile treatment effects, conditional value at risk, and beyond. Journal of Machine Learning Research, 2023

Nathan Kallus and Masatoshi Uehara. Efficient evaluation of natural stochastic policies in offline reinforcement learning. Biometrika, 2023

Masatoshi Uehara, Danhyang Lee, and Jae Kwang Kim. Semiparametric response model with nonignorable nonresponse. Scandinavian Journal of Statistics, 2023

Nathan Kallus and Masatoshi Uehara. Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. Operations research, 2021. (The version at Informs is here. But there are several typos there. )

Takeru Matsuda, Masatoshi Uehara, and Aapo Hyvarinen. Information criteria for non-normalized models. Journal of Machine Learning Research, 2021.

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research and ICML, 2020. (Code)

Unpublished Articles Under Revision

Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981, 2021. (Rejection with Resubmission from Annals of Statistics) (SLIDE )

Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029, 2021. (Revision from Journal of Machine Learning Research)

Recent Working Drafts

Masatoshi Uehara (*), Yulai Zhao (*), Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani (*), Sergey Levine (*). Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control arXiv preprint arXiv:2402.15194, 2024

Masatoshi Uehara (*), Yulai Zhao (*), Tommaso Biancalani, Sergey Levine. Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review. arXiv preprint arXiv:2407.13734

Masatoshi Uehara (*), Yulai Zhao (*), Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani (*), Sergey Levine (*). Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control arXiv preprint arXiv:2402.15194, 2024

Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara. Source Condition Double Robust Inference on Functionals of Inverse Problems arXiv preprint arXiv:2208.08291

Zihao Li (*), Hui Lan (*), Vasilis Syrgkanis, Mengdi Wang, Masatoshi Uehara. Regularized DeepIV with Model Selection arXiv preprint arXiv:2403.04236, 2024

Tutorials/Talks

About Dynamic treatment regime

About 深層CRESTミーティング