Enhancing People Decisions through Learning Based Planning

Deep reinforcement learning supports decisions in business processes to shorten cycle times

Enhancing People Decisions through Learning Based Planning

11 februari 2026

Jeroen Middelhuis shows how deep reinforcement learning improves staffing in business processes. Shorter cycle times help organizations handle scarcity peaks queues and deliver faster services.

Image by AndreyPopov on iStock

Jeroen Middelhuis, PhD researcher with the Information Systems research group, defended his work on February 10. At the Department of Industrial Engineering and Innovation Sciences he shows how to deploy people and resources more intelligently so cycle times drop and services respond faster.

Why now

Many organizations face the same pressures. Staff is scarce, demand fluctuates and processes are complex. Think of online retailers during sales peaks, municipalities with long permit waits or care providers who must align rosters with triage. Middelhuis focuses on choosing the right resource to task assignment at each decision point so that the whole process moves forward, not just today鈥檚 bottleneck.

The method

The research uses deep reinforcement learning. A decision making agent receives the current process state, chooses an action that pairs a worker with a task and then receives a reward as feedback. By learning from experience, the agent discovers policies that minimize the average cycle time of cases in the process. The result is a data driven way of planning that adapts to what actually happens on the floor.

Findings

An initial study introduced a general framework to test the approach on processes inspired by real practice. The method outperformed common heuristics and existing allocation methods and proved applicable to real processes as well. This shows that reward based learning works not only in theory but also in operations where every minute matters.

Smarter rewards

Follow up studies improved four core components of the approach. First came a reward function that removes manual reward tuning and aligns directly with the true process objective. The learning strategy then evaluates the outcome of actions by simulating execution trajectories. On smaller processes the agent learned the optimal policy. On larger processes it matched or exceeded the best performing benchmarks.

Faster choices

Next came a redesign of the action space. Rather than learning every decision from scratch, the agent selects from a set of simple proven heuristics. At each decision it picks the rule that fits best now. This shrinks the search space and speeds up learning while the learned mix of rules outperforms any single rule on its own.

More context

Finally the agent takes richer context into account. By using prefix information, the sequence of steps already taken by an ongoing case, it better predicts what is likely to happen next. In a loan application, for instance, multiple credit checks increase the chance of rejection. With that knowledge, the agent makes choices that improve overall flow.

Real gains

For entrepreneurs and public services this means less waiting for customers and patients, fewer fire drills for planners and more predictability in daily operations. In practice this leads to faster customer service reply times, smoother permit handling, better deployment of lab staff or care teams and more calm in teams that switch context every day. Middelhuis shows how learning from feedback produces planning policies that consistently beat standalone rules.

Jeroen Middelhuis defended his thesis on February 10, 2026. Title of the thesis: . Supervisors: Remco Dijkman, Ivo Adan and Zaharah Bukhsh.

Media contact

Marc Rosmalen

m.rosmalen@tue.nl

黑料福利网