AI Agent Bites #3: Test Task Submission AI Evaluator for HR

The problem we solve

Evaluating task submissions—especially when they involve multiple components and nuanced criteria—can be slow, inconsistent, and manually exhausting. Without a structured system, grading often becomes subjective and difficult to scale.

To bring structure, speed, and consistency to this process, we built a custom grading engine using Hunch Tools as the foundation. Third-party hiring tools didn’t quite offer the flexibility or depth we needed, so we took matters into our own hands.

How it works

Context for LLM

Grading materials – Rubric, benchmark examples, and the task document
Submission files – Slideshow and video transcript

Processing

The submission enters a multi-step evaluation pipeline in Hunch Tools, powered by a blend of eight AI agents. Here’s how it’s structured:

Multi-Agent Evaluator – Built with Gemini Pro 2.5, this system uses 5 agents to independently score each criterion. Their scores are then aggregated to deliver a balanced, transparent evaluation
Single-Agent Evaluators – 3 standalone LLMs (Gemini Pro 2.5, GPT 4.1, and GPT-3.5-turbo) assess the full submission holistically. These provide broad, top-level insights to complement the detailed criteria scores

All models run within a single Hunch Tools workflow, making multi-agent orchestration seamless and accessible.

Output

Each submission produces 4 outputs:

An aggregated, criteria-by-criteria evaluation from the multi-agent system
3 complete submission assessments from individual LLMs

All results are outputted to the canvas interface thanks to the “output” feature, allowing to set certain nodes to be “outputs”. This plays further into the convenience of Hunch, which allows to convert any workflow (canvas) into a “tool”, which is effectively a shareable interface for running the whole workflow in the most accessible manner.

Example walkthrough

Impact

The evaluation process now takes minutes —without sacrificing quality or depth. It ensures fairness, reduces workload, and enables a scalable approach to grading high-volume submissions with ease.

Built with

Hunch Tools – Workflow engine and logic
Gemini Pro 2.5 – Criteria-based agent evaluator + holistic scorer
GPT 4.1 – Holistic evaluator
GPT-3.5-turbo (“GPT o3”) – Holistic evaluator

Complexity

Low

Looking to leverage modern AI tools within your company? Get in touch and explore next steps.

Sava Opikovs