Definition. AI in HR software refers to capabilities driven by machine learning models (predictive ML), large language models (generative AI) or autonomous reasoning systems (agentic AI), embedded within HR applications such as HRIS, talent acquisition, learning, performance or workforce analytics platforms. The category covers everything from a recommendation engine flagging at-risk employees, to a chatbot answering policy questions, to an agent autonomously screening candidates.
TL;DR
- Start with the problem, not the AI.
- Apply standard triage - hygiene, market standard, differentiating.
- Test with data, not demos.
- Govern AI on your side; vendors cannot give you governance.
- Contract every AI commitment before signature, not after.
What you’ll learn
| Start with the problem | If you’re starting from ‘we need AI’, you’re already evaluating wrong. |
| Eight questions to ask vendors | The marketing tells you what the AI does. These questions tell you what’s actually behind it. |
| Triaging AI capabilities | Most AI is market standard, not differentiating. Classify before vendors influence you. |
| Value-driven decision making | Without a value case, AI is unaffordable at scale. |
| Data tests | The single most useful evaluation instrument for AI. Design to neutralise vendor pushback. |
| AI governance | A buyer-side responsibility. Vendors can’t give you governance. |
| Capture commitments contractually | If it isn’t in the contract, it doesn’t exist. |
| Platform lock-in | AI deepens lock-in. Contract for the exit while you have leverage. |
| After go-live | Evaluation doesn’t stop at signature. Measure post-launch. |
On this page
- Start with the problem
- What AI in HR software means
- The vendor sales paradox
- Eight questions to ask vendors
- Triaging AI capabilities
- The six capability areas
- Value-driven decision making
- Data tests
- POCs and pilots
- AI governance
- Employee buy-in and change
- Regulatory snapshot
- Capture commitments contractually
- Platform lock-in
- How this fits the method
- After go-live: measurement
- FAQ
Start with the problem, not the AI
The most common mistake I see buyers make in 2026 is the same mistake they were making in 2018, just with louder marketing: letting vendor capabilities define the solution. AI amplifies the problem because the technology is novel, the demos are compelling and the temptation to anchor a procurement around ‘we need AI’ is strong.
In my book I describe the three root causes of HR technology failure: the technology solves the wrong problem, the wrong vendor is chosen and the implementation is botched. An UNLEASH study found that 42% of HR tech implementations had failed or underperformed two years after installation1, and PwC reports that 36% of buyers are likely to switch vendors at contract renewal2. AI doesn’t change those patterns. If anything, it accelerates them. A vendor with an impressive AI demo and a thin product can take a buyer further off course, faster, than a vendor with no AI at all.
The right starting point is design thinking: empathising with users, defining the real problem, ideating solutions and only then asking whether AI has a useful role to play. Phase A of the selection process (‘Know What You Want’) is the first defence against AI-led procurement.
Once you’ve decided you need AI, that decision sits inside a problem you can measure success against. You no longer evaluate ‘AI capability’ in the abstract. You evaluate whether each vendor’s AI solves your problem better than the alternatives, including doing nothing.
What “AI in HR software” actually means
The term is used loosely. For evaluation purposes, it helps to distinguish three categories:
Predictive machine learning. Models trained on historical data to score, classify, or predict. Examples: attrition risk scoring, candidate match scoring, anomaly detection in payroll. Mature technology, well-understood evaluation methods.
Generative AI. Large language models producing text, summaries, or structured outputs. Examples: chatbot interfaces, automated job description generation, performance review summaries, conversational policy lookup. Rapidly evolving; evaluation methods less standardised.
Agentic AI. Systems that combine generative models with tools and the ability to take actions autonomously. Examples: an agent that screens candidates, schedules interviews and drafts rejection emails. The newest category; vendor offerings range from real to vapourware.
Autonomous action is qualitatively different from generative assistance. Agentic systems introduce challenges that don’t arise with summarisation or recommendation: orchestration complexity, auditability of chained decisions, prompt drift over time, tool-permission and security boundaries and failure containment when the agent acts on incorrect reasoning. In HR specifically, autonomous action creates exposure to employment law risk, discrimination claims, opaque accountability and reputational damage. I’d treat agentic AI claims with substantially more scrutiny than other AI categories, particularly any agent that makes or executes decisions affecting employment, pay or progression.
Each category has different risk profiles, different evaluation techniques and different regulatory implications. Most ‘AI’ labels in HR vendor marketing today cover predictive ML or generative AI; agentic AI claims warrant the most scrutiny.
You’ll also encounter two architectural patterns: embedded AI (the vendor’s own model, trained on their data, integrated natively into the product) and bolt-on AI (a third-party model, typically from OpenAI, Anthropic or Google, accessed via API). Neither is better in the abstract, but the questions you ask vary by pattern. A bolt-on solution means you’re also evaluating the vendor’s API supplier, indirectly.
The vendor sales paradox in the age of AI
In the preface of my book I quote an industry veteran: ‘If you understand nothing else when selecting software, understand that software vendors are incented to say "yes". Very few will flat-out lie to you; however, if there is any possibility of a "yes", you will get a "yes".’ That observation has aged well. Vendor sales teams are skilled in reading buyers, not in domain expertise. Apparently, Salesforce discovered that their best salespeople come from the car sales world. I’ve sat through hundreds of vendor demos and witnessed many very polished presentations. In the AI era, this dynamic intensifies.
‘AI-washing’ is the practice of presenting non-AI capabilities as AI, or basic AI as proprietary advanced AI. It takes several common forms:
- Rules-based automation rebadged as AI (‘AI-driven workflow’)
- Generic LLM wrappers presented as proprietary models
- Roadmap features demonstrated as if they were shipped product
- ‘AI-ready’ platforms with no AI features actually deployed
- Demos that rely on cherry-picked data the buyer can’t reproduce
The asymmetry is greater than with traditional features. With a workflow engine, you can usually tell from a demo whether the capability exists. With AI, the technology obscures rather than reveals: a confident answer from a chatbot doesn’t tell you whether the underlying model is good, accurate or even consistent.
The defences are the same as for any other capability, but applied more rigorously: insist on shipped product, demonstrated against your data, with commitments written into the contract.
Eight questions to ask an HR tech vendor about their AI
The marketing will tell you what the AI does. These questions tell you what’s actually behind it.
- What kind of AI is it: predictive, generative or agentic? Vague answers (‘it’s AI-powered’) are themselves diagnostic. You’re listening for a specific category, because each carries a different risk profile and needs a different evaluation approach.
- Is it shipped today, or still on the roadmap? You want a release version and the date the feature went live. ‘Coming in the next major release’ is not in your score, and not in your contract.
- Is the model their own, or a third-party API? A bolt-on model means you’re indirectly evaluating the API supplier too: OpenAI, Anthropic, Google, AWS. Neither pattern is better in the abstract, but the questions you ask differ.
- What is it trained on? Source data, whether your data is used for training, opt-out rights and how often the model is retrained. Default the answer to ‘not on our data without explicit opt-in’.
- How is bias monitored? A documented process, measured metrics, a named accountable owner and the frequency of audit. ‘We don’t see bias’ is not an answer.
- Can it explain the basis for its recommendations? Human-readable rationales surfaced in the user interface, not just a confidence score. If the AI can’t show its working, the manager using it can’t defend the output.
- What happens when it gets things wrong? An escalation path, an override mechanism, a support process and a realistic resolution time. Worth asking before signing, not after the first incident.
- Can you test it using your own data? You are listening for ‘yes’. Anything else is your test result.
Triaging AI capabilities: hygiene, market standard or differentiating?
Buyers routinely over-score AI as a category. They treat it as inherently differentiating when, in most cases, it isn’t. In my book I argue strongly for triaging requirements into three groups, each treated very differently during evaluation. The same discipline applies to AI capabilities.
‘Hygiene’ requirements are pass/fail. For AI specifically, hygiene includes: data residency and sovereignty, bias governance and audit, model risk classification, GDPR compliance for automated decision-making (Article 22), security of training and inference data and EU AI Act conformity for high-risk HR systems. Fail any of these and the vendor is eliminated regardless of other factors.
‘Market standard’ AI capabilities are what leading vendors routinely offer: intelligent search, summarisation, basic conversational interfaces, predictive scoring on common HR signals. Many first-generation AI features are commoditising at feature level, but execution quality, integration depth and operational usefulness still vary materially between vendors. Two vendors may both claim ‘AI summarisation’ while one is transformative and the other is barely usable. Confirm market-standard features exist, then assess execution quality without treating them as differentiators.
‘Differentiating’ AI capabilities are where scoring matters. These are the AI features that, for your specific problem and context, deliver materially different value across vendors. Examples vary by use case but might include: domain-specific model accuracy on your data, agentic workflow that genuinely removes a process step, integration of AI with your data lake or a unique training approach.
The discipline is in correctly classifying. Most vendor AI marketing positions market-standard features as differentiating. Do the classification yourself, before vendors get to influence you.
Evaluating AI across the six capability areas
In my book I describe six capability areas every HR tech evaluation should cover. AI sits inside this structure, not alongside it.
Functional. What problem does the AI solve, and does it solve it in your context? Functional evaluation for AI starts with use case fit. A model that achieves 92% accuracy in the vendor’s benchmark may collapse to 60% on your data. Ask: what is the AI’s job, and how will we know it’s done it?
User experience. How does AI surface in the flow of work? Is it explainable to the user? What is the human override path? An AI feature that doesn’t show users why it made a recommendation won’t be trusted, and untrusted features don’t get used.
Technical. What model is it: proprietary, fine-tuned or third-party? Where does inference happen? What customer data is used for training versus inference? What integration patterns are supported? For bolt-on AI, the technical evaluation extends to the API supplier’s terms.
Service delivery. How often is the model updated? Are you notified before changes? What happens when the AI gets it wrong, and what is the support escalation path? Model regressions are a real risk: the AI that worked for you in month one may behave differently in month nine.
Commercial. AI introduces pricing models that didn’t exist five years ago: token-based, query-based, agent-based, capacity-based. Understand the unit economics. A pilot priced at ‘free for the first 10,000 queries’ becomes a different conversation at scale.
Implementation. What does ‘go-live with AI’ actually mean? Many AI features require customer data to be useful. Plan for the data preparation, governance setup and bias monitoring required before the AI delivers value. Implementation readiness work should begin during Phase C, not after contract signature.
Value-driven decision making
Evaluation only matters if you can value what you’re evaluating. The right vendor is the one that drives the best return, not the one with the highest score on a generic capability matrix and not the one with the most impressive AI. Without a defensible value case, scoring becomes subjective and high-cost AI features get over-rewarded simply because they’re visible.
A value driver tree is the structuring instrument I recommend, mapping strategic objectives, benefits, value drivers, metrics and solution capabilities. Built properly in Phase A, it tells you which AI capabilities matter to your value case, what targets they need to hit and which vendors deliver them.
AI consumption pricing sharpens this discipline considerably. Token-based, query-based or agent-based pricing introduces variable costs that scale with adoption: precisely the scenario your value case has to anticipate. A vendor priced at ‘free for the first 10,000 queries’ looks rather different at 100,000. If your value driver tree doesn’t connect AI usage to business benefit, you can’t reason about whether consumption costs are justified. The flip side is that consumption metrics also make benefit attribution easier: if you can count queries, you can count value per query. Both halves of the ROI calculation become more measurable.
The decisive technique: data tests
AI based on machine learning is what’s often called a ‘black box’. Unlike the rule-based systems we’re used to, where a response can be traced through a series of what-ifs, ML outputs are difficult to predict. Anthropic’s own research on large language models (Tracing Thoughts in Language Models) acknowledges that the internal mechanisms remain stubbornly opaque even to the people who built them. Generative AI can compound this with hallucinations: incorrect answers delivered with great confidence. For HR, where you might one day need to explain to a rejected candidate or a passed-over employee why the AI said no, that opacity matters. ‘The AI said no. We’re not entirely sure why, but we trust it’ isn’t a conversation that goes well.
Data tests are the most useful way around the black box. They give the vendor a known dataset, ask them to process it with their AI and compare outputs across vendors against criteria defined in advance. Better than vendor demos. More revealing than RFP responses. More practical than POCs in early evaluation.
Designing a good data test:
- Use anonymised employee data from your own organisation where possible. Synthetic data is acceptable when real data can’t be shared, but tests less faithfully.
- Use the same dataset across all vendors. Without identical inputs you can’t compare outputs; vendor-supplied data favours the vendor.
- Include edge cases deliberately. Unusual job titles, payroll exceptions, ambiguous policy questions, non-linear careers. Edge cases reveal model brittleness.
- Try to break the system. Throw bad data, unexpected formats and edge-of-policy scenarios at the AI to see how it handles them. The point isn’t only what the AI does on a good day; it’s what it does when conditions are imperfect.
- Probe for bias. Include cases designed to surface bias across protected characteristics, where you can do so lawfully.
- Evaluate accuracy and explanation. Score not only whether the AI got it right, but whether it can tell you why.
- Run identical tests repeatedly. Generative AI can produce different outputs from identical inputs. For candidate scoring or policy interpretation, output instability is itself a governance issue.
A short worked example. For an AI candidate shortlisting tool: anonymise 200 historic applications across the diversity you actually hire for, run them through every shortlisted vendor with the same job spec, and score the outputs on quality of shortlisting, demographic disparity and the explanations the AI gives for each recommendation. The same shape of test works for policy chatbots (real user questions versus your authoritative source), payroll anomaly detection (known-good and known-bad records), or any AI feature with measurable outputs.
Used well, data tests cut weeks out of evaluation and remove the influence of vendor demo polish.
Designing around vendor pushback
Vendors push back on data tests, and not always unreasonably. They’re protecting IP exposure, sales cycle timing and demo control. The trick is to design the test in a way that neutralises their concerns: narrow the scope (two scenarios, four hours of their time), use their own sandbox or pre-prod environment with synthetic or anonymised data, set the test as an RFP entry condition rather than a late ask and pre-clear data handling via NDA. Phase the depth as well: scripted demos for all vendors, data tests for the shortlist, POCs for the preferred vendor only. Each stage costs vendors more, so only the serious survive. Position the test as standard enterprise AI procurement practice, not a bespoke favour.
If a vendor refuses every form of data validation, that’s information about the vendor. Treat it as a material evaluation risk unless they offer a credible alternative: a structured reference visit, a controlled customer pilot, a documented third-party audit. Customer references who ran their own validation are a workable substitute when direct testing fails: less rigorous, but better than the demo alone.
Supporting evidence for validation, audit and testing
Buyer-side AI data tests are rarely public - enterprise procurement is confidential by nature. But the principle is well-supported. Regulators mandate it (NYC’s bias audits), courts are testing it (the EEOC’s iTutorGroup settlement, Mobley v. Workday), standards bodies endorse it (NIST AI RMF, GSA AI Buying Guide) and major employers practise it (Amazon’s well-known decision to scrap a biased recruiting AI).
None of this is a clean precedent for buyer-side testing during selection. It supports the principle, not the specific practice. Worth flagging that an NYC-style bias audit is a narrow statistical exercise on hiring outcomes, not full validation: proving an AI is fair, accurate AND suitable for your use case is broader work. The absence of public enterprise case studies is itself a finding: buyers who run rigorous tests don’t publicise them.
POCs and pilots for immature AI features
For genuinely novel AI capabilities, agentic workflows in particular, data tests may not be enough. The next step is a proof of concept (POC) or pilot.
A POC is a cut-down version of the solution, with limited configuration and test data, running outside production. It lets buyers experience the AI hands-on with their own people and processes, without committing to deployment.
A pilot is a cut-down version of the production system, with real users, live data and some integrations. Pilots are typically run with one vendor only and follow vendor selection.
Both are time and resource intensive, and both carry a specific trap: POCs that drift into production without proper due diligence on hygiene requirements. I’ve seen this happen more than once. If you run a POC, run it deliberately, with success criteria, a clear end date and a decision rule that returns you to the formal selection process at the end.
AI governance: a buyer-side responsibility
Responsible AI isn’t something the vendor delivers. It’s something the buyer governs. Organisations using AI in HR should develop governance, policies and guardrails specific to HR applications, ideally before vendor selection.
At minimum, your AI governance should cover:
- Model risk classification. Which AI use cases are high-risk (recruiting decisions, performance management) versus lower-risk (intelligent search, summarisation)? Different risk tiers warrant different controls.
- Human-in-the-loop policy. For which AI outputs is human review required before action? Meaningful oversight means a human actually reviews and can change the AI’s output. Rubber-stamp oversight, sometimes called ‘human on the loop’, is a compliance fiction. Define the role, the training and the conditions under which the human should intervene or pause the AI.
- Bias monitoring. How is bias measured in production, how often and who is accountable when it is found?
- Accessibility and disability. Does the AI work for users with disabilities? Hiring tools that filter on video, voice, written speed or other modality can systematically disadvantage candidates with disabilities. NYC-style demographic audits won’t catch this. Test it separately.
- Escalation paths. When the AI gets it wrong, where does the case go and how is it resolved?
- Employee opt-out and transparency. How are employees informed that AI is being used in decisions that affect them? Where applicable, how do they opt out?
- Model change management. How do you handle vendor-side model updates that may change AI behaviour mid-contract?
- Record keeping and audit logs. What does the AI log when it makes a decision: inputs, outputs, model version, human review status? You need enough audit trail to reconstruct any decision under challenge. Treat audit logging as a selection criterion, not a post-go-live afterthought.
This list assumes an organisation with the capability to design and operate these controls. In practice, most HR functions, procurement teams and legal departments are still building their AI maturity. Acknowledging this honestly is part of buyer-side governance. Many organisations will need external support, whether through internal AI committees, external counsel or specialist advisors, to establish proportionate controls. The governance challenge is organisational as much as technical.
One adjacent risk worth naming: shadow AI. Even with a fully-governed AI platform in place, managers and HR teams use ChatGPT, Copilot or Gemini outside the platform - to summarise CVs, draft policy responses, score candidates informally. Governance has to extend to this too, or it isn’t governance, it’s wishful thinking.
Governance work belongs in Phase A and Phase E of the SelectionWise method. Define it before procurement, and have it operational before the AI goes live. Vendors can support governance with tooling and evidence. They cannot own governance accountability on your behalf.
Employee buy-in, adoption and change management
AI adoption in HR is as much a change management challenge as a technology one. Many AI failures in HR won’t be technical failures. They’ll be buy-in failures, adoption failures, cultural failures or industrial relations failures. The risk concentrates in the use cases with the highest stakes for individual employees: performance management, internal mobility, workforce planning and recruitment scoring.
Stakeholder management here is broader than the buying team. Ask: can the AI explain its outputs in language an affected employee would understand and accept? What happens when an employee challenges an AI recommendation? Where there are unions or works councils, have they been engaged on the proposed use cases? In some jurisdictions (notably EU markets) consultation can be a legal precondition before deployment, not just good practice. Build the consultation timeline into your selection plan.
Perceived fairness matters as much as measured fairness. An AI tool that’s technically unbiased but feels opaque to employees will erode trust and harm adoption. Build employee transparency into the selection criteria, not as a compliance afterthought.
Regulatory snapshot
AI in HR sits in a regulated and fast-moving space. The most significant frameworks for buyers to be aware of include:
- EU AI Act. Many HR uses of AI - recruitment, performance management, workforce decisions - are classified as high-risk and carry obligations on both the vendors building the AI and the organisations deploying it.
- GDPR / UK GDPR. Restrictions on solely automated decisions that significantly affect individuals continue to apply, subject to defined exceptions and safeguards.
- US bias laws. New York City’s bias-audit requirements, Colorado’s evolving AI law, and several state and federal frameworks are increasingly being applied to AI-driven hiring tools. Mobley v. Workday is testing vendor liability for embedded AI in the courts.
- Sector regulators. Bodies like the UK ICO, EHRC and equivalents elsewhere publish guidance that affects how AI in HR is governed in practice.
The picture changes quickly. Treat this as a prompt to seek specialist counsel for your jurisdiction and use cases, not as a current legal position.
Capture AI commitments contractually
I’ve written elsewhere about how much of what gets demonstrated and promised during a sales process is contractually invalid: presales information is normally deemed inadmissible, and vendors typically resist incorporating RFP responses as binding. For AI capabilities specifically, this is a particularly expensive gap.
AI-specific commitments to capture in the contract:
- Shipped capability definition. What AI features are shipped today, scored in evaluation and included in the price.
- Model update controls. Notification before material model changes, the right to test before activation, and rollback rights if a new version regresses on accuracy, bias or behaviour. Without these, the AI you contracted for may quietly become a different system mid-contract.
- Performance commitments. Where the AI was scored on accuracy or bias metrics in evaluation, capture target performance and remedies for material regression.
- Data usage rights. Whether your data may be used for vendor model training (default should be no, with opt-in).
- Opt-out provisions. Right to disable AI features and revert to non-AI processing without penalty.
- AI roadmap commitments. Where roadmap features factored into the selection decision, get them written into the contract with delivery dates and remedies.
Don’t accept ‘we’ll send you a notice’ as a substitute for contractual commitments. AI moves quickly. Contracts last five years.
AI and platform lock-in
Modern HR platforms are no longer single applications. They’ve become data layer, workflow layer, AI layer and orchestration layer combined. AI accelerates the depth of lock-in because adoption embeds the platform into daily operational behaviour in ways that previous SaaS lock-in didn’t.
Switching costs increase as the following accumulate inside a platform: configured prompts and prompt libraries, custom automations, AI workflow chains, embedded copilots, agent permissions, training feedback loops and the muscle memory of users who’ve learned the AI’s quirks. Replacing the platform replaces all of it.
In contract negotiation, push for portability commitments specific to AI:
- Prompt and configuration ownership. Custom prompts, prompt libraries and AI workflow configurations are your IP, exportable at contract end in a usable format.
- Data exportability. Including training feedback data, AI interaction logs and audit trails - not just the underlying HR records.
- Hyperscaler dependency disclosure. Where the vendor’s AI relies on a third-party model (OpenAI, Anthropic, Google, AWS Bedrock), understand the contractual exposure if those relationships change.
- Transition assistance. AI-specific transition assistance at contract end, not generic SaaS exit.
Buyers who treat AI as a product feature rather than an embedded layer will be surprised by switching costs in five years. The discipline now is to contract for the exit while you still have leverage.
How this fits the SelectionWise method
AI evaluation runs across the full SelectionWise lifecycle. The toolkit provides the templates, checklists and AI accelerators to operationalise it at each phase.
A quick note on AI on both sides of the table. AI isn’t just what you’re buying; it can also be a tool that helps you buy well. It can generate value driver trees, draft RFP documents from your requirements, analyse vendor responses and summarise reference calls. AI evaluation and AI-assisted evaluation are two sides of the same selection.
- Phase A - Know What You Want. Define the problem first; decide whether AI is part of the solution second. A clear solution definition and a value driver tree anchor the AI question in business value.
- Phase B - Selection Preparation. Apply triage to AI capabilities. Build the AI-specific evaluation list. A structured requirements triage and an evaluation list make this concrete.
- Phase C - Vendor Selection. Scripted demos, data tests, hygiene assessment, reference checks. Data tests in particular carry the most weight for AI features.
- Phase D - Implementation Partner Selection. Your SI needs AI implementation experience, not just product experience. Ask differently.
- Phase E - Readiness & Contracting. AI governance operational. Contractual commitments captured. Implementation readiness assessed. Contracts signed only when ready to implement.
After go-live: operational measurement
AI evaluation doesn’t end at contract signature. The framework needs operational measurement to confirm that the AI is delivering the value case and behaving as expected in production. Set these up before go-live, not after.
- Adoption rates. Are users actually engaging with the AI features, by user group, location and use case?
- Override and acceptance rates. How often do users follow the AI’s recommendation? Patterns by population segment matter as much as the headline number.
- False positive and false negative tracking. Particularly in scoring and screening use cases. Measure both, not just accuracy in aggregate.
- Output variance. For generative features, sample outputs over time to detect drift or regression following vendor model updates.
- Escalation frequency. How often does the AI escalate to a human, and what is the human response pattern? Spikes indicate model issues; flat lines indicate the AI is being trusted blindly.
- Realised value against the value case. Quarterly review against the value driver tree. AI that does not move the metrics it was selected for is a sunk cost in disguise.
Build the measurement plan during Phase B. The metrics you need post go-live are the metrics you should be using to evaluate vendors during selection.