Skip to main content

Testing inexplicable fuzzy AI systems requires adapting our testing approaches and methods.

What challenges are you facing? Where do you struggle? What strategies have you found effective? Share your challenges, experiences, fears, and insights.

Hello ​@BenSimo ,

 

Thanks much for this question. Before going ahead with this, I’d like to say that session was indeed informative and useful. One thing - Technology comes and goes, don’t let the magic of that tech distract the testing - To the point.

 

I’ve been exposed to couple of chat, QnA(similar to Chat GPT) based applications in our firm.

 

For most of the applications, due to privacy reasons we are utilizing Azure OpenAI services rather then directly OpenAI. And I found utilizing RAG here made more sense and even I can say - it has reduced hallucination well.

For those who didn’t know RAG, Ben has explained it very well :)

Basically for chat based applications, most of the testers we got from functional background as this is a new technology and also, we can’t do word-word comparison rather we need to do semantic comparison against metrics and we were started giving some score for that - We used RAGAS framework and utilized all those metrics and included some custom metrics as well for making sure that we are making this Responsible(AI). It works well.

Now, we are planning to automate by utilising some of the frameworks - RAGAS, DeepEval etc.., via python and some definitely needed human evaluation.

These are kind of automated checks but as you say Human evaluation is much more needed now than any time.

 

Fear: 

As of now, we are comparing against the ground truths/expected answers with LLM responses and it’s going well but we might need to automate this as well. Like any admin can upload their respective docs and users should be able to ask any questions. How can we make sure that the doc is ingested successfully. Do we need to update our test periodically? And one more fear is if the answer is derived from couple of docs then how exactly we can prepare the ground truths dynamically. 

 

Insights:

Previously, we didn’t use the references concept like how perplexity.ai should the references for the generated content at that time the human evaluation is difficult. Now it became some what easy as we requested this functionality and it works well.

Don’t think everyone knows AI - They think it’s some web app and the process will be the same. Might be it’s a web app but the testing of it definitely is different when compared to normal web app testing. You need to stretch your limits - ask different questions. I did even used Dev Tools and asked Dev why we are using extreme values for parameters - temp etc.., What if we reduce and test it? And why we are not getting the 6th file even though semantic search works - depends on Top K. 

 

One strategy I found effective is depending upon your use case - You need to direct the chat application to include how many previous contexts for the answering the next query. If its a QnA - Just one Question to One answer you don’t need it and for a chat, you need it but you need to use it carefully as it consumes so many tokens - direclty cost us so   

Looking forward to see more responses to this question and learn more.

 

Thanks entire ​@ShifSync Team - ​@Mustafa and ​@Kat.


Thank you☺️, Ben Simo, for this great webinar! It really made me think about AI testing and the challenges that come with it.

 

One of the biggest challenges I face is understanding AI’s decision-making. Unlike traditional software, AI models, especially deep learning ones, don’t always give clear reasons for their outputs. This makes debugging and validation tough. Another major issue is bias in training data. If the data is not diverse enough, the AI can produce unfair results, which can be hard to detect. AI models also change over time, which means they need constant monitoring to avoid performance issues.

 

One of my biggest fears is hidden biases and unintended consequences. A small flaw in training data can lead to real-world problems, sometimes only discovered after deployment. That’s why testing AI requires a different mindset.

 

Some things that help me are using explainability tools like SHAP and LIME to understand model behavior, adversarial testing to see how the AI reacts to tricky inputs, and continuous monitoring to catch unexpected changes early. AI testing is constantly evolving, and we need to adapt to keep up.

 

Thanks again for the session, Ben Simo.☺️ I really learned a lot.

 

 


In my role as a QA engineer, I focus on testing AI features and integrations within our product, rather than full AI systems. This means ensuring that AI-driven components work as expected while maintaining trustworthiness—validity, reliability, security, resilience, and fairness. Unlike traditional software, AI doesn’t always produce the same output for the same input, making reproducibility and validation complex. Instead of a simple pass/fail approach, we evaluate efficacy & safety to ensure AI meets business needs and ethical considerations.

 

One of the biggest challenges is explainability. AI decisions can feel like a black box, making it difficult to verify correctness or debug issues. If users can’t trust or understand why AI behaves a certain way, it impacts adoption and compliance. This was also discussed in the webinar—how AI must be interpretable and privacy-enhanced to be considered trustworthy.

 

Another major challenge is generative AI hallucinations. Sometimes, AI confidently generates misleading or incorrect outputs. As mentioned in the webinar, not every unexpected response is a defect, but we still need to assess its impact. We use techniques like benchmarking models to compare performance over time and ensure continuous improvement without regressions.

 

To manage risks effectively, we follow an approach similar to the risk matrix discussed in the webinar, evaluating likelihood, severity, and the level of autonomy AI has in decision-making. Maintaining an AI risk repository helps us track known issues, biases, and performance shifts over time. Since AI behavior changes dynamically, we emphasize continuous monitoring and involve human testers where needed to validate nuanced AI outputs.

 

The key takeaway from both my experience and the webinar is that AI doesn’t have to be flawless to be valuable—it just needs to be explainable, fair, and reliable in real-world use. The focus isn’t just on accuracy but on ensuring AI-driven features perform consistently, minimize risks, and enhance user trust.


Thank you Ben for providing detailed knowledge about Generative AI and its related use cases in the world of testing.

Testing Generative AI systems / LLM Models is a very unique experience in itself. In our product model, we also use Azure Open AI Services to suite our project requirements. The challenges we faced are related to data explosion where LLM models creating duplicate entries for same set of entities hence leading to inaccurate information and this we discovered quite late in our “AI Test Strategy” because of ineffective training data. This required complete rework. Greatest fear here for a QA or a PO is to decide when this AI system is going to meet user expectations. It is quite complicated here to set entry and exit criteria efficiently.

In my experience, what worked best and would work best is complete study of positive and negative comparisons against your legacy implementation before you think of a replacement. Do not aim for 100% accuracy and efficiency, decide your own scores and benchmarks as per business goals. Perform testing on varied sets of data, repetitive testing on same set of data with checking positive and negative variance in each case. Get insights related to usage of AI system from internal users who are not involved in development of AI application. As a QA, do not pass your AI Test metrics in terms of solid figures only, talk about comparative analysis and % rise in benefits the AI application is going to bring to the table.

Extremely elated to be part of such fruitful session and conversations.

Thanks again !

Shivi Malviya


Testing inexplicable fuzzy AI systems requires adapting our testing approaches and methods.

What challenges are you facing? Where do you struggle? What strategies have you found effective? Share your challenges, experiences, fears, and insights.

Appreciate it, Ben, for sharing such in-depth insights on Generative AI and its various applications in the testing domain. Your knowledge truly added great value!

Challenges in Testing AI Systems

  1. Unpredictable Results – AI can give different answers for small changes in input, making it hard to predict outcomes.

  2. Difficult to Understand – Many AI models work like a "black box," meaning it's hard to see how they make decisions.

  3. Always Changing – AI models keep learning and updating, so they need to be tested regularly.

  4. Data Issues – The quality of AI decisions depends on the data, and even small errors in data can cause big problems.

  5. No Fixed Answers – Unlike regular software, AI can give different results for the same input, making testing harder.

  6. Unexpected Failures – AI can fail in rare situations or when given tricky inputs, so testing for edge cases is important.

  7. No Standard Rules – There are no clear testing rules for AI, so teams must create their own methods.

  8. Fairness & Rules – AI must follow ethical guidelines and avoid bias, which adds extra testing challenges.

Ways to Test AI More Effectively

  1. Tricky Input Testing – Give AI difficult inputs on purpose to find weak spots.

  2. Better Data – Use high-quality, well-balanced data to improve AI accuracy.

  3. Test Regularly – Keep testing AI as it changes over time.

  4. AI for Testing AI – Use AI-based tools to help find mistakes and patterns.

  5. Mix Human & AI Testing – Combine human thinking with automation to catch more issues.

  6. Extreme Condition Testing – Check how AI performs under unusual situations.

  7. Backup & Version Control – Keep track of AI updates and have a backup plan if something goes wrong.

  8. Check for Fairness – Regularly review AI to make sure it's not biased.

  9. Real-World Simulations – Test AI in real-like scenarios to see how well it works in practical situations.


Reply