It is challenging to evaluate the results of GenAI applications because there are no expected results, only results from an acceptable range. For summarization GenAI use cases, what are the most common evaluation metrics?
-
ROUGE: Measures overlap between the generated and reference summaries using n-grams and sequences. Widely used but focuses on surface-level matches.
-
BLEU: Evaluates precision of n-gram overlap. Good for short texts but struggles with synonyms and paraphrasing.
-
METEOR: Considers synonyms and linguistic variations, offering better semantic evaluation than BLEU.
-
BERTScore: Uses embeddings to measure semantic similarity. More accurate for meaning but computationally intensive.
-
QuestEval: Assesses how well the summary answers questions derived from the source. Great for faithfulness and informativeness.
-
Human Evaluation: Judges fluency, relevance, conciseness, and faithfulness. Essential for nuanced insights but time-consuming.
- Define expected type of results - Range
- there is no 100% test coverage - apply risk based testing - human need to compensate for missing test coverage
- Implement Health check as LLM and Data drifts
Evaluating GenAI summarization involves automatic and human metrics. Common automatic metrics include ROUGE (word/phrase overlap), BLEU (n-gram precision), BERTScore (semantic similarity), and SummaC (factual consistency). These methods are fast but may overlook nuances. Human evaluation focuses on informativeness (key points captured), coherence (logical flow), fluency (grammar/readability), relevance (aligned with source), and factual accuracy (truthfulness). Combining both is common: automatic metrics for scale, human reviews for depth. Emerging trends include task-specific metrics and user feedback for iterative improvement, providing a balanced and comprehensive evaluation strategy for GenAI summarization.
Evaluate as a ratio of the correct answers or nearly correct answer to the sum of correct and incorrect answers.
Look for true positive, false positive, true negative and false negative responses and evaluate the recall and precision of the answers.
This will allow the tester irrespective of AI expertise to evaluate the LLM
It is difficult to evaluate the Gen AI responses, there are no pre written conditions for the responses to be pass/fail. But it is important to ensure the accuracy, reliability of the responses.
We can perform health check regularly to ensure the reliability, performance of the LLM and monitor the LLM’s. Many API’s are available to perform the health checks on the LLM’s.
Also, there is no fixed output or expected output, we have to compare the actual results with the facts.
Human Evaluation is the perfect way to assess the Gen AI responses, human evaluators can assess the responses with the factual data, relevant data.
The responses should be easy to read and understandable. Apply some automation techniques to assess the responses. ROUGE, BLEU and TER.
By using RAGAS Framework( Retrieval Augmented Generation Assessment Evaluation Framework)
we can evaluate AI models that generate answers by retrieving relevant information. It works by comparing the AI's response to an ideal answer(Ground Truth) based on the context and additional retrieved information.
RAG Assessment employs a set of predefined metrics to evaluate both the retrieval and generation components of the model. These metrics help ensure that the responses generated by the GenAI are of high quality, relevant, accurate, coherent, contextually appropriate, complete, and safe. Below are the key metrics used in RAG Assessment and how they are applied:
1. Answer Relevancy,2. Answer Accuracy,3. Answer Completeness,4. Answer Harmfulness
Just a human responding. Did you use Chat GPT. How do you know the response is correct :)
Evaluating summarization outputs in Generative AI (GenAI) use cases can be complex due to the absence of fixed "expected results." Instead, the evaluation relies on comparing generated summaries against acceptable benchmarks.
While doing this we can face lot of challenges which would be on following lines and i think we have to adopt optimum approach to address them, for e.g.
- Outputs can be diverse, to handle this we can have a human intervention to evaluate the model flexibility and accurateness.
- Adjust metrics to prioritize different qualities (e.g., accuracy for legal documents, readability for blog summaries).
- Use frameworks like NLGEval or EvalSumm to streamline evaluations across metrics.
By combining metrics like Weighted scoring (for fleuncy, accuracy, flexibility), Evaluating different dimesions, Benchmarking Against Multiple References summarization results can be holistically assessed to ensure they meet user needs and context-specific requirements.
Human intervention is ofcourse needed.Using APIs is one of the way.
It is challenging to evaluate the results of GenAI applications because there are no expected results, only results from an acceptable range. For summarization GenAI use cases, what are the most common evaluation metrics?
I’d like to go a bit deeper on the question, rather than simply providing an answer. A metric is a measurement function, a mathematical operation by which we hang a number onto an observation. One of the important questions is to ask: how do you count to one? In qualitative research, there’s a (clunky) term for this: “operationalization”. That is, what is the operation by which you put a number on “relevancy”, “accuracy”, “completeness”, or “harmfulness”?
A description or an assessment makes sense to me in this context; numbers less so.
It is challenging to evaluate the results of GenAI applications because there are no expected results, only results from an acceptable range. For summarization GenAI use cases, what are the most common evaluation metrics?
Evaluating GenAI summarization results requires a combination of objective metrics and human judgment since the outputs often lack a definitive "correct" answer. Here are the most common evaluation metrics used:
1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Measures the overlap of n-grams (e.g., unigrams, bigrams) between the generated summary and a reference summary.
- ROUGE-N focuses on exact matches, while ROUGE-L evaluates sequence similarity considering sentence structure.
2. BLEU (Bilingual Evaluation Understudy):
- Originally designed for machine translation, BLEU can also be used to measure the precision of n-grams in the generated summary compared to a reference.
- While useful, it often undervalues summaries with varied but accurate phrasing.
3. BERTScore:
- Uses contextual embeddings from models like BERT to assess semantic similarity between the generated summary and reference summaries.
- It captures meaning better than surface-level metrics like ROUGE and BLEU.
4. Length-based Metrics:
- Evaluates whether the summary adheres to a desired length or compression ratio without losing essential information.
5. Factual Consistency Checks:
- Measures the accuracy of factual information retained from the source content using automated tools or manual verification.
- Tools like QuestEval and FactCC can assist with this.
6. Human Evaluation:
- Involves assessing summaries based on criteria such as coherence, relevance, fluency, and informativeness.
- This step ensures the summary aligns with user expectations and captures subjective nuances.
7. Novelty and Redundancy Analysis:
- Evaluates whether the summary introduces unnecessary information (novelty) or repeats content excessively (redundancy).
8. Domain-specific Metrics:
- For specialized use cases (e.g., legal or medical summaries), customized metrics are designed to assess adherence to domain-specific requirements.
Using a combination of these metrics provides a balanced evaluation, ensuring both technical and user-centric quality in GenAI summarization outputs.
It is challenging to evaluate the results of GenAI applications because there are no expected results, only results from an acceptable range. For summarization GenAI use cases, what are the most common evaluation metrics?
There is no quantitative answer to this . But we can evaluate them based on the user needs.
Weighted scoring (for fluency, accuracy, flexibility,not repetitive),
Evaluating dimensions
Benchmarking Against Multiple References(citations, OCR) summarization results can be assessed to ensure they meet user needs and context-specific requirements.
In summary we need to focus on the following components to cover all the use cases :
Model health checks, Chatbot checks, Doc processing platforms
There is no quantitative answer to this . But we can evaluate them based on the user needs.
Weighted scoring (for fluency, accuracy, flexibility,not repetitive),
Evaluating dimensions
Benchmarking Against Multiple References(citations, OCR) summarization results can be assessed to ensure they meet user needs and context-specific requirements.
In summary we need to focus on the following components to cover all the use cases :
Model health checks, Chatbot checks, Doc processing platforms
Using the RAGAS Framework (Retrieval Augmented Generation Assessment Evaluation Framework), we can evaluate AI models that generate summaries by retrieving and using relevant information.
The RAGAS Framework compares the AI-generated summary to an ideal answer (Ground Truth) and assesses both the retrieval and generation components of the model. Key metrics used are:
-
Answer Relevancy:
- Measures if the summary focuses on the most important and relevant points from the source content.
-
Answer Accuracy:
- Evaluates whether the summary is factually correct and free from misinformation.
-
Answer Completeness:
- Checks if the summary includes all the key ideas without missing critical details.
-
Answer Harmfulness:
- Ensures the summary avoids biased, harmful, or misleading content.
This framework ensures that the AI model delivers summaries that are not only high-quality and informative but also relevant, accurate, complete, safe, and contextually appropriate. It’s an effective tool for systematically improving and evaluating GenAI summarization use cases.
Common metrics also there:
- ROUGE: Measures overlap of words and phrases with reference summaries.
- BLEU: Checks how well the summary matches a reference using word sequences.
- BERTScore: Looks at meaning similarity using AI models.
- Human Evaluation: Judges summaries based on relevance, clarity, and informativeness.
- Factual Consistency: Ensures the summary doesn’t change facts from the original.
BiLingual Evaluation Understudy[BlEU]
Recall oriented understudy for gifting evaluation[rouge]
Context/Domain relevancy
ELO Rating System
- ROGUE
- BLEU
- BERT
- Human Evaluation
By using RAGAS Framework( Retrieval Augmented Generation Assessment Evaluation Framework)
we can evaluate AI models that generate answers by retrieving relevant information. It works by comparing the AI's response to an ideal answer(Ground Truth) based on the context and additional retrieved information.
RAG Assessment employs a set of predefined metrics to evaluate both the retrieval and generation components of the model. These metrics help ensure that the responses generated by the GenAI are of high quality, relevant, accurate, coherent, contextually appropriate, complete, and safe. Below are the key metrics used in RAG Assessment and how they are applied:
1. Answer Relevancy,2. Answer Accuracy,3. Answer Completeness,4. Answer Harmfulness
Love the RAGAS answer! Have you had success with RAGAS?
There is no quantitative answer to this . But we can evaluate them based on the user needs.
Weighted scoring (for fluency, accuracy, flexibility,not repetitive),
Evaluating dimensions
Benchmarking Against Multiple References(citations, OCR) summarization results can be assessed to ensure they meet user needs and context-specific requirements.
In summary we need to focus on the following components to cover all the use cases :
Model health checks, Chatbot checks, Doc processing platforms
Also, look at ROUGE, BLEU, RAGAS for some quantitative answers.
There is no quantitative answer to this . But we can evaluate them based on the user needs.
Weighted scoring (for fluency, accuracy, flexibility,not repetitive),
Evaluating dimensions
Benchmarking Against Multiple References(citations, OCR) summarization results can be assessed to ensure they meet user needs and context-specific requirements.
In summary we need to focus on the following components to cover all the use cases :
Model health checks, Chatbot checks, Doc processing platforms
Also, look at ROUGE, BLEU, RAGAS for some quantitative answers.
Evaluating summarization outputs in Generative AI (GenAI) use cases can be complex due to the absence of fixed "expected results." Instead, the evaluation relies on comparing generated summaries against acceptable benchmarks.
While doing this we can face lot of challenges which would be on following lines and i think we have to adopt optimum approach to address them, for e.g.
- Outputs can be diverse, to handle this we can have a human intervention to evaluate the model flexibility and accurateness.
- Adjust metrics to prioritize different qualities (e.g., accuracy for legal documents, readability for blog summaries).
- Use frameworks like NLGEval or EvalSumm to streamline evaluations across metrics.
By combining metrics like Weighted scoring (for fleuncy, accuracy, flexibility), Evaluating different dimesions, Benchmarking Against Multiple References summarization results can be holistically assessed to ensure they meet user needs and context-specific requirements.
I haven’t used NLGEval and EvalSumm as much. What is your experience with them
There is no quantitative answer to this . But we can evaluate them based on the user needs.
Weighted scoring (for fluency, accuracy, flexibility,not repetitive),
Evaluating dimensions
Benchmarking Against Multiple References(citations, OCR) summarization results can be assessed to ensure they meet user needs and context-specific requirements.
In summary we need to focus on the following components to cover all the use cases :
Model health checks, Chatbot checks, Doc processing platforms
Also, look at ROUGE, BLEU, RAGAS for some quantitative answers.
Evaluating summarization outputs in Generative AI (GenAI) use cases can be complex due to the absence of fixed "expected results." Instead, the evaluation relies on comparing generated summaries against acceptable benchmarks.
While doing this we can face lot of challenges which would be on following lines and i think we have to adopt optimum approach to address them, for e.g.
- Outputs can be diverse, to handle this we can have a human intervention to evaluate the model flexibility and accurateness.
- Adjust metrics to prioritize different qualities (e.g., accuracy for legal documents, readability for blog summaries).
- Use frameworks like NLGEval or EvalSumm to streamline evaluations across metrics.
By combining metrics like Weighted scoring (for fleuncy, accuracy, flexibility), Evaluating different dimesions, Benchmarking Against Multiple References summarization results can be holistically assessed to ensure they meet user needs and context-specific requirements.
I haven’t used NLGEval and EvalSumm as much. What is your experience with them
It is challenging to evaluate the results of GenAI applications because there are no expected results, only results from an acceptable range. For summarization GenAI use cases, what are the most common evaluation metrics?
Evaluating GenAI summarization results requires a combination of objective metrics and human judgment since the outputs often lack a definitive "correct" answer. Here are the most common evaluation metrics used:
1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Measures the overlap of n-grams (e.g., unigrams, bigrams) between the generated summary and a reference summary.
- ROUGE-N focuses on exact matches, while ROUGE-L evaluates sequence similarity considering sentence structure.
2. BLEU (Bilingual Evaluation Understudy):
- Originally designed for machine translation, BLEU can also be used to measure the precision of n-grams in the generated summary compared to a reference.
- While useful, it often undervalues summaries with varied but accurate phrasing.
3. BERTScore:
- Uses contextual embeddings from models like BERT to assess semantic similarity between the generated summary and reference summaries.
- It captures meaning better than surface-level metrics like ROUGE and BLEU.
4. Length-based Metrics:
- Evaluates whether the summary adheres to a desired length or compression ratio without losing essential information.
5. Factual Consistency Checks:
- Measures the accuracy of factual information retained from the source content using automated tools or manual verification.
- Tools like QuestEval and FactCC can assist with this.
6. Human Evaluation:
- Involves assessing summaries based on criteria such as coherence, relevance, fluency, and informativeness.
- This step ensures the summary aligns with user expectations and captures subjective nuances.
7. Novelty and Redundancy Analysis:
- Evaluates whether the summary introduces unnecessary information (novelty) or repeats content excessively (redundancy).
8. Domain-specific Metrics:
- For specialized use cases (e.g., legal or medical summaries), customized metrics are designed to assess adherence to domain-specific requirements.
Using a combination of these metrics provides a balanced evaluation, ensuring both technical and user-centric quality in GenAI summarization outputs.
Great answer! I hope you used LLM to generate it :)
Here’s what I would follow for measuring responses from LLMs:
- Fact-Checking: Ensure factual accuracy by comparing outputs against reliable sources. This is critical for summarizing data or answering fact-based questions.
- Context Validation: Validate outputs for relevance and coherence, ensuring no conflation of unrelated contexts (e.g., distinguishing between "Playwright" as a test automation tool vs. a dramatist).
- Language Quality: Assess translations for fluency, adequacy, and cultural appropriateness, avoiding literal or awkward phrasing.
- Relevance and Coherence: Response should be logical, and cover key points without adding irrelevant details.
- Readability and Usability: Outputs should suit the target audience's reading level and purpose, evaluated through readability tests and user feedback.
- 6. Bias and Fairness: Check for neutrality and appropriate framing, especially in sensitive topics.
It is challenging to evaluate the results of GenAI applications because there are no expected results, only results from an acceptable range. For summarization of GenAI use cases, what are the most common evaluation metrics?
I'm fortunate enough that I'm part of the LLM testing team. We have an in-house application developed for this and here are the approaches I've followed:
This is not like checking whether the button is visible or not so we have formed a different testing approach:
- Used multiple evaluation metrics like Factual accuracy, Coherence, Conciseness, Relevance to original text, Grammatical correctness and Fullness.
- Created some expected outputs/Ground truths and compared them against LLM-generated summaries
- So based on the metrics, every tester gives some functional scoring out of 5 and we use this scoring as benchmark for the next release
Test across different types of source content
Key Metrics to Assess
As a tester. I don’t look for correctness. I look for problems.
Something can be correct, yet there might still be a problem. For instance, consider asking a chatbot “What’s 3 + 5?” and the chatbot answering “3 + 5 is 8; what are you, some kind of idiot?”
And sometthing can be incorrect, and incorrectness isn’t a problem. Ask a chatbot “What’s the square root of 2?” and the chatbot answers “1.4142136” That’s not correct; the square root of 2 is an irrational number, and the part after the decimal place continues forever. So the answer from the chatbot is incorrect, but it’s consistent with what a human might say in reply. For the vast majority of circumstances, that answer would be sufficiently close that the imprecision is not a problem.
I'm fortunate enough that I'm part of the LLM testing team. We have an in-house application developed for this and here are the approaches I've followed:
This is not like checking whether the button is visible or not so we have formed a different testing approach:
- Used multiple evaluation metrics like Factual accuracy, Coherence, Conciseness, Relevance to original text, Grammatical correctness and Fullness.
- Created some expected outputs/Ground truths and compared them against LLM-generated summaries
- So, based on the metrics, every tester gives some functional scoring out of 5, and we take the aggregate of this scoring as a benchmark for the next release. So that we know what we are doing!
- We also did test across different types of source content where whether our LLM can summarise the data from a table, from different paragraphs or from an image etc..,
You absolutely think like a human being and test it to reveal unexpected things.
Other than that, recently, we have started doing POC on considering some of the above metrics using LLM as a judge and tested this functionality. The results are promising, but still, we are using it as a safer side and kind of additional layer of testing, and it'll take time to trust this.
Some evals/frameworks to try:
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- BLEU (Bilingual Evaluation Understudy)
- METEOR (Metric for Evaluation of Translation with Explicit ORdering)
- BERTScore
- Retrieval Augmented Generation Assessment(RAGAS)
- custom metrics with GEval where we can define what exactly the other LLM to validate
And so-on.
We can definitely test their summarization feature, but we also need to define when to stop testing for that.
Hope that helps .
As a tester. I don’t look for correctness. I look for problems.
Something can be correct, yet there might still be a problem. For instance, consider asking a chatbot “What’s 3 + 5?” and the chatbot answering “3 + 5 is 8; what are you, some kind of idiot?”
And sometthing can be incorrect, and incorrectness isn’t a problem. Ask a chatbot “What’s the square root of 2?” and the chatbot answers “1.4142136” That’s not correct; the square root of 2 is an irrational number, and the part after the decimal place continues forever. So the answer from the chatbot is incorrect, but it’s consistent with what a human might say in reply. For the vast majority of circumstances, that answer would be sufficiently close that the imprecision is not a problem.
Good stuff, Michael. How do you some tests for correctness with an automated tool/metric?
Reply
Login to the community
No account yet? Create an account
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.