Why Amazon needs to change its model performance measurement metric


The first step in measuring the performance of a machine learning model is its evaluation. Many metrics are used to evaluate an ML model; in fact, selecting the most appropriate metrics is essential to ensure a refined model.

Amazon introduced a new metric, QUALS, to evaluate the performance of abstract text summary models. An upgrade from its predecessor, the latest metric, has improved speed and capacity. It works with abstract summary, which sums up a text by automatically extracting sentences from given words and rephrasing them.

Register for our Free Workshop

The problem

Since these systems are based on deep learning, there is maximum overlap between the generated summaries and the sample summary in the model training data. But this theory is incorrect in the practical use of the abstract summary because the significant overlap between the generated summary and the target summary leads to factually incorrect sentences.

Amazon AI provided an example to project the same.

Example from Klitschko.png

Credit: Amazon AI blog post

The solution

Conventional metrics for training abstract summary models do not take factual accuracy into account. Amazon introduced a new metric to measure the performance of abstract summary models called QUALS. “Our metric adopts the same general strategy as the previous QAGS metric, but is 55 times faster to apply, making it more convenient for training models,” according to the Amazon blog post.


Credit: Amazon AI blog post

The given image shows the architecture of the new QUALS model at the bottom compared to the previous QAGS at the top. QUALS has a simpler architecture allowing the model to run faster.

A comparison of their models trained using both techniques revealed that the approach improved over previous top performing models by 15% on one data set and 2% on another.

Q&A scoring: QAGS vs QUALS

QAGS, the previous system, uses a four-step procedure to grade a textual summary, extract names and phrases from the summary, feed that name into a trained question generation model, and feed the generated questions into a trained response model.


QAGS requires the sequential application of three neural models.

On the other hand, QUALS reduces these sequences to one, which makes it 55 times faster than QAGS. QUALS stands for Question Answer with a Language Model Score for Synthesis.

It uses the Question-Answer Joint Generation (QAGen) model which takes a text as input and generates question-answer pairs about it.


QUALS work

QUALS requires a unique neural model, the question-answer generation model.

The model produces 60 pairs of high probability questions and answers for a given summary. The researchers also claim that these options are made after exploring various options to generate various unique suggestions. The model can then eliminate pairs whose word sequences do not match the summary.

How he finds factual inconsistencies

The source text behind the summary is passed to the QAGen model which calculates the probability of similar question-answer pairs extracted for the summary. Then the model compares the probability of generating a matching pair for the source text and for the summary – when this probability is low, the QUALS is low. Since this suggests a discrepancy where the pair is correct for the summary but not the source text, it indicates a factual inconsistency.

See also
Recipe recovery

Training methodology

The researchers proposed a contrastive learning method to use the QUALS score and train the ML model accordingly. Contrast learning is where the model learns the general characteristics of an unlabeled data set across similar or different data points.

This includes the initial training of a summary model through the Standardized Approach, which uses Maximum Likelihood Estimation (MLE) to approximate a sentence overlap score. Then, this trained model generates new summaries for various source texts in the training dataset. Finally, it creates two groups of summaries, one containing ground truth summaries with high scores and general summaries with low scores.

Finally, the team will use a loss function to recycle the synthesis model. This allows the model to generate summaries like the first group and discourages summaries like the second.


The team used two assessment models: a standard trained model and another trained by contrastive learning. Additionally, the team used three ROGUE metrics to assess summaries instead of QUALS. Of the five metrics, models trained using QUALS outperformed both baselines.

A confirmatory human assessment study compared 100 abstracts generated by QUALS with 100 generated by MLE. Human subjects were asked to compare them to check for consistency of facts, grammar, and valuable information. Summaries based on QUALS were found to be better in terms of accuracy and information. However, the grammatical correction was the same for both.


The ??Test 2 is a popular method for testing the hypothesis between two or more groups. It allows developers to confirm independence between the two variables, analyze categorical data, and evaluate tests of independence when using a bivariate table.

Confusion matrix

The confusion matrix or error matrix is ​​represented by a 2D table describing the performance of a classification model on a machine learning test dataset. The two-dimensional matrix consists of each row representing instances of the predictive class while each column represents instances of the actual class. Values ​​can also be expressed in the other direction.

Gini coefficient

The Gini coefficient is a popular measure for unbalanced class values ​​because it provides a statistical measure of the distribution. The coefficient ranges from 0 to 1, where 0 represents perfect equality and 1 represents perfect inequality. Here, a higher index value leads to a greater dispersion of the data.

Join our Discord server. Be part of an engaging online community. Join here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.


About Author

Leave A Reply