Quick Start
In the last couple of years, research into the field of natural language processing which is used in tasks such as text summarization, has made significant advances. But, despite attaining the highest levels of fluency, neural systems may be vulnerable to hallucinations (i.e. generating the text in a way that's understandable but not exactly as the source) and this can prevent the use of these systems for many tasks which require high levels of accuracy.
While assessing the accuracy of generated text to the original content may be difficult but it's usually simpler when the content source is well-structured (e.g. using tabular form). Furthermore structured data can test the model's capacity to reason and inference using numerical methods. However, the existing large-scale structured datasets can be noisy (i.e. that the reference sentence is not completely inferred from the tabular data) which makes them insufficient to measure hallucinations in the process of developing models.
Table-to-Text Generation
Here we introduces the concept of a controllable generation task that requires the Wikipedia table that contains a set of cells selected is used as the base data for writing a brief description of what cells contain within its context within the table. This example demonstrates some of the numerous problems that the task poses including computation, a vast open-domain vocabulary, as well as a diverse table structure.
Annotation Process
An annotation process that can create unnatural, yet pure sentences using tabular data is a major problem. A lot of datasets such as GTS a Datasets provider, combine naturally occurring text heuristically with tables which is a noisy process, making it difficult to discern whether hallucinations are prarily due to models or data noise. On contrary you can get annotators to write sentences starting from scratch that are identical to the table, however the results often aren't diverse in regards to structure and style.
To the contrary, Text Data Collection are constructed using an original data annotation technique that allows annotators to revise already existing Wikipedia sentences through stages. This creates targeted sentences that are clear and natural, with intriguing and diverse linguistic characteristics. The annotation and data collection process begins by acquiring tables from Wikipedia and other sources.
The annotator will then highlight the cells of the table that help support the sentence. They also remove sentences that aren't that are supported by the table. The annotator also decontextualizes the sentence to make it stand alone (e.g. using the correct resolution of pronouns) and with correct grammar, in the event that it is it is necessary.
1. Dataset Analysis
We conducted a topic study of the AI database covering 44 categories. We found there was a significant overlap between topics like the Sports and Countries topics, each one of which is comprised of several fine-grained topics, e.g., football/Olympics for sports and population/buildings of countries, comprise 56.4 percent of the data. The remaining 44% of the dataset is comprised of the more extensive range of topics, such as Performing Arts, Transportation, and Entertainment.
We also conducted an analysis by hand of the diverse linguistic phenomena found in the dataset of over 100 randomly selected examples. Below is a table that summarizes the percentage of instances that require a reference to page and section titles, and certain linguistic issues found in the dataset that may present new challenges to existing technology.
Linguistic PhenomenaPercentageIt is required to reference the page title82%It is necessary to reference the section title19%Table description must be referenced to3%Reasoning (logical or numerical, temporal, etc.)21%Comparative analysis across rows, columns or cells13%You will need background information12%
2. Baseline Results
We present the baseline outcomes of three modern models that are based on the literature with respect to two measurement metrics that are PARENT and BLEU. In addition to presenting the score of the general testing set, we assess each model against the more difficult subset comprised of examples outside of the domain. The table below demonstrates that the BERT-to BERT model is the best with respect to both PARENT and BLEU. In addition, all models have much lower scores on the set of challenges, which highlights the difficulty of generalization outside-of-domain.
While automated metrics can provide some insight into performance, they aren't yet sufficient to assess hallucination within text-generation systems. To understand the nature of hallucinations it is necessary to manually assess the most efficient baseline to assess how close its performance is in relation to information from the source table with the assumption that differences are a sign of hallucinations. To calculate "Expert" performance, for each "Expert" performance, for every example of our test set with multiple references we held one reference in the test set and requested annotators to evaluate it to the other references for accuracy. The results indicate that the most reliable baseline is able to conceal information approximately 20 percent times.
3. Model Errors and Challenges
We offer some of the model errors observed to highlight the most difficult aspects of the AI and OCR Datasets. The models we have tested are state-of-the-art and have difficulty with hallucination or numerical reasoning as well as uncommon topics, even when using clean references (errors shown in red). We are hoping that our dataset will also prove useful for other tasks like table comprehension or sentence editing.
Conclusion
In Global Technology Solutions (GTS) we provide you the AI Training datasets, Annotation along with Data Quality Management in various languages, some of them are as stated below;
- Chinese text dataset services
- Dutch text dataset services
- French text dataset services
- German text dataset services
- Italian text dataset services
- Japanese text dataset services
- Portuguese text dataset services
- Spanish text dataset services
We have a vast text data collection with over 200 language support that cuts across document dataset, receipt dataset, ticket dataset, business card dataset…etc.