COCOGEN vs DAVINCI: A Human Evaluation of Structured Commonsense Graph Generation

Ties
Summary and 1 Introduction
2 COOGEN: representing common sense structures with code and 2.1 Conversion (t, g) in python code
2.2 Invitation to a few shots to generate G
3 evaluation and 3.1 Experimental configuration
3.2 Script generation: proscript
3.3 Monitoring of the state of the entity: Propara
3.4 Generation of argument graphics: Explagraphs
4 analysis
5 related work
6 Conclusion, thanks, limitations and references
Size estimates of the models a few strokes
B Dynamic prompt creation
C Human evaluation
D Statistics of the data set
E Splemates output
In invites
G Design a python class for a structured task
H Impact of the size of the model
I Variation of prompts
C Human evaluation
On the four tasks used in this work, the prediction of the edges of the proscribed and the propar have only one correct value possible. Thus, after previous work, we report standard automated measures for these tasks. For explagraphs, we use measures based on models proposed by Saha et al. (2021), who turned out to have a strong correlation with human judgments. For the generation of proscribed graphics, we have carried out an exhaustive automated assessment which marks separately the accuracy of the nodes and the accuracy of the edges.
However, automated measures are limited in their ability to assess the output generated by the model. Thus, to deepen the quality of the results, we carry out a human assessment to compare the results generated by Cocogen and Davinci. We sampered 20 examples and three of the authors carried out the evaluation. The annotators were shown two graphs (generated by Coogen and Davinci) and were invited to select one that they thought better with regard to relevance and accuracy. The selection of each criterion was made independently: the same graph could be carried out separately: the same graph could have more relevant nodes (higher relevance) but may not be correct. The identity of the model that generated each graph (Cocogen or Davinci) has been mixed and unknown to the assessors.
The results of Table 11 indicate that human evaluation is closely correlated with automated measures: for explagraphs, annotators have found the graphics generated by cocogen to be more relevant and correct. We note that Davinci often fails to recover the semantic relationships between nodes in argument graphics. For example, consider a belief (b) urbanization harmed the natural habitats of animals of the world. We want to generate a graph which can counter this belief with the argument (a) urbanization causes an increase in jobs.
For the same invite, generated by cocogen (urbanization; causes; increase in jobs); (increase in jobs; has a context; good); (good; not capable of; damage) while the Davinci generated (jobs; no evil; natural habitats) → (natural habitats; no part of; animals). Note that Davinci managed to recover the relevant events (“natural” animals “) but organized them in incorrect relationships. For proscribed, human evaluation shows that Coogen and Davinci have complementary forces, while Coogen generally produces more relevant and correct results.
D Statistics of the data set
The statistics of the data set are presented in Table 12. The test fraction for the Explagraphs is not available, so we assess the validation fraction. For the proscribed, we obtained the test divisions of the authors.
E Splemates output
The cocogen output samples for all the tasks are located in Madaan / Cocogen / Tree / Hand / Outings. Representative examples of each task are presented in Figure 5. Surprisingly, cocogen (Codex with a python prompt) generates Python graphics syntactically which are similar to task graphics / tables in almost 100% of cases.
In invites
The prompts for each task are present in this anonymous URL:
-
Proscript script generation: HTTPS://github.com/madaan/cocogen/Tree / hand / data / proscript_ script_generation / prompt.txt
-
Proscript edge-prediction: https://github.com/madaan/cocogen/Tree / hand / data / proscript_edge_ prediction / prompt.txt
-
Propara: Madaan / Cocogen / Tree / Hand / Data / Explagraphs / Prompt.txt
-
Explagraphs: Madaan / Cocogen / Tree / Main / Data / Explagraphs / Prompt.txt
These prompts are also present in the attached additional equipment and can be found in the data file under sub-directory of respective tasks.
Authors:
(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, United States ([email protected]));
(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]));
(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]));
(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]));
(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).