Bitcoin

COCOGEN vs DAVINCI: A Human Evaluation of Structured Commonsense Graph Generation

Summary and 1 Introduction

2 COOGEN: representing common sense structures with code and 2.1 Conversion (t, g) in python code

2.2 Invitation to a few shots to generate G

3 evaluation and 3.1 Experimental configuration

3.2 Script generation: proscript

3.3 Monitoring of the state of the entity: Propara

3.4 Generation of argument graphics: Explagraphs

4 analysis

5 related work

6 Conclusion, thanks, limitations and references

Size estimates of the models a few strokes

B Dynamic prompt creation

C Human evaluation

D Statistics of the data set

E Splemates output

In invites

G Design a python class for a structured task

H Impact of the size of the model

I Variation of prompts

C Human evaluation

On the four tasks used in this work, the prediction of the edges of the proscribed and the propar have only one correct value possible. Thus, after previous work, we report standard automated measures for these tasks. For explagraphs, we use measures based on models proposed by Saha et al. (2021), who turned out to have a strong correlation with human judgments. For the generation of proscribed graphics, we have carried out an exhaustive automated assessment which marks separately the accuracy of the nodes and the accuracy of the edges.

However, automated measures are limited in their ability to assess the output generated by the model. Thus, to deepen the quality of the results, we carry out a human assessment to compare the results generated by Cocogen and Davinci. We sampered 20 examples and three of the authors carried out the evaluation. The annotators were shown two graphs (generated by Coogen and Davinci) and were invited to select one that they thought better with regard to relevance and accuracy. The selection of each criterion was made independently: the same graph could be carried out separately: the same graph could have more relevant nodes (higher relevance) but may not be correct. The identity of the model that generated each graph (Cocogen or Davinci) has been mixed and unknown to the assessors.

The results of Table 11 indicate that human evaluation is closely correlated with automated measures: for explagraphs, annotators have found the graphics generated by cocogen to be more relevant and correct. We note that Davinci often fails to recover the semantic relationships between nodes in argument graphics. For example, consider a belief (b) urbanization harmed the natural habitats of animals of the world. We want to generate a graph which can counter this belief with the argument (a) urbanization causes an increase in jobs.

Table 8: KST on the generation of proscribed: the dynamic creation of an invite leads to marginal improvements.Table 8: KST on the generation of proscribed: the dynamic creation of an invite leads to marginal improvements.

Table 9: KST on Explagraphs: We note that the Explagraphs contains several examples which are similar to each other in the training set. Thus, the creation of dynamics of an prompt by selecting examples closest to the entrance night in fact to performance.Table 9: KST on Explagraphs: We note that the Explagraphs contains several examples which are similar to each other in the training set. Thus, the creation of dynamics of an prompt by selecting examples closest to the entrance night in fact to performance.

Table 10: The examples closest to the training set corresponding to the entry of the test: Believing: Religion causes many fights. And argument: there would be fewer fights without religious conflicts. As the table shows, the examples overlap, which reduces diversity in the invite, effectively reducing the number of examples in a invitation creation using the closest neighbors (section 4.Table 10: The examples closest to the training set corresponding to the entry of the test: Believing: Religion causes many fights. And argument: there would be fewer fights without religious conflicts. As the table shows, the examples overlap, which reduces diversity in the invite, effectively reducing the number of examples in a invitation creation using the closest neighbors (section 4.

Table 11: Human evaluation of the graphics generated by Coogen and Davinci. The assessors have been shown graphics generated by Coogen and Davinci, and were invited to select one which is more relevant for entry and correction. In the event of no preference, assessors could choose the absence of preference. The table shows the% time of the graphics of each model were preferred.Table 11: Human evaluation of the graphics generated by Coogen and Davinci. The assessors have been shown graphics generated by Coogen and Davinci, and were invited to select one which is more relevant for entry and correction. In the event of no preference, assessors could choose the absence of preference. The table shows the% time of the graphics of each model were preferred.

For the same invite, generated by cocogen (urbanization; causes; increase in jobs); (increase in jobs; has a context; good); (good; not capable of; damage) while the Davinci generated (jobs; no evil; natural habitats) → (natural habitats; no part of; animals). Note that Davinci managed to recover the relevant events (“natural” animals “) but organized them in incorrect relationships. For proscribed, human evaluation shows that Coogen and Davinci have complementary forces, while Coogen generally produces more relevant and correct results.

D Statistics of the data set

The statistics of the data set are presented in Table 12. The test fraction for the Explagraphs is not available, so we assess the validation fraction. For the proscribed, we obtained the test divisions of the authors.

Table 12: Corpus statistics for tasks used in this work.Table 12: Corpus statistics for tasks used in this work.

E Splemates output

The cocogen output samples for all the tasks are located in Madaan / Cocogen / Tree / Hand / Outings. Representative examples of each task are presented in Figure 5. Surprisingly, cocogen (Codex with a python prompt) generates Python graphics syntactically which are similar to task graphics / tables in almost 100% of cases.

In invites

The prompts for each task are present in this anonymous URL:

  1. Proscript script generation: HTTPS://github.com/madaan/cocogen/Tree / hand / data / proscript_ script_generation / prompt.txt

  2. Proscript edge-prediction: https://github.com/madaan/cocogen/Tree / hand / data / proscript_edge_ prediction / prompt.txt

  3. Propara: Madaan / Cocogen / Tree / Hand / Data / Explagraphs / Prompt.txt

  4. Explagraphs: Madaan / Cocogen / Tree / Main / Data / Explagraphs / Prompt.txt

These prompts are also present in the attached additional equipment and can be found in the data file under sub-directory of respective tasks.


Authors:

(1) Aman Madaan, Language Technologies Institute, Carnegie Mellon University, United States ([email protected]));

(2) Shuyan Zhou, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]));

(3) Uri Alon, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]));

(4) Yiming Yang, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]));

(5) Graham Neubig, Language Technologies Institute, Carnegie Mellon University, USA ([email protected]).

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblocker Detected

Please consider supporting us by disabling your ad blocker