Reliability of the Tool

Given LLMs’ tendency to provide plausible but factually incorrect information, extensive analyses have been done on ensuring the responses are aligned with ground truths and human expectations both accurately and consistently. Based on these analyses, we are also able to continuously refine our prompts and workflows.

Furthermore, we analyzed the response’s consistency and accuracy when evaluating against 11 well-known Machine Learning projects on GitHub. We also have done human evaluations on three repositories manually to make sure the evaluation of this tool aligns with human expectations.

The analyses and findings are available inside the report/ directory on GitHub.