A Government Test Reveals
AI Performs Worse Than Human Workers
A government trial shows generative AI performs significantly worse than human workers in summarizing official documents, raising concerns over AI’s reliability.
Summary of Findings
A recent government trial has revealed that generative AI significantly underperforms when compared to human workers in summarizing complex information. Commissioned by the Australian Securities and Investment Commission (ASIC) and conducted by Amazon Web Services (AWS), the test assessed AI’s capability to generate useful summaries of government documents.
Experiment Overview
The test used Meta’s open-source Llama2-70B, a large language model with 70 billion parameters. The model was tasked with summarizing documents related to a parliamentary inquiry, emphasizing references to ASIC and including citations. ASIC employees also generated their summaries for comparison.
Five independent evaluators reviewed the summaries in a blind test. Each summary was labeled A or B, and the scorers were unaware of which version was AI-generated.
Results: AI Falls Short
The AI-generated summaries received a dismal average score of 47%, compared to the 81% score for human-generated content. The poor performance highlighted several critical flaws:
- Lack of accuracy: The AI struggled to generate correct page references.
- Weak comprehension: It is often misunderstood or misses key context.
- Poor structure: Summaries were wordy, redundant, and lacked relevance.
Evaluators’ Reactions
Three out of five assessors expressed disbelief that the content could be from AI due to its poor quality. Their reactions underscore broader concerns about overestimating AI’s current capabilities.
Broader Implications
This experiment reinforces a growing narrative: despite rapid advances, generative AI is not yet a reliable replacement for human judgment in tasks requiring nuance and critical thinking. Instead of saving time, its output often requires extensive fact-checking and revision, ultimately adding to the workload.
Recommendations and Conclusion
While the model’s performance might improve with adjustments, the core issues, contextual misunderstanding, factual errors, and verbosity suggest fundamental limitations. For now, organizations should approach AI integration cautiously, especially for high-stakes or detail-intensive work.
#AIvsHuman, #GenerativeAI, #ASIC, #ArtificialIntelligence, #Llama2, #AWS, #AIUnderperforms, #AITrustIssues, #TechNews, #AIinGovernment,