Prompt Engineering Techniques for Language Model Reasoning Lack Replicability

Abstract:

As large language models (LLMs) are integrated into everyday applications, research into prompt engineering techniques to improve these models' behavior has surged. However, clear methodological guidelines for evaluating these techniques are lacking. This raises concerns about the replicability and generalizability of the prompt engineering techniques’ benefits. We support our concerns with a series of replication experiments focused on zero-shot prompt engineering techniques purported to influence reasoning abilities in LLMs. We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3, Vicuna, and BLOOM on the chain-ofthought,
EmotionPrompting, Sandbagging, Re-Reading, Rephrase-and-Respond, and ExpertPrompting prompt engineering techniques. We applied them on manually double-checked subsets of reasoning benchmarks including CommonsenseQA, CRT, NumGLUE, ScienceQA, and StrategyQA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research. To counter these issues, we propose recommendations for establishing sound benchmarks, and designing rigorous experimental frameworks to ensure accurate and reliable assessments of model outputs.