Beyond Boundaries: Evaluating Large Language Models in Computer Science with CS-Bench

A comprehensive evaluation of large language models' performance in computer science, introducing the CS-Bench benchmark and its findings.

_{^{Photo by naeim jafari on Unsplash}}

Evaluating the Performance of Large Language Models in Computer Science: The CS-Bench Benchmark

As the field of artificial intelligence continues to evolve, large language models (LLMs) have shown immense potential across various domains. However, one key challenge remains: enabling LLMs to effectively utilize computer science knowledge and serve humanity more efficiently. Despite existing studies covering multiple fields, including computer science, there is a lack of comprehensive evaluation specifically focused on LLMs’ performance in computer science.

Image: A representation of computer science concepts

Recent research has explored LLMs’ potential in various industries and scientific fields. However, studies on LLMs in computer science fall into two main categories: broad evaluation benchmarks where computer science constitutes only a small fraction, and explorations of specific LLM applications within computer science. Neither approach provides a comprehensive evaluation of LLMs’ foundational knowledge and reasoning abilities in the field.

“While individual capabilities like mathematics, coding, and logical reasoning have been well-studied, research on their integrated application and interrelationships remains sparse.”

Researchers from Beijing University of Posts and Telecommunications propose CS-Bench, the first benchmark dedicated to evaluating LLMs’ performance in computer science. CS-Bench features high-quality, diverse task forms, varying capacities, and bilingual evaluation. It comprises approximately 5,000 carefully curated test items spanning 26 sections across 4 key computer science domains.

Image: A representation of data structure concepts

The benchmark includes multiple-choice, assertion, fill-in-the-blank, and open-ended questions to better simulate real-world scenarios and assess LLMs’ robustness to different task formats. CS-Bench evaluates both knowledge-type and reasoning-type questions, supporting bilingual evaluation in Chinese and English.

CS-Bench covers four key domains: Data Structure and Algorithm (DSA), Computer Organization (CO), Computer Network (CN), and Operating System (OS). It includes 26 fine-grained subfields and diverse task forms to enrich assessment dimensions and simulate real-world scenarios.

Image: A representation of computer network concepts

Evaluation results show that overall scores of models range from 39.86% to 72.29%. GPT-4 and GPT-4o represent the highest standard on CS-Bench, being the only models exceeding 70% proficiency. Open-source models like Qwen1.5-110B and Llama3-70B have surpassed previously strong closed-source models. Newer models demonstrate significant improvements compared to earlier versions. All models perform worse on reasoning compared to knowledge scores, indicating that reasoning poses a greater challenge.

“LLMs generally perform best in Data Structure and Algorithm and worst in Operating Systems.”

This study introduces CS-Bench to provide valuable insights into LLMs’ performance in computer science. Even top-performing models like GPT-4o have significant room for improvement. The benchmark highlights the close interconnections between computer science, mathematics, and coding abilities in LLMs. These findings offer directions for enhancing LLMs in the field and provide valuable insights into their cross-abilities and applications, paving the way for future advancements in AI and computer science.

Image: A representation of AI and computer science integration