The current landscape
The introduction of Devin has sparked significant interest in AI Agents for Software Engineering, further fueling the already dynamic Generative AI field. This is despite a rebuttal that surfaced a month later, challenging many of Devin’s claims. The intrigue surrounding Devin is particularly heightened due to its proprietary nature.
In response, open-source AI Agents striving to match or surpass Devin’s purported capabilities are rapidly gaining traction. Devin utilized SWE-Bench to benchmark its abilities. The Princeton team that developed SWE-Bench recently launched the open-source SWE-agent, demonstrating performance nearly on par with Devin. OpenDevin is another prominent open-source project aiming to compete with Devin. Other examples of open source AI coding agents include Devika, Aider, GPT Pilot, GPT Engineer and more.
Multi-Agent frameworks, such as AutoGen, are likely to play an important role in this evolving landscape as well.
Evaluation tools
OpenAI has recently open-sourced simple-eval, a tool designed for evaluating LLMs with a particular focus on chat-based models with a “zero-shot, chain-of-thought” setting. It covers HumanEval, which is commonly used for assessing LLM coding capabilities. Simple-evals serves as a lightweight version of openai eval, emphasizing
For evaluating real-world software engineering problem-solving skills, Princeton NLP’s SWE-Bench and SWE-Bench Lite are more suitable. This article is a good further reading on the benchmark.
Run SWE-Agent
Follow installation instructions.
Preparation:
- Install Docker
- Install Conda
- Download code to build the docker images to run swe-agent. It is safer to run in a docker environment because coding agents will execute AI generated code on our computer.
git clone https://github.com/princeton-nlp/SWE-agent.git
cd SWE-agent
conda env create -f environment.yml
conda activate swe-agent
./setup.sh # create the docker images to run swe-agent
- Provide configuration information, such as API keys. Create and fill in the
keys.cfg
file.
Build
python run.py --model_name gpt4-0125 \
--data_path https://github.com/pvlib/pvlib-python/issues/1603 \
--config_file config/default_from_url.yaml
python run.py --model_name gpt3-0125 \
--data_path https://github.com/pvlib/pvlib-python/issues/1603 \
--config_file config/default_from_url.yaml
- make sure to check the
model_name
argument.,gpt4-0125
is the best quality but much more expensive thangpt3-0125
. - passing the
--open_pr
flag will automatically open a PR if an issue is solved.
Run OpenDevin
Installation
Check dependencies:
node --version
poetry --version
Note: if poetry is outdated, use brew upgrade poetry
to update it (on Mac).
git clone https://github.com/OpenDevin/OpenDevin.git
cd OpenDevin
make build
There are errors:
Installing Python dependencies...
/bin/sh: chroma-hnswlib: command not found
Installing ...
Creating virtualenv opendevin in /Users/charles/github/OpenDevin/.venv
Collecting chroma-hnswlib
Downloading chroma-hnswlib-0.7.3.tar.gz (31 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [19 lines of output]
... snipped for brevity ...
File "<string>", line 73, in <module>
File "<string>", line 90, in BuildExt
ValueError: list.remove(x): x not in list
[end of output]
So we need to look into the make build
process in the Makefile
.
As part of the install-python-dependencies
target in the Makefile. It creates a virtual environment at .venv
managed by poetry:
install-python-dependencies:
@echo "$(GREEN)Installing Python dependencies...$(RESET)"
@if [ "$(shell uname)" = "Darwin" ]; then \
echo "$(BLUE)Installing `chroma-hnswlib`...$(RESET)"; \
export HNSWLIB_NO_NATIVE=1; \
poetry run pip install chroma-hnswlib; \
fi
@poetry install --without evaluation
@echo "$(GREEN)Python dependencies installed successfully.$(RESET)"
In this target, poetry install --without evaluation
is the command that installs the Python dependencies and creates the virtual environment if it doesn’t exist. The --without evaluation
option tells Poetry to skip installing packages from the “evaluation” group, which is a feature of Poetry for managing different sets of dependencies. According to the pyproject.toml
, the evaluation group is about torch
.
[tool.poetry.group.evaluation.dependencies]
torch = "*"
Now let’s make some adjustments. We saw the system is using Python 3.12, but at this time we feel Python 3.11 results in fewer compatibility issues based on our overall experience. Let’s update:
[tool.poetry.dependencis]
python = ">=3.11,<3.12"
run poetry lock
to update the lock file.
run make build
again.
It indeed succeeds!
Configure model usage
make setup-config
will generate the Config.toml
file.
run
make run
and simply access: http://localhost:3001/
We ran into connection issues with “Brave” browser, but switching to “Chrome” worked.
Parting thoughts
At present, SWE-agent has hinted at the development of a user interface. In contrast, OpenDevin already boasts both a frontend and a backend, although its backend agent remains rudimentary. However, OpenDevin’s flexible architecture allows for the backend to be replaced with different agents. There are also plans to use SWE-agent as a backend agent for OpenDevin. While these developments have not yet been released, the potential integration of these two elements still represent a valuble step forward for generative AI coding agents.
Updates (July 2024)
OpenDevin has made sound progresses on Agents. In OpenDevin’s agent framework, the basic control loop for an agent operates as follows:
while True:
= agent.generate_prompt(state)
prompt = llm.completion(prompt)
response = agent.parse_response(response)
action = runtime.run(action)
observation = state.update(action, observation) state
Based on the current state, an agent determines the next step, processes it through the LLM to decide the appropriate action, executes the action, and then updates the state with the result. This loop then repeats.
The CodeAct Agent is currently the most advanced agent. Additionally, the SWE Agent has been adapted into the Codeact SWE Agent.
There is also a suite of microagents, each specialized in specific tasks, such as coding, mathematical operations, and database management (e.g., postgres_agent).
A manager or delegator agent is responsible for assigning tasks to the appropriate specialized agents.
OpenDevin is providing evaluation capabilities of coding agents by incorporating SWE-bench and other evaluation tools. The architecture supports the integration of any agents that adhere to a minimal common interface, facilitating the comparison of coding agent performance.
The team also plans to integrate the agents more closely with developer workflows, including VSCode plugins, GitHub issues, PR comments, and CI/CD processes.