Generative AI Coding Agents – Dr. Charles Shen

The current landscape

The introduction of Devin has sparked significant interest in AI Agents for Software Engineering, further fueling the already dynamic Generative AI field. This is despite a rebuttal that surfaced a month later, challenging many of Devin’s claims. The intrigue surrounding Devin is particularly heightened due to its proprietary nature.

In response, open-source AI Agents striving to match or surpass Devin’s purported capabilities are rapidly gaining traction. Devin utilized SWE-Bench to benchmark its abilities. The Princeton team that developed SWE-Bench recently launched the open-source SWE-agent, demonstrating performance nearly on par with Devin. OpenDevin is another prominent open-source project aiming to compete with Devin. Other examples of open source AI coding agents include Devika, Aider, GPT Pilot, GPT Engineer and more.

Multi-Agent frameworks, such as AutoGen, are likely to play an important role in this evolving landscape as well.

Evaluation tools

OpenAI has recently open-sourced simple-eval, a tool designed for evaluating LLMs with a particular focus on chat-based models with a “zero-shot, chain-of-thought” setting. It covers HumanEval, which is commonly used for assessing LLM coding capabilities. Simple-evals serves as a lightweight version of openai eval, emphasizing

For evaluating real-world software engineering problem-solving skills, Princeton NLP’s SWE-Bench and SWE-Bench Lite are more suitable. This article is a good further reading on the benchmark.

Run SWE-Agent

Follow installation instructions.

Preparation:

Install Docker
Install Conda
Download code to build the docker images to run swe-agent. It is safer to run in a docker environment because coding agents will execute AI generated code on our computer.

git clone https://github.com/princeton-nlp/SWE-agent.git
cd SWE-agent
conda env create -f environment.yml
conda activate swe-agent
./setup.sh  # create the docker images to run swe-agent

Provide configuration information, such as API keys. Create and fill in the keys.cfg file.

Build

python run.py --model_name gpt4-0125 \
  --data_path https://github.com/pvlib/pvlib-python/issues/1603 \
  --config_file config/default_from_url.yaml

python run.py --model_name gpt3-0125 \
  --data_path https://github.com/pvlib/pvlib-python/issues/1603 \
  --config_file config/default_from_url.yaml

make sure to check the model_name argument., gpt4-0125 is the best quality but much more expensive than gpt3-0125.
passing the --open_pr flag will automatically open a PR if an issue is solved.

Run OpenDevin

Installation

Check dependencies:

node --version
poetry --version

Note: if poetry is outdated, use brew upgrade poetry to update it (on Mac).

git clone https://github.com/OpenDevin/OpenDevin.git
cd OpenDevin
make build

There are errors:

Installing Python dependencies...
/bin/sh: chroma-hnswlib: command not found
Installing ...
Creating virtualenv opendevin in /Users/charles/github/OpenDevin/.venv
Collecting chroma-hnswlib
  Downloading chroma-hnswlib-0.7.3.tar.gz (31 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]

  ... snipped for brevity ...

      File "<string>", line 73, in <module>
      File "<string>", line 90, in BuildExt
    ValueError: list.remove(x): x not in list
    [end of output]

So we need to look into the make build process in the Makefile.

As part of the install-python-dependencies target in the Makefile. It creates a virtual environment at .venv managed by poetry:

install-python-dependencies:
    @echo "$(GREEN)Installing Python dependencies...$(RESET)"
    @if [ "$(shell uname)" = "Darwin" ]; then \
        echo "$(BLUE)Installing `chroma-hnswlib`...$(RESET)"; \
        export HNSWLIB_NO_NATIVE=1; \
        poetry run pip install chroma-hnswlib; \
    fi
    @poetry install --without evaluation
    @echo "$(GREEN)Python dependencies installed successfully.$(RESET)"

In this target, poetry install --without evaluation is the command that installs the Python dependencies and creates the virtual environment if it doesn’t exist. The --without evaluation option tells Poetry to skip installing packages from the “evaluation” group, which is a feature of Poetry for managing different sets of dependencies. According to the pyproject.toml, the evaluation group is about torch.

[tool.poetry.group.evaluation.dependencies]
torch = "*"

Now let’s make some adjustments. We saw the system is using Python 3.12, but at this time we feel Python 3.11 results in fewer compatibility issues based on our overall experience. Let’s update:

[tool.poetry.dependencis]
python = ">=3.11,<3.12"

run poetry lock to update the lock file.

run make build again.

It indeed succeeds!

Configure model usage

make setup-config will generate the Config.toml file.

run

make run and simply access: http://localhost:3001/

We ran into connection issues with “Brave” browser, but switching to “Chrome” worked.

Parting thoughts

At present, SWE-agent has hinted at the development of a user interface. In contrast, OpenDevin already boasts both a frontend and a backend, although its backend agent remains rudimentary. However, OpenDevin’s flexible architecture allows for the backend to be replaced with different agents. There are also plans to use SWE-agent as a backend agent for OpenDevin. While these developments have not yet been released, the potential integration of these two elements still represent a valuble step forward for generative AI coding agents.

Updates (July 2024)

OpenDevin has made sound progresses on Agents. In OpenDevin’s agent framework, the basic control loop for an agent operates as follows:

while True:
  prompt = agent.generate_prompt(state)
  response = llm.completion(prompt)
  action = agent.parse_response(response)
  observation = runtime.run(action)
  state = state.update(action, observation)

Based on the current state, an agent determines the next step, processes it through the LLM to decide the appropriate action, executes the action, and then updates the state with the result. This loop then repeats.

The CodeAct Agent is currently the most advanced agent. Additionally, the SWE Agent has been adapted into the Codeact SWE Agent.

There is also a suite of microagents, each specialized in specific tasks, such as coding, mathematical operations, and database management (e.g., postgres_agent).

A manager or delegator agent is responsible for assigning tasks to the appropriate specialized agents.

OpenDevin is providing evaluation capabilities of coding agents by incorporating SWE-bench and other evaluation tools. The architecture supports the integration of any agents that adhere to a minimal common interface, facilitating the comparison of coding agent performance.

The team also plans to integrate the agents more closely with developer workflows, including VSCode plugins, GitHub issues, PR comments, and CI/CD processes.