ICLR25 | LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

Caigao Jiang* · Xiang Shu* · Hong Qian · Xingyu Lu
Jun Zhou · Aimin Zhou · Yang Yu

*Equal Contribution, Corresponding Authors.

East China Normal University | Ant Group | Nanjing University

🤖Model Release

We release the LLMOPT-Qwen2.5-14B model on Hugging Face and conduct comprehensive performance evaluations. We have updated the model evaluation results as shown in the following table, where the original results correspond to Table 1 and Table 2 in the paper. The differences in results stem from two reasons. Firstly, we exclude all Mamo EasyLP and ComplexLP datasets from the training process, reserving them exclusively for the test. Additionally, unlike the version described in our paper which used Qwen1.5-14B, this release is fine-tuned from the latest Qwen2.5-14B-Instruct model. The performance metrics for LLMOPT-Qwen2.5-14B are as follows:

Dataset NL4Opt Mamo Easy Mamo Complex NLP4LP ComplexOR IndustryOR ICML Competition OptiBench OptMath AVG
#Questions 230 652 211 242 18 100 410 605 166 -
ER with self-correction 100.00% 100.00% 99.05% 100.00% 100.00% 94.00% 99.66% 82.31% 75.30% 94.48%
SA with self-correction 97.31% 95.31% 85.78% 86.49% 76.47% 44.00% 95.76% 66.44% 40.00% 76.40%
AST with self-correction 1.38 1.13 2.13 1.50 3.46 2.14 1.47 1.54 4.06 2.09
ER w/o self-correction 97.42% 98.29% 77.73% 97.93% 88.89% 61.00% 93.90% 73.22% 31.93% 80.03%
SA w/o self-correction 80.28% 89.53% 44.08% 73.42% 35.29% 29.00% 75.35% 53.83% 12.50% 54.81%

In the experiment, we use three performance metrics to comprehensively evaluate the optimization generalization of the algorithm, namely, Execution Rate (ER), Solving Accuracy (SA), and Average Solving Times (AST). Specifically, ER refers to the proportion of solutions whose code can run without any errors and has running results output. SA refers to the proportion of solutions that correctly solve the optimization problem, i.e., find the optimal solution. AST refers to the average number of times the self-correction process is performed during the test.

📊Dataset Release

Data Structure

To facilitate the evaluation, we process all datasets into a unified data structure. Specifically, each dataset is organized in a jsonl file, and each line is an independent piece of data. Each data includes four attributes, question, answer, ori, and index. The question field is a complete string description of the optimization problem, including complete data that can solve a problem. The answer field is a float type value, which indicates the objective function value corresponding to the optimal solution of the problem, i.e., the ground truth. The ori field indicates the source of the problem, that is, the name of the dataset. In order to facilitate statistical results, we use the index field to number the data in each dataset.

The data are available.

An example: (The first data of the NL4Opt dataset)

{
    "question": "There has been an oil spill in the ocean and ducks need to be taken to shore to be cleaned either by boat or by canoe. A boat can take 10 ducks per trip while a canoe can take 8 ducks per trip. Since the boats are motor powered, they take 20 minutes per trip while the canoes take 40 minutes per trip. In order to avoid further environmental damage, there can be at most 12 boat trips and at least 60% of the trips should be by canoe. If at least 300 ducks need to be taken to shore, how many of each transportation method should be used to minimize the total amount of time needed to transport the ducks?", 
    "answer": 1160, 
    "ori": "5_nl4opt_test", 
    "index": 1
}

Dataset Source

Here we explain the sources of all data sets and the detailed data processing process. For ground truth values with more than two decimal places, they will be rounded to two decimal places. If you find any omissions in manual labeling, please feel free to correct them.

1. NL4Opt

The data for this testset comes from the competition, NL4Opt. We only used the test split. We manually labeled these 230 optimization problems. The original dataset contains 245 problems, of which 15 were found to be unsolvable after manual inspection, so we manually removed these problems. The sorted data can be found in the ./data/testset/nl4opt_test.jsonl.

2. Mamo Easy

This testset comes from the paper Mamo: a Mathematical Modeling Benchmark with Solvers. We obtained the original dataset of 652 data from huggingface. Since we found some wrong ground truth value in the open-source data, we manually checked and re-labeled all the data. The manually checked data is stored in ./data/testset/mamo_easy_test.jsonl.

3. Mamo Complex

This testset comes from the paper Mamo: a Mathematical Modeling Benchmark with Solvers. We sorted out 211 original problems from the complex_lp spilt of the huggingface and stored the original data in a unified format in ./data/testset/mamo_complex_test.jsonl.

4. NLP4LP

This testset comes from the paper OptiMUS: Optimization Modeling Using MIP Solvers and large language models. We sorted out these 242 feasible original problems from huggingface and stored the original data in a unified format in ./data/testset/nlp4lp.jsonl.

5. ComplexOR

This testset comes from the paper Chain-of-Experts: When LLMs Meet Complex Operation Research Problems. We sorted out these 18 feasible original problems from the github repo and stored the original data in a unified format in ./data/testset/complexor.jsonl.

6. IndustryOR

This testset comes from the paper ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling. We sorted out these 100 original problems from huggingface and stored the original data in a unified format in ./data/testset/industryor.jsonl.

7. ICML Competition

The data for this testset comes from the competition, ICML 2024 Challenges on Automated Math Reasoning - Track 3: Automated Optimization Problem-Solving with Code. We only used the test split. Since the competition organizer did not open source the ground truth of the testset, we manually labeled these 410 problems. The original dataset contains 421 problems, of which 11 were found to be unsolvable after manual inspection, so we manually removed these problems. The sorted data can be found in the ./data/testset/task3_test.jsonl.

8. OptiBench

This testset comes from the paper OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling. We sorted out these 605 original problems from the repository and stored the original data in a unified format in ./data/testset/optibench.jsonl.

9. OptMath

This testset comes from the paper OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling. We sorted out these 165 original problems from the repository and stored the original data in a unified format in ./data/testset/optmath.jsonl.

⚙️Inference

The following example code for model inference in getting the experiement data:

model = AutoModelForCausalLM.from_pretrained(path,torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(path_t)
prompt = "Give me a short introduction to large language model."
messages = [{"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
            ]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids,max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

⌛️Future Work

With the remarkable progress and rapid development of reasoning models (like DeepSeek R1 and OpenAI O1-3) in solving complex mathematical problems, we have also developed the LLMOPT Reasoning model. We will soon release our LLMOPT Reasoning version along with a new benchmarking effort.

📄Citation

If you encounter any question about our work, please do not hesitate to submit an issue. If you do find our resources helpful, please cite our paper.

@inproceedings{JiangShu2025llmopt,
  title     = {LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch},
  author    = {Caigao Jiang and Xiang Shu and Hong Qian and Xingyu Lu and Jun Zhou and Aimin Zhou and Yang Yu},
  booktitle = {Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)},
  year      = {2025},
  address   = {Singapore, Singapore},
  url       = {https://openreview.net/pdf?id=9OMvtboTJg}
}
Downloads last month
52
Safetensors
Model size
14.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ant-opt/LLMOPT-Qwen2.5-14B

Base model

Qwen/Qwen2.5-14B
Finetuned
(166)
this model
Quantizations
2 models