Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation
Researchers are developing new methods to improve the evaluation and training of large language models (LLMs). One approach, SCOPE, calibrates LLM judges to ensure reliable pairwise evaluations with controlled error rates. Another technique, D3, uses dynamic influence graphs to optimize data scheduling during LLM training by considering sample interactions. Additionally, OBCache offers a principled framework for pruning key-value caches to reduce memory overhead during long-context inference, improving accuracy. AI
IMPACT New research introduces methods for more reliable LLM evaluation, efficient training data scheduling, and optimized inference, potentially improving LLM performance and resource utilization.
- LLMs
- PagedAttention
- FlashAttention
- Nested WAIT
- A100 GPU
- Llama-2-7B
- LLM
- Asteria
- FasterTransformer
- Orca
- vLLM
- SCICONVBENCH
- Sarathi-Serve
- A100
- KVDrive
- LLMEval-Logic
- LLaDA2.0-mini
- DeepSeek-R1-Distill-7B
- POPE benchmark
- LLaDA2.0-flash
- V* benchmark
- TIDE
- FT-Dojo
- Frontier
- arXiv
- llama.cpp
- WebGPU
- PALS
- Charon
- LlamaWeb
- FT-Agent
- rePIRL
- LLaMA
- LoRA
- Qwen
- GPT-5
- OBCache
- FEM-Bench
- Gemini 3 Pro
- AxBench
- Item Response Theory
- SCOPE
- Hermes
- Lean