Modern embedded systems are powered by increasingly powerful hardware and are increasingly reliant on Artificial Intelligence (AI) technologies for advanced capabilities. Large Language Models (LLMs) are now being widely used to enable the next generation of human-computer interaction. While LLMs have shown impressive task orchestration capabilities, their computation complexity has limited them to run on the cloud – which introduces internet dependency and additional latency. While smaller LLMs (\(< 5B\) parameters) can run on modern embedded systems such as smartwatches and phones, their performance in UI-interaction and task orchestration remains poor. In this paper we introduce LUCI:Lightweight UI Command Interface. LUCI follows a separation of tasks structure by using a combination of LLM agents and algorithmic procedures to accomplish sub-tasks while using a high-level level LLM-Agent with rule-based checks to orchestrate the pipeline. LUCI addresses the limitations of previous In-Context learning approaches by incorporating a novel semantic information extraction mechanism that compresses the frontend code into a structured intermediate Information-Action-Field (IAF) representation. These IAF representations are then used by an Action Selection LLM. This compression allows LUCI to have a much larger effective context window along with better grounding due to the context information in IAF.
Pairing our multi-agent pipeline with our IAF representations allows LUCI to achieve similar task success rates as GPT-4Von the Mind2Web benchmark, while using 2.7B-parameter text-only PHI-2 model. When testing with GPT 3.5, LUCI shows a \(20\%\) improvement in task success rates over the state-of-the-art (SOTA) on the same benchmarks.
In today’s connected technology landscape, embedded systems play a critical role. Most Internet of Things(IoT) devices are powered by increasingly capable embedded systems. The prevalence of these smaller embedded devices, such as smartwatches and the newly burgeoning field of “smart glasses”, requires a rethinking of what human-computer interaction looks like and what next-generation operating systems for embedded systems may look like. A key challenge here is the ability to efficiently bridge the gap between interfaces designed for larger devices that utilize touch or keyboard/mouse input and the smaller devices that rely on voice or minimal gesture and button inputs.
Natural Language Processing (NLP) has been a key building block in these voice-first user interfaces, such as those found in smart speakers and smart watches, with digital assistants such as Google Assistant, Siri, and Alexa serving as the first interfaces for user interaction on voice-only devices. The advent of large language models (LLMs) has ushered a new era of voice command-focused devices, such as the Rabbit R1. We define `voice command-focused devices’ as devices whose interface primarily relies on parsing and acting on voice commands from a user along with any additional context information such as images. However, while these devices have proved extremely powerful in their limited scope, they lack the ability to truly interact with the Web and serve as a true replacement for pure touch and keyboard/mouse interfaces.
A key challenge in this domain is the ability of an LLM to convert user prompts into actions. Most tools and websites have been designed for humans and create challenges for LLMs to navigate. The first approaches attempted to extend the coding capabilities of LLMs to work with APIs to perform tasks. They adopted a Planner, Actor and Reporter structure to provide the necessary context to the LLMs, i.e., grounding, allowing them to interact with the environment (computer systems). However, API-based methods required the creation of natural language instructions for the API of each application, which limited their generalization across multiple applications. In-Context Learning(ICL) was used to remedy this by providing the necessary context on relevant prompts rather than pre-training. However, this was limited by the context length of the LLMs.
Simultaneously, another avenue of research was LLM frameworks to manipulate the graphic user interface (GUI). These focused on teaching LLMs to interact with the GUI front-end of applications, forgoing the need for specific APIs and leading to better generalization. Early Reinforcement Learning (RL) on Hypertext Markup Language(HTML) struggled with the selection of relevant UI elements that required supervised training. In this work, our aim is to remedy these challenges and create a lightweight LLM GUI control framework.
In order to remedy challenges with identifying UI elements from HTML, a multi-modal model was introduced that operated on both text and image inputs such as GPT 4V. These models were able to achieve state-of-the-art performance on the Mind2Web benchmark. However, these models were limited by their large size and the need for fine-tuning on a large multimodal dataset. This large size makes it impossible to run these models locally on current and near-term embedded systems, leading to internet dependency. A direct consequence of this dependency are continuous server costs and latency issues.
We introduce a novel framework - Lightweight UI Command Interface (LUCI), that enables multi-application GUI-based orchestration via an ensemble of lightweight LLMs. The LUCI framework is designed to be modular, hierarchical, and OS-agnostic, enabling it to work seamlessly across both native and web interfaces. This is achieved by using a separation of responsibilities model to decompose tasks into sub-tasks and recursively solve them with a variety of LLM and rule-based components. Our approach is based on the insights of multi-agent based LLM pipelines and the Mixture of Experts models. LUCI uses a set of fine-tuned models with specialized prompts to handle different aspects of the system such as high-level planning, tool selection and action selection. The information flow between these models is shown in the architecture figure above. LUCI adopts an application centric approach to planning, where the system first identifies the most relevant application that will be needed to address the given task prompt. This both greatly simplifies the planning as well as serves as effective grounding for downstream tasks. A ‘conversational model’ then generates sub-tasks for the prompt based on the selected applications. These subtasks are processed iteratively. Additionally, in order to efficiently parse the front-end code from applications, we developed a rule-based semantic parser which we refer to as `UI Extractor’. This parser compresses the code into an intermediate representation known as Information-Action-Field (IAF) format. This allows LUCI to greatly increase the effective attention window of the lightweight LLMs. Overall, the key contributions of LUCI can be summarized as:
Our experiments show that LUCI achieves a \(99+\%\) success rate on the MiniWoB++ benchmark. Additionally, LUCI achieves an upto \(31\%\) improvement in Step SR and \(24\%\) improvement in OP F1 Score over GPT-4V on the Mind2Web benchmark while using GPT 3.5. When using the \(2.7B\) parameter PHI-2 model, LUCI demonstrates \(10\%\) increase in step SR and \(\)7.9\%\(\) increase in OP F1 score over GPT 4V while \(10^3\) times fewer parameters \(1.8\)- Trillion (GPT 4V) vs \(3\)-Billion parameters (PHI-2).
@inproceedings{10.1145/3735452.3735536,
author = {Lagudu, Guna and Sharma, Vinayak and Shrivastava, Aviral},
title = {LUCI: Lightweight UI Command Interface},
year = {2025},
isbn = {9798400719219},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3735452.3735536},
doi = {10.1145/3735452.3735536},
abstract = {Modern embedded systems are powered by increasingly powerful hardware and are increasingly reliant on Artificial Intelligence (AI) technologies for advanced capabilities. Large Language Models (LLMs) are now being widely used to enable the next generation of human-computer interaction. While LLMs have shown impressive task orchestration capabilities, their computation complexity has limited them to run on the cloud – which introduces internet dependency and additional latency. While smaller LLMs (< 5𝐵 parameters) can run on modern embedded systems such as smartwatches and phones, their performance in UI-interaction and task orchestration remains poor. In this paper we introduce LUCI:Lightweight UI Command Interface. LUCI follows a separation of tasks structure by using a combination of LLM agents and algorithmic procedures to accomplish sub-tasks while using a high-level level LLM-Agent with rule-based checks to orchestrate the pipeline. LUCI addresses the limitations of previous In-Context learning approaches by incorporating a novel semantic information extraction mechanism that compresses the frontend code into a structured intermediate Information-Action-Field (IAF) representation. These IAF representations are then used by an Action Selection LLM. This compression allows LUCI to have a much larger effective context window along with better grounding due to the context information in IAF. Pairing our multi-agent pipeline with our IAF representations allows LUCI to achieve similar task success rates as GPT-4Von the Mind2Web benchmark, while using 2.7B parameter text-only PHI-2 model. When testing with GPT 3.5, LUCI shows a 20\% improvement in task success rates over the state-of-the-art (SOTA) on the same benchmarks.},
booktitle = {Proceedings of the 26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems},
pages = {182–191},
numpages = {10},
keywords = {Embedded Systems, LLM, System Automation},
location = {Seoul, Republic of Korea},
series = {LCTES '25}
}