🍊 PageGuide: Browser Extension to Assist Users in Navigating a Webpage and Locating Information

A browser extension that grounds LLM answers in a page directly
🖍️ highlights evidence, 🗺️ navigates step-by-step, 🙈 hides distractions, 📄 reads PDFs, and 👁️ answers visual questions (see Method and Examples).

Tin Nguyen^*†1, Thang T. Truong^*1, Runtao Zhou^*2, Trung Bui, Chirag Agarwal², Anh Totti Nguyen¹

Abstract Motivation Videos Method and Examples User Study ArXiv Chrome Extension

Abstract

Users browsing the web daily struggle to locate relevant information on cluttered pages, complete unfamiliar multi-step tasks, and stay focused amid distracting content. State-of-the-art AI assistants and browser agents return answers without showing where information comes from, forcing users to manually verify results and blindly trust every automated step.

We present 🍊 PageGuide, a browser extension that grounds LLM answers directly in the HTML DOM via visual overlays, addressing three core user needs:

Find — locating and highlighting relevant evidence in-situ so users can instantly verify answers on the page;
Guide — showing step-by-step instructions one at a time so users can follow and perform actions by themselves;
Hide — hiding distracting content with a per-element justification and a reviewable checklist.

In a within-subject controlled user study (N = 94), PageGuide outperforms unaided browsing across all modes: Hide accuracy improves by 26 percentage points and task time drops by 70%; Guide completion rate increases by 30 percentage points; and Find reduces Ctrl+F usage by 80% and task time by 19%.

Motivation

Figure 1: (a) Existing web agents can fail in three ways: by providing answers that cannot be verified from the page, failing to highlight relevant page elements, or failing to hide content requested by the user. On the Amazon River Wikipedia page, ChatGPT Atlas returns the correct answer, but it does not highlight the supporting evidence on the page, making the response difficult for the user to verify. On TradingView, ChatGPT Atlas explains how to move the price scale but does not highlight the relevant Settings icon on the chart. On Threads, Gemini Agent cannot directly hide negative comments about Ronaldo on the page. (b) In contrast, given a query in the agent side panel on the right (powered by Gemini-3-Flash), 🍊 PageGuide displays the full answer with clickable references that navigate to the corresponding highlighted evidence on the page (e.g., the page will jump to the location of the Nevado Mismi when clicking at Nevado Mismi^[1]).

ChatGPT Atlas. Returns the correct answer but does not highlight the supporting evidence on the page, making the response difficult for the user to verify.

ChatGPT Atlas (YouTube). Explains how to perform a task but does not highlight the relevant UI element on the page, leaving the user to locate it manually.

Browser Use. Autonomously executes actions without showing where each step is performed on the page — users cannot verify or intervene at each stage.

Gemini Agent. Cannot directly hide specific content requested by the user; the page remains unchanged after the request.

Molmo Web. Points to a location on the page but does not ground the answer in the actual HTML DOM element, preventing in-situ verification.

Video Demonstrations

Method

PageGuide offers three interaction modes, each targeting a distinct user need. A lightweight intent router classifies each query and dispatches it to the appropriate handler, which reads the live HTML DOM and performs a grounded action directly on the page.

Figure 4: Given a user query, the Router assigns it to one of three handlers, after which the agent reads the HTML DOM and produces the corresponding final state. (a) Find (——▶): for factual lookup queries (e.g., "How many episodes in the movie Stranger Things?"), the agent locates supporting evidence spans; the final state highlights relevant elements directly on the page (e.g., ). (b) Guide (——▶): for navigation queries (e.g., "How do I change my password?"), the agent iteratively generates actions and incorporates feedback; the final state is the target page reached after completing the steps (e.g., the Change Password form). (c) Hide (——▶): for content-hiding queries (e.g., "Hide political posts"), the agent scores HTML DOM elements based on the user's intent; the final state hides the matched elements on the page.

See full prompts in the paper →

System: Given a user query and a brief page context, classify the query into one of three handlers: find (factual lookup), guide (step-by-step task), or hide (content hiding). Return a JSON object with the handler, confidence score, and a one-sentence justification.

User: Query: "{query}" Page context: {page_title, content_type}

Examples

"What is the price of this product?"
→ {"handler": "find", "confidence": 0.9, "reason": "Question about page content"}

"How do I report this video?"
→ {"handler": "guide", "confidence": 0.9, "reason": "How-to question needing step-by-step guidance"}

"Hide the ads on this page"
→ {"handler": "hide", "confidence": 0.95, "reason": "Request to hide ads"}

System: Given a user query and a structured HTML DOM element index, answer the query in natural language. For every factual claim, insert an inline citation in the format [N:"exact phrase"], where N is the element index and exact phrase is the verbatim text span supporting the claim.

User: Query: "{query}" HTML DOM index: {element_id, text, tag, bbox}

Example

Q: "Who directed this movie?"
A: The movie was directed by Christopher Nolan [45:"Christopher Nolan"].

Q: "Who are the main actors?"
A: The main actors are Leonardo DiCaprio [23:"Leonardo DiCaprio"], Tom Hardy [27:"Tom Hardy"], and Ellen Page [31:"Ellen Page"].

System: Given a user task and a structured HTML DOM element index, produce one step at a time as a JSON action object — instruction text, target SoM index, action type (click, input, scroll), and a next-step hint. Guide the user ONE step at a time.

User: Task: "{query}" Step: {step_number} HTML DOM index: {element_id, text, tag, bbox}

Example

Q: "How do I report this video?" (Step 1)
→ {"step": 1, "instruction": "Click the three-dot menu (⋮) to see more options", "highlight": {"index": 5, "text": "⋮"}, "waitFor": "click", "isLastStep": false, "nextStepHint": "The menu will open with a Report option"}

(Step 2, after menu opened — PAGE INDEX now shows [20] Report)
→ {"step": 2, "instruction": "Click 'Report' to report this video", "highlight": {"index": 20, "text": "Report"}, "waitFor": "click", "isLastStep": true, "nextStepHint": "You'll see reporting options"}

System: Given a hiding request and a structured HTML DOM element index, identify all elements that match the user's intent. For each matched element, return: (1) the element index, (2) a one-sentence justification, and (3) a short content snippet. Return at most 15 items; pick the most prominent if more match.

User: Request: "{query}" HTML DOM index: {element_id, text, tag, bbox}

Example

Request: "Hide the ads on this page"
→ {"found": [{"index": 12, "reason": "Sponsored post marked as 'Ad'", "snippet": "Limited-time offer: Get 30% off today."}, {"index": 47, "reason": "Promoted banner advertisement", "snippet": "Shop now — exclusive deal"}], "message": "Found 2 advertisement elements on the page"}

Show the exact evidence behind every answer

Figure 1: Given a query in the agent side panel on the right (powered by Gemini-3-Flash), PageGuide displays the full answer with clickable references that navigate to the corresponding highlighted evidence on the page (e.g., the page will jump to the location of the Nevado Mismi when clicking at Nevado Mismi^[1]).

Step-by-step help while the user stays in control

Figure 2: Given the query "How to add ABC to this GitHub project?", PageGuide (powered by Gemini-3-Flash) generates a step-by-step plan and delivers it one step at a time. The target UI element is highlighted directly on the page (e.g., Settings, Collaborators), while the sidebar panel shows the current instruction, the outcome hint, and Next / Stop controls. The user always drives the pace: each step only advances when Next is explicitly clicked, keeping the user in full control, especially when verification is required (e.g., entering a password or confirming the collaborator's account).

Navigate through FAQ sections step by step

Guide Example 2: Given the query "How to find the time frames to find a lost item?", PageGuide navigates the user through an FAQ page step by step. Each step highlights the target element on the page (e.g., form, entered into our system for processing) while the sidebar panel delivers the current instruction, a next-step hint, and Next / Stop controls.

Remove distractions with a transparent, reviewable process

Figure 3: On social platforms such as X.com, users often encounter repetitive or distracting content. Given the query "Can you hide advertisements?", PageGuide identifies matching HTML DOM elements and surfaces a summary pop-up on the right listing the detected items. The user can review and confirm the selection before the action is applied—each confirmed element is hidden via CSS display:none, keeping the surrounding layout intact.

Ask questions about any PDF — directly in the browser

PDF Reading. When the user opens a PDF in the browser, PageGuide automatically detects the document and enables document-level Q&A. The agent reads the full PDF content and answers questions with inline citations, so users can verify every claim without leaving the browser tab.

Answer questions about images and visual content on the page

Visual Question Answering feature example

Visual Q&A. For pages containing charts, diagrams, or images, PageGuide captures the relevant visual element and routes the query to a vision-capable model. Users can ask natural-language questions about any image on the page and receive grounded, evidence-backed answers without needing to open a separate tool.

Turn off the page to reclaim your focus

Page Off. With a single click, PageGuide dims the entire page and blocks all interactive elements, helping users avoid distractions and stay focused. The overlay can be dismissed at any time, restoring the page to its original state without any permanent changes to the DOM.

User Study

We conducted a within-subject controlled study (N = 94) on real websites, with counterbalanced ordering to mitigate learning effects. Each participant completed 6 tasks (2 per mode) under both a control condition (standard browser tools) and an extension condition. We measured task accuracy, completion time, and behavioral signals (Ctrl+F, clicks, scrolls).

📊 User study data and code to analyze the results

Figure 5: Task performance comparing control and extension conditions across all three features. Find and Guide are evaluated by accuracy (proportion of correctly completed tasks); Hide is evaluated by accuracy (proportion of target elements correctly identified). PageGuide improves performance in all features: Find (0.81 → 0.86), Guide (0.23 → 0.53), and Hide accuracy (0.30 → 0.56), with the largest gains in Guide and Hide.

Figure 6: Task completion time (seconds) for the control and extension conditions, restricted to correctly completed tasks. Each box shows the median and interquartile range across participants. PageGuide reduces completion time across all three features: Find (65.2s → 52.8s), Hide (104s → 31.7s), and Guide (95.8s → 66.7s), with the largest gain observed in Hide.

Figure 7: Behavioral signals (mean ± SE) comparing the control and extension conditions. Each bar shows the average count or distance per task for five metrics: Ctrl+F presses, text selections, mouse clicks, scroll count, and mouse movement distance. All five metrics decrease substantially with PageGuide, indicating that users rely less on manual search and perform fewer interactions to complete the same tasks.

Figure 8: Behavioral metrics (mean ± SE) broken down by task type (Find, Guide, Hide) and condition. While Find and Hide show consistent reductions across all signals, Guide shows a different pattern: page visits and mouse movement distance increase with PageGuide, reflecting that the extension actively guides users to navigate to new pages as part of the task.

Figure 9: Post-study Likert ratings (1 = Strongly Disagree, 7 = Strongly Agree) for each interaction mode. Each mode includes three questions: whether PageGuide is accurate or gives correct guidance (F1/G1), whether it makes the task easier (F2/G2/H2), and whether the task would be difficult to complete without it (F3/G3/H3). Bars extending to the right indicate agreement. Find and Hide show the most concentrated positive distributions (89–91% agreement on ease of use), while Guide shows slightly more variance, reflecting the added complexity of multi-step procedural tasks.

🍊 PageGuide: Browser Extension to Assist Users in Navigating a Webpage and Locating Information

Abstract

Motivation

Video Demonstrations

Method

Show the exact evidence behind every answer

Step-by-step help while the user stays in control

Navigate through FAQ sections step by step

Remove distractions with a transparent, reviewable process

Ask questions about any PDF — directly in the browser

Answer questions about images and visual content on the page

Turn off the page to reclaim your focus

User Study

BibTeX