8000 Feedback: Add prompting suggestions + strong descriptions so users can have success with the server · Issue #37 · mobile-next/mobile-mcp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Feedback: Add prompting suggestions + strong descriptions so users can have success with the server #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
govindrai opened this issue Apr 18, 2025 · 4 comments

Comments

@govindrai
Copy link
govindrai commented Apr 18, 2025

Device/Package Details

Latest Mobile-MCP commit as of 4/18/2025 8083235 + Cursor IDE + MacOSX 15.1.1.

Thanks for writing up this mcp server. It takes a lot of hard work. I've been messing around with your project, and I wanted to share my experience. I know the project is in very early stages, and I am just trying to understand how to best/efficiently use the software.

Feedback

I have been having a hard time getting the AI models to call the right tools with the right arguments.

I think something that can help is improving tool descriptions (maybe even adding one shot or multi shot examples in the tool description) + providing some tried and true prompting examples for each tool for that users can send to their LLM.

For example, say I want to take screenshots of each tab in a mobile app. Some of those tabs are scrollable. LLM will always provide direction "up" instead of "down". However to scroll up, the direction needs to be "down". This makes it so the server never scrolls down successfully until you explicitly tell the llm to send the down scroll direction.

Currently I feel that users will need to specify tool names and provide exact arguments (ideally not since natural language is preferred here), but maybe that is necessary if the model can't call the right tools for whatever reason.

What I wanted to do: Get cursor to get mobile-mcp to get past a passcode screen (shown below). It kept trying to tap the keyboard in when it could have easily typed it (but it chose to tap). But then for whatever reason the tapping never worked. Not sure if this is a bug or tool calling issue.

Image

My LLM conversation

Affirm App Screenshot Instructions

User:

open the affirm app on the iphone 16 simulator and take a screenshot of the home tab. if you hit the passcode screen type in 1111 as the passcode.

Cursor:

I'll help you open the Affirm app on the iPhone 16 simulator and take a screenshot of the home tab.

Cursor:

I see that we need to enter the passcode 1111. I'll tap the number 1 four times.

Cursor:

It looks like the passcode wasn't accepted. Let me try entering 1111 again with a slight delay between each tap to ensure they register properly.

User:

please explain why you are clicking on coordinates instead of typing?

Cursor:

I'm clicking on coordinates instead of typing because the passcode screen shows a numeric keypad rather than a text input field. The mobile interaction tools available to me work differently based on the type of input needed:

  1. For standard text fields with a keyboard, I would use the type_keys function
  2. For PIN/passcode screens with numeric keypads, I need to physically tap the number buttons by clicking at their coordinates

The passcode screen presents a specialized keypad interface where each button needs to be tapped individually. I identified the coordinates of the "1" button (x:100, y:884) and am tapping it four times to enter "1111".

User:

how do you know it's a specialized keyboard? seems like a normal ios numberpad keyboard tome

Cursor:

You're right - it is a standard iOS numeric keypad. I should have been more precise in my explanation.

I'm clicking on coordinates because this passcode screen requires tapping the numeric buttons directly rather than using the type_keys function. Even though it's a standard iOS keypad, passcode entry screens typically require direct interaction with the on-screen buttons rather than accepting text input via typing functions.

User:

what does the type_keys tool do?

Cursor:

The type_keys tool types text into a focused input element. It has two parameters:

  • text: The string of text you want to type
  • submit: Boolean that determines whether to submit the text (like pressing Enter)

It's designed for typing into text fields that accept keyboard input. For the passcode screen, I could have tried using type_keys with "1111", but I chose to click the number buttons directly since the passcode screen shows a numeric keypad interface.

User:

just type it instead of tapping

Cursor:

Perfect! We've successfully accessed the Affirm app and we're looking at the home tab. The screenshot shows:

  • Affirm logo at the top
  • Search bar to find stores that accept Affirm payments
  • Featured merchants including Neiman Marcus, Harley-Davidson, Jomashop, and others
  • Promotional banner for 0% APR on first Affirm purchase with Apple Pay
  • Bottom navigation with Home (currently selected), Deals, Card, Manage, and Money tabs

This is the home tab of the Affirm app as requested.

@govindrai govindrai changed the title Recommend adding prompting suggestions so users can have success Feedback: Add prompting suggestions + strong descriptions so users can have success with the se 8000 rver Apr 18, 2025
@krismuhi
Copy link
Member

great feedback! thanks for taking the time for this awesome writeup.
we are reviewing it and gonna share updates soon

@gmegidish
Copy link
Contributor

@govindrai that's amazing feedback, that helps a lot!. it's interesting to see if we can put some context into the tools, like, when you "use device", it will return a long response to the LLM, telling it that if it needs to write some text, better try with the keyboard first and only use "tap" if it doesn't work.

Right now, it seems that you can engineer the perfect prompt for "use keyboard to enter the password 1234". We added "focused" (but only for android atm) for elements, maybe LLM will be smart enough to see that you're now focused on a text view and maybe type using keys. Or maybe we need to expose "is keyboard open" which will hint that you can use keys instead of tapping.

Thank you so much for the report. May I close the ticket?

@govindrai
Copy link
Author
govindrai commented May 4, 2025 via email

@gmegidish
Copy link
Contributor

Thank you so much @govindrai !!! 💯

< 63EB /task-lists>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0