-
Notifications
You must be signed in to change notification settings - Fork 77
Feedback: Add prompting suggestions + strong descriptions so users can have success with the server #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
great feedback! thanks for taking the time for this awesome writeup. |
@govindrai that's amazing feedback, that helps a lot!. it's interesting to see if we can put some context into the tools, like, when you "use device", it will return a long response to the LLM, telling it that if it needs to write some text, better try with the keyboard first and only use "tap" if it doesn't work. Right now, it seems that you can engineer the perfect prompt for "use keyboard to enter the password 1234". We added "focused" (but only for android atm) for elements, maybe LLM will be smart enough to see that you're now focused on a text view and maybe type using keys. Or maybe we need to expose "is keyboard open" which will hint that you can use keys instead of tapping. Thank you so much for the report. May I close the ticket? |
Appreciate you guys taking the time to reply and positively accepting the
feedback!! Yes, please close!
…On Sun, May 4, 2025 at 6:30 AM Gil Megidish ***@***.***> wrote:
*gmegidish* left a comment (mobile-next/mobile-mcp#37)
<#37 (comment)>
@govindrai <https://github.com/govindrai> that's amazing feedback, that
helps a lot!. it's interesting to see if we can put some context into the
tools, like, when you "use device", it will return a long response to the
LLM, telling it that if it needs to write some text, better try with the
keyboard first and only use "tap" if it doesn't work.
Right now, it seems that you can engineer the perfect prompt for "use
keyboard to enter the password 1234". We added "focused" (but only for
android atm) for elements, maybe LLM will be smart enough to see that
you're now focused on a text view and maybe type using keys. Or maybe we
need to expose "is keyboard open" which will hint that you can use keys
instead of tapping.
Thank you so much for the report. May I close the ticket?
—
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADJXTMK3VTZQTLBR47WIWDD24YJAJAVCNFSM6AAAAAB3NLVRR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNBZGIZDCMZZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you so much @govindrai !!! 💯 |
Uh oh!
There was an error while loading. Please reload this page.
Device/Package Details
Latest Mobile-MCP commit as of 4/18/2025 8083235 + Cursor IDE + MacOSX 15.1.1.
Thanks for writing up this mcp server. It takes a lot of hard work. I've been messing around with your project, and I wanted to share my experience. I know the project is in very early stages, and I am just trying to understand how to best/efficiently use the software.
Feedback
I have been having a hard time getting the AI models to call the right tools with the right arguments.
I think something that can help is improving tool descriptions (maybe even adding one shot or multi shot examples in the tool description) + providing some tried and true prompting examples for each tool for that users can send to their LLM.
For example, say I want to take screenshots of each tab in a mobile app. Some of those tabs are scrollable. LLM will always provide direction "up" instead of "down". However to scroll up, the direction needs to be "down". This makes it so the server never scrolls down successfully until you explicitly tell the llm to send the down scroll direction.
Currently I feel that users will need to specify tool names and provide exact arguments (ideally not since natural language is preferred here), but maybe that is necessary if the model can't call the right tools for whatever reason.
What I wanted to do: Get cursor to get mobile-mcp to get past a passcode screen (shown below). It kept trying to tap the keyboard in when it could have easily typed it (but it chose to tap). But then for whatever reason the tapping never worked. Not sure if this is a bug or tool calling issue.
My LLM conversation
Affirm App Screenshot Instructions
User:
open the affirm app on the iphone 16 simulator and take a screenshot of the home tab. if you hit the passcode screen type in 1111 as the passcode.
Cursor:
I'll help you open the Affirm app on the iPhone 16 simulator and take a screenshot of the home tab.
Cursor:
I see that we need to enter the passcode 1111. I'll tap the number 1 four times.
Cursor:
It looks like the passcode wasn't accepted. Let me try entering 1111 again with a slight delay between each tap to ensure they register properly.
User:
please explain why you are clicking on coordinates instead of typing?
Cursor:
I'm clicking on coordinates instead of typing because the passcode screen shows a numeric keypad rather than a text input field. The mobile interaction tools available to me work differently based on the type of input needed:
type_keys
functionThe passcode screen presents a specialized keypad interface where each button needs to be tapped individually. I identified the coordinates of the "1" button (x:100, y:884) and am tapping it four times to enter "1111".
User:
how do you know it's a specialized keyboard? seems like a normal ios numberpad keyboard tome
Cursor:
You're right - it is a standard iOS numeric keypad. I should have been more precise in my explanation.
I'm clicking on coordinates because this passcode screen requires tapping the numeric buttons directly rather than using the
type_keys
function. Even though it's a standard iOS keypad, passcode entry screens typically require direct interaction with the on-screen buttons rather than accepting text input via typing functions.User:
what does the type_keys tool do?
Cursor:
The
type_keys
tool types text into a focused input element. It has two parameters:text
: The string of text you want to typesubmit
: Boolean that determines whether to submit the text (like pressing Enter)It's designed for typing into text fields that accept keyboard input. For the passcode screen, I could have tried using
type_keys
with "1111", but I chose to click the number buttons directly since the passcode screen shows a numeric keypad interface.User:
just type it instead of tapping
Cursor:
Perfect! We've successfully accessed the Affirm app and we're looking at the home tab. The screenshot shows:
This is the home tab of the Affirm app as requested.
The text was updated successfully, but these errors were encountered: