🚀 Feature: caching queries #6305

AidenY69 · 2023-09-22T13:29:27Z

🔖 Feature description

I'm putting this as a feature, and not a bug, but feel free to re-tag it.

I have a platform with 40+ million items in its database, growing exponentially. Whenever pages are loaded, the entire database is queried for the results, however, this becomes a very slow process, especially as the database grows. In my use case, most of the pages are rankings, so it would be like "most viewed today/week/month/etc" or "most recent" etc.

My suggestion is that instead of querying every single hit, that there is some optimizations that would allow the developer to cache the query. That way most users will get an instant reload, while every now an then when the cache is cleared, that one hit will be slow, so 99%+ users will get a fast load and that 1 person gets screwed. Right now, every user gets a slow hit, and as we know, the speed of a reload affects the user's preference for the platform, which could result in them going to a faster competitor, rage quitting, etc.

This will not only decrease load times for the end user, but will put less stress on the server, especially in cloud situations that are charged per compute unit.

🎤 Pitch

For example, when a user clicks on "most recent", instead of the entire database being queried, it could be that a cache is stored of the last hit every 15 minutes (developer configurable), sure the list will be out of date by X minutes/hours/days, but for lists that don't change often this would be very fast. For example, if the page is for "most viewed this week", that could be set to update daily.

There's more examples too that don't really affect me, but could for others, for example social media type platforms that want to sort their feeds based on activity period, or messaging platforms, etc.

👀 Have you spent some time to check if this issue has been raised before?

I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

I have read the Code of Conduct

stnguyen90 · 2023-09-22T16:27:00Z

@AidenY69, thanks for raising this issue! 🙏🏼

I have a platform with 40+ million items in its database, growing exponentially. Whenever pages are loaded, the entire database is queried for the results, however, this becomes a very slow process, especially as the database grows. In my use case, most of the pages are rankings, so it would be like "most viewed today/week/month/etc" or "most recent" etc.

Some questions for you:

How long do these queries take?
What is the query you're making?
Do you have indexes for the related queries?

AidenY69 · 2023-09-22T18:56:30Z

@AidenY69, thanks for raising this issue! 🙏🏼

I have a platform with 40+ million items in its database, growing exponentially. Whenever pages are loaded, the entire database is queried for the results, however, this becomes a very slow process, especially as the database grows. In my use case, most of the pages are rankings, so it would be like "most viewed today/week/month/etc" or "most recent" etc.

Some questions for you:

How long do these queries take?

What is the query you're making?

Do you have indexes for the related queries?

~10,000 ms to 11,000 ms
In this case it was a collection sorted by date and showing the most recent 20
Yes

stnguyen90 · 2023-09-22T19:04:42Z

Interesting...I'm still surprised an indexed query would take 10-11s to respond with only 20 results 🧐 I would try to investigate why the query is so slow.

As for caching, Appwrite doesn't cache listDocuments() because of the variability of queries and permissions. Appwrite does, however, cache getDocument(). So, you can put the data in a single document and fetch that instead. When data changes, you can update that 1 document.

Some other members from the community have also made use of Cloudflare to cache specific routes and then purge the cache on update. For more info, see Discord.

AidenY69 · 2023-09-22T19:34:26Z

8000

Interesting...I'm still surprised an indexed query would take 10-11s to respond with only 20 results 🧐 I would try to investigate why the query is so slow.

oh sorry, maybe I misunderstood your question. the collection has 40 million items, the result is 20, so it has to process the entire collection (40m), sort by date, then presents the 20.

As for caching, Appwrite doesn't cache listDocuments() because of the variability of queries and permissions. Appwrite does, however, cache getDocument(). So, you can put the data in a single document and fetch that instead. When data changes, you can update that 1 document.

I'm not sure I understand what you're saying here, but the actual items don't change, but the amount does, so today there could be 40m, but next week it might be 45m etc. the amount of items only goes up as users use the platform, but the items themselves are fixed data at the time of creation. So I'm not sure what you mean by "put the data in a single document"?

Some other members from the community have also made use of Cloudflare to cache specific routes and then purge the cache on update. For more info, see [Discord[(https://discord.com/channels/564160730845151244/564160731327758347/1153038642332307617).

I'll give this a check. I already use Cloudflare, and have the site optimized to cache via NGINX, but since the actual database query is the bottleneck, I would think Appwrite would make sense to look into some built-in caching system for larger collections

eldadfux · 2023-09-22T19:39:14Z

Query caching is an interesting feature! Just to get more context, @AidenY69 what permission model are you using for this collection? is this collections-level or document-level permissions?

Can you share more info to understand why the query is taking 10-11 seconds? Is this expected from your side due to potential query complexity?

AidenY69 · 2023-09-22T19:44:36Z

Query caching is an interesting feature! Just to get more context, @AidenY69 what permission model are you using for this collection? is this collections-level or document-level permissions?

Can you share more info to understand why the query is taking 10-11 seconds? Is this expected from your side due to potential query complexity?

It's collections-level perms.

The speed is directly correlated to the size of the collection. When I had a few million items, the query was nearly instant, as it hit 10m+ then it slowed down abit but still just a second or two, at 40m it's 10-11s. I expect to hit 50-60m in the next few weeks, so I would imagine the speed will likely slow down another couple seconds.

Nothing on the server config has changed, no hardware changes, no backend changes etc. I haven't even touched the codebase in well over a month. The query has remained the same too. It's 100% correlated to the size of the database, which makes sense. but that's why I think having a cache of some kind would help.

eldadfux · 2023-09-22T19:55:00Z

Curious to learn more on how would you imagine this caching mechanism work.

Would you need something that would be predefined per collection, specific query or per user?
Would you expect it to be client side cache or server, maybe both?
If this caching is done server side, would you except to be able to clean it instantly if needed?

Will you be open to chat more with our team about this feature? This is an interesting idea for us to explore.

AidenY69 · 2023-09-22T20:22:03Z

Would you need something that would be predefined per collection, specific query or per user?

In my use case, per query would be ideal. If you think back to the example of 'most recent', that query (and results and permissions) doesn't change on a user level. But different queries happen on the same collection. So one query is just 'most recent' but on the same collection, it's also sorted by day/week/month time periods. I'd want the 'most recent' to be cached on a shorter interval so users can see that the website is active, but for something like the daily or weekly chart, that makes sense to update on a longer interval.

Would you expect it to be client side cache or server, maybe both?

I had assumed on the server-side entirely. I use typescript with server side rendering, so the user performs the action, then the server side does most of the processing, then sends back the completed request. I guess client side makes more sense for client-side rendering platforms. But I would prefer the server just keep a cache.

If this caching is done server side, would you except to be able to clean it instantly if needed?

Yes, but I would assume it would be command-line. When I say cache, I guess I just more-so was thinking of a specialized database (MariaDB, MySQL, or MongoDB) specifically for caching the results from queries, not like a folder/file stored on the server. So in that case the admin can just flush the database. The process would essentially be whenever a query is made, it checks that 'query database' and if a cache doesn't exist, then it goes to the main collection and can store it in the query database for future use. Then that query database can be be deleted every X interval per query, or I guess collection, depending what you chose.

Will you be open to chat more with our team about this feature? This is an interesting idea for us to explore.

Sure! I would also actually suggest creating a working environment that mimics a larger setup. I doubt many people have collections in the double digit millions, but I think it might just be worth creating an instance with fake data in the 50m+ range just to see how things get optimized, for when that day happens since platforms grow like Duckweeds these days, so only matter of time. I know I said 10-11ms on my setup, but that's on a very powerful server with minimal users as I went back to a closed-beta, I would think this would actually be a lot worse platforms that have a lot more traffic.

stnguyen90 · 2023-09-22T22:20:56Z

oh sorry, maybe I misunderstood your question. the collection has 40 million items, the result is 20, so it has to process the entire collection (40m), sort by date, then presents the 20.

If there's an index that matches the sorting and queries, the database shouldn't have to process the entire collection 🧐

I'm not sure I understand what you're saying here, but the actual items don't change, but the amount does, so today there could be 40m, but next week it might be 45m etc. the amount of items only goes up as users use the platform, but the items themselves are fixed data at the time of creation. So I'm not sure what you mean by "put the data in a single document"?

For example, you can have a cache collection with a value attribute that is a big string. Then, when your 20 most recent changes, stringify the new list and put it in the document with id most-recent. Client-side, you would do databases.getDocument('default', 'cache', 'most-recent') and then parse the value.

fliitor · 2023-09-25T15:06:00Z

@eldadfux While appwrite use redis, I can only advise you to switch to Redis Stack and use search & json features
Read requests go through redis, write through mariaDB
On the other hand, you'll need to maintain indexing templates based on collections and attributes. That would be most of the work, I think

It could also improve your architecture for the cloud. MariaDB will have fewer requests, and the load will be more on redis (read requests are generally more frequent than write requests).

greendesertsnow · 2023-11-27T19:09:36Z

I have always thought query caching was called indexing...

AidenY69 · 2023-11-27T19:21:45Z

I have always thought query caching was called indexing...

Query Caching:

Purpose: The primary aim of query caching is to store the results of previously executed queries. When the same or a similar query is executed again, the system can fetch the result from the cache instead of executing the query anew. This can significantly reduce the time and resources needed for frequent or repeated queries.
Mechanism: Query caching involves storing both the query and its result. When a new query is received, the system checks if an identical or sufficiently similar query has been executed before. If so, the cached result is returned immediately.
Limitations: The effectiveness of query caching decreases in environments with high data volatility, where frequent updates can render cached data obsolete quickly. Also, it's less effective for unique or rarely repeated queries.

Indexing:

Purpose: Indexing is aimed at improving the speed of data retrieval operations within a database. By creating indexes, the database can find rows much faster than it would by scanning the entire table (known as a full table scan).
Mechanism: An index is typically created on columns that are used frequently in query conditions (like WHERE clauses). It works somewhat like an index in a book, allowing the database to quickly locate the starting point of the data it needs without scanning the entire table.
Limitations: While indexes greatly improve query speed, they also require additional storage space and can slow down write operations (like INSERT, UPDATE, DELETE) because the index needs to be updated whenever data changes.

Comparison:

Use Cases: Query caching is most effective for read-heavy applications with frequent identical or similar queries. Indexing is crucial for databases where quick data retrieval is needed, irrespective of the uniqueness of the queries.
Performance Impact: Caching can drastically reduce load times for repeated queries, but its benefits are limited to scenarios where data doesn't change often. Indexing consistently improves read performance but can have a trade-off with write performance.
Maintenance: Caches may need to be cleared or updated frequently in rapidly changing data environments, while indexes require ongoing maintenance to ensure they remain efficient as data grows and evolves.

stnguyen90 added the product / databases Fixes and upgrades for the Appwrite Database. label Sep 22, 2023

stnguyen90 self-assigned this Sep 22, 2023

stnguyen90 removed their assignment Sep 22, 2023

eldadfux added the feature label Sep 22, 2023

stnguyen90 self-assigned this Sep 26, 2023

github-actions bot mentioned this issue Oct 1, 2023

Issues Opened in September 2023 appwrite/.github#79

Closed

stnguyen90 added enhancement New feature or request and removed feature labels Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Feature: caching queries #6305

🚀 Feature: caching queries #6305

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🚀 Feature: caching queries #6305

🚀 Feature: caching queries #6305

Comments

🔖 Feature description

🎤 Pitch

👀 Have you spent some time to check if this issue has been raised before?

🏢 Have you read the Code of Conduct?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!