-
-
You must be signed in to change notification settings -
Fix recorder purging by batch processing purges #37140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Verified purge before data was available did nothing as expected |
No repack happening with mariadb |
repack not happening is unrelated and fixed here #37142 |
Working ok with SQLite
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice job with the rescheduling after the current work, that way we don't build up large queues ! Do we want to hot fix this so some people can start cleaning up their DB before the migration? |
Thanks! I think releasing this as a hotfix might help; yes. However, I'm not sure about the "size" of the issue I experienced. I know at least 1 other person that experienced the same. 10000 |
events_purge_before = purge_before | ||
if events: | ||
events_purge_before = min( | ||
purge_before, events[0].time_fired + timedelta(hours=1) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure but I think the states_purge_before
timestamp should be reused when deleting events.
With the timestamps unaligned, you might try to delete an event that is still referenced by the states table. That should fail due to the foreign key constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a relation defined, but it is not hard enforced. I've managed to get my states out of wack with the events tables during testing (like 60+ days difference) and it did not cause any issues.
If this was a concern, to begin with, the purge as before this change already worked by date & time, which didn't guarantee any referential integrity either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a relation defined, but it is not hard enforced.
Hmm interesting. It seems like referential integrity must be enabled explicitly for SQLite and that we do not do that. I was not aware of this.
However, referential integrity is enabled by default for MySQL(InnoDB) and it also seems like we do not disable that so I think my concern is still valid.
If this was a concern, to begin with, the purge as before this change already worked by date & time, which didn't guarantee any referential integrity either.
Integrity was implicitly guaranteed before by the invariant that state_changed
events never fire before the state has changed.
Say we delete states and then events up to time T
. Any stray state would have a time > T
. However, because of that invariant, its matching event would also have time > T
and it would thus not have been deleted. So the State
was not stray after all.
With the new implementation we delete states up to Ts
and events up to Te
. They will be close but I cannot see that Ts >= Te
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the foreign key defined in MySQL, Cascade delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default in MySQL is no action (which is restricting), fail if referential integrity is broken.
But @amelchio is right, or at least, it would reduce the risk. I guess the easiest attack to this, is by getting the oldest records from both states and events table, and get the max() from those and work from there.
So I will adjust that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading up a bit on SQLAlchemy cascades actually got me confused so I moved my dev setup to MySQL to test this a bit. After removing the state deletion from the code (to simulate the situation I am talking about) I got this error:
2020-06-29 09:21:28 WARNING (Recorder) [homeassistant.components.recorder.purge] Error purging history: (MySQLdb._exceptions.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`hadev`.`states`, CONSTRAINT `states_ibfk_1` FOREIGN KEY (`event_id`) REFERENCES `events` (`event_id`))')
[SQL: DELETE FROM events WHERE events.time_fired < %s]
[parameters: (datetime.datetime(2020, 6, 29, 7, 21, 28, 294027, tzinfo=<UTC>),)]
(Background on this error at: http://sqlalche.me/e/13/gkpj).
So it indeed seems like an issue if one is unfortunate enough to hit the small window.
@frenck If you want the easiest attack, just pick one of the timestamps and use that. Any timestamp should work, as long as the same one is used with both deletes.
Proposed change
During the beta of 0.112, I've noticed the migration took a long time. Problem: My database is huge!
My huge database wasn't personal choice, but due to the way, we handle purges. In my case, I run MariaDB as the database server for the Home Assistant recorder.
The delete queries produced by SQLAlchemy are highly inefficient for the Home Assistant database on MariaDB and can run a long time. In my case, the purge never finishes. Result: The database just keeps growing until I run out of disk space.
Besides that, it causes database locking issues. This I've also experienced, but actually always ignored.
This PR implements batch processing when purging. It finds the oldest record, and starts purging with an offset of 1 hour from that record until the keep_days is reached.
Each batch is enqueued, which prevents issues with locking as well. During the purge process, the queue will be processed as normal.
Downsides: Purging takes longer, but isn't intrusive, blocking or problematic.
PS: I've tested this with batches of 10.000 records, however, SQLAlchemy doesn't support limits on deletes statements. I tried it using a limit on a subquery, but this isn't supported by MariaDB. This solution implemented in this PR might be slightly less elegant but is fully compatible with all databases we support.
Type of change
Example entry for
configuration.yaml
:# Example configuration.yaml
Additional information
Checklist
black --fast homeassistant tests
)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest
.requirements_all.txt
.Updated by running
python3 -m script.gen_requirements_all
..coveragerc
.The integration reached or maintains the following Integration Quality Scale: