Fix recorder purging by batch processing purges #37140

frenck · 2020-06-26T14:57:38Z

Proposed change

During the beta of 0.112, I've noticed the migration took a long time. Problem: My database is huge!

My huge database wasn't personal choice, but due to the way, we handle purges. In my case, I run MariaDB as the database server for the Home Assistant recorder.

The delete queries produced by SQLAlchemy are highly inefficient for the Home Assistant database on MariaDB and can run a long time. In my case, the purge never finishes. Result: The database just keeps growing until I run out of disk space.

Besides that, it causes database locking issues. This I've also experienced, but actually always ignored.

This PR implements batch processing when purging. It finds the oldest record, and starts purging with an offset of 1 hour from that record until the keep_days is reached.

Each batch is enqueued, which prevents issues with locking as well. During the purge process, the queue will be processed as normal.

Downsides: Purging takes longer, but isn't intrusive, blocking or problematic.

PS: I've tested this with batches of 10.000 records, however, SQLAlchemy doesn't support limits on deletes statements. I tried it using a limit on a subquery, but this isn't supported by MariaDB. This solution implemented in this PR might be slightly less elegant but is fully compatible with all databases we support.

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New integration (thank you!)
New feature (which adds functionality to an existing integration)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Example entry for `configuration.yaml`:

# Example configuration.yaml

Additional information

This PR fixes or closes issue: fixes #
This PR might fix or close issue: recorder: Lock wait timeout exceeded #35241
This PR is related to issue:
Link to documentation pull request:

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
The code has been formatted using Black (black --fast homeassistant tests)
Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated for www.home-assistant.io

If the code communicates with devices, web services, or third-party tools:

The manifest file has all fields filled out correctly.
Updated and included derived files by running: python3 -m script.hassfest.
New or updated dependencies have been added to requirements_all.txt.
Updated by running python3 -m script.gen_requirements_all.
Untested files have been added to .coveragerc.

The integration reached or maintains the following Integration Quality Scale:

No score or internal
🥈 Silver
🥇 Gold
🏆 Platinum

homeassistant/components/recorder/purge.py

bdraco · 2020-06-26T15:18:33Z

2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-12 15:17:21.359948+00:00
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.006246s
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 states
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.005331s
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 events
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 recorder_runs

Verified purge before data was available did nothing as expected

bdraco · 2020-06-26T15:20:47Z

2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.315384+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.265241s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 65 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002952s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 75 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.642492+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003690s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 71 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003201s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 83 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.696661+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003138s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 100 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002760s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 112 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.773599+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002986s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 130 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002526s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 145 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.890534+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002831s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 120 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002806s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 132 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.959757+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003016s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 258 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002652s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 583 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 1 recorder_runs

No repack happening with mariadb

bdraco · 2020-06-26T15:25:46Z

repack not happening is unrelated and fixed here #37142

bdraco · 2020-06-26T15:33:03Z

Working ok with SQLite

2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-01 15:32:37.097612+00:00
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002602s
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 states
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002537s
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 events
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 recorder_runs
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Using driver: pysqlite
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Vacuuming SQL DB to free space

bdraco

LGTM

balloob · 2020-06-26T17:28:20Z

Nice job with the rescheduling after the current work, that way we don't build up large queues !

Do we want to hot fix this so some people can start cleaning up their DB before the migration?

frenck · 2020-06-26T19:51:10Z

Thanks!

I think releasing this as a hotfix might help; yes.

However, I'm not sure about the "size" of the issue I experienced. I know at least 1 other person that experienced the same.

10000

amelchio · 2020-06-26T23:13:07Z

homeassistant/components/recorder/purge.py

+            events_purge_before = purge_before
+            if events:
+                events_purge_before = min(
+                    purge_before, events[0].time_fired + timedelta(hours=1)
+                )


I am not 100% sure but I think the states_purge_before timestamp should be reused when deleting events.

With the timestamps unaligned, you might try to delete an event that is still referenced by the states table. That should fail due to the foreign key constraint.

There is a relation defined, but it is not hard enforced. I've managed to get my states out of wack with the events tables during testing (like 60+ days difference) and it did not cause any issues.

If this was a concern, to begin with, the purge as before this change already worked by date & time, which didn't guarantee any referential integrity either.

There is a relation defined, but it is not hard enforced.

Hmm interesting. It seems like referential integrity must be enabled explicitly for SQLite and that we do not do that. I was not aware of this.

However, referential integrity is enabled by default for MySQL(InnoDB) and it also seems like we do not disable that so I think my concern is still valid.

If this was a concern, to begin with, the purge as before this change already worked by date & time, which didn't guarantee any referential integrity either.

Integrity was implicitly guaranteed before by the invariant that state_changed events never fire before the state has changed.

Say we delete states and then events up to time T. Any stray state would have a time > T. However, because of that invariant, its matching event would also have time > T and it would thus not have been deleted. So the State was not stray after all.

With the new implementation we delete states up to Ts and events up to Te. They will be close but I cannot see that Ts >= Te.

How is the foreign key defined in MySQL, Cascade delete?

The default in MySQL is no action (which is restricting), fail if referential integrity is broken.

But @amelchio is right, or at least, it would reduce the risk. I guess the easiest attack to this, is by getting the oldest records from both states and events table, and get the max() from those and work from there.

So I will adjust that.

Reading up a bit on SQLAlchemy cascades actually got me confused so I moved my dev setup to MySQL to test this a bit. After removing the state deletion from the code (to simulate the situation I am talking about) I got this error:

2020-06-29 09:21:28 WARNING (Recorder) [homeassistant.components.recorder.purge] Error purging history: (MySQLdb._exceptions.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`hadev`.`states`, CONSTRAINT `states_ibfk_1` FOREIGN KEY (`event_id`) REFERENCES `events` (`event_id`))') [SQL: DELETE FROM events WHERE events.time_fired < %s] [parameters: (datetime.datetime(2020, 6, 29, 7, 21, 28, 294027, tzinfo=<UTC>),)] (Background on this error at: http://sqlalche.me/e/13/gkpj).

So it indeed seems like an issue if one is unfortunate enough to hit the small window.

@frenck If you want the easiest attack, just pick one of the timestamps and use that. Any timestamp should work, as long as the same one is used with both deletes.

Fix recorder purging by batch processing purges

0b9e649

homeassistant added the cla-signed label Jun 26, 2020

probot-home-assistant bot added bugfix core has-tests integration: recorder labels Jun 26, 2020

bdraco reviewed Jun 26, 2020

View reviewed changes

homeassistant/components/recorder/purge.py Outdated Show resolved Hide resolved

Prevent creating second session

69b0e89

bdraco approved these changes Jun 26, 2020

View reviewed changes

balloob merged commit 7d74b74 into dev Jun 26, 2020

balloob deleted the frenck-2020-0658 branch June 26, 2020 17:27

balloob added this to the 0.112.0 milestone Jun 26, 2020

balloob added the cherry-picked label Jun 26, 2020

balloob pushed a commit that referenced this pull request Jun 26, 2020

Fix recorder purging by batch processing purges (#37140)

976cbdd

amelchio reviewed Jun 26, 2020

View reviewed changes

frenck mentioned this pull request Jun 29, 2020

Ensure recorder data integrity and MySQL lock error handling #37228

Merged

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix recorder purging by batch processing purges #37140

Fix recorder purging by batch processing purges #37140

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix recorder purging by batch processing purges #37140

Fix recorder purging by batch processing purges #37140

Uh oh!

Conversation

Proposed change

Type of change

Example entry for configuration.yaml:

Additional information

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Example entry for `configuration.yaml`: