8000 Fix recorder purging by batch processing purges by frenck · Pull Request #37140 · home-assistant/core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix recorder purging by batch processing purges #37140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 26, 2020
Merged

Fix recorder purging by batch processing purges #37140

merged 2 commits into from
Jun 26, 2020

Conversation

frenck
Copy link
Member
@frenck frenck commented Jun 26, 2020

Proposed change

During the beta of 0.112, I've noticed the migration took a long time. Problem: My database is huge!

My huge database wasn't personal choice, but due to the way, we handle purges. In my case, I run MariaDB as the database server for the Home Assistant recorder.

The delete queries produced by SQLAlchemy are highly inefficient for the Home Assistant database on MariaDB and can run a long time. In my case, the purge never finishes. Result: The database just keeps growing until I run out of disk space.

Besides that, it causes database locking issues. This I've also experienced, but actually always ignored.

This PR implements batch processing when purging. It finds the oldest record, and starts purging with an offset of 1 hour from that record until the keep_days is reached.

Each batch is enqueued, which prevents issues with locking as well. During the purge process, the queue will be processed as normal.

Downsides: Purging takes longer, but isn't intrusive, blocking or problematic.

PS: I've tested this with batches of 10.000 records, however, SQLAlchemy doesn't support limits on deletes statements. I tried it using a limit on a subquery, but this isn't supported by MariaDB. This solution implemented in this PR might be slightly less elegant but is fully compatible with all databases we support.

image

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Example entry for configuration.yaml:

# Example configuration.yaml

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • Untested files have been added to .coveragerc.

The integration reached or maintains the following Integration Quality Scale:

  • No score or internal
  • 🥈 Silver
  • 🥇 Gold
  • 🏆 Platinum

@bdraco
Copy link
Member
bdraco commented Jun 26, 2020
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-12 15:17:21.359948+00:00
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.006246s
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 states
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.005331s
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 events
2020-06-26 17:17:21 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 recorder_runs

Verified purge before data was available did nothing as expected

@bdraco
Copy link
Member
bdraco commented Jun 26, 2020
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.315384+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.265241s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 65 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002952s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 75 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.642492+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003690s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 71 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003201s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 83 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.696661+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003138s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 100 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002760s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 112 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.773599+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002986s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 130 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002526s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 145 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.890534+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002831s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 120 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002806s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 132 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging hasn't fully completed yet.
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-24 15:19:27.959757+00:00
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.003016s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 258 states
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002652s
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 583 events
2020-06-26 17:19:27 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 1 recorder_runs

No repack happening with mariadb

@bdraco
Copy link
Member
bdraco commented Jun 26, 2020

repack not happening is unrelated and fixed here #37142

@bdraco
Copy link
Member
bdraco commented Jun 26, 2020

Working ok with SQLite

2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging events before 2020-06-01 15:32:37.097612+00:00
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002602s
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 states
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.util] converting 1 rows to native objects took 0.002537s
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 events
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted 0 recorder_runs
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Using driver: pysqlite
2020-06-26 15:32:37 DEBUG (Recorder) [homeassistant.components.recorder.purge] Vacuuming SQL DB to free space

Copy link
Member
@bdraco bdraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@balloob balloob merged commit 7d74b74 into dev Jun 26, 2020
@balloob balloob deleted the frenck-2020-0658 branch June 26, 2020 17:27
@balloob
Copy link
Member
balloob commented Jun 26, 2020

Nice job with the rescheduling after the current work, that way we don't build up large queues !

Do we want to hot fix this so some people can start cleaning up their DB before the migration?

@balloob balloob added this to the 0.112.0 milestone Jun 26, 2020
@frenck
Copy link
Member Author
frenck commented Jun 26, 2020

Thanks!

I think releasing this as a hotfix might help; yes.

However, I'm not sure about the "size" of the issue I experienced. I know at least 1 other person that experienced the same.

10000

Comment on lines +44 to +48
events_purge_before = purge_before
if events:
events_purge_before = min(
purge_before, events[0].time_fired + timedelta(hours=1)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure but I think the states_purge_before timestamp should be reused when deleting events.

With the timestamps unaligned, you might try to delete an event that is still referenced by the states table. That should fail due to the foreign key constraint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a relation defined, but it is not hard enforced. I've managed to get my states out of wack with the events tables during testing (like 60+ days difference) and it did not cause any issues.

If this was a concern, to begin with, the purge as before this change already worked by date & time, which didn't guarantee any referential integrity either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a relation defined, but it is not hard enforced.

Hmm interesting. It seems like referential integrity must be enabled explicitly for SQLite and that we do not do that. I was not aware of this.

However, referential integrity is enabled by default for MySQL(InnoDB) and it also seems like we do not disable that so I think my concern is still valid.

If this was a concern, to begin with, the purge as before this change already worked by date & time, which didn't guarantee any referential integrity either.

Integrity was implicitly guaranteed before by the invariant that state_changed events never fire before the state has changed.

Say we delete states and then events up to time T. Any stray state would have a time > T. However, because of that invariant, its matching event would also have time > T and it would thus not have been deleted. So the State was not stray after all.

With the new implementation we delete states up to Ts and events up to Te. They will be close but I cannot see that Ts >= Te.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the foreign key defined in MySQL, Cascade delete?

Copy link
Member Author
@frenck frenck Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in MySQL is no action (which is restricting), fail if referential integrity is broken.

But @amelchio is right, or at least, it would reduce the risk. I guess the easiest attack to this, is by getting the oldest records from both states and events table, and get the max() from those and work from there.

So I will adjust that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading up a bit on SQLAlchemy cascades actually got me confused so I moved my dev setup to MySQL to test this a bit. After removing the state deletion from the code (to simulate the situation I am talking about) I got this error:

2020-06-29 09:21:28 WARNING (Recorder) [homeassistant.components.recorder.purge] Error purging history: (MySQLdb._exceptions.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`hadev`.`states`, CONSTRAINT `states_ibfk_1` FOREIGN KEY (`event_id`) REFERENCES `events` (`event_id`))')
[SQL: DELETE FROM events WHERE events.time_fired < %s]
[parameters: (datetime.datetime(2020, 6, 29, 7, 21, 28, 294027, tzinfo=<UTC>),)]
(Background on this error at: http://sqlalche.me/e/13/gkpj).

So it indeed seems like an issue if one is unfortunate enough to hit the small window.

@frenck If you want the easiest attack, just pick one of the timestamps and use that. Any timestamp should work, as long as the same one is used with both deletes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0