8000 Re-merge ambiguous commit error PR with locking fixes by spencerkimball · Pull Request #10207 · cockroachdb/cockroach · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Re-merge ambiguous commit error PR with locking fixes #10207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 26, 2016

Conversation

spencerkimball
Copy link
Member
@spencerkimball spencerkimball commented Oct 25, 2016

See #10154
Fixes #10204
Fixes #10184


This change is Reviewable

@spencerkimball
Copy link
Member Author

I'm still working on the split queue issue.

@tamird
Copy link
Contributor
tamird commented Oct 25, 2016

Note that you need a separate "Fixes #..." line for each issue you wish to close. #7604 and #10023 were not closed by your original PR.

@petermattis
Copy link
Collaborator

Review status: 0 of 15 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/kv/transport.go, line 166 at r1 (raw file):

  client.pending = true
  gt.clientIndex++
  gt.Unlock()

How could the previous code get away without locking while this code can't? Seems to me that you only need to protect client.pending here as gt.clientIndex is always accessed from a single goroutine. Also note that IsExhausted accesses gt.clientIndex without grabbing the lock. If this is true, I think you should rename grpcTransport.Mutex to grpcTransport.clientPendingMu.


pkg/kv/transport.go, line 235 at r1 (raw file):

          // Swap the client representing this replica to the front.
          gt.orderedClients[i], gt.orderedClients[gt.clientIndex] =
              gt.orderedClients[gt.clientIndex], gt.orderedClients[i]

I think this has a bad interaction with SendNext. SendNext grabs a pointer to the i-th client, but you're swapping the underlying elements (not pointers) here which means when SendNext finishes it could clear client.pending for the wrong batchClient.


Comments from Reviewable

@tamird
Copy link
Contributor
tamird commented Oct 25, 2016

Reviewed 15 of 15 files at r1.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/kv/dist_sender.go, line 1055 at r1 (raw file):
From the original PR:

where did you read that this is the error code to check for? aren't there other codes which correspond to the definitely-not-sent condition?

I looked at the GRPC source code. This is the fail fast condition and it's an error code which isn't used for any other purpose.

Can you add this to the comment, preferably with a link to the relevant part of the grpc source? Remember to replace blob/master with blob/SHA where SHA is the specific commit you looked at.


pkg/kv/dist_sender.go, line 1075 at r1 (raw file):
From the original PR:

That would lose the very-purposeful verbosity level specified here. I'm still not sure what appropriate usage of the recent explosion of new log/tracing/event APIs is. I'm wrapping log.ErrEvent in a log.V(2) as a solution, but it doesn't necessarily feel right. Did you have something else in mind?

@RaduBerinde: care to weigh in?


pkg/kv/transport.go, line 151 at r1 (raw file):

  clientIndex    int
  orderedClients []batchClient
  syncutil.Mutex

not everything in this structure is protected by this mutex. Can you please introduce an anonymous struct that includes only the mutex and fields protected by it?


pkg/kv/transport_test.go, line 41 at r1 (raw file):
From the original PR:

As discussed in many recent PRs, this is bad practice that results in this line number being reported in failures, which makes it impossible to determine which invocation of this function resulted in an unexpected result. Helpers such as this one should return errors for this reason.

It would nearly triple the size of this method. And FYI, this is why I made the error a t.Fatalf, so that the stack trace shows the line number in the test that resulted in the failure.

t.Fatal does not produce a stack trace, so I still think this needs to change.


pkg/sql/ambiguous_commit_test.go, line 45 at r1 (raw file):

// key is specified in advance, it can result in violated uniqueness
// constraints, or duplicate key violations. See #6053, #7604, and #10023.
func TestAmbiguousCommit(t *testing.T) {

Please be sure to address #10184 before merging this.


pkg/sql/ambiguous_commit_test.go, line 86 at r1 (raw file):
From the original PR:

please check these errors.

It's a SQLRunner, which already asserts there are no errors.

OK, but that's not clear to me as a reader of this code, so I would prefer that you check the errors to minimize head-scratching.


pkg/sql/ambiguous_commit_test.go, line 147 at r1 (raw file):
From the original PR:

unchecked error? how did this pass lint?

Because the SQLRunner asserts no error.

Huh? The linter doesn't know that. For the reasons mentioned above, I would prefer that you check this error.


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 15 of 15 files at r1.
Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


pkg/kv/dist_sender.go, line 1022 at r1 (raw file):

                  // should return it instead of trying other replicas. However,
                  // if there are still other RPCs outstanding or an ambiguous
                  // RPC error was received, we must continue.

The meaning of "continue" here is unclear. How about "we must return an ambiguous status instead of the returned error"?


pkg/kv/transport.go, line 235 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this has a bad interaction with SendNext. SendNext grabs a pointer to the i-th client, but you're swapping the underlying elements (not pointers) here which means when SendNext finishes it could clear client.pending for the wrong batchClient.

Yeah, this needs to change to pointers.

pkg/sql/ambiguous_commit_test.go, line 53 at r1 (raw file):

  committed := make(chan struct{})
  wait := make(chan struct{})
  tableStartKey := keys.MakeTablePrefix(51 /* initial table ID */)

Define this in terms of some constant (keys.MaxReservedTableID?). Or could we query for the table ID after creating it? (I don't think it's in information_schema yet but we should add it)


pkg/sql/ambiguous_commit_test.go, line 121 at r1 (raw file):

  go func() {
      // Use a connection other than through the node which is the current
      // leaseholder to ensure that we use GRPC instead of the local server.

Why does it matter that we use GRPC here? Isn't the ambiguous commit logic at a high enough level that we'd get an ambiguous commit error if the local node was hanging?


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


pkg/sql/ambiguous_commit_test.go, line 53 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Define this in terms of some constant (keys.MaxReservedTableID?). Or could we query for the table ID after creating it? (I don't think it's in information_schema yet but we should add it)

I have such a query in `pkg/cmd/zerosum/main.go`. I borrowed this query in #10232.

pkg/sql/ambiguous_commit_test.go, line 86 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

From the original PR:

please check these errors.

It's a SQLRunner, which already asserts there are no errors.

OK, but that's not clear to me as a reader of this code, so I would prefer that you check the errors to minimize head-scratching.

`SQLRunner.Exec` does not return an error.

pkg/sql/ambiguous_commit_test.go, line 147 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

From the original PR:

unchecked error? how did this pass lint?

Because the SQLRunner asserts no error.

Huh? The linter doesn't know that. For the reasons mentioned above, I would prefer that you check this error.

`sqlutils.Row.Scan` does not return an error.

Comments from Reviewable

@tamird
Copy link
Contributor
tamird commented Oct 26, 2016

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


pkg/sql/ambiguous_commit_test.go, line 86 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

SQLRunner.Exec does not return an error.

Ah, I thought this thing was a `database/sql.DB`.

pkg/sql/ambiguous_commit_test.go, line 147 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

sqlutils.Row.Scan does not return an error.

Oh, that structure generally produces sucky error message. Sigh.

Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/ambiguous-commit branch from fd578f5 to 4e88f38 Compare October 26, 2016 16:30
@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


pkg/kv/dist_sender.go, line 1022 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The meaning of "continue" here is unclear. How about "we must return an ambiguous status instead of the returned error"?

Done.

pkg/kv/dist_sender.go, line 1055 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

From the original PR:

where did you read that this is the error code to check for? aren't there other codes which correspond to the definitely-not-sent condition?

I looked at the GRPC source code. This is the fail fast condition and it's an error code which isn't used for any other purpose.

Can you add this to the comment, preferably with a link to the relevant part of the grpc source? Remember to replace blob/master with blob/SHA where SHA is the specific commit you looked at.

Done

pkg/kv/dist_sender.go, line 1075 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

From the original PR:

That would lose the very-purposeful verbosity level specified here. I'm still not sure what appropriate usage of the recent explosion of new log/tracing/event APIs is. I'm wrapping log.ErrEvent in a log.V(2) as a solution, but it doesn't necessarily feel right. Did you have something else in mind?

@RaduBerinde: care to weigh in?

Done.

pkg/kv/transport.go, line 151 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not everything in this structure is protected by this mutex. Can you please introduce an anonymous struct that includes only the mutex and fields protected by it?

Reorganized for more clarity.

pkg/kv/transport.go, line 166 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

How could the previous code get away without locking while this code can't? Seems to me that you only need to protect client.pending here as gt.clientIndex is always accessed from a single goroutine. Also note that IsExhausted accesses gt.clientIndex without grabbing the lock. If this is true, I think you should rename grpcTransport.Mutex to grpcTransport.clientPendingMu.

Done.

pkg/kv/transport.go, line 235 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this has a bad interaction with SendNext. SendNext grabs a pointer to the i-th client, but you're swapping the underlying elements (not pointers) here which means when SendNext finishes it could clear client.pending for the wrong batchClient.

Good catch. Fixed.

pkg/kv/transport_test.go, line 41 at r1 (raw file):

Previously, tamird (Tamir Duberstein) 8000 wrote…

From the original PR:

As discussed in many recent PRs, this is bad practice that results in this line number being reported in failures, which makes it impossible to determine which invocation of this function resulted in an unexpected result. Helpers such as this one should return errors for this reason.

It would nearly triple the size of this method. And FYI, this is why I made the error a t.Fatalf, so that the stack trace shows the line number in the test that resulted in the failure.

t.Fatal does not produce a stack trace, so I still think this needs to change.

I added caller's line number to the verify function instead.

pkg/sql/ambiguous_commit_test.go, line 45 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Please be sure to address #10184 before merging this.

#10184 has been reassigned to pete, who's submitted a fix.

pkg/sql/ambiguous_commit_test.go, line 53 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I have such a query in pkg/cmd/zerosum/main.go. I borrowed this query in #10232.

I moved it into `sqlutils`.

pkg/sql/ambiguous_commit_test.go, line 86 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

From the original PR:

please check these errors.

It's a SQLRunner, which already asserts there are no errors.

OK, but that's not clear to me as a reader of this code, so I would prefer that you check the errors to minimize head-scratching.

There are no errors returned from `SQLRunner`. What's being ignored is the returned `sql.Result` object.

pkg/sql/ambiguous_commit_test.go, line 121 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why does it matter that we use GRPC here? Isn't the ambiguous commit logic at a high enough level that we'd get an ambiguous commit error if the local node was hanging?

Because the local server doesn't run a goroutine – it executes synchronously, so the dist sender thread of execution does not time out. I expanded the comment.

pkg/sql/ambiguous_commit_test.go, line 147 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

From the original PR:

unchecked error? how did this pass lint?

Because the SQLRunner asserts no error.

Huh? The linter doesn't know that. For the reasons mentioned above, I would prefer that you check this error.

Because there's no error returned. This returns a single `sql.Row` result, which this code calls `Scan()` on.

Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/ambiguous-commit branch from 4e88f38 to a477ff1 Compare October 26, 2016 16:53
@spencerkimball spencerkimball merged commit 5bc7bf1 into master Oct 26, 2016
@spencerkimball spencerkimball deleted the spencerkimball/ambiguous-commit branch October 26, 2016 17:27
@tamird
Copy link
Contributor
tamird commented Oct 26, 2016

I'll address my own comments in a follow-up.


Reviewed 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/kv/transport.go, line 241 at r2 (raw file):

}

func (gt *grpcTransport) setPending(replica roachpb.ReplicaDescriptor, pending bool) {

i don't understand the purpose of this function. all the callers already have a reference to the client, why can't they take clientPendingMu and set pending directly? Some of the callers currently hold a copy of the client, but there's no reason they couldn't take a reference.


pkg/kv/transport_test.go, line 41 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I added caller's line number to the verify function instead.

Please use `caller.Lookup` instead of `runtime.Caller` as we do everywhere else.

pkg/sql/ambiguous_commit_test.go, line 86 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

There are no errors returned from SQLRunner. What's being ignored is the returned sql.Result object.

It would be preferable not to assign it to `_`, since it will cause exactly the confusion that lead me to leave that comment.

pkg/sql/ambiguous_commit_test.go, line 103 at r2 (raw file):

          t.Fatal(err)
      }
      if !desc.StartKey.Equal(tableStartKey.Load().([]byte)) {

please extract a local variable for tableStartKey.Load().([]byte). This looks inherently flaky to me, as written.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/kv/transport.go, line 241 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i don't understand the purpose of this function. all the callers already have a reference to the client, why can't they take clientPendingMu and set pending directly? Some of the callers currently hold a copy of the client, but there's no reason they couldn't take a reference.

The previous code (which was doing as you suggest) was incorrect. See my comment.

Comments from Reviewable

@tamird
Copy link
Contributor
tamird commented Oct 26, 2016

Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/kv/transport.go, line 241 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The previous code (which was doing as you suggest) was incorrect. See my comment.

Ah, I see; of course. I'll add a comment to this.

Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0