-
Notifications
You must be signed in to change notification settings - Fork 18.1k
godoc.org: various degrees of service degradation and unavailability on Sunday, January 19, 2020 #36642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sorry, I've never touched godoc.org. Not even sure where it lives. |
|
(There has also been an issue created in golang/gddo#670.) |
cc @broady |
I can confirm, there is an excessively high frequency of 500 Internal Server Error responses on godoc.org at this time. I'll look into whether there's something I can do now. Until this issue is resolved, consider using pkg.go.dev, which also displays documentation for Go packages (see #33654), as a workaround. The pkg.go.dev server was designed to scale to a large number of packages and users better than godoc.org, and should be able to handle higher workloads. /cc @julieqiu |
I've taken some measures to address the problem, and the 500s are no longer happening at very high frequency as of 4:45 PM EST (12 minutes ago). I'll remove the Soon label since this is mitigated. There's more to be done, but that can be done later. Please let us know if you're still completely unable to reach godoc.org. |
I’m still getting “Error: Server Error”s on any page. Tested in clean chrome and safari. Edit: just worked |
Thanks guys for the quick fix / workaround, be interested in what the underlying cause was when identified? |
@dmitshur suggestion worked for me. For example changing: https://godoc.org/golang.org/x/text/cases To: https://pkg.go.dev/golang.org/x/text/cases but to confirm, no, its still not working:
|
The measures I applied in #36642 (comment) worked for a few hours, but by now the redis backend that godoc.org uses has become completely unresponsive, and we'll need to do more work to get it operational again. We'll update this issue when the godoc.org service is operational again. Until then, |
Another update. We've restored the redis backend, and so godoc.org should be operating okay now. I'll watch it some more to make sure it continues to be stable. Edit: The godoc.org server has been serving successfully without any 500s since this comment was originally posted, so I'll update this issue to be resolved. |
@dmitshur when can we expect a post-incident analysis to be completed with a publicly available write-up? Edit: I should have clarified, I am assuming the issue has been addressed with the site behaving now. I could be wrong. |
@theckman isnt that a little presumptuous? I would like one as well, but its not owed to you or anyone. What makes you think that it is? |
@cup I think it's worth noting that other organizations that support these sorts of systems often have write-ups for their failures, npmjs.org might be a good example. (I can't believe I'm looking to NPM as a good example...) That said, I think "well, others do it" isn't a great answer so let me try and explain why I think it's important. Ultimately we are putting trust in this organization, as a community, to host and run all of the core infrastructure for our language. I feel there is a need for transparency in to their handing of incidents so that we can be confident they are doing right by the community, but to also give them feedback where we feel we have misaligned expectations. To me as a Site Reliability Engineer it's unacceptable to me when issues are discovered because customers reported them. There is an obligation to: A. Have automatic discovery of issues. From the outside it looks like neither are in place, which would be surprising when you consider the SRE book came from Google. Such incident reports, detailing what contributed to the failure and how long it took to remediate, would sunshine on those issues. And if they aren't paging folks to bring Go infrastructure back up when it's failing, are they doing right by the community? |
@theckman yeah, I get that, you think its important, I do too. But why do you feel entitled to demand one? For example, compare what you said:
with another option:
one is a demand, one is a (polite) ask.
no, theres not. Are you paying Google or Alphabet? This is open source software friend. The Golang team, and the larger Go community dont have any obligation to you in this regard. |
@cup I'm not demanding one. I asked when we can expect one to be available. A valid answer to that question is "We aren't planning on writing one." In response to that, the only thing I can do is provide context on why I feel it's important to write one. |
@cup Forgot to add: As a Site Reliability Engineer, those are the obligations of my role in that job. I'm not saying those are obligations of Google as an entire company, but are the expectations I have being that they have a strong SRE organization. |
To my knowledge, nobody has ever committed to an availability goal for godoc.org. It is operated as a best-effort service, and it is slated to be replaced by pkg.go.dev. @dmitshur took time out of his weekend to fix it, which as far as I'm concerned was beyond the call of duty. I think it would have been nice to at least thank him. I don't think it's fair to assume that a post mortem is owed. This is quite different from other services, notably proxy/index/sum.golang.org, which are critical dependencies for many Go users. We don't have a published policy for post mortems, but I think if those experienced an outage anything like this scale it would be appropriate to publish one. If anyone would like to discuss the level of support that these services receive, I think a thread on golang-dev or golang-nuts would be a better forum. Or, if there's a concrete proposal, (e.g. publicize a post mortem for any significant outage) perhaps file an issue. But I don't think this discussion is shaping up to be productive. |
@cup I don't think your comment was appropriate. I've not attacked you, or anyone else here. I am asking questions for the benefit of the community and I'm not going to tolerate being treated like that. Edit: For posterity, since you edited the message: |
@theckman I am sorry for my comment. If you notice carefully I deleted it. If you desire to post screenshots, then that reflects poorly on you as its off topic for this discussion.
are you though? I hesitate to say it, but it seems from your presumptive wording that youre posting for your own benefit, or perhaps Netflix benefit. |
@cup I raised the issue [1] because people were in the Gophers Slack trying to figure out what was goin on with getting their package documentation. People were confused, and godoc.org is currently the entrypoint we point a lot of newbies to for things. It's pretty critical from the user support perspective. So being 100% honest with you, I am asking on behalf of the Go community and nobody else. If these issues can be acknowledged and made public sooner, it makes providing user support in the community much easier (and it looks better on us too). Hah, I really wish it was for the benefit of Netflix. It would mean we're writing a lot more Go than Java or NodeJS. I'd selfishly love that, but until then I'll keep dreaming. 😄 I noticed your edit after I sent the message, and so I was in a bit of a tough spot. I'm morally against editing my own messages to change my stance on things, especially in cases where I feel I've put my own foot in my mouth. In this case my message no longer had the relevant context and felt I needed to add it. Since we've not had a chance to collaborate together before, it may be good to share my personal context. I have this stance because I feel my failures should be public/transparent to others, so they can form their own opinions and have an opportunity to not make mistakes I've made. The side effect here is that I try not to edit away things I've second thoughts about. I may add an edit indicating I was wrong, but my original message stands. To summarize my own bar for these things: edits for clarifications or typo fixes 👍, but edits to change my message completely / rewrite history 👎. [1] golang/gddo#670 Edit: Providing it as an edit instead of a separate comment to avoid followers getting hit with another Issue Update Email. Since my motives are being attacked, here is the Slack conversation where I called out raising the issue (right after people were experiencing problems): https://gophers.slack.com/archives/C029RQSEE/p1579455636310300 |
fair enough, but those people are looking for Go package documentation, not a post mortem. You are looking for that. So even if you were acting in good faith to help those people, getting a post mortem is only going to help you and others looking for that, not for people who just want the documentation. |
@cup Not looking for a post mortem since I'm pretty confident this didn't kill anybody. 😉My desire for such a retrospective is around this one line:
If we do a retrospective and discover it's not clear who to communicate these issues to and how, that's an extremely valuable learning. If we're able to make changes from that, it would help those of us who are fielding user questions, and pointing them to different Go-related resources. I think a good analogy might be businesses with Customer Support organizations. If there is something going on, most try to provide that sort of context to their support agents so they can better communicate with the people contacting them.
Being able to do this would be super nice. |
It's a little disappointing to me that a simple request for a post-mortem for an outage on a system many people rely on is being met with this level of aggression. There's really no need to accuse people of being entitled or having ulterior motives for asking for one. Moreover, the ball on this request is pretty much in the go team's court, and speculating on their answer and talking down to community members for requesting one is not helpful. It's more constructive to see what the go team says. Isn't this why we have a CoC? |
Let's please everybody take a break from this discussion and see what @dmitshur has to say. Please try to be polite and respectful at all times on this issue tracker. Thanks. |
The godoc.org service continues to be stable and is not serving unexpected 500 errors anymore, so I'll close this. The main cause of the 500 errors that were being served on Sunday was that the backing redis server that stores Go package documentation started to misbehave. Many of the redis operations were failing with an "ERR max number of clients reached" error. Almost every page on godoc.org needs to talk to the redis server, and when that operation wasn't successful, a 500 Internal Server Error was rendered. In #36642 (comment), I made changes to scale down the number of frontend instances of gddo-server in order to reduce the total number of connections to the redis server. This bought some time, as the "ERR max number of clients reached" errors were greatly reduced, but after a few hours (see #36642 (comment)), they started to reoccur continuously. At that point, we needed to restart the redis server for it to become responsive again. The root cause was a redis connection leak that led to the number of redis client connections to gradually grow over the course of the last 6 weeks to an unsustainably high number. There were no measures in place to limit or detect a large number of redis client connections, so we did not notice it sooner. Now that it is under control, I plan to add an alert for that metric so we can detect and resolve any recurrence of this problem before it can cause a service disruption. Edit: The redis connection leak was resolved and an alert was added. We want to provide a good experience for users viewing Go package documentation. However, I want to reiterate what @heschik said above in #36642 (comment). Even with additional alerting, the godoc.org support continues to be provided on a best-effort basis, because it is slated to be eventually replaced by pkg.go.dev. The pkg.go.dev website can show documentation for specific versions of a Go package, and it has been designed to better handle the amount of Go packages that exist today. It is currently and will continue to be staffed by an active on-call rotation. |
All pages are returning
Internal server error.
cc @dmitshur @andybons and maybe @bradfitz ?
The text was updated successfully, but these errors were encountered: