RESOLVED: Students unable to submit files, see previews (Nov 6)

The following information was provided by Instructure.

Updated on November 9, 2018 at 3:26 pm

Users were unable to submit files to assignments in Canvas and may also have seen delayed rendering of file previews, uploads, and downloads between 11:15 AM MST to 1:00 PM MST on 6 November, 2018

OVERVIEW
Users at your institution who tried to submit a file to an assignment between 11:15 AM MST and 1:00 PM MST on 6 November, 2018 received an error message. Some also saw delayed rendering when trying to preview a file or noted delays when uploading or downloading files. A capacity issue caused this incident. We manually increased capacity to solve the short-term problem; we changed a default configuration value and updated our auto-scaling settings to prevent future, similar incidents.

DETAILS
We’ve recently applied the concept of service-oriented architecture to Canvas development. This means we’ve delegated several key functions to separate, stand-alone services that are independent of the core Canvas application. This approach makes core Canvas more stable, scalable, and maintainable and allows for quicker innovation in the delegated functions.

We’ve been gradually deploying a new Files service over the last six months. As the name suggests, its purpose is to handle files wherever they appear in Canvas: submitting to assignments, uploading or downloading, rendering previews, etc. We’ve deployed the service at a cautious, deliberate pace because files play such a critical role in teaching and learning.

Load on the Files service neared the limit of what its supporting resources could handle at about 9:45 AM US Mountain Time (MT) on 6 November. Canvas routinely manages load spikes by adding resources automatically, as needed. But auto-scaling did not happen in this case because one component of the Files service was already using the maximum number of servers defined in its configuration. We set this upper limit during the testing and limited-release phases of development, and we had not updated it as we added more load to the service.

With the service operating near capacity, requests from users began to queue, waiting for resources. At about 11:15 AM MT, users began to see error messages as some requests timed out while waiting.

MITIGATION
The timeout errors alerted us to the problem, and engineers began troubleshooting the issue immediately. Most components of the service looked healthy upon first examination. It took time to identify the resource-constrained component and add more resources manually, and it took a bit longer for the new resources to work through the queued requests and return the service to a healthy state. The Files service was operating normally again by 1:00 PM MT.

Manually adding resources solved the short-term problem on 6 November. Engineering has updated the Files service configuration to increase the maximum resource levels for the component that caused this incident and reviewed maximums for other components of the service, too, to ensure they are appropriate to current potential load conditions.

Ideally, monitoring would have alerted us when we’d reached our limit on resources, before users saw meaningfully slow performance or error messages. Even with a new, more-appropriate maximum capacity setting in place, our auto-scaling routine can only function effectively when monitoring alerts it to a need. Engineering will review both monitoring and scaling mechanisms for the Files service and make sure they are up to the task.

CONCLUSION
We apologize for the trouble this incident caused for you and your users on 6 November. It’s a privilege to work with you, and we will apply lessons from this incident to grow and improve.

Resolved November 6, 2018 at 2:17 pm

Users in the US Region are no longer experiencing the page errors.

WHAT IS HAPPENING?
Our engineers have implemented a fix for the page errors.

WHAT ARE WE DOING ABOUT IT?
No further action is needed at this time.

WHAT HAPPENS NEXT?
We are gathering details about the cause of this situation and will provide updates at that time. For now though, users can expect normal performance levels within Canvas.

Originally reported November 6, 2018 at 1:29 pm

Some Canvas users in the US region are seeing page errors while trying to submit assignments.

WHAT IS HAPPENING?
Users are continuing to receive page errors when trying to submit or upload files.

WHAT ARE WE DOING ABOUT IT?
Our engineers are investigating the root cause. We may have found a potential cause, but we’re still testing to ensure it’s going to provide a resolution.

WHAT HAPPENS NEXT?
We will provide an update when one becomes available. We will also be posting updates to status.instructure.com We greatly appreciate your patience with this situation.