US Backend outage

Incident Report for JourneyApps

Resolved

We can confirm that the incident is now resolved.

Posted Dec 22, 2021 - 10:15 MST

Monitoring

All systems have recovered as a result of the workarounds that we applied, in combination with the recovery of AWS.

We are monitoring and will post updates here should any issues persist.

We would like to apologize sincerely for the outage. We have already identified ways to make JourneyApps more resilient to upstream AWS outages such as these and will conduct the necessary root cause investigations to identify further improvements.

Posted Dec 22, 2021 - 07:37 MST

Update

We have applied our first workarounds to circumvent the affected AWS Availability Zones (where practically possible).

That said, the performance of the US backend region is still severely affected. We are working on scaling up performance now.

In addition, we are still not seeing any marked improvement from AWS despite their status updates.

To summarize the current impact:
- Some API requests to US region deployments may be successful.
- OXIDE seems to be working for code edits, though deploying to US region backends doesn't seem to work as expected.
- CloudCode tasks in the US region seem to run, though due to the degraded performance these tasks may still time out frequently and/or run into other errors. Deploying changes to CloudCode tasks will also not work currently.

We are downgrading the impact to "Partial outage" for now, and will update you as soon as performance of the backend is back to normal.

Posted Dec 22, 2021 - 07:21 MST

Update

AWS has communicated that power is being restored to the affected services, though we're not yet seeing the positive impact on our side. The outage continues.

We are in the process of applying workarounds that should restore service to core backend functions for the US region.

Posted Dec 22, 2021 - 06:47 MST

Identified

This is the latest update from AWS:

"5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center."

We have been attempting to fail away from the affected Availability Zone as recommended, but have not been successful due to errors from AWS. From our experience it seems like the outage at AWS is more extensive than being reported and is potentially affecting other Availability Zones too.

We are currently working on possible workarounds to restore partial functionality and will keep you updated.

Posted Dec 22, 2021 - 06:13 MST

Update

From our initial investigations it seems like our infrastructure provider, Amazon Web Services, is having an outage which is affecting all our services hosted in the US region.

This is their most current update:
"4:35 AM PST We are investigating increased EC2 launched failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue."

This includes, as mentioned previously:

- US-region deployments, which are currently down. This includes the Backend Data Browser.
- API clients (e.g. integration brokers, CloudCode tasks and apps) will see errors when making API calls.
- [New]: OXIDE is intermittently working for some customers, but some customers might see errors when trying to load apps in OXIDE. Deploying apps will possibly not work currently for US-based deployments.
- It is not currently clear whether CloudCode tasks in the US region are running at the moment - we are investigating.

Posted Dec 22, 2021 - 05:48 MST

Investigating

We are currently investigating an outage on the US backend region. Other regions are unaffected

The outage will affect API calls to this region, whether that is from the app (syncing data or OnlineDB calls), CloudCode (DB calls), or other API clients.

All US-region deployments are affected, including Testing, Staging, and Production deployments.

The Backend Data browser is also affected, and currently does not load.

This outage started at 12:10PM UTC (i.e. roughly 20 minutes ago).

Posted Dec 22, 2021 - 05:31 MST

This incident affected: CloudCode (CloudCode US Region), Build Your App (OXIDE), and Regional Cloud Backends (Backend - US).