Node Down Response
3 minute read
This process is intended to help customer support personnel quickly identify the scope of a node down problem and get services back online as quickly as possible.
Node Down Triage
Triage Checklist
Determine Production Status of the Trustgrid Node
Confirm High Availability or Disaster Recover is Functioning
If clustered, is the partner active and working
If there is a second site, confirm if it is active
Determine the type of outage (Control Plane, Data Plane or Both)
Determine Production Status of the Trustgrid Node
Use the Production Status Tags to determine if the node is in use and expected to be online.
Confirm High Availability or Disaster Recover is Functioning
Before troubleshooting why a node is down we should determine if the services it provided can be provided in the interim by a cluster member or devices at a secondary/disaster-recovery site.
Is the Node a Cluster Member
Check if partner member is online:
If Yes:
- Confirm partner member is now active and operating normally. - The issue is limited to this specific Node. This could be its power input, the hardware or operating system, configuration, or its internet connection (if different than partner member).
If No:
- Possible site-level issue including power or internet provider - Proceed to Secondary / DR Site
Is there a Secondary / DR Site for this Node/Cluster?
Are the nodes deployed at the secondary site online in the Trustgrid?
If No:
- Confirm nodes for other end-user sites are not also offline which might indicate a wider spread issue. Escalate to Trustgrid Support if you suspect a major issue with the Trustgrid system. - If limited to a single customer it is recommended to contact that customer immediately. They may be experiencing an outage or performing maintenance.
If Yes:
- If the customer is configured for Automatic Failover between sites, verify that traffic is flowing through the active member at the secondary site. - If the customer is configured for Manual Failover you will need to adjust the route destination to point to the secondary site.
Determine Connectivity Status of the Trustgrid Node
Because Trustgrid provides independent control and data planes, there are a few ways it can manifest as “down”:
Field Name | Description |
---|---|
Control Plane Down | The node appears offline from within the Trustgrid portal. This indicates that the node is disconnected from the control plane either because it has shutdown or has been unable to send a heartbeat notification to Trustgrid within the past 10 minutes. |
Data Plane Down | The node appears online from within the portal but reports it is unable to connect to one or more gateway nodes. This can be indicated by the Data Plane Status indicator when viewing the node in the portal, or by receiving a “Gateway Connectivity Health Check” failure event notification. Both of these only work if the Control Plane is currently working. |
Both Control and Data Plane Down | In this situation the node appears down in the Trustgrid Portal and users/applications are unable to reach services across the data plane between the Gateway and Edge sites. This is the most common scenario. While the Data Plane is most critical for the services provided across the device, first priority should be restoring the Control Plane connection so that additional troubleshooting tools are available. Often this process also uncovers the reason the data plane is down. |
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.