App Down

(set:$time to ($time+0))(replace:?time)[$time] You bump the version to 2.0.0 in pom.xml file. Choose your next step: [[Commit and deploy]] [[Test]]

(set:$time to ($time+60))(replace:?time)[$time] Despite your best efforts to channel your inner Professor X, you highly doubt your mind reading telling you the app is down because you need a snack and coffee is accurate. [[Request more information]]

(set:$time to ($time+2))(replace:?time)[$time] (set: $checkedNewRelic to true) You see the app metrics stopped sending at the time the restart was attempted, but nothing else. (if: $checkedLogs is false)\ [[Check the logs]] [[Investigate the container runtimes.->Investigate the pods further in Kubernetes]]

(set:$time to ($time+2))(replace:?time)[$time] (set: $checkedLogs to true) You see that the logs stopped when they attempted to restart the application. Nothing suspicous or interesting to dig into (if: $checkedNewRelic is false)\ [[Check Application Performance Monitoring ]] [[Investigate the containers/pods further in Kubernetes->Investigate the pods further in Kubernetes]]

(if: $testedApp is false)[ (set:$time to ($time+60))(replace:?time)[$time] Oh no!!! You made a typo in the pom file! Luckily you PR builds and test the app and caught this. However, now everyone on the team knows you don't test your changes before you commit them and you are added to the wall of shame. You correct your changes, test locally, and resubmit. <img src="https://raw.githubusercontent.com/javaplus/Problem_Solving/refs/heads/master/images/cve-scenario/shame.webp"/> ] (if: $testedApp is true)[ (set:$time to ($time+5))(replace:?time)[$time] You commit your code. It passes all checks in the PR allowing it be merged into main. You kick off the deploy to get your code into an environment. ] [[Validate deployment]]

(set:$time to ($time+1))(replace:?time)[$time] You describe the pod and get the following: <img width=100% src="https://raw.githubusercontent.com/javaplus/Problem_Solving/refs/heads/master/images/cve-scenario/describePods.png"/> (zoom for more details of the command) Oh no! Blocked by a CVE present in your image! An unfortunate but very common occurence in our world. In this platform anytime that a Critical CVE is present in the container, it prevents it from starting up. Restarts in Kubernetes can happen for various reasons and CVEs are constantly being discovered. What do you want to do now? [[Raise an exception]] [[Update the Dependency]] (text-style:"italic","expand")[Note: CVE stands for Common Vulnerabilities and Exposures, a project that classifies and tracks publicly known information security vulnerabilities]

(set:$time to ($time+2))(replace:?time)[$time] You decide to examine the running containers more closely. Since your containers are running in Kubernetes pods, you know your best route is to use the //kubectl// CLI. So you pull up your handy dandy CLI and start exploring the containers with the //kubectl get pods// command. Let's see what we have: <img width=100% src="https://raw.githubusercontent.com/javaplus/Problem_Solving/refs/heads/master/images/cve-scenario/getPods.png"/> Yup there is the problem. There are 0 ready containers and a //RunContainerError//. Now what? [[Describe the pod]] [[Try increasing the replicas for the pod]] [[Try restarting the pod]]

(set:$time to ($time+120))(replace:?time)[$time] You attempt to fill out an exception to allow your pods to start back up only to have it ''denied'' due to the criticality of this CVE and a fix being readily available. [[Update the Dependency]]

(set:$time to ($time+10))(replace:?time)[$time] You have learned the basics of problem solving and now can effecitvely and politely ask for the information you need to dig into this problem. You send the following email: //Hello Fellow Teammate! I politely request the following information to further diagnose your issue: * In which environment are you testing? * What is the URL of the endpoint you are using? * Is the entire application down or just an endpoint not working? * How can I recreate this problem? * When did this problem start? * What have you done to try to troubleshoot this issue? // [[Send email]]

(set:$time to ($time+10))(replace:?time)[$time] Your teammate was a bit confused at first as to why you needed information to work with and why you could not indeed read their mind, but eventually dug into the problem to provided you with some basic information: * In which environment are you testing? //In the test environment// * What is the URL of the endpoint you are using? // http://superapp-test.acme.net/customer/v2// * Is the entire application down or just an endpoint not working? //Whole thing is toast// * How can I recreate this problem? //Hit any endpoint // * When did this problem start? //After we restarted the containers in our test environment// * What have you done to try to troubleshoot this issue? // Ran ''kubectl get pods'' to see the running containers and there was an error that said ''RunContainerError'' The plot thickens and it appears we have a real doozy of a problem at hand. [[Check the logs]] [[Check Application Performance Monitoring ]] [[Check the application containers->Investigate the pods further in Kubernetes]]

(set: $checkedLogs to false) (set: $checkedNewRelic to false) (set: $vOneTwoOneEight to false) (set: $testedApp to false) (set: $increasedPods to false) (set: $restartedPod to false) You are a developer on an application which exposes several REST API endpoints that is deployed to a Kubernetes environment. Your REST APIs also have an proxy gateway that is a pass through to your app in kubernetes. The proxy handles normal items like logging and OAUTH validation. You've recently deployed a new V2 version of your REST API which adds a few new fields, which are required, to some requests and updated responses compared to the V1 services which is still available. You have received the coveted problem solving request as outlined below: //App is down, please help.// And so your journey begins :) [[Use Postman to test your application]] [[Request more information]] [[Attempt to read their mind]]

(set:$time to ($time+10))(replace:?time)[$time] (set: $testedApp to true) You go to start your app up locally only to realize it won't even build! What is going on? Upon looking at the build logs it says: ''Cannot find the artifact log4j:2.0.0.0'' Where did that extra ''.0'' come from? You accidentally typed it into the pom.xml. You correct your mistake. [[Commit and deploy]]

(align:"==>")[(border:"solid")[(color:yellow)[Time spent: [(str:$time)]<time| Minutes]]]

(set:$time to ($time+2))(replace:?time)[$time] (set: $increasedPods to true) You increase the replicas on your kubernetes deployment, then do another ''kubectl get pods''. Now you have multiple pods all reporting the same error as the original error: //RunContainerError// What do you want to try now? (if: $restartedPod is false)\ [[Try restarting the pods->Try restarting the pod]] [[Describe the pods->Describe the pod]]

(set:$time to ($time+2))(replace:?time)[$time] (set: $restartedPod to true) You issue the ``kubectl delete pod`` command to delete the pod(s), upon doing so your kubernetes deployment kicks in and tries to start the pod back up in order to meet the agreed upon contract of replicas in the deployment. Upon doing so you see the new pod spinning up, however you see the familiar error of: ''RunContainerError'' [[Describe the pod]] (if: $increasedPods is false)\ [[Try increasing the replicas for the pod]]

(set:$time to ($time+20))(replace:?time)[$time] (if: $vOneTwoOneEight is true)[ You bump the version, build, and attempt a deploy and the same error occurs! We need to bump the version to at least 2.0 to avoid this CVE. ] You search the artifact repository (Maven in this case) to see which versions are available and find the following: * 1.2.18 * 2.0.0 Which version do you want update to? (if: $vOneTwoOneEight is false)\ [[1.2.18->Update the Dependency]] [[2.0.0]] (set: $vOneTwoOneEight to true)

(set:$time to ($time+15))(replace:?time)[$time] In a reasonable panic you first check production in fear of an outage. You send a simple request to the app in production and find that it is working fine there. You go through several different requests using most methods like GET, POST, PUT, and DELETE, but still find no issue. You feel you've wasted 15 minutes... [[Request more information]] [[Attempt to read their mind]]

(set:$time to ($time+10))(replace:?time)[$time] You check the pod is up and running. You test the application is running in the environment and it all checks out. Your app logs look clean and Application Monitoring shows a healthy app is running. Do a little dance and celebrate you did it! <img src="https://raw.githubusercontent.com/javaplus/Problem_Solving/refs/heads/master/images/cve-scenario/hansSalute.webp"/> (link: "Try Again to Improve Your Time")[(set: $time to 0)(goto: "Start")]