Tell me about a situation where things did not go as planned. What did you learn from it?

Question

Tell me about a situation where things did not go as planned. What did you learn from it?

Asked at Microsoft over a year ago

29.7k views

ibm

Asked at

29.7k views

How to answer Behavioral questions

Interview Guide

General framework for failure questions:

Give context. What is the product / business you were working on?
Talk about what failed. If you are an engineer, talk about how much money that outage cost the company, it's OK if it was pretty catastrophic as long as you talk about how complex the system is. If you are a PM, pick a more curated failure, not the time that you bet all your engineers on a failing product since that may call into question your judgement. Ideally, PMs should pick an execution failure, to avoid giving the interviewer any reason to believe that you may lack good judgement or strategic thinking skills.
Explain what you did when you learned you were failing. Ideally this was calm and structured and has a good ending.
Tell me what you learned.
Share what you'd do differently next time.

An example from a PM lens:

[context] I was the PM for user onboarding in a clinical research platform. The platform contained different clinical research studies that users could sign up to participate in. When onboarding to a study, users could read about the study in their native language and decide if they wanted to consent to participate in the research. From a user perspective, correct and available translations are critical to deciding if you want to participate. Correct and available translations were critical to the business from a regulatory perspective, since participants must willingly provide their consent in a language they are fluent in.
[what failed] Engineering pushed a change that knocked out all non-English content translations for over 24 hours, meaning that users who were not fluent in English were only able to read study information in English on the platform, for 24 hours. This outage invalidated the consents for users who signed up for a clinical study during that 24 hour period, since we could not ensure that their consent was truly informed. The failure happened for three reasons:
1. new feature code broke the existing translations service
2. there was no testing in place to stop engineering from pushing that bad code
3. there was no process in place to ensure test planning and test coverage
[what I did during the failure] I worked with the engineer who discovered the outage and gathered the cross-functional team who needed to understand the outage and prepare additional countermeasures (ie reaching out to the clinical research sites to let them know that there had been an outage of translations). I stayed in close contact with the operational team throughout the outage and helped to create some content that we could send to recover those users once we were sure the solution was in place and robust. After we found the root cause, I helped the testing team write out the system-level UI tests to catch this issue in future testing, if there was a regression. I also worked with the engineering manager to improve the engineering team processes around test planning. I also added a missing step to our launch planning process, to ensure that all new features have adequate integration and system test coverage. In the end, we recovered quickly and were able to get over 80% of the lost users re-consented in their native languages, within one week.
[learning] I learned three key things from this experience.
1. as a PM, it's my responsibility to ensure that my features are being adequately tested by engineering to protect and ensure the user experience.
2. I can help ensure that my engineering teams have process and hygiene around test planning and test coverage.
3. put users first, always. We had many users thank us for reaching out to get them re-consented in their native language, and it was an opportunity to build our relationship with them since for many of them it was their first time interactiing with our platform.
[what I'd do differently] In general, I now take a more process-driven approach to ensuring adequate testing is in place. Since I am the one responsible for signing off on feature launches, I have implemented better checklists and processes that ensure that we don't miss key testing, and I'm happy to say that we have not had a test-coverage related outage in the platform since that one.

The same example from the engineering lens:

[context] I worked on language translations for a clinical research platform. The platform contained different clinical research studies that users could participate in, and users could read about the studies in their native language to decide if they wanted to participate in the research. Correct translations were critical to the business from a regulatory perspective, since participants must willingly provide their consent with all of the information available to them, and they were critical to users because they need to fully understand the clinical research they are electing to participate in.
[what failed] I pushed a change that knocked out all non-English content translations on the platform meaning that non-native English speakers/readers were only able to read information in English on the platform. The outage lasted around 24 hours before I caught it and we resolved the issue. This outage invalidated the consents for everyone who signed up for a clinical study in English in that period of time, since we could not ensure that their consent was truly informed. The failure happened for two reasons: (1) I pushed code that broke the translations service, and (2) there was no testing in place to stop me from pushing that bad code.
[what I did during the failure] I regularly tested my own code, end to end, in Spanish since it's my second language. While working on a new feature, I did some testing and noticed that, despite provisioning my environment to show me Spanish content, all the content was in English. To verify that the translations were broken, I double checked my environment set up first, then I immediately rolled back production to a safe build where I had previously verified that Spanish translations were working. Then, I started a post-mortem doc and began the technical investigation on what went wrong. I also informed our PM so they could gather the cross-functional team who needed to understand the outage and prepare additional countermeasures (ie reaching out to the clinical research sites to let them know that there had been an outage of translations). Once we found the root cause, I worked with the engineer whose translations service had broken, to fix the issue. We tested the change together and verified that it was working before we rolled out the change (and all other changes that had been rolled back). I worked with that engineer to write additional integration tests to cover the failure path, and I worked with the manual test engineer to write additional manual tests which would catch any language outage.
[learning] I learned three key things from this experience.
1. as an engineer, it's my responsibility to ensure that my features are being adequately tested, through unit, integration and manual system level tests.
2. I can help to ensure that my team has good process and hygiene around writing and maintaining automated integration tests that touch multiple services
3. bringing in key stakeholders like the clinical operations staff allowed us to solve the operational symptoms in parallel to fixing the technical issue, minimizing overall user confusion and loss for the outage.
[what I'd do differently] Primarily, I would ensure that as I was launching new features, I would work more closely with the dedicated testers to ensure that they are covering key E2E test cases before we launched my feature. My team didn't have strong process around this, and so I missed it. The entire issue could have been caught either by better integration or even unit testing for the service that I broke, or by our manual testing efforts if tests had covered this use case. Today, when I design and plan to launch new features, I reserve significantly more time for test plan development, including unit and integration tests that I write myself, and also system-level manual tests that our testing team owns and maintains. In general, I take a more process-driven approach to ensuring adequate testing is in place.

*the feeling of amazing applies only for the interviewer. This sounds kind of evil, but look, I do 5 interviews a week, let me enjoy a crash and burn once and a while. Also, you are here and learning how to answer the question, so you are doing to do fine!

Chetan Channa · Answer 3 · 2021-07-29T07:26:47+0000

In my previous company, one of the key aspect of the product was integrating with external 3rd party payment mechanisms for auto-renewal subscriptions. In one of the integrations, after we scaled after a few months, we suddenly started getting complaints from the users that they are getting charged multiple times. From both user and product persepctive, this was a big problem.

I worked with engineering to figure out the root cause and realized that because of the scale of the increase in scale, the partner system was not able to send charging notifications properly; some were delayed and some never came, even though the user was charged. As a result, our system kept on sending charging requests, leading the user to get chared multiple times.

As the first step, I communicated this to the partner to make sure that they understand the root cause which was not because of us. Secondly, since the users were getting impacted, I took the decision to stop sending any charging requests for that partner. In meantime, we quickly put a hack that, the charging request won't be send for the length of charging period. These 2 things solved the immidiate issues. Then I went back the to the drawing-board and re-wrote the requirements which the engineering worked on to solve this issue permanently.

This is what I learned from this :-

Always before the launch, figure out the scale you expect and make sure all internal & external systems are able to handle it. From that point on, I incorporated what I called as PSR as part of requirements.
At the end as PM, you are responsible for consumers so always plan for failure/worst cases scenarios from the external systems as well. So from now on, I always gave requirements and use-cases for failure in 3rd party systems.
I got our contractual terms modified legally so as we don't get legally liable for issues in the partner systems. It was made more explicit.
Last but not the least, we made a process to make sure that we scale slowly and don't increase traffic massively. Also put in relevant alerts for such scenarios.