Notes from my Google Cloud Professional DevOps Engineer Certification Exam

7 min readNov 2, 2020

Subscribe to my YouTube channel that teaches you to apply Google Cloud to your projects and also prepare for the certifications: youtube.com/AwesomeGCP. Check out the playlists I currently have for Associate Cloud Engineer, Professional Architect, Professional Data Engineer, Professional Cloud Developer, Professional Cloud DevOps Engineer, Professional Cloud Network Engineer, and Professional Cloud Security Engineer.

Preparation

Taking this exam showed me that it becomes easier the more time you spend learning. I was originally supposed to take this exam around Feb/March 2020 and had started preparing then. But then the pandemic hit and I couldn’t go to the test center. When there was news that the exam is going to be available online, I prepared a little again but I couldn’t take the exam because I got extremely busy with my work. Finally, I was able to take it a few days ago. I didn’t take much time to prepare this time — just skimmed parts of a Coursera Google Cloud course (see git repo here), did a Qwiklabs lab or two, read through a handful of docs. And I passed! I won’t say it is an easy exam, but I was able to get through without much immediate effort since I had already put in effort towards it multiple times.

The Online Exam

I refactored this section into a separate post here because it is applicable to all exams: https://medium.com/@sathishvj/taking-the-google-cloud-certification-online-exam-a8d5a8d18550

Preparation

As I mentioned, I had prepared for this a couple of times. The first time around, I created videos of the practice questions on my YouTube Channel (AwesomeGCP). I reviewed all of them again, and that really helped. I also skimmed over a more recent course on Coursera that I hadn’t fully gone through before. I read through a few posts written by others to see what they’d faced. I’ve collected all those links in this git repo: https://github.com/sathishvj/awesome-gcp-certifications

And now, based on the exam areas that I encountered, here are some of the topics that I too recommend that you study based on the study guide.

Applying Site Reliability Engineering Principles to a Service

Do the related Coursera and Pluralsight Courses. (listed here)
There are the bibles of SRE available for free (available here). But they are very dry to read through. I wouldn’t necessarily recommend that you get through that in its entirety. There are alternative resources that explain concepts in more easily understandable ways.
Go through Seth and Liz’s videos (link). It is a simple clear explanation. These are also repeated in the Coursera and Pluralsight courses.
Where do SLIs and SLOs sit on a spectrum?
How do you identify what are the correct SLIs to choose? What characteristics should they have?
How do you set the SLOs based on the SLIs?
How aggressive or loose do you make SLOs?
What is the deciding factors for choosing SLOs?
What is an error budget?
What is the purpose and application of error budgets?
How does the team arrive at an error budget? Who are the stakeholders who need to be bought into that discussion?
Know the formula for error budget, and also know how to apply it. It is not as much about the mathematics of it as it is about understanding it when discussed in plain English.
Based on error budget, how do you plan feature velocity vs reliability stability? When can you go faster on features and when should you slow down and consider reliability?
What does it mean to say SLI is below an SLO? I was weirdly confused by this phrasing during the exam though I’d understood it well when preparing. It is straightforward, but at the moment I had to “air draw” to figure it out. Note that SLOs sit somewhere on the SLI spectrum. And SLOs have to be close to but less restrictive than achievable SLIs.
What needs to be done in a post mortem?
What are the typical SLIs used for different types of services? Learn “The Four Golden Signals” for user facing systems. (link)
Who all are involved in the decision about SLOs?
What is toil? How do you reduce toil? (link)
Given a scenario, how do you set appropriate SLOs? You are expected to understand the business requirement, the available SLIs, customer expectations, and then set SLOs accordingly.
What is the correct way to do a postmortem? (link)
What are the output artefacts after a postmortem?
What are the requirements and tasks at different stages of an application w.r.t. DevOps? E.g., planning, testing, deployment architecture, capacity planning.
Softer aspects of working with teams. E.g. task allocation, don’t blame the person but fix the technical issue, etc.

2. Building and implementing CI/CD pipelines for a service

Everything about Cloud Build. (link)
How to create steps? A decent understanding of the available builders. (link)
Cloud Build service account. (link)
What permissions does the CB service account have by default? What permissions need to be given for other tasks. (link)
How to include security checks in the CD pipeline? (E.g. vulnerability scanning)
GitOps-style CD (link)
How to gate deployment with approvals? (link)
Do at least one hands-on of Spinnaker. (link)
Skim through and understand the Spinnaker ebook. (link)
Understand the different deployment methodologies, especially w.r.t. k8s. (e.g. rolling update, canary, A/B)
Ways in which to deploy to AppEngine and migrate traffic. (link)
How do you get Secrets into the Build pipeline? (link)
Use and usage of KMS. (link)

3. Implementing service monitoring strategies

Learn monitoring thoroughly.
Learn logging thoroughly.
Good to know some typical metrics and their names within Stackdriver/Operations. Don’t by heart all, just get a sense of the different names. (link)
What are the types of graphs you can create? (e.g. bar chart, line graph, etc.) Which are useful to understand which kind of data?
What are the different value types and metric kinds (link)
How do you create custom metrics? (link)
How do you create and use a dashboard? (link)
What are the parameters given to create a chart? (link)
Which metrics are supported by default? (link)
What are the logging and monitoring agents? How do you install them? When would you need to install them?
What permissions do they need?
If there is an issue receiving logs in Operations Logging, what could be the likely issues?
When do you need to integrate custom logging and monitoring in your application?
How to monitor multiple projects? (link)
Permissions to be set for different access levels and user groups.

4. Optimizing service performance

Everything about Stackdriver (now called Operations) — debugger, tracer, profiler, logs. (link)
How do you identify network issues — flow logs, packet mirroring, etc.
When do you use which one? Is there any impact on speed of network?
How to balance cost vs reliability?
What are the different network tiers? (link)
What are the potential issues with microservices? How do you debug performance? How do identify issues.
Under what circumstances would you use alternate tracing/profiling tools?
Choosing an appropriate Load Balancer. (link)
Setting Load Balancers w.r.t. Kubernetes. (link)
Setting internal and external k8s Load Balancers.