Notes from my Google Cloud Professional Data Engineer Exam
Subscribe to my YouTube channel that teaches you to apply Google Cloud to your projects and also prepare for the certifications: youtube.com/AwesomeGCP. Check out the playlists I currently have for Associate Cloud Engineer, Professional Architect, Professional Data Engineer, Professional Cloud Developer, Professional Cloud DevOps Engineer, Professional Cloud Network Engineer, and Professional Cloud Security Engineer.
Immediately after the exam I do a memory dump as notes. Hence it is also quite unordered. This is a sanitized list that gives general topics and questions I encountered. The intention is not to give you the questions, but to give you topics that you can be prepared for. I was often stumped by some questions; hopefully you can be more prepared based on my experience. Wish you the very best!
Tough exam. I assumed this one would be easier because I spent more time preparing and I had the experience of the previous certifications. After the exam I went over the questions again to remind myself later what areas were covered — the answer is, everything. Zero direct questions. Every question was embedded in a situation/use case.
- BigQuery Data Transfer Service. I knew of storage transfer service and big query connectors, but I went ‘huh?’ when I saw this.
https://cloud.google.com/bigquery/transfer/ (Edit: at the time that I wrote the exam, this was new for me. Now, it’s capabilities have also expanded.) - IAM + Dataflow. Dataflow developer mode.
https://cloud.google.com/dataflow/docs/concepts/access-control - I̵A̵M̵ ̵+̵ ̵B̵i̵g̵Q̵u̵e̵r̵y̵.̵ ̵A̵c̵c̵e̵s̵s̵ ̵l̵e̵v̵e̵l̵ ̵v̵i̵a̵ ̵t̵a̵b̵l̵e̵s̵/̵d̵a̵t̵a̵s̵e̵t̵s̵.̵ ̵R̵e̵m̵e̵m̵b̵e̵r̵ ̵t̵h̵a̵t̵ ̵y̵o̵u̵ ̵c̵a̵n̵n̵o̵t̵ ̵r̵e̵s̵t̵r̵i̵c̵t̵ ̵a̵c̵c̵e̵s̵s̵ ̵a̵t̵ ̵t̵a̵b̵l̵e̵ ̵l̵e̵v̵e̵l̵.̵ ̵I̵t̵ ̵i̵s̵ ̵o̵n̵l̵y̵ ̵a̵t̵ ̵d̵a̵t̵a̵s̵e̵t̵ ̵l̵e̵v̵e̵l̵.̵ ̵A̵l̵s̵o̵ ̵l̵o̵o̵k̵ ̵u̵p̵ ̵w̵h̵a̵t̵ ̵A̵u̵t̵h̵o̵r̵i̵z̵e̵d̵ ̵V̵i̵e̵w̵s̵ ̵a̵r̵e̵.̵
https://cloud.google.com/bigquery/docs/access-control - Edit 2020/06: IAM + BigQuery. Access level. As Nikhil pointed out in the comments, you can now restrict access at Table level.
https://cloud.google.com/blog/products/data-analytics/introducing-table-level-access-controls-in-bigquery - BigQuery: partitioning tables. Based on what are they partitioned — ingestion time, timestamp, date. How are they named? How are they then accessed in queries? Using _PARTITIONTIME.
https://cloud.google.com/bigquery/docs/partitioned-tables - BigQuery. Syntax for wildcards in big query names. And in legacy SQL?
https://cloud.google.com/bigquery/docs/querying-wildcard-tables - BigQuery: table date range for bq. Accessing tables with dates and partitioned tables with functions like TABLE_DATE_RANGE, _TABLE_SUFFIX, TABLE_QUERY.
https://stackoverflow.com/questions/22641894/bigquery-wildcard-using-table-date-range - Cloud Spanner: secondary index for cloud spanner. How indexes are created for you and how you can create secondary indexes.
https://cloud.google.com/spanner/docs/secondary-indexes - Datastore: multiple indexes for datastore. Default indexes. Syntax for creating custom, composite indexes.
https://cloud.google.com/datastore/docs/concepts/indexes - BigTable: row key scheme. What are the recommended ways for creating the row key? How do you avoid hotspotting? Should you use timestamp, and where?
https://cloud.google.com/bigtable/docs/schema-design - BigTable: ways to optimize.
https://cloud.google.com/bigtable/docs/performance - PubSub, Dataflow, Dataproc — properties and uses of these products. The courses from Coursera, Linux Academy, and Cloud Academy cover these well.
- Dataproc: usage of gcs instead of existing file system. It is a best practice to use Google Cloud Storage instead of using HDFS. You can destroy the compute nodes after data crunching and save cost on them.
- BigQuery+DataStudio — caching/pre-fetch cache. Learn how you connect DataStudio to storage solutions. Learn the difference between default caching (which cannot be disabled) and pre-fetch caching (which can be disabled). What is the difference between doing that with Viewer credentials and Owner credentials.
https://support.google.com/datastudio/answer/7020039?hl=en - Dataprep: jobs. How are Dataprep jobs created and run? What permissions do you need? A term I saw was that this is a more ‘casual’ way of data cleaning. Dataproc/Dataflow would be more programmatic and therefore ‘intense’, I suppose.
https://cloud.google.com/dataprep/docs/html/Jobs-Page_57344842 - DataStudio: visualisation. What are the causes of stale data? And how do you get the latest? What caching options do you need to set?
- Machine Learning : feature crosses. Learn what these are and what issues it solves.
https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture - Machine Learning. Go through the Coursera course on machine learning.
https://www.coursera.org/learn/serverless-machine-learning-gcp/home/welcome - Machine Learning: Dealing with overfitting.
https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting - Machine Learning: Regularization. What does it mean to increase or decrease regularization?
https://www.coursera.org/lecture/deep-neural-network/why-regularization-reduces-overfitting-T6OJj - Dataproc: how to control scaling? Configure autoscaling?
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling - Avro file format. This is a compressed format that bigquery/dataflow can work with it directly.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro - Know a bit about other technologies outside of just GCP also. Remember that as a Professional on GCP, you are also expected to know technologies in the general ecosystem. You might have to decide between GCP solutions and alternatives in the market. Just by-hearting GCP won’t cut it.
- Key Management Service. Using KMS with non-GCP products. Note that there is a default key management where Google manages all the keys, then there is a customer managed encryption keys, and also a customer supplied encryption keys.
https://cloud.google.com/kms/docs/ - BigQuery query plan. BigQuery allows you to see the query plan and execution profile for queries that you run. Know the phases, difference between average and max time, why there can be skew in the plan, and how to optimize for it.
https://cloud.google.com/bigquery/query-plan-explanation - BigQuery + GCS. Know how to link tables between GCS and BigQuery as permanent tables and temporary tables.
https://cloud.google.com/bigquery/external-data-cloud-storage - You don’t have to by-heart the case studies, but study them well. Work through solutioning for it by yourself. The Linux Academy course has a module that goes over the case studies. (I believe the updated exam has no more case studies.)
- Bigquery. Know what a federated table is. While you are at it, learn also about clustered tables.
https://cloud.google.com/bigquery/external-data-sources
The Data Engineer exam was refreshed on March 29th. These are some extracted key points and links that others have posted. From what I am reading of others’ notes
Notes
- Cloud Composer: added in new topics.
- No case studies in new exam.
- BigQuery: streaming data, quotas, and limits, ETL data verification, BigQuery ML, User Defined Functions.
- Datastore: backup and migrate.
- ML: I heard there is a little more ML. Scaling TensorFlow.
- PubSub: migrating from Kafka, debugging via Stackdriver.
Posts
- Dmitri Lerko: https://deploy.live/blog/google-cloud-certified-professional-data-engineer/ — Good post with lots of links.
My Certification
Google Cloud Certified — Professional Data Engineer
Notes from each of my exams
For those appearing for the various certification exams, here is a list of sanitized notes (no direct question, only general topics) about the exam.
Overall notes across all GCP certification exams
Notes from the Professional Cloud Architect exam
Notes from the beta Professional Cloud Developer exam
Notes from the Professional Data Engineer exam
Notes from the Associate Cloud Engineer exam
Notes from the beta Professional Cloud Network Engineer Exam
Notes from the beta Professional Cloud Security Engineer Exam
Notes from the Professional Collaboration Engineer Exam
Notes from the Professional DevOps Engineer Exam
Notes from the Professional Machine Learning Engineer Exam
Official Links
Main Link — https://cloud.google.com/certification/data-engineer
Topics Outline — https://cloud.google.com/certification/guides/data-engineer/
Practice Exam —https://cloud.google.com/certification/practice-exam/data-engineer
Github Repo: awesome-gcp-certifications
A collection of posts, videos, courses, qwiklabs, and other exam details for all exams: https://github.com/sathishvj/awesome-gcp-certifications
Free Qwiklabs Codes to Practice
I’ve collected here a bunch of free Qwiklabs codes which are awesome to get lots of hands-on practice. Use them well.
More Questions?
Check the FAQs here: https://medium.com/@sathishvj/frequently-asked-follow-up-questions-on-google-cloud-gcp-certifications-438e1addb91d.
Wish you the very best with your GCP certifications. You can reach me at LinkedIn and Twitter. If you can support my work creating videos on my YouTube channel AwesomeGCP, you can do so on Patreon or BuyMeACoffee.