Google Certified Professional Data Engineer Set 1 Author: CloudVikas Published Date: 18 June 2021 1 Comment on Google Certified Professional Data Engineer Set 1 Welcome to Google Certified Professional Data Engineer Set 1.Click on Next Button to proceed. 1. What method could you use to help compute averages when dealing with unbounded/streaming data?Change the input data to be batched/bounded to make it easier to compute averagesSession windowsSliding windows2. Bigtable is compatible with which open source project's client library for Java?Apache HBaseApache KafkaApache Cassandra3. You have multiple systems that all need to be notified of orders being processed. How should you configure Pub/Sub?Create a new Topic for each individual order. Create multiple Subscriptions for each Topic, one for every system that needs to be notified.Create a Topic for orders. Create multiple Subscriptions for this Topic, one for every system that needs to be notified.Create a new Topic for each individual order. Create a Subscription for each Topic that can be shared by every system that needs to be notified.4. Which GCP product implements the Apache Beam SDK and is sometimes recommended as an alternative to Dataproc particularly for streaming data?Cloud ComposerCloud DataflowCloud Datalab5. Cloud Memorystore is essentially a managed service based on which open-source project?MongoDBRedisMemcached6. A push Subscription requires what as its endpoint?An HTTPS URL with a valid SSL certificate that accepts PUT requests.An HTTPS URL with a valid SSL certificate that accepts POST requests.An HTTP URL that accepts GET requests.7. What steps do you need to take to set up BigQuery before use?You must create processing nodes in Compute Engine.BigQuery is a serverless product and all compute and storage resources are managed for you.You must create storage buckets for BigQuery to use.8. When you run a query in BigQuery, what happens to the results?Query results exist only within the BigQuery UI and are not stored in a table.Query results are either written to a destination table specified by the user, or to a temporary cached results table.Query results exist only in BigQuery memory and are not stored in a table.9. Your company stores data in BigQuery for long-term and short-term analytics queries. Most of the jobs only need to study the last 7 days of data. Over time, the cost of queries keeps going up. How can you redesign the database to lower the cost of the most frequent queries?Create a new table every 7 days. Maintain a separate table that duplicates the weekly tables to contain all data.Create a new table every 7 days. Use JOIN statements to conduct long-term analytics queries.Use DATE partitioned tables.10. High Availability for Cloud SQL PostgreSQL instances works because:Instances share a regional replicated persistent diskInstances replicate data directly between themselves for automatic failoverGoogle SREs are on call and can quickly bring up a new instance in the event of a failure11. After a period of maintenance for your application, you want to start it up again but there is a large backlog of messages waiting in its Subscription to a Pub/Sub topic which you do not want to process. What is the best way to deal with this?Delete the Subscription and re-create it.Seek to a point in the future on the Subscription to effectively discard all the messages.Delete the Topic and re-create it.12. If you have 2 replicating clusters in your Bigtable instance, how can you ensure that your application will be guaranteed strong consistency for its transactions?Use an application profile that specifies single-cluster routing.Refactor your application so that strong consistency is no longer required.Use an application profile that specifies multi-cluster routing, but place both clusters within the same region.13. What action should you take if you observe that Bigtable performance is poor on a subset of queries?Use the Key Visualizer tool to identify hot spots and consider changing how the row key distributes rowsAdd debugging steps to your application to identify the problematic queriesAdd an additional cluster to the instance to increase the read and write throughput14. What would be the most secure way to grant access from Dataflow in Project A to a Cloud Storage bucket in Project B?Grant storage viewer access for the bucket in Project B to the default compute service account in Project A.Create a custom service account to use as the Dataflow controller service account in Project A. Grant storage viewer access for the bucket in Project B to the custom service account in Project A.Copy data from the bucket in Project B to a new bucket in Project A.15. What is the maximum number of clusters per Bigtable instance?43There is no limit, providing you are within the Compute Engine quotas of your GCP project16. You are required to share a subset of data from a BigQuery data set with a 3rd party analytics team. The data may change over time, and you should not grant unnecessary projects permissions to this team if you can avoid it. How should you proceed?Create an authorized view based on a specific query for the subset of data, and provide access for the team only to that view.Create an export of the data subset to a Cloud Storage bucket. Provide a signed URL for the team to download the data from the bucket.Add the team to your GCP project and assign them the BigQuery Data Viewer IAM role for the data set.17. What is the name given to a dataset that can be acted upon within a Cloud Dataflow pipeline?BucketPCollectionAggregation18. What is a sensible way to test a Cloud Dataflow pipeline before deploying it to production?Remove DataflowRunner from PipelineOptions, to allow the pipeline to run locally.Tune PipelineOptions to use the smallest amount of compute resources possible.Stop the Dataflow job half way through to minimize costs.19. Which transformation can be used to process collections of key/value pairs, in a similar fashion to the shuffle phase of a map/shuffle/reduce-style algorithm?GroupByKeyPartitionCoGroupByKey20. What is a federated data source?A BigQuery data set that belongs to a different GCP project, used in an SQL statement.An external data source that can be queried directly even though the data is not stored in BigQuery.A BigQuery table that belongs to a different data set, used in an SQL statement.21. Which primary Apache services does Dataproc run?Which primary Apache services does Dataproc run? (Choose 2) Dataflow Cassandra Spark Hadoop22. What is a pipeline in the context of Cloud Dataflow?A pipeline represents the entire series of steps involved in ingesting data, transforming that data and writing output.A pipeline could be any of these.A pipeline represents a collection of data being prepared for transformation.23. Using Cloud IAM, what is the most granular level for which you can configure access control for Pub/Sub?Across individual topcis and subscriptionsAcross individual topics, but not subscriptionsAcross the entire project, including all topics and subscriptions24. Which big data programming model is implemented with Cloud Dataflow?Apache NiFiApache BeamApache Spark25. You want to use Pub/Sub to distribute jobs to a group of Compute Engine VMs, which should each take the next job from the queue. How should you configure Pub/Sub?Create a new Topic for each individual job. Create multiple Subscriptions for each Topic, one for every Compute Engine VM in the group.Create a new Topic for each individual job. Create a Subscription for each Topic that can be shared by every Compute Engine VM in the group.Create a Topic for jobs. Create a Subscription for this Topic that can be shared by every Compute Engine VM in the group.26. Your application requires access to a Cloud Storage bucket. What is the best way to achieve this?Create a custom service account with only the required permissions for the applicationMake the application prompt a user for their Google credentials to authenticate with Cloud StorageInclude a user's Google credentials in the application code so it can authenticate with Cloud Storage27. What is the maximum total size permitted for a Publish request, including metadata and payload?1MB10MB50MB28. How does Bigtable managed the storage of tablets?Tablets are stored on cluster nodes, but storage can dynamically grow as part of the managed serviceTablets are stored on cluster nodes, which must be sized accordingly for their storage needsTablets are stored in Google Colossus, but a cluster node has a limit on how much storage it can process29. How can you ensure that table modifications in BigQuery are ACID compliant?Add a nominal wait time to any application queries following an update to allow for eventual consistency.Maintain a separate table that records transactions themselves so changes can be re-applied if any are lost.No special accommodations are required, BigQuery table modifications are ACID compliant by design.30. A customer has 90TB of archive data to move to Google Cloud Storage, but has restrictions on connecting their private network to the public Internet. How could you facilitate this?Create bastion hosts that connect to the private network and Google Cloud, and start multiple uploads to GCS using gsutilUse the 100TB Transfer ApplianceCreate bastion hosts that connect to the private network and Google Cloud, and start a single upload to GCS using gsutil31 out of 30Please fill in the comment box below. Email Author: CloudVikas