Lohit Arrabothu

•

27.09.2025

A Practical Guide to Data Engineering with Google Cloud

If you’re new to Google Cloud or you’ve only used one or two services, it can feel like there are a lot of moving parts. The good news: Google Cloud’s data stack is opinionated, integrated and battle-tested. This post maps the landscape so you can pick the right tools for ingestion, storage, processing, orchestration, governance, and analytics without getting lost in product sprawl.

The mental model

Think in stages:
Ingest → Land → Process → Orchestrate → Govern → Serve
Each stage has 1–3 defaults that cover 90% of use cases. I’ll name those defaults, when to reach for them, and gotchas.

Ingest (streaming and batch)

Event streams → Start with Pub/Sub for durable, low-latency fan-out between producers and consumers. You get regional exactly-once delivery (when enabled) and dead-lettering out of the box. Use it to buffer telemetry, clickstream, and microservice events, then sink to BigQuery or Cloud Storage.

Heads-up: Pub/Sub Lite will be retired on March 18, 2026. Plan migrations to standard Pub/Sub with partitioned topics and autoscaling, which are the recommended alternatives.

Database CDC → Use Datastream for serverless, change-data-capture from MySQL, PostgreSQL, and Oracle into BigQuery/Cloud Storage–great for near-real-time replicas and operational analytics.

SaaS & scheduled loads → BigQuery Data Transfer Service (DTS) automates recurring loads from Google Ads, GA4, S3, Redshift, etc., and handles backfills. For big file moves into Cloud Storage, use Storage Transfer Service.

Direct to BigQuery (low-latency) → If you’re streaming rows straight into warehouses, the BigQuery Storage Write API is the high-throughput, low-latency path favored by vendors and CDC tool

When you need an immediate response path (e.g., app or device round-trip), expose an HTTP endpoint on Cloud Run/Functions (Gen2) and write to BigQuery via the Storage Write API; use Pub/Sub for downstream fan-out.

Use case — Real-time product analytics (event streams → BigQuery)

Business situation
You want “active users right now,” funnel steps, and quick alerts within minutes of an event happening.

Approach

Stream events to a partitioned BigQuery landing table (Pub/Sub → Dataflow in prod; simple INSERTs for the demo).
Include a dedupe id (insert_id) so downstream transforms are idempotent.
Keep everything in the same region to minimize latency.

‍

-- Create a streaming sink (landing table)
CREATE SCHEMA IF NOT EXISTS my_de;
CREATE TABLE IF NOT EXISTS my_de.events_raw (
  user_id INT64,
  event_ts TIMESTAMP,
  action STRING,
  insert_id STRING
)
PARTITION BY DATE(event_ts)
CLUSTER BY user_id;

-- Simulate two events
INSERT INTO my_de.events_raw VALUES
(101, CURRENT_TIMESTAMP(), 'view', GENERATE_UUID()),
(102, CURRENT_TIMESTAMP(), 'add_to_cart', GENERATE_UUID());

-- Quick check: events in the last 5 minutes
SELECT action, COUNT(*) AS events
FROM my_de.events_raw
WHERE event_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MINUTE)
GROUP BY action
ORDER BY events DESC;

‍

Land (data lake and warehouse storage)

Object store → Cloud Storage is your durable landing zone for raw and curated files (CSV/Parquet/Avro), and the target for large transfers. Pair it with lifecycle rules and bucket-level controls.‍

Warehouse tables → BigQuery remains the analytics centerpiece. Choose editions (Standard, Enterprise, Enterprise Plus) if you prefer capacity-based pricing with autoscaling. For example, you might reserve 500 baseline slots that automatically scale up to 1500 during query spikes. Use on-demand pricing if workloads are small, unpredictable, or spiky.‍

Open table formats (lakehouse) → For open formats, you have two routes: BigLake external tables for read-optimized access, or BigLake managed tables for Apache Iceberg in BigQuery. (Note: Iceberg support is still rolling out regionally, so check availability in your region.)‍

Multimodal data in BigQuery (images, docs, audio) → BigQuery now treats unstructured objects in Cloud Storage as first-class citizens via Object Tables and the ObjectRef data type. That means you can run SQL that references PDFs, images, and audio without shuffling data between systems — keeping governance and auditing in one place. Pair this with policy tags/row-level security to keep controls consistent across structured and unstructured data.

Use case — Zero-copy lakehouse reads (GCS → BigQuery via BigLake)

When to use: You’ve landed Parquet/CSV files in Cloud Storage and want to query them in BigQuery without loading/copying.‍

Approach: Create a BigLake external table over a Hive-partitioned GCS prefix; optionally materialize “hot” partitions later.

-- External table over Parquet in GCS (zero-copy)
CREATE OR REPLACE EXTERNAL TABLE my_de.orders_ext
WITH CONNECTION `US.my-conn`          -- your BigLake connection
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://MY_BUCKET/orders/dt=*/ *.parquet'],
  hive_partitioning_mode = 'AUTO',
  hive_partitioning_source_uri_prefix = 'gs://MY_BUCKET/orders/'
);
-- Quick peek (no load)
SELECT dt, COUNT(*) AS rows
FROM my_de.orders_ext
GROUP BY dt
ORDER BY dt DESC;

-- (Optional) Materialize a hot partition into native BigQuery
CREATE TABLE IF NOT EXISTS my_de.orders_2025_09_24 AS
SELECT * FROM my_de.orders_ext WHERE dt = '2025-09-24';

Process (ETL/ELT at scale)

Batch & streaming pipelines → Dataflow (managed Apache Beam) is the default for unified batch/stream processing with autoscaling and exactly-once sinks to BigQuery. Keep business logic in Beam, let Dataflow handle scaling, checkpoints, and runner details.‍

‍

Note: Dataflow SQL has been deprecated. For streaming transformations, migrate to Apache Beam pipelines or consider BigQuery’s streaming ingestion with transformations handled in Dataform.‍

‍

Spark workloads → When you want Spark (libraries, UDFs, notebooks), use Serverless for Apache Spark: submit jobs without managing clusters, and autoscale only while jobs run. It integrates with BigQuery (connectors) and BigQuery Studio for interactive PySpark. Use Dataproc clusters only if you truly need pinned clusters or specific Hadoop ecosystem components.

‍

Spark from your notebook, no clusters to babysit. Colab Enterprise and Vertex AI Workbench can launch serverless Spark straight from the notebook. Use PySpark for heavy transforms and keep results in BigQuery — no extra copies to manage.‍

‍

SQL-first transformation & modeling → Dataform (in BigQuery Studio) gives you dbt-style, dependency-aware SQL pipelines with CI/CD, code reviews, and scheduling. It’s the native way to manage ELT in BigQuery.

‍

BigFrames (pandas-like on BigQuery). If your team prefers Pythonic data wrangling, BigQuery DataFrames (BigFrames) lets you write pandas-style code that compiles down to efficient BigQuery SQL under the hood. It’s an easy on-ramp for analysts moving from notebooks to warehouse-native ELT.

Use case — Curate → Gold with an incremental MERGE (batch ELT)

When to use: You’ve got files exposed via orders_ext (BigLake) or rows in a staging table and you want a query-friendly table plus a daily fact that refreshes quickly with late data.

Approach:

Curate to a partitioned + clustered native table.
Recompute a small N-day window and MERGE into the gold table (idempotent).

‍

-- Curate: external/staging -> native table
CREATE OR REPLACE TABLE my_de.orders_curated
PARTITION BY DATE(created_at) CLUSTER BY customer_id AS
SELECT
  order_id,
  customer_id,
  TIMESTAMP(order_ts) AS created_at,
  CAST(item_count AS INT64) AS qty,
  CAST(price AS NUMERIC) AS price,
  status
FROM my_de.orders_ext
WHERE status IN ('Complete','Shipped');

-- Incremental MERGE of last 3 days into a daily fact
DECLARE n_days INT64 DEFAULT 3;

CREATE OR REPLACE TEMP TABLE recent AS
SELECT
  DATE(created_at) AS d,
  COUNT(DISTINCT order_id) AS orders,
  SUM(qty) AS units,
  SUM(qty * price) AS revenue
FROM my_de.orders_curated
WHERE DATE(created_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL n_days DAY)
GROUP BY d;

CREATE TABLE IF NOT EXISTS my_de.daily_sales (
  d DATE, orders INT64, units INT64, revenue NUMERIC
) PARTITION BY d;

MERGE my_de.daily_sales T
USING recent S
ON T.d = S.d
WHEN MATCHED THEN
  UPDATE SET orders=S.orders, units=S.units, revenue=S.revenue
WHEN NOT MATCHED THEN
  INSERT (d, orders, units, revenue)
  VALUES (S.d, S.orders, S.units, S.revenue);

Orchestrate (scheduling and workflows)

Complex DAGs with operators → Cloud Composer (managed Airflow) remains the workhorse for multi-service DAGs, cross-project orchestration, and extensive operator ecosystems.

Service-to-service steps & APIs → Workflows is a lightweight, serverless step orchestrator for chaining Google Cloud services and HTTP endpoints with YAML-defined state machines - great for glue around DTS, Datastream, and Cloud Run jobs.

When pipelines originate in notebooks, you can promote them to Vertex AI Pipelines or Composer as they mature, keeping experiment → production handoffs smooth.

Use case — Serverless step chaining with Workflows (no Airflow)

When to use: You just need to run a SQL transform (or stored procedure) on a schedule or after an event - without managing Composer/Airflow.

Approach: A Workflow calls BigQuery’s jobs API to run your SQL. Keep the Workflow and BigQuery in the same location.

# Minimal: Workflows → BigQuery query
# Replace PROJECT and put the file as workflows.yaml, then deploy:
# gcloud workflows deploy de-orchestrator --source=workflows.yaml --location=us

main:
  steps:
  - run_bq_query:
      call: http.post
      args:
        url: https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT/locations/US/jobs
        auth: { type: OAuth2 }
        body:
          jobReference: { location: "US" }
          configuration:
            query:
              query: "CALL my_de.refresh_curated();"
              useLegacySql: false
  - done:
      return: "OK"

Govern (catalog, quality, lineage, security)

Catalog, quality, lineage → Dataplex Data Catalog centralizes metadata, automated data quality checks, and lineage across BigQuery, BigLake, and integrated systems. It also integrates with OpenLineage when you need to import external lineage.

Fine-grained access controls → In BigQuery, combine policy tags for column-level security with row-level security policies to enforce least privilege down to the cell/row. Works seamlessly with Looker/Studio when properly configured.

Enterprise controls → In capacity-based BigQuery editions, you also get features like slot autoscaling, managed disaster recovery (Enterprise Plus), and broader workload controls — useful for regulated environments. Pair with VPC Service Controls as needed.

AI-assisted data prep & agents → BigQuery’s Gemini-powered editor and the Data Engineering Agent can generate transformations, build ingestion pipelines (dedupe, standardize dates, unit tests), and scaffold Dataform DAGs from plain-english instructions, useful for standardizing quality rules early.

Use case — Lock PII and limit visibility by region

When to use: You need to share tables broadly but protect PII (e.g., email) and restrict rows by a user’s region/team.

Approach:

Tag sensitive columns with a policy tag (taxonomy created in Dataplex/Data Catalog).
Add a row access policy to filter rows.

‍

-- Example table
CREATE SCHEMA IF NOT EXISTS my_de;
CREATE TABLE IF NOT EXISTS my_de.customers (
  customer_id INT64,
  email STRING,
  region STRING,
  signup_ts TIMESTAMP
);
-- Protect PII: apply a policy tag to the email column
ALTER TABLE my_de.customers
  ALTER COLUMN email SET POLICY TAGS ('pii.email');

-- Limit data by region for a specific group
CREATE OR REPLACE ROW ACCESS POLICY r_region
ON my_de.customers
GRANT TO ("group:data-analysts@yourco.com")
FILTER USING (region IN ('US','CA'));  -- demo filter; replace with your rule

Serve (BI, notebooks, ML-adjacent)

BI & dashboards

Looker for governed, semantic-model-driven analytics apps and embedded BI.
Looker Studio for lightweight dashboards and ad-hoc exploration on top of BigQuery.

Notebooks & analysts workbench → BigQuery Studio notebooks live next to your datasets, version with your code assets, and can spin up Serverless Spark sessions for PySpark alongside SQL. It provides a unified way to keep ELT, Spark, and exploration in one environment.

AI-first, multilingual notebooks → Colab Enterprise (inside BigQuery Studio) blends SQL → Python → Spark in one notebook and ships with an AI Data Science Agent for planning code, fixing errors, and visualizing outputs - reducing context switches during exploration.

ML where it belongs: BigQuery now supports vector search (vector indexes and VECTOR_SEARCH()), so you can build semantic/RAG-style workloads without moving data. Separately, BigQuery ML lets you train and predict with SQL (and generate embeddings you can index).

Foundation models from SQL. Use AI.GENERATE / AI.GENERATE_TABLE to call Gemini endpoints directly from SQL - great for extracting structured fields from PDFs, summarizing contracts, or scoring review sentiment by feature.

TimesFM in BQML. For fast time-series forecasting, AI.FORECAST taps TimesFM for zero-/few-shot forecasts without bespoke model training.

Vector search. Generate embeddings with ML.GENERATE_EMBEDDING, build a VECTOR INDEX, and query with VECTOR_SEARCH for semantic/RAG and “find-similar” use cases - no separate vector DB to run.

Use case — Predictions directly from the warehouse (BQML)

When to use: You want quick predictions in dashboards without moving data or standing up separate ML infra.

Approach: Train and evaluate a BQML model in SQL; surface predictions via a view or table for BI.

-- Train a simple classifier with a holdout split
CREATE OR REPLACE MODEL my_de.income_model
OPTIONS (
  model_type = 'logistic_reg',
  input_label_cols = ['income_bracket'],
  data_split_method = 'RANDOM', data_split_eval_fraction = 0.20
) AS
SELECT * EXCEPT(income_bracket), income_bracket
FROM `bigquery-public-data.ml_datasets.census_adult_income`;

-- Evaluate in one line
SELECT * FROM ML.EVALUATE(MODEL my_de.income_model);

-- Expose predictions (use in BI)
CREATE OR REPLACE VIEW my_de.income_predictions AS
SELECT
  *,
  (SELECT AS STRUCT * FROM UNNEST(predicted_income_bracket_probs)) AS probs,
  predicted_income_bracket AS predicted_label
FROM ML.PREDICT(
  MODEL my_de.income_model,
  (SELECT * FROM `bigquery-public-data.ml_datasets.census_adult_income`)
);

Operational stores are also important to consider (for pipelines that need them)

While BigQuery is your analytics store, many pipelines read from or write into operational databases:

Bigtable for massive, single-key time series/IoT with petabyte scale and low latency.
Spanner for globally consistent relational workloads with high availability SLAs.
AlloyDB (for PostgreSQL compatibility and performance), and Cloud SQL (managed MySQL/Postgres/SQL Server) for app-centric OLTP.

Opinionated “defaults”

If you just want a solid starting lineup:

Streaming: Pub/Sub → Dataflow (Beam) → BigQuery (Storage Write API).
Batch ELT: Storage Transfer/DTS → Cloud Storage → BigQuery + Dataform.
Spark needs: Serverless for Apache Spark with BigQuery connector; compose with Composer when DAGs grow.
Governance: Dataplex for catalog/lineage/quality; BigQuery policy tags + row-level security.

Cost and reliability notes

BigQuery editions & autoscaling. Reservations let you set baseline and max slots; autoscaling expands capacity within that band. Start conservative, then raise the max where queues form.
Spark without clusters. Serverless Spark bills only while jobs run. Great for spiky and exploratory workloads; move to Composer-scheduled jobs as they stabilize.
Deprecations matter. Replace Pub/Sub Lite; avoid net-new dependency on Dataflow SQL.
Notebook → job parity. What runs interactively in Colab Enterprise should be promoted to scheduled jobs (Composer/Workflows/Vertex Pipelines) to avoid surprise costs from ad-hoc sessions.
GenAI quotas & cost controls. Centralize Vertex endpoints and set per-project quotas/budgets before rolling out AI.GENERATE to analysts.

A sample reference flow

Events hit Pub/Sub →
Dataflow enriches/validates →
Write to BigQuery (curated) and Cloud Storage (immutable raw) →
Transform with Dataform (SQL DAG) →
Dataplex runs data quality and publishes lineage →
Analysts consume with Looker/Looker Studio; power users explore in BigQuery Studio notebooks.

Final takeaway

You don’t need a dozen tools to build a reliable, governed data platform on Google Cloud. Pick the defaults above, layer governance early, and use editions/autoscaling to keep costs predictable.

If you want a version of this post tailored to your team’s architecture or a reference diagram you can share internally, tell me your stack and I’ll adapt it.

Thank you for reading!