Hey, it’s been a while…. This is me writing though not another AI dumped bit of content, equally this is for me more than anyone else, its a line in the sand, a commitment of starting a new learning journey and putting it out there for everyone to see. (I did however use AI on the images)

For those that know me, I have been a constant and consistent learner. I started my IT career with hardware—building servers for Cambridge University, getting server operating systems up and running, and handling systems administration and operations. That was before the virtualization era. From there, I moved to the cloud, transitioned my learning into DevOps and Kubernetes, and that brings me to the here and now.

We are surrounded by AI, but “AI” means so many different things to different people, teams, and companies.

Lately, I have been intrigued by a specific challenge: How can we ensure the validity of our AI experience when it interacts with our own data? How can we ensure the data we hover above an LLM is actually going to provide us with the correct information?

From that perspective, I have found myself drawn into the world of Data Engineering and DataOps.

image 1

So, what’s the plan?

The first area of focus is watching foundational content designed to establish core concepts before diving directly into specific platforms. I’m currently working my way through this crash course: Data Engineering Foundations Crash Course!

I am specifically looking at recent content that discusses data engineering core principles and how they land with AI and machine learning in mind. This video is proving to be a great starting point because it details how data flows through real-world systems—exploring how data is stored, processed, orchestrated, and served. It bridges the gap between raw data and consumption, illustrating how foundational steps like proper data collection and pipeline stability directly impact the success of downstream machine learning models and AI dashboards.

Before I step into heavy-hitting enterprise tooling like Databricks, Snowflake, and Microsoft Fabric, this stage is entirely about cementing those basic Data Engineering Foundations through a structured, multi-phase approach:

Phase 1: The Storage Revolution (From Blocks to Formats)

Typically, my learning has gone straight into the platform layer, but we need to focus on the data layer for this new world. In Data Engineering, the bucket alone is just the graveyard; the magic is in the open table formats sitting inside them. To master this, you must understand Delta Lake, Apache Iceberg, and Apache Hudi. These formats add an ACID compliance and metadata layer to Parquet files, effectively making object storage act like a high-performance database.

Phase 2: The Data Pipeline Lifecycle (CI/CD for the DevOps Crew)

Data Engineering has its own version of a DevOps pipeline, and you need to map ETL/ELT (Extract, Load, Transform) to concepts you already know. Instead of application deployment, the focus here shifts to data ingestion, code-driven data transformation (using tools like dbt), and modern workflow orchestration (like Airflow or Dagster).

Phase 3: Data Preparation for LLMs (The RAG & Semantic Layer)

Before an LLM can interact with your data safely and accurately, that data has to be translated into a mathematical language the model understands. This is where traditional data engineering pipelines pivot into AI-native architectures. The focus here shifts from flat relational tables to managing unstructured data (PDFs, documentation, customer tickets).

I need to understand the lifecycle of chunking strategies (how you slice up long text), generating Vector Embeddings via embedding models, and optimizing Approximate Nearest Neighbor (ANN) indexing algorithms like HNSW (Hierarchical Navigable Small World) for sub-millisecond similarity search.

I’ll be exploring dedicated vector systems like Milvus and Qdrant (written in Rust for low latency), seeing how standard databases adapt via extensions like pgvector for PostgreSQL, and learning how orchestration frameworks like LangChain and LlamaIndex stitch pipelines and LLMs together.

Phase 4: Data Privacy, Masking & Governance for AI

When you “hover data above an LLM,” your security perimeter changes completely. Traditional Role-Based Access Control (RBAC) doesn’t gracefully translate when an LLM can access an entire data lakehouse and accidentally leak sensitive executive info or PII (Personally Identifiable Information) through a single user prompt.

This phase is about moving toward modern Data Security Posture Management (DSPM) and context-aware, adaptive policies. I need to understand and learn programmatic Dynamic Data Masking (redacting data at query time based on user role) and Row-Level Access Policies within the pipeline before the data ever reaches the LLM’s context window.

Phase 5: DataOps & Data Observability

Think of this as Prometheus and Grafana, but for data quality. If an infrastructure pipeline fails, a pod crashes. If a data pipeline fails silently, a null value enters a database, poisoning downstream AI models and corporate decision-making without anyone noticing until it’s too late.

Phase 6: Platform Distinctions & Architectures

Once the fundamentals are locked in, it’s time to get into the weeds of the platform layer. At the moment, I’m framing this exploration around three core pillars:

  1. How data enters
  2. How it is secured
  3. How it is consumed (including by AI)

At this stage, I don’t know what I don’t know, so this list may evolve, but the dominant enterprise platforms and ecosystems I keep hearing about are Databricks, Snowflake, and Microsoft Fabric.

image

TLDR;

  • Phase 1: Storage Formats (Iceberg, Delta Lake) — Where it lives.
  • Phase 2: Pipelines & Orchestration (dbt, Airflow) — How it moves and transforms.
  • Phase 3: Data Preparation for LLMs (Vectors & Embeddings) — How the AI reads it.
  • Phase 4: Governance & Privacy (DSPM & Masking) — How we keep it secure.
  • Phase 5: DataOps & Observability (Data Quality) — How we ensure it isn’t poison.
  • Phase 6: Enterprise Platforms (Databricks, Snowflake, Fabric) — The big suites that tie it all together.

My other “Big Rocks”

As well as navigating the brave new world of data engineering, I also need to continue diving deeper into a few of my existing technology areas:

  • DevOps Tooling & Source Code Repositories: Deepening expertise across GitHub, GitLab, and Azure DevOps.
  • DevOps, Product Management & Development Services: Focusing on the Atlassian stack (Jira & Confluence).
  • Platform Focus: Gaining an even stronger foothold in Red Hat OpenShift and KubeVirt.
  • Databases: Expanding operational knowledge of modern database architectures.
image 2

Leave a Reply

Your email address will not be published. Required fields are marked *