Yesterday we wrote down a plan.

Today we started the plan.

We started this course linked above, it had a bit of everything I was looking for the foundations. We are around 2 hours in and I decided that might be a good time to look back over the notes and look at the additional areas that have been mentioned already that we need to get into. I am not going to reinvent the wheel here, If you are interested then get into this first. I would say you can take the first hour and learn something.

The next hour was more talking about technologies that would be needed with some overview, In the above hour you will hear many tooling options as well so I will start there. My plan here is to try and find some useful, focused content that I can watch through to learn more of the fundamentals of these tools.

PSA: The Apache Software Foundation (ASF)

Many of the tools listed below are fronted with “Apache”, I had obviously heard of many of these Apache titled bits of software but I was maybe a little naive to the background. The Apache Software Foundation (ASF) actually manages over 350 open-source projects spanning almost every corner of enterprise technology, including web servers, operating systems, development tools, and database management systems.

It All Started with a Web Server, I am pretty confident that anyone that has read this far has used the Apache Web Server before?

The Broader Apache Universe

To give you an idea of how diverse the community is, here are some major non-data engineering projects you might run into:

  • Web & Application Servers: Along with the original HTTP Server, they manage Apache Tomcat, which is used globally to run Java-based web applications.
  • Databases & Storage: They host highly popular NoSQL databases like Apache Cassandra (originally built by Facebook to power their inbox search) and Apache CouchDB.
  • Build & Dev Tools: Tools that developers use every day to build their code, like Apache Maven and Apache Ant, are standard parts of the software development lifecycle.
  • Operating Systems: They even house Apache CloudStack, which is open-source software designed to deploy and manage large networks of virtual machines as a cloud computing platform.

All of the above is out of scope for this learning journey, but in the reverse it reminded me of the Cloud Native Computing Foundation (CNCF) in a way.

Lets get into the Apache Data Engineering Stack and surrounding technologies.

Ingestion/Streaming (Apache Kafka)

The highway that captures live data and moves it into the architecture.

image 4

Storage (Apache Iceberg, Delta Lake inside Cloud Storage (S3))

The smart organising layer that formats the data safely on disk.

image 6

Compute/Processing (Apache Spark, Apache Flink)

The heavy-duty engines that reach into storage to transform and clean the data.

image 7

Architectural Platforms (Snowflake, DataBricks, Microsoft Fabric)

The overarching cloud environments where all of this compute and storage lives.

image 8

Orchestration (Apache Airflow)

The conductor that schedules when the ingestion, storage, and compute actions trigger.

image 9

Languages/Interfaces (Python, SQL, Jupyter).

The tools you use to write the logic and talk to every single layer above.

image 10

Other Tools (Containers, Virtualisation, Cloud)

Docker was mentioned but I have covered this in the #90DaysOfDevOps series, so going to drop that link below.

https://github.com/MichaelCade/90DaysOfDevOps/blob/main/2022/Days/day42.md

image 3

My goal now instead of using Gemini to craft these somewhat interesting images above, is to find where people much smarter than me have already digested a lot of these tools and areas so I can then be smarter myself when I go back to the crash course.
Also side note… that initial crash course is not the full course free on youtube, its an almost 4 hour teaser to a 17 hour paid for bootcamp.. Jury is out if thats the way to go.

Leave a Reply

Your email address will not be published. Required fields are marked *