Moaz Mansour

View Original

What is Data Engineering?

Recently there has been an emerging of new small pockets in different small and medium companies called Data Engineering teams. That raised the question of what is Data Engineering? I try throughout this blog post to clarify what Data engineering is and the important set of skills you need to become one.

Summary (TLDR):

Data Engineers are Software Engineers who specialize in working with data. Building Extract, Transform, and Load (ETL) pipelines are their primary task. To become one, you need to
- Be proficient in Python
- Understand & practice data structures and algorithms
- Comprehend design patterns
- Skill in SQL
- Learn NoSQL databases like Elasticsearch
- Familiarize yourself with a cloud platform like AWS

Excited to read? Me too, let’s dive in. 

Data Science VS Data Engineering

As a starter, we should clarify the difference between a Data Scientist and a Data Engineer.

“A Scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.” - Gordon Lindsay Glegg.

A data scientist is, as the role name might suggest, a data expert. They work closely with other stakeholders to try to answer a specific business or research question using data. Data science is a multidisciplinary terrain. It emerged over the past decade with the surge in internet usage that resulted in what we now know as Big Data.

Data scientists aren’t essentially software engineers or computer scientists; even though they might come with this background, they come from different walks of life. They build modules and algorithms to address complex data problems using statistics, mathematics, data analysis, and informatics.

Data Engineers, on the other hand, are essentially Software Engineers. As a Data Engineer, you are focused on writing clean production code at scale. You have extensive knowledge of the programming language you use. You understand operation performance, systems integrations, dealing with databases, testing frameworks, code efficiency, and writing in Object-Oriented Programming (OOP) and Functional Programming styles. Data Engineering is a specialization within Software Engineering. 

The term “Data Engineer” is relatively new to the industry. A few of the companies I worked with/for are still keeping the role under the blanket of Software Development or Engineering. You will see job postings like “Full Stack Developer, Back-end Developer, and Software Engineer,” which technically are data engineering roles under the hood. Some other companies started seeing the importance of this specialization and are now building their data engineering teams.

What is unique about Data Engineering, then?

Data Engineers are software engineers who focus on dealing, well, with data. They work with data manipulation, wrangling, extraction, processing, and storage. Accordingly, their main task is to provide a reliable infrastructure for data. Extract, Transform, and Load or (ETL) for short is the primary framework Data Engineers work with.

What is the ETL pipeline?

Extract is the process of obtaining the raw target data you need for your data application. Data extraction can take different forms and strategies depending on the nature of your application and the source you are collecting the data from. Sources can take different forms like calling an API, scraping an HTML page from the web, downloading a Comma Separated Values (CSV) file, receiving an email, and more. Extraction algorithms should address the limitations of the source they are handling like, a maximum number of API calls, per se. 

The frequency of running this algorithm depends on the nature of the source and the feature delivery requirements. Extraction algorithms can be event-based triggered like an email receipt or batch-based triggered, which means they run every specific period. They can also be capturing a live stream of data.

Transform is the phase when you make sense of your data. It is where you crunch numbers, tag text, integrate sources, and/or aggregate values to create measurable metrics and human-readable insights. 

In this phase, Data Engineers build a logic to alter and reconstruct raw data extracted. The transformation process might require integrating machine learning modules for tagging purposes like text sentiment analysis

Load is the process to store your transformed data securely in a way that makes it consumable.  In this phase, Data Engineers should consider the end-product data delivery requirements. They decide on the suitable architecture, data model, and structure to deliver their data depending on the nature of the data and the end-product needs. 

It can be as simple as delivering a dataset in a CSV file, building an API on top of an integrated relational database, and/or instant delivery through NoSQL databases. 

What does it take to be a Data Engineer?

To pursue a data engineering career requires learning an essential set of Software Engineering skills and a passion for working with data. It is a three-fold journey to take you from a developer who consumes frameworks to someone who creates frameworks. It unfolds into Language Proficiency, Data Structures and Algorithms, and Design Patterns.

Language proficiency is about mastering the programming language you use. It starts with learning things like defining a function, iterating over a loop, and creating an object. Also, understanding how simple data structures like arrays, lists, and dictionaries work under the hood, their limitations, time complexity, and processing cost. Language proficiency is crucial when it comes to software engineering. It is the foundational piece you use to build up all the fantastic things you will achieve as a Data Engineer.

I am a big advocate for Python, specifically when it comes to data engineering. Python is an object-oriented, high-level language with integrated dynamic linguistics. It is publicly usable and supported widely. Great for all-range of projects on the data and web application side of things. It is also dominating the Data Science world, and since as a data engineer you will work closely with this realm, Python will be of great help.

Data Structures and Algorithms are at the heart of any software solution. A Data Structure is simply a data representation method with its own implementation and characteristics. Some examples of notable data structures are arrays, linked lists, trees, heaps, stacks, and more. Learning about those different data structures should help you pick an efficient method to represent your program data. 

To navigate a data structure of choice and perform specific tasks like sorting and searching, you will need to develop an algorithm. I know that the word algorithm can be obscure. It does have a reputation of being some sort of black magic that requires extensive knowledge to understand; however, this is not true. 

An Algorithm is a step-by-step procedure to solve a problem. Any set of programming steps that solves a particular issue is considered a computer algorithm. In this specific case, the term “Algorithms refers to a peculiar collection of procedures. What makes them unique is that they are well-researched, tried enough times, and proved to be the most efficient way to do a specific task. 

Efficiency in computer programs is defined in terms of time and space complexities. Time complexity is how long an algorithm takes to run to solve a particular task, while space complexity is the memory space this algorithm consumes to run. 

Other Software Engineering career advice content tends to overlook the language proficiency piece and dive right into data structures and algorithms. That’s because data structures and algorithms are at the forefront of big tech interviews. I go against this narrative. 

Language proficiency, in my opinion, should be appreciated more as the first step towards your software engineering career. Understanding data structures require a lot of practice. Getting into the habit of solving real-world problems helps shape an understanding of when to entertain a particular data structure over another. I don’t think someone can turn programming into a habit unless they have a good command of their coding language.

Design patterns represent best practices or typical solutions to common problems faced in software development. Design patterns are not code, and learning them helps you avoid reinventing the wheel. They are quite often created for and used by OOP languages like Java and Python.

What else do you need to know to become a Data Engineer?

In addition to the skills mentioned above, Data Engineers are often dealing with and managing databases. Learning how to put together a relational model and querying in SQL is essential. You should know things like Entity-Relations (ER), indexing, and database triggers and constraints.

It would be best to know about Object-Relational Mapping (ORM) libraries, like peewee and SQLAlchemy in Python, to query and manipulate data from a database. Learning how to structure and query data in NoSQL databases like Elasticsearch, Redis, and DynamoDB also comes in handy.

In addition, a majority of the data solutions I built were cloud-hosted in a serverless architecture. Getting familiar with one of the sizable cloud platforms like Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure is ample.

What helpful resources can help you start.

I highly recommend the provided tracks by Treehouse for learning Python. They are concise, well-prepared, and not at all expensive.

Udacity provides a free course and a paid nano degree for data structures and algorithms in Python. I believe they are helpful given that you work on a variety of problem sets in the nano degree.

One of the resources I found really useful was the code-interview-university GitHub repository provided by John Washam. It is meant for new software engineers and is quite resourceful.