In our Data Demystified series, we talked about the benefits of a modern data stack and how to build a stack that will help your organization answer important questions and drive the business forward.
Since then, we’ve been explaining each layer of the stack in greater detail:
- In our data warehouses explainer, we reviewed the benefits of having a centralized storage location for data, how to get data into a warehouse, and general costs.
- In ETL vs. ELT: Understanding the Differences & Benefits, we broke down the differences, similarities, and ideal use cases for these two popular methods of data ingestion.
Now, we’re going to zoom out to assess data ingestion as a whole.
You’ll learn what it is, why it’s important, common challenges to look out for, and how to choose the right data ingestion tool (including a comparison chart of five common tools that we often recommend to startups and other early-stage companies).
What is data ingestion?
There are a lot of sources that generate data, including the company’s website, SaaS tools, spreadsheets, and production databases.
Data ingestion is the process of collecting, processing, and preparing data from those various sources into one central data storage location (like a data warehouse, data lake, or database) where the entire organization can access, analyze, and use it.
Why is data ingestion important?
Data ingestion is a vital process that ensures clean data will flow between your data sources, data storage system, and downstream reporting and analytics tools.
This is critical – end-users rely on these tools to access and analyze data. They need clean, reliable data that they can trust to guide their biggest decisions.
Types of data ingestion
There are two methods of data ingestion that are commonly employed. Real-time and batch-based.
Real-Time Data Ingestion
Real-time data ingestion, sometimes called streaming or stream processing, focuses on sourcing, manipulating, and loading data as soon as it is generated to create a continuous output. This method is a good fit for organizations that analyze data from sources that include web, mobile, and server events.
Companies that use data to guide immediate, reactive actions are good candidates for real-time data ingestion. They need a high-performance data analytics program that helps them make vital decisions on time-sensitive issues. Banks, for example, would use this method to monitor for fraudulent activity. It’s also popular amongst logistics and finance functions.
Real-time data ingestion is also useful for monitoring, specifically of IT systems, manufacturing equipment, and Internet of Things (IoT) devices.
Here’s another example of a prime use case for real-time data ingestion: A SaaS business that provides a tracking app for on-demand delivery. For this startup, real-time data ingestion is critical.
In terms of tooling, some of the top streaming analytics technologies include Apache Kafka, Amazon Kinesis, and Confluent.
Batch-Based Data Ingestion
The most widely used form of data ingestion, batch processing, is the most effective option when real-time data is not required. It allows organizations to collect data in large quantities at specific intervals or during a scheduled event. Being the simpler and more affordable option, batch processing is generally the preferred data ingestion method for businesses.
Batch-based data ingestion is the method of choice for most data analytics work and most workflows rely on batch-loaded data. For example, business intelligence, data science, and machine learning all typically rely on batch processing.
However, while batch-based processing may be the more popular choice between these two data ingestion methods, it’s not uncommon for organizations to opt for a combination of both real-time and batch-based data ingestion.
What does “real-time” mean to your organization?
When deciding which type of data ingestion to use, you’ll also want to establish your standard of what “real-time” means. Is it every ten seconds or every five minutes?
This is an important point – most companies that say they want real-time ingestion actually want “near real-time” ingestion, which is a batch-based process.
Here’s a good rule of thumb to use: For anything over five minutes, batch processing should yield the same results. Remember that in terms of infrastructure cost, there’s a much larger difference between 5 minutes and 5 seconds than there is between 5 hours and 5 minutes.
However, if your organization decides to make use of both options, try to use real-time data ingestion sparingly. It’s the costlier and more complex option.
The two most common data ingestion processes
Data ingestion is essentially all about moving data from one point to another. To fully understand data ingestion is to know how it operates as part of the greater overall process of data integration.
While data ingestion focuses solely on the movement of data, data integration focuses on its extraction, loading, and transformation.
There are two common processes you can use to achieve this: ELT or ETL.
With both processes, three things need to happen. Data needs to be:
- Extracted raw from a source like a SQL database or SaaS tool
- Transformed from its raw state by cleaning it up, processing, and converting it to your preferred file format.
- Loaded into a data warehouse
Both approaches return raw data as clean, structured data ready to be loaded into a data warehouse. How that raw data is processed, however, is a bit different for each process.
The simplified difference between ETL and ELT is the order in which the data gets loaded into the warehouse.
When it comes to ELT, the raw data is loaded into a warehouse immediately following extraction. Then it’s transformed.
With ETL, the data is first cleaned, processed, converted, and then loaded into a warehouse.
We’ll go over the major pros and cons of each process below, but check out this blog for a more comprehensive explainer: ETL vs. ELT: Understanding the Differences & Benefits.
Extract, transform, and load, or ETL, begins its extraction by collecting data once it is generated from a source system. Then, it’s transformed into a consistent and clean form of data that is compatible with your storage system. Once the data is in its final format and structure it can be loaded onto a centralized analytics system.
- Lower storage costs by only storing data you need
- Compliance with specific security protocols and regulations by removing sensitive data before it even enters your warehouse
- A universally understood process amongst engineers everywhere
- Higher start-up and maintenance costs
- Limited flexibility – transformations for format changes and potential edge cases need to be configured ahead of time
To summarize, while ELT is more expensive to start up and maintain, you’ll lower your data storage costs and eliminate clutter from your database.
Let’s take a look at the alternative.
ELT is a newer process that first became popular in the 2000s with the emergence of cheap and scalable cloud computing that could be decoupled from data storage. This shift made ELT more accessible.
Another reason for ELT’s growing popularity is due to cloud vendors like Snowflake, Amazon RedShift, and Google BigQuery; these providers have made the process of deploying and maintaining a cloud data warehouse cheap and simple without the need for a full-time database administrator.
On the process side, the primary difference between ELT and ETL is that with ELT, raw data is loaded onto the data storage system and then transformed directly inside of the storage system. Because of this slight difference in the order of the data ingestion process, ELT is a faster, easier, and more cost-effective method of data integration.
- Faster, because it loads data directly into your storage system for transformation
- More flexibility – transform your data as needed, even if those needs change
- Cost-effective with lower startup and ongoing maintenance costs
- More efficient, because it saves time that would otherwise be spent on transforming data before loading
- Potential security and compliance issues from loading raw data directly into your storage system
- Storage costs can quickly balloon because both messy and clean data is stored in your data storage system
What are some of the challenges with data ingestion?
The challenges that you might experience with data ingestion largely depend on whether you use in-house data ingestion management or opt for a dedicated data ingestion tool.
In-house data ingestion can be tricky, because as your data volume increases, so will the demand to keep up with processing. Eventually, in-house data ingestion will become impossible. As your total number of distinct data sources grows, you would be better off choosing a data ingestion tool that can effectively and efficiently process all of that information.
Another issue with in-house data ingestion is that by building a pipeline, you are now required to integrate with a third-party API. This creates a tumultuous data ecosystem that prevents you from future-proofing your own systems.
Security also becomes a major issue with in-house data ingestion. You have a responsibility to protect all of the sensitive information in your database, not to mention data privacy and protection regulations to comply with. Every effort must be made to ward off attacks on your system.
This is why many organizations choose to forgo in-house data ingestion management in favor of data ingestion tools.
What are the best data ingestion tools?
As you already know, in-house data ingestion can be a complex and costly endeavor. Luckily, there are tools to make things easier.
There are tons of data ingestion tools available (and we aren’t exaggerating about that). While we’d love to share an in-depth look at every single option, the following four tools are all popular choices amongst different organizations for various reasons. If you’re starting to compare data ingestion tools, look at these first.
What about domain-specific data ingestion and integration tools?
It’s also worth noting that there are several domain-specific data ingestion tools, like Adverity, which focuses on marketing, sales, and ecommerce teams. The value of domain-specific ingestion tools is that they generally include some proprietary data transformation or enrichment features.
These can be good options for companies that work with a specific type of data, like a marketing agency.
How about open-source data ingestion tools?
Open-source tools, like Airbyte, are a good option for companies with technical backgrounds and/or engineering resources.
Once you have a server set up to host the application, Airbyte’s easy-to-use platform makes it easy to get everything else up and running. There’s also a dbt integration – after data has been loaded into Airbyte, you can also trigger data transformation within the same tool.
What happens after data ingestion?
Once the data ingestion process is complete, critical business data will be safely stored inside a data warehouse, data lake, or database.
Eventually, it will flow into your data analytics and business intelligence environment, and ultimately, into the hands of business end-users.
We’re not there quite yet, though. Before data can be used as fuel to guide business decisions, it needs to be modeled into something usable.