Building the Future with Data Engineering: Uma Uppin on Crafting Scalable Infrastructure, Tracking Business Metrics, and Embracing an AI-Driven World
In this exclusive TechBullion interview, Uma Uppin delves into the evolving field of data engineering, exploring how it forms the backbone of data-driven organizations today. From defining the critical components of data infrastructure and data quality to discussing the importance of tracking user metrics like acquisition, retention, and churn, Uma provides a detailed look at the strategies that enable businesses to turn raw data into actionable insights. She highlights the importance of establishing a dedicated data engineering function early on, avoiding common pitfalls, and building scalable data infrastructures that drive growth. Looking toward the future, Uma shares insights into how businesses can leverage data engineering to stay ahead in an increasingly AI-powered landscape.
Can you define data engineering for our audience?
Data Engineering is a multidisciplinary organization. It involves everything from building infrastructure to collecting raw data and transforming it so downstream users can easily consume it for insights or build AI applications.
Some of the functions of data engineering are:
- Data Infrastructure — This function lays down the infrastructure data systems needed to collect data. These can be message queues to collect user behavior or system-generated API events that provide insights into metrics like adoption, engagement, etc. They can be processed in batch or real-time events. They are also responsible for setting up alerting and monitoring so the systems are up and running.
- Data Platform—This function builds a data lake and ingests all events and product data required for machine learning and analytics. They also build tools to enable scheduling and ingesting data at a regular cadence, so ingesting data is automated. They build tools that allow us to grasp data complexity and build data catalogs and lineage to maintain and inform data users.
- Data Quality—This function is responsible for building pipelines to clean and transform raw data into meaningful data elements, which can then be consumed by downstream applications and used in machine learning and generative AI.
- Data Tools—This function builds the tools needed to make sense of and query that data to get insights. Think of them as the front-end wrapper for the backend data team. They build tools like UIs for easy querying, understanding the data lineage and metadata, and experimentation.
From your experience, what are the key metrics organizations should track to truly understand their business growth? How does data engineering make this possible?
Businesses should measure growth based on new user acquisition and how sticky the product is. To gauge that, the companies should collect four data elements: new users, retained users ( users who have been active since joining), churned users (users who churn) and resurrected users (users who are back after churn.)Of course, the state of users fluctuates over time, so measuring these metrics across a time window on a time series is essential. Net growth is the sum of new, retained, and resurrected users minus churned users. By creating and monitoring these metrics, businesses can understand where the drop-off is, and that allows further investigation to target problem areas.
Data engineering gathers events when a new user onboards a product and tracks their activity on the product. Then they build pipelines to compute the metrics across various time windows. For example, what were these metrics for a user in the last 7 or 30 days on a particular day? From there plotting on a trending chart to visualize how the business is growing is easier.
You’ve worked extensively with data quality and analytics. How do you recommend companies structure their data engineering to effectively measure user engagement, retention, and churn?
A lot of these decisions are based on the size of the company. If a company wants to grow exponentially, it is crucial to understand different parts of a user acquisition and retention funnel and where the drop-off happens. I would advocate starting with a data infrastructure and analytics team to enable this. While the infrastructure team handles the ingestion of raw data and events into the data lake, the analytics team can swiftly build pipelines to analyze different stages of the funnel. Once these two functions are established and the company is growing, it becomes essential to establish a data engineering function that will build foundational core dimensions and metrics. These become the single source of truth for any downstream consumption. I’ve seen companies where, without this function, calculating revenue becomes challenging, and they report different numbers based on systems/pipelines that are used. In addition to data engineering, a separate data tools team will help build data tools like data lineage, metadata organization, and Retrieval Augmented Generation (RAG) -based Generative AI applications.
Many companies struggle to turn their data into actionable insights. What common pitfalls do you see in data engineering implementations, and how can organizations avoid them?
Many companies do not invest in a data engineering function because they think that the ROI is not directly related to their product. Unfortunately, these companies employ self-service tools and let the operations and analytics team build critical data systems. While this is a good strategy when the company is small, as it scales, these pipelines will fail eventually because they weren’t built optimally. Therefore, it is essential to invest in at least a small data engineering team as early as possible and build a scalable infrastructure for the future.
Could you explain your approach to building alerting and monitoring systems for data quality? How does this impact business decision-making?
Observability is a vital element in building effective monitoring systems, allowing teams to maintain oversight and respond quickly to issues. There are two main types of alerting and monitoring systems.
The first type focuses on ensuring system uptime, keeping any downtime within the established SLA for the service. Monitoring is set up across all systems, so if any downtime occurs, an on-call person is immediately alerted to investigate. This approach enables a proactive team response, minimizing the time needed to restore service operations and reducing potential business impact.
The second type of alerting focuses on identifying unusual patterns or shifts in key metrics, which may indicate changes in business performance. For example, a month-over-month decline in user engagement metrics could suggest that customers are migrating to a competitor. These alerts help teams identify early warning signs, allowing them to take preventive action to safeguard business interests.
When it comes to product development, how do you recommend organizations use data engineering to track product quality and user experience? What metrics matter most?
Organizations can build several key metrics to help them understand a product’s success. The key is to figure out what success means for the organization. For assessing initial product adoption, metrics such as app downloads, installations, registrations, subscriptions, and brand awareness are valuable indicators to determine if there is a product-market fit.
Next, tracking growth metrics like retention, churn, stickiness, and social shares helps us assess whether users find value in the product and continue to engage with it.
Finally, metrics such as average revenue per user, customer lifetime value, customer acquisition cost, and return on investment provide insights into the financial health of the business.
Looking ahead, how do you see the role of data engineering evolving as businesses become increasingly data-driven? What should organizations be preparing for?
With AI rapidly advancing, a company’s success will depend on its ability to build and integrate generative AI applications. The fundamental input to these applications is not just the data that the company generates but how clean the data is. We can now extract data from unstructured formats like PDFs and images, which can further automate many functionalities for organizations. All of these require companies to implement a fully functional data organization.