Emerging DataOps Tools: Data Lineage and Data Provenance
Part 1 of a multi-part series on future technology trends
The Inaugural Edition! - I hope to be writing frequently on here to provide some themes on enterprise software, web3, and much more.
For those that don’t know me, I am a technology enthusiast and former engineer. Early in my career, I was a systems engineer at Lockheed Martin where I worked on autonomous vehicles. I’m most recently a business school grad at UVA-Darden and a VC investor at Dcode Capital, where I focused on investing in enterprise software companies with use cases in government. I’m also a movie fanatic so you should see a couple movie analogies in these. 😎
Why should you read? If you like technology trends in enterprise software, this is definitely going to be a fun/easy read for you to learn about something new. I also hope I’m able to simplify the industry speak that often makes these products tough to understand. Make it easy for the average person to understand you know? At the very least, I promise some 🔥 movie references.
Today, I wanted to focus on a category of data operations tools called data lineage and data provenance. I’m providing a summary here of reasons why I’m bullish on the next generation of tools that provide this capability.
The Problem
I first came across this problem researching another topic - I was looking into Master Data Management (MDM) software. MDM enables your organization to consolidate duplicative data across a bunch of systems and create a single golden record.
Through my research, I got to talk to a ton of people smarter than me in the financial services industry as well as in data architecture. Just talking about what the future might hold, I learned about a problem set that I think will start to grow rapidly in 2022: Data lineage and data provenance.
What does that mean though? I find real world examples are the most helpful.
If you’re Sam the VP, you’re sitting way downstream of some of the customer-facing systems your teams are running and you live in Tableau or PowerBI looking at dashboards. You probably don’t have a sound grasp of where any of the data you’re looking at came from. If you wanted to figure it out, you’d ping a team member and it may lead to a 1-week deep dive bringing in several other teams to track it back to the source.
Alternatively, you might be Rob the new associate on the customer success team. You’re trying to move around some columns in a particular document you have in your data lake. Unbeknownst to Rob, that document isn’t just used by him. It’s used by a BD team in Australia that also uses the data to target customers for upsell. If Rob, starts moving stuff around, it kicks off a set of dominos that wrecks downstream systems for others.
Bonus points if you remember this movie.
Simply put, data lineage and data provenance tools allow you to map data through your organization. The idea here is to provide a self-service tool where a user can understand where the data they’re looking at came from, where it was augmented, and where it might be headed next in the organization. The diagram below helps show what a data lineage tool may show you about how data flows through an organization.
Just wrapping it up in a bow, these are some of the problems data lineage tools solve:
Use Cases
Identify downstream systems and teams affected by upstream changes - The team can get together and hash out any changes before something breaks.
Self-service data tracking - Instead of a VP sending an email to his team asking where the data in his dashboard comes from, a data lineage tool provides an instantaneous view of how the data traveled through each system.
Regulatory - Compliance teams can use data lineage to track how data changes from when customer input all the way up to analytics or AI platforms.
A major use case for GDPR, which I discuss below.
Macro Trends and Market Size
When thinking about whether any technology can be big, I try to understand the global tailwinds that would help it along the way and the potential size of the market.
Macro Trends
Thinking through what would drive enterprises to adopt data lineage tools, I think the main drivers are regulatory, adoption of cloud and transparent AI, and the accelerating pace of M&A following the 2020 pandemic year.
GDPR and New Regulations - For those that don’t know about GDPR, its new regulations imposed in the European Union to protect citizen’s data privacy. Of particular interest, Article 30 of the GDPR requires organizations to maintain a record of processing activities. This also includes a description of recipients of client facing data and a list of any data that may be transferred to another country1. As you can imagine, tools mapping where data is flowing through an organization become a necessity.
Movement towards Transparent AI - Transparent AI enable you to understand why an AI platform made a decision based on inputs. This came up prominently in the news with the leak of internal Facebook document (Oops! - I mean Meta). There has been a great deal of inquiry about Facebook’s newsfeed algorithm and why it pushes certain content rather than others2. As AI become more common place in the enterprise, AI transparency will become table stakes when the technology directly impacts customers. As a result, transparency of the origins of data being fed into these models will become a necessity as well.
Bonus Reading: Roy Walker wrote a great article about Transparent AI if you want to learn more3.
Accelerating M&A activity - Pitchbook projects over 27,000 acquisitions through the first three quarters of 20214. This matches all deals made in 2019 combined. With increased M&A, acquirers will need to integrate more and more software systems into their current data infrastructure. Tools that allow teams to clearly define where new systems are feeding their data will ease the integration process.
Pandemic shift to the Cloud - With the pandemic and remote work, we saw an acceleration of businesses both large and small moving to the cloud. With a brand new data infrastructure for organizations, data lineage tools will provide the clarity of where data is flowing through the organization in this new environment.
Market Size
After thinking through macro trends, I want to talk through whether this is a big opportunity or a small opportunity.
To level set for everyone, I wanted to show the size of the broad market for big data and analytics tools. This includes anything from data infrastructure tools like Snowflake to analytics tools like Tableau. IDC releases an annual spending guide on the sector and they projected the market to be a whopping $215B in 2020 growing at a 12.8% CAGR through 20255. They also do some great breakdowns of spend by sector so recommend taking a look.
Now, I want to drill down a little bit deeper to see what the market size might be for data lineage specifically. It’s a pretty niche specialized tool set so I looked for comparable categories where data lineage might fall into.
A market thats highly related to data lineage is metadata management. Metadata management tools allow users to search for metadata attributes, understand how fields are calculated, and adjust what metadata attributes are tracked. These tools also allows you to build catalogs so that business users can easily find where data resides across the enterprise. In fact, several metadata tools on the market today combine data lineage functionality into their product.
MarketsandMarkets, a research tool I use for more niche technologies, assesses the market size for metadata management tools at $6.3B growing at a rapid 19% CAGR6. While data lineage is likely a smaller fraction of the market size, this analysis gave me confidence that we’re looking at a potential billion dollar opportunity.
Companies to Watch 👀
Thus far, we’ve talked about the problem that enterprises face in tackling data lineage, the global trends that should accelerate growth for these tools, and the size of the opportunity.
Wrapping up, let’s talk about some of the companies that are out there today and how they solve data lineage and data provenance issues. The majority are very early (seed stage) companies along with a couple other legacy players that offer data lineage as an add-on feature.
Metadata-enabled Tools
These tools use metadata as the core infrastructure to track and log data through the enterprise.
Alvin.Ai - Alvin.ai is led by Martin Sahlen and Dan Mashiter. Martin comes from his previous startup feature.fm, a music marketing platform. Dan most recently comes from people.io, a UK-based startup focused on data privacy, where he was head of growth.
Stage: Seed
How it works: check out this blog post
Manta - Manta is led by Tomas Kratky. Tomas spent 14-years at Profinit, a data science platform for financial orgs, as their head of R&D before starting Manta.
Stage: Series A1; Total Funding Raised: $17M
How it works: Youtube Demo
Blockchain-enabled Tools
These tools use blockchain and tokenization to be able to track the provenance of documents over their lifetime.
Blocky - Blocky currently allows users to sign data files onto a blockchain as a security feature to verify authenticity of documents. The technology allows for organizations to understand the data lineage of any file through its file history on the blockchain. The company is co-founded by Taylor Heinecke, David Millman, and Mike Wittie. The team brings experience from academia at Montana State University’s computer science department.
Stage: Seed
Other DataOps Platforms
These platforms offer data lineage as an additional feature beyond their core functionality.
Talend - The Talend data fabric platform offers features for data integration to ingest data from a variety of platforms, data pipelines through their Stitch platform, data quality to identify bad data, out-of-the-box APIs for devops teams, and a data lineage visualization tool.
Stage: PE-Backed
How it works: Youtube Demo
Atlan - Their platform offers an all-in-one platform dataops platform that provides data cataloging to allow for data exploration and search, data quality controls to help detect bad data, and data lineage to map how data flows through the enterprise.
Stage: Series A
How it works: Youtube Demo
NOTE: If you know of a company not mentioned above, please reach out to me!
Hope you enjoyed! Please drop a comment, subscribe, throw me a follow on twitter @NicoSands or message me on Linkedin!
https://www.talend.com/resources/gdpr-stitch-data-lineage/
https://www.washingtonpost.com/technology/2021/10/26/facebook-angry-emoji-algorithm/
https://medium.com/all-things-venture-capital/ai-explainability-why-we-need-it-how-it-works-and-whos-winning-b4ca3c26b2a6
https://pitchbook.com/news/reports/q3-2021-global-ma-report
https://www.idc.com/getdoc.jsp?containerId=prUS48165721
https://www.marketsandmarkets.com/Market-Reports/metadata-management-tools-market-120201191.html#:~:text=The%20global%20metadata%20management%20tools,19.0%25%20during%20the%20forecast%20period.