Innovative Approaches to Data Discovery by Leading Tech Giants
Written on
Many of the major technology firms rely on data to make critical decisions aimed at improving customer service. As these organizations expand, the complexity of their data environments increases, making them harder to navigate.
Data discovery refers to the process of identifying relevant, high-quality data within these intricate landscapes.
This article examines how Facebook, Airbnb, and Uber frame the challenge of data discovery within their business models and how they have developed unique in-house solutions to tackle these challenges.
Each of these companies has shared detailed articles on their respective platforms. If a particular strategy piques your interest, I recommend reading the full articles for a deeper understanding of their implementations and methodologies.
Facebook's Approach to Data Discovery
Facebook caters to billions of users who rely on its high-quality services. To deliver a meaningful experience, teams must efficiently locate relevant and precise information. However, several discovery challenges arise:
- Relevant tables might have obscure or non-descriptive names.
- Different teams may have overlapping datasets.
Teams are generally aware of what data they need, yet locating the correct data swiftly can be a challenge. Data discovery platforms assist these teams in finding the needed information with greater speed.
Facebook's Solution: Nemo
To address these challenges effectively, Facebook created a platform named Nemo.
> "Relying on data specialists to find necessary data for each decision was unsustainable. Therefore, we developed Nemo, an internal data discovery engine that allows engineers to quickly access the information they need, with high confidence in its accuracy."
Nemo has simplified the data search experience for Facebook's engineers by offering over 12 types of data artifacts, enhancing the success rate of data searches by over 50%. The system has adapted to the tripling diversity of data types and a doubling of query volume.
The platform utilizes Unicorn, Facebook's efficient social graph search system, to enhance scalability. Nemo also supports refined searches based on table usage, privacy limitations, and data recency. Moreover, the engine can parse queries, presenting relevant data tables to users, while ranking them based on quality, recency, usage, and lineage.
Airbnb's Perspective on Data Discovery
Similarly, Airbnb has experienced significant growth in both the volume and diversity of its data over recent years.
> "We've witnessed explosive growth in the volume of data and the number of internal resources: data tables, dashboards, reports, and metric definitions."
While this growth signifies their commitment to data-informed decisions, it has also led to new challenges. Data sources often vary in quality, complexity, relevance, and trustworthiness, making it difficult to find optimal data.
Airbnb identified two primary themes causing major issues in their data landscape:
- Navigating the data environment is complicated and often requires users to seek guidance on resource locations.
- Trusting the data is challenging due to insufficient context and metadata.
These issues have led users to avoid existing resources and create their own, thereby complicating the data landscape further. Additionally, data localization within specific teams resulted in a narrow understanding of data without broader context, leading to subpar visualizations. Permission rules further exacerbated data sharing and comprehension issues.
> "Comprehending the entire data ecosystem, from event log creation to visualization consumption, provides more value than the sum of its individual parts."
Airbnb's Solution: Dataportal
> "The primary aim of Dataportal is to democratize data and empower Airbnb employees to make data-informed decisions through exploration, discovery, and trust."
Dataportal is designed to offer a framework that enables all employees to easily find data and feel confident in its relevance and reliability.
The platform comprises four components:
- Search
- Context and Metadata
- Employee-centric data
- Team-centric data
The search feature allows users to explore logging schemas, data tables, dashboards, and employee/team information. Utilizing all available metadata, the platform builds context and trust. It employs PageRank with a network representation of the data ecosystem to deliver pertinent search results.
Recognizing the importance of context, Dataportal allows users to view resource creators, usage history, and creation/update timestamps. It also includes user profiles, enabling employees to search for tables interacted with by others. Team pages mirror this functionality, showcasing their data interactions and creations.
This innovative approach treats the data discovery platform like a social network, significantly clarifying Airbnb's data landscape as they scaled.
Uber's Approach to Data Discovery
Like Airbnb, Uber places great importance on data context for informed decision-making.
> "Big data alone isn't sufficient for extracting insights; it requires context to facilitate effective business decisions."
As Uber has expanded to handle 15 million trips daily, with 75 million active monthly users and 18,000 employees globally, their data complexity has also escalated. Uber initiated the development of a data discovery platform in 2015, starting with static HTML tables that were manually updated, which proved unsustainable.
> "At this scale and pace of growth, a solid system for discovering datasets and their metadata is essential for making data useful at Uber."
Uber's Solution: Databook
Databook automatically generates metadata about tables to provide context regarding data quality and significance. The platform focuses on four key elements:
- Extensibility: Simplified addition of new metadata, storage, and entities to tables.
- Accessibility: Programmatic access to all metadata.
- Scalability: Support for numerous concurrent read requests.
- Power: Capacity for read and write requests across multiple data centers.
Databook ingests data from various sources (Cassandra, Hive, Vertica, etc.), stores metadata, and presents it through RESTful APIs accessible via the Databook UI.
By not requiring real-time metadata visibility, Databook's architecture ensures faster read throughput for users, a primary goal of the platform. The system's modular design separates the request-serving layer from the data collection layer for independent computation.
Users can access metadata through a RESTful API or a visual interface, with search functionality powered by Elasticsearch, enabling searches across multiple dimensions such as name, owner, and column.
Conclusion
As leading tech companies increasingly prioritize data-driven decision-making while scaling, the need for effective data discovery platforms grows. Manually managing information is impractical at such scales.
While Facebook, Airbnb, and Uber adopt distinct strategies to tackle the issue, they share common priorities:
- Data context is crucial throughout the process.
- Data discovery platforms are vital for making informed decisions on a large scale.
- Trustworthiness, relevance, and data recency are essential factors in data search rankings.
It is fascinating to observe how top-performing businesses address data challenges at such scale. If you found this information insightful, there are around 13 additional articles detailing how major tech companies are addressing data discovery challenges, along with other topics in machine learning and data science.
Thank you for engaging with this article.