Share with your friends!

wikimedia wants to make it easier for Wikimedia Foundation is enhancing its data accessibility by developing an AI-friendly database to facilitate easier searches for both users and AI developers.

wikimedia wants to make it easier for

Understanding Wikidata

Wikidata serves as a vital resource for information, housing a vast array of structured data that supports Wikipedia and other Wikimedia projects. Launched in 2012, Wikidata aims to provide a centralized repository for data that can be used across various languages and platforms. It contains over 19 million entries, encompassing everything from historical figures to scientific concepts. Each entry is linked to various attributes, such as images, text, and keywords, making it a rich source of information.

For example, the late English writer Douglas Adams, best known for his 1979 book The Hitchhiker’s Guide to the Galaxy, has a detailed entry in Wikidata. Beyond basic facts like his birth sign and the cataloging number used by libraries worldwide, users can find a wealth of information about his works, influences, and contributions to literature. This information is stored in formats that are accessible to both humans and machines, including JSON, which is particularly useful for developers and AI systems.

The New AI-Friendly Database

Recently, Wikimedia announced the development of a new AI-friendly database designed to improve how large language models (LLMs) can access and utilize the data stored in Wikidata. This initiative stems from the Wikipedia Embedding Project, led by Wikimedia Deutschland, which oversees the management and development of Wikidata. The Berlin-based team has spent the past year working on transforming the structured data of Wikidata into vectorized formats that capture the context and meaning of each entry.

Vectorization is a process that converts data into numerical representations, allowing AI systems to understand and process the information more effectively. In this new format, data can be visualized as a graph with interconnected dots, where each dot represents an entry, and the lines signify relationships between them. For instance, Douglas Adams would be linked to terms like “human,” “author,” and the titles of his books, creating a rich tapestry of interconnected information.

Implications for AI Development

The primary goal of this project is to democratize access to high-quality data for AI developers, particularly those outside the realm of large tech companies. Lydia Pintscher, the Wikidata portfolio lead, emphasized that companies like OpenAI and Anthropic possess the resources to vectorize data effectively. However, smaller organizations often lack the same level of access and funding. “Really, for me, it’s about giving them that edge up and to at least give them a chance, right?” Pintscher stated.

This initiative aims to level the playing field, allowing smaller developers to create innovative applications that leverage the vast data stored in Wikidata. One example of a project that has successfully utilized Wikidata is Govdirectory, a platform that helps users find social media handles and email addresses for public officials worldwide. By harnessing the curated data from Wikidata, Govdirectory has made it easier for citizens to connect with their representatives.

Enhancing AI Systems

Most AI chatbots currently prioritize popular topics and keywords, often leading to a skewed representation of information. By providing easier access to Wikidata, the Wikimedia team hopes to foster the development of AI systems that can better reflect niche topics that might not be widely represented on the internet. Pintscher remarked that this could be a more effective way to integrate new information into AI systems like ChatGPT, rather than relying on the traditional method of generating content and waiting for the next training cycle.

Philippe Saadé, the Wikidata AI project manager, explained that the new vectorized format will allow AI systems to access not just the raw information but also the context surrounding it. This is crucial for developing more nuanced and accurate AI responses. For example, when querying about Douglas Adams, an AI system could provide richer, more contextualized information rather than just a list of facts.

Technical Aspects of the Project

The team utilized a model from Jina AI, a company specializing in AI and machine learning, to convert Wikidata’s structured data into vectors. This collaboration highlights the importance of partnerships in advancing AI capabilities. Additionally, the infrastructure for storing the vector database is provided by DataStax, a subsidiary of IBM, which is currently offering its services for free to support this initiative.

As the project progresses, the team is actively seeking feedback from developers who utilize the database. This feedback will be instrumental in refining the database and ensuring it meets the needs of its users. However, it is important to note that the current version of the database does not include entirely new information added to Wikidata over the past year. Saadé reassured users that minor edits or tweaks to existing entries would not significantly impact the database’s overall usefulness. “At the end of the day, the vector that we’re computing is like a general idea of an item, so if some small edit has been made on Wikidata, it’s not going to be super relevant,” he said.

Future Prospects

The introduction of an AI-friendly database marks a significant step forward for Wikidata and its mission to provide accessible knowledge to everyone. As AI technology continues to evolve, the need for high-quality, well-structured data becomes increasingly critical. By making it easier for developers to access and utilize Wikidata’s resources, Wikimedia is not only fostering innovation but also encouraging the development of AI systems that are more inclusive and representative of diverse topics.

Looking ahead, the Wikimedia team plans to continue refining the database and expanding its capabilities. They aim to incorporate user feedback and make necessary updates to ensure the database remains relevant and useful. As AI developers begin to leverage this new resource, it will be interesting to observe how it influences the landscape of AI applications and the types of information that become more readily available to users.

Conclusion

The development of an AI-friendly database by Wikimedia is a promising initiative that aims to enhance the accessibility of data for AI developers, particularly those from smaller organizations. By democratizing access to high-quality information, Wikimedia is paving the way for more innovative and inclusive AI applications. As the project unfolds, it will be crucial to monitor its impact on the AI landscape and the broader implications for how information is accessed and utilized in the digital age.

Source: Original report