Stefano Maffulli – Open Source Initiative

OSI at the United Nations OSPOs for Good

Stefano Maffulli — Wed, 24 Jul 2024 22:57:06 +0000

Earlier this month the Open Source Initiative participated in the “OSPOs for Good” event promoted by the United Nations in NYC. Stefano Maffulli, the Executive Director of the OSI, participated in a panel moderated by Mehdi Snene about Open Source AI alongside distinguished speakers Ashley Kramer, Craig Ramlal, Sasha Luccioni, and Sergio Gago. Please find below a transcript of Stefano’s presentation.

Mehdi Snene

What is Open Source in AI? What does it mean? What are the foundational pieces? How far along is the data? There is mention of weights, and data skills. How can we truly understand what Open Source in AI is? Today, joining us, we’ll have someone who can help us understand what Open Source in AI means and where we are heading. Stefano, can you offer your insights?

Stefano Maffulli

Thanks. We have some thoughts on this. We’ve been pondering these questions since they first emerged when GPT started to appear. We asked ourselves: How do we transfer the principles of permissionless innovation and the immense value created by the Open Source ecosystem into the AI space?

After a little over two years of research and global conversations with multiple stakeholders, we identified three key elements. Firstly, permissionless innovation needs to be ported to AI, but this is complex and must be broken down into smaller components.

We realized that, as developers, users, and deployers of AI systems, we need to understand how these systems are built. This involves studying all components carefully, being able to run them for any purpose without asking for permission (a basic tenet of Open Source), and modifying them to change outputs based on the same inputs. These basic principles include being able to share these modifications with others.

To achieve this, you need data, the code used for training and cleaning the data (e.g., removing duplicates), the parameters, the weights, and a way to run inference on those weights. It’s fairly straightforward. However, the challenge lies in the legal framework.

Now, the complicated piece is how Open Source software has had a very wonderful run, based on the fact that the legal framework that governs Open Source is fairly simple and globally accepted. It’s built on copyright, a system that has worked wonderfully in both ways. It gives exclusive rights to the content creators, but also the same mechanism can be used to grant rights to anyone who receives the creation.

With data, we don’t have that mechanism. That is a very simple and dramatic realization. When we talk about data, we should pay attention to what kind of data we’re discussing. There is data as content created, and there is data as facts; like fires, speed limits, or traces of a road. Those are facts, and they have different ways of being treated. There is also private data, personal information, and various other kinds of data, each with different rules and regulations around the world.

Governments’ major role in the future will be to facilitate permissionless innovation in data by harmonizing these rules. This will level the playing field, where currently larger corporations have significantly more power than Open Source developers or those wishing to create large language models. Governments should help create datasets, remove barriers, and facilitate access for academia, smaller developers, and the global south.

Mehdi Snene

We already have open data and Open Source. Now, we need to create open AI and open models. Are we bringing these two domains together and keeping them separate, or are we creating something new from scratch when we talk about open AI?

Stefano Maffulli

This is a very interesting and powerful question. I believe that open data as a movement has been around for quite a while. However, it’s only recently that data scientists have truly realized the value they hold in their hands. Data is fungible and can be used to build new things that are completely different from their original domains.

We need to talk more about this and establish platforms for better interaction. One striking example is a popular dataset of images used for training many image generation AI tools, which contained child sexual abuse images for many years. A research paper highlighted this huge problem, but no one filed a bug report, and there was no easy way for the maintainers of this dataset to notice and remove those images.

There are things that the software world understands very well, and things that data scientists understand very well. We are starting to see the need for more space for interactions and learning from each other.

The conversation is extremely complicated. Alex and I have had long discussions about this. I don’t want to focus entirely on this, but I do want to say that Open Source has never been about pleasing companies or specific stakeholders. We need to think of it as an ecosystem where the balances of power are maintained.

While Open Source software and Open Source AI are still evolving, the necessary ingredients—data, code, and other components—are there. However, the data piece still needs to be debated and finalized. Pushing for radical openness with data has clear drawbacks and issues. It’s going to be a balance of intentions, aiming for the best outcome for the general public and the whole ecosystem.

Mehdi Snene

Thank you so much. My next question is about the future. What are your thoughts on the next big technology?

Stefano Maffulli

From the perspective of open innovation, it’s about what’s going to give society control over technology. The focus of Open Source has always been to enable developers and end-users to have sovereignty over the technology they use. Whether it’s quantum computers, AI, or future technologies, maintaining that control is crucial.

Governments need to play a role in enabling innovation and ensuring that no single power becomes too dominant. The balance between the private sector, public sector, nonprofit sector, and the often-overlooked fourth sector—which includes developers and creators who work for the public good rather than for profit—must be maintained. This balance is essential for fostering an ecosystem where all stakeholders have equal interests and influence.

If you would like to listen to the panel discussion in its entirety, you can do so here (the Open Source AI panel starts at 1:00:00 approximately).

Explaining the concept of Data information

Stefano Maffulli — Fri, 14 Jun 2024 13:53:28 +0000

There seems to be some confusion caused by the concept of Data information included in the draft v0.0.8 of the Open Source AI Definition. Some readers may have seen the original dataset included in the list of optional components and quickly jumped to the wrong conclusions. This post clarifies how the draft arrived at its current state, the design principles behind the Data information concept and the constraints (legal and technical) it operates under.

The objective of the Open Source AI Definition

The objective of the Open Source AI Definition is to replicate in the context of artificial intelligence (AI) the principles of autonomy, transparency, frictionless reuse, and collaborative improvement for end users and developers of AI systems. These are described in the preamble.

Following the preamble is the definition of Open Source AI, an adaptation of the definition of Free Software (also known as “the four freedoms”) to AI nomenclature. The preamble and the four freedoms have been co-designed over several meetings and public discussions, online and in-person, and have not recently received significant comments.

The Free Software definition specifies that a precondition to the freedom to study and modify a program is to have access to the source code. Source code is defined as “the preferred form of the program for making changes in.” Draft v0.0.8 contains a description of what’s necessary to enjoy the freedoms to study and modify an AI system. This new section titled Preferred form to make modifications to machine-learning systems has generated a heated debate.

What is the preferred form to make modifications

The concept of “preferred form to make modifications” focuses on machine learning systems because these systems require data and training to produce a working system. Other AI systems are more easily classifiable as software and don’t require a special definition.

The system analysis phase of the co-design process revealed that studying and modifying machine learning systems requires data, code for training and inference and model parameters. For the parameters, there’s no ambiguity: an Open Source AI must make them available under terms that respect the Open Source principles (no field-of-use restrictions, no discrimination against people, etc). For the data and code requirements, the text in the “preferred form to make modifications” section is longer and harder to parse, generating some confusion.

The intent of the code and data requirements is to ensure that end users, deployers and developers of an Open Source AI system have all the tools and instructions to recreate that AI system from scratch, to satisfy the freedoms to study and modify the system. At a high-level view, it makes sense to suggest that training datasets should be mandatorily released with permissive licenses in order to be Open Source AI.

However on close examination, it became clear that sharing the original datasets is full of traps. It actually puts Open Source at a disadvantage compared to opaque and proprietary AI systems.

The issue with data

Data is not software: The legal landscape for data is much wider than copyright. Aggregating large datasets and distributing them internationally is an endless nightmare that includes privacy laws, copyright, sui-generis rights, patents, secrets and more. Without diving deeper into legal issues, let’s focus on practical examples to clarify why the distribution of the training dataset is not spelled out as a requirement in the concept of Data information.

The Pile, the open dataset used to train the very open Pythia models, was taken down after an alleged copyright infringement, currently being litigated in the United States. However, the Pile appears to be legal to share in Japan. It’s also unclear whether it can be legally shared in the European Union.
DOLMA, the open dataset used to train the very open OLMo models, was initially released with a restrictive license. It later switched to a permissive one. On further inspection, DOLMA appears to suffer from the same legal uncertainties of the Pile, however the Allen Institute has not been sued yet.
Training techniques that preserve privacy like federated learning don’t create datasets.

All these cases show that requiring the original datasets creates vagueness and uncertainty in applying the Open Source AI Definition:

If a dataset is only legal in Japan, is that AI Open Source only in Japan?
If a dataset is initially legally available but later retracted, does the AI go from being Open Source to not?
- If so, what happens to the applications that use such AI?
If no dataset is created, then will any AI trained with such techniques ever be Open Source?

Additionally, there are reasons to believe that OpenAI, Anthropic and other proprietary systems have been trained on the same questionable data inside The Pile and DOLMA: Proving that’s the case is a lot harder and expensive though. This is clearly a disincentive to be open and transparent on the data sources, adding a burden to the organizations that try to do the right thing.

The solution to these questions, draft v0.0.8 contains the concept of Data information, coupled with code requirements to obtain the expected result: for end users, developers and deployers of AI systems to be able to reproduce an Open Source AI.

Understanding the concept of Data Information

Data information, in the draft Open Source AI Definition, is defined as:

Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.

Read that from the end: The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data. That means that an Open Source AI must disclose all the ingredients, where they’ve been bought and all the instructions to prepare the dish.

This is a solution that came out of the co-design process, where reviewers didn’t rank the training datasets as high as they ranked the training code and data transparency requirements.

Data information and the code requirements also address all of the questions around the legality of distributing data and datasets, or their absence.

If a dataset is only legal in Japan or becomes illegal later, one should still be able to recreate a dataset suitable to train an equivalent system replacing the illegal or unavailable pieces with similar ones.

AI systems trained with federated learning (where a dataset isn’t created) can still be Open Source AI if all instructions and code are released so that a new training with different data can generate an equivalent system.

The Data information concept also solves an example (raised on the forum) of an AI system trained on data licensed directly from Reddit. In this case, if the original developers released enough information to allow another AI developer to recreate a substantially equivalent system with Reddit data taken from an existing dataset, like CommonCrawl, it would be considered Open Source AI.

The proposed alternatives

While generally well received, draft v0.0.8 has been criticized by a few people on the forum for putting the training dataset in the “optional requirements”. Some suggestions and pushback we’ve received:

Require the use of synthetic data when the training dataset cannot be legally shared: This technique may work in some corner cases, if the technology evolves to be reliable enough. It’s expensive and untested at scale.
Classify as Open Source AI systems where all their components are “open source”: This approach is not rooted in the longstanding practice of the GNU project to accept system library exceptions and other compromises in exchange for more Open Source tools.
Datasets built by crawling the internet are the equivalent of theft, they shouldn’t be allowed at all, let alone allowed in Open Source AI: This pushback ignores the reality that large data aggregators already have acquired legally the rights to accumulate that same data (through scraping and terms of use) and are trading it, exclusively capturing the economic value of what should be in the commons. Read Towards a Books Data Commons for AI Training for more details. There is no general agreement that text and data mining is equivalent to theft.

These demands and suggestions are hard to accept. We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones. We need a Definition that contains positive examples from the start so we can practically demonstrate positive qualities to policymakers.

The discussion about data, how to generate incentives to create datasets that can be distributed internationally, safely, preserving privacy, is extremely complex. It can be addressed separately from the Open Source AI Definition. In collaboration with Open Future Foundation and others, OSI is designing a series of conferences to tackle the data governance issue. We’ll make an announcement soon.

Have your say now

The concept of Data information and code requirements is hard to grasp at first. But the preliminary results of the validation phase confirm that the draft v0.0.8 works as expected: Pythia and OLMo both would be Open Source AI, while Falcon, Grok, Llama, Mistral would not (even if they used OSD-compatible licenses) because they don’t share Data information. BLOOM and StarCoder would fail because of field-of-use restrictions in their models.

Data information can be improved but it’s better than other solutions proposed so far. As we get closer to the release of the stable version of the Open Source AI Definition, we need to hear from you: If you support this concept please comment on the forum today. If you don’t support it, please try to propose an alternative that at least covers the practical examples of Pile, DOLMA and federated learning above. Help the community move the conversation forward.

Contributions of Open Source to AI: a panel discussion at CPDP-ai conference

Stefano Maffulli — Tue, 04 Jun 2024 09:00:00 +0000

I participated as a panelist at the CPDP-ai 2024 conference in Brussels last week where we discussed the significant contributions of Open Source to AI and highlighted the specific properties that differentiate Open Source AI from proprietary solutions. Representing the Open Source Initiative (OSI), the globally recognized non-profit that defines the term Open Source, I emphasized the longstanding principle of granting users full agency and control over technology, which has been proven to deliver extensive social benefits.

Below is a glimpse at the questions and answers posed to me and my fellow panelists:

Question: Stefano, please explain what the contribution to AI from Open Source is, and if there are specific properties of Open Source AI that make a difference for the users and for the people who are confronted with its results.

Response: The Definition of Open Source Software has existed for over 25 years; That doesn’t apply to AI. The Open Source Definition for software provides a stable north star for all participants in the digital ecosystem, from small and large companies to citizens and governments.

The basic principle of the Open Source Definition is to grant to the users of any technology full agency and control over the technology itself. This means that users of Open Source technologies have self-sovereignty of the technical solutions.

The Open Source Definition has demonstrated that massive social benefits accrue when you remove the barriers to learning, using, sharing and improving software systems. There is ample evidence that giving users agency, control and self-sovereignty of their technical choices produces a viable ecosystem based on permissionless innovation. Multiple studies by the EU Commission and Harvard researchers have assigned significant economic value to Open Source Software, all based on that single, clear, understood and approved Definition from 26 years ago.

For AI, and especially the most recent machine learning solutions, it’s less clear how society can maintain self-sovereignty of the technology and how to achieve permissionless innovation. Despite the fact that many people talk about Open Source AI, including the AI Act, there is no shared understanding of what that means, yet!

The Open Source Initiative is concluding a global, multi-stakeholder co-design process to find an unequivocal definition of Open Source AI, and we’re heading towards the conclusion of this process with a vastly increased knowledge of the AI machine learning space. The current draft of the Open Source AI Definition recognizes that in order to study, use, share and modify AI, one needs to refer to an AI system, not a single individual component. The global process has identified the components required for society to maintain control of the technology and these are:

Detailed information about the dataset used to train the system and the code so that a skilled person can train a system with similar capabilities
All the libraries and tools used to run training and inference
The model architecture and the parameters, like weights and biases

Having unrestricted access to all these elements is what makes an AI an Open Source AI.

We’re in the final stretch of the process, starting to gather support for the current draft of the definition.

The most controversial part of the discussion is the role of data in the training. To answer your question about the power of big foreign tech companies, putting aside the hardware requirements, the data is where the fight is. There seem to be two views of the world on data when it comes to AI: One thinks that text and data mining is basically strip mining humanity and all accumulation of data without consent of the rights holders must be made illegal. Another view of the world is that text and data mining for the purpose of training Open Source AI is probably the only antidote to the superpowers of large corporations. These camps haven’t found a common position yet. Japan seems to have made up its mind already, legalizing unrestricted text and data mining. We’ll see where the lawsuits in the US will go, if they ever get to a decision in court or, as I suspect, they will be settled out of court.

In any case, data, competence and to some extent hardware, are the levers to control the development of AI.

Open Source has been leveling the playing field of technologies. We know from past experience with Open Source software that giving people unrestricted access to the means of digital production enables tremendous economic value. This worked in Europe as well as in China. We think that Open Source AI can have the same effect of generating value while leaving control of the technology in the hands of society.

Question: Big tech companies are important for the development of AI. Apart from the purely technological impacts, there is also economic importance. The European Commission has been very concerned about the Digital Single Market recently, and has initiated legislation such as DSA and DMA to improve competition and market access. Will these instruments be sufficient in view of AI roll-out, thinking also of the recently adopted AI Act? Or will additional attention need to be paid?

Response: Open is the best antidote to the concentration of power. That said, I see these legislations as the sticks, very necessary. I’d love us to think also about carrots. We don’t want to repeat the mistakes of the past with the early years of the internet. Open Source software was equally available in the US and Europe but despite that, the few European champions of Open Source haven’t grown big enough to have a global impact. And some of the biggest EU companies aren’t exactly friendly with Open Source either.

Chinese companies have taken a different approach. But in Europe we have talents, and we have an attractive quality of life so we can get even more talents. Finding money is never an issue. We need to remove the disincentives to grow our companies bigger, widen the access to the internal EU market and support their international expansion, too.

For example, we need to review European Regulation 1025, on standardization to accommodate for Open Source. 1025 Regulation was written at a time when Open Source was considered a “business model” and information and communication technology standards were about voltages in a wire. Today, Open Source is between 80% and 90% of all software and “digital elements” comprise some part of every modern product. Even hardware solutions are dominated by “digital elements.” As such, the approach taken by 1025 is out of date and most likely needs a root-and-branch rethink to properly apply to the world today and the world we anticipate tomorrow.

We need to make sure that the standardization rules required by the Cyber Resilience Act are written together with Open Source champions so the rules don’t favor exclusively the cartel of European patent holders who try to seek rent instead of innovating. Europe has all the means to be at the center of AI innovation; It embodies the right values of diversity and collaboration.

Closing remarks: We think that Open Source is the best antidote to fight market concentration in AI. Data is where the concentration of power is happening now and it’s in the hands of massive corporations: not only Google, Meta, Amazon, Reddit but also Sony, Warner, Netflix, Getty Images, Adobe … All these companies have already gained access to massive amounts of data, legally. These companies basically own our data, legally: Our pictures, the graph of our circles of friends, all the books and movies…

There is a risk that if we don’t write policies that allow text and data mining in exchange of a real Open Source AI (one that society can fully control) then we risk leaving the most powerful AI systems in the hands of the oligopoly who can afford trading money for access to data.

Exploring openness in AI: Insights from the Columbia Convening

Stefano Maffulli — Thu, 23 May 2024 12:00:00 +0000

Over the past year, a robust debate has emerged regarding the benefits and risks of open sourcing foundation models in AI. This discussion has often been characterized by high-level generalities or narrow focuses on specific technical attributes. One of the key challenges—one that the OSI community is addressing head on—is defining Open Source within the context of foundation models.

A new framework is proposed to help inform practical and nuanced decisions about the openness of AI systems, including foundation models. The recent proceedings from the Columbia Convening on Openness in Artificial Intelligence, made available for the first time this week, are a welcome addition to the process.

The Columbia Convening brought together experts and stakeholders to discuss the complexities and nuances of openness in AI. The goal was not to define Open Source AI but to illuminate the multifaceted nature of the issue. The proceedings reflect the February conversations and are based on the backgrounder text developed collaboratively with the working group.

One of the significant contributions of these proceedings is the framework for understanding openness across the AI stack. The framework summarizes previous work on the topic, analyzes the various reasons for pursuing openness, and outlines how openness varies in different parts of the AI stack, both at the model and system levels. This approach provides a common descriptive framework to deepen a more nuanced and rigorous understanding of openness in AI. It also aims to enable further work around definitions of openness and safety in AI.

The proceedings emphasize the importance of recognizing safety safeguards, licenses, and documents as attributes rather than components of the AI stack. This evolution from a model stack to a system stack underscores the dynamic nature of the AI field and the need for adaptable frameworks.

These proceedings are set to be released in time for the upcoming AI Safety Summit in South Korea. This timely release will help maintain momentum ahead of further discussions on openness at the French summit in 2025.

We’re happy to see collaboration of like-minded individuals in discussing and solving the varied problems associated with openness in AI.

Why datasets built on public domain might not be enough for AI

Stefano Maffulli — Tue, 07 May 2024 10:00:00 +0000

There is tension between copyright laws and large datasets suitable to train large language models. Common Corpus is a dataset that only uses text from copyright-expired sources to bypass the legal issues. It’s a useful achievement, paving the path to research without immediate risk of lawsuits. I also fear that this approach may lead to bad policies, reinforcing the power of copyright holders; not the small creators but large corporations.

A dataset built on public domain sources

In March 2024 Common Corpus was released as an open access dataset for training large language models (LLMs). Announcing the release, the lead developer Pierre-Carl Langlais says “Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.” The dataset contains 500 billion words in multiple European languages and different cultural heritages. It is a project coordinated by the French startup Pleias and supported by organizations committed to open science such as Occiglot, Eleuther AI and Nomic AI as well as being partly funded by the French government. The stated intention of Common Corpus is to democratize access to large quality datasets. It has many other positive characteristics, highlighted also by Open Future’s summary of a talk given by Langlais.

The commons needs more data

The debates sparked by the Deep Dive: AI process on the role of training data highlighted that AI practitioners encounter many obstacles assembling datasets. At the same time, we discovered that tech giants have an incredible advantage over researchers and startups. They’ve been slurping data for decades, have the financial means to go to court and can enter into bilateral agreements to license data. These strategies are inaccessible to small competitors and academics. Accepting that the only path to creating open large datasets suitable to train Open Source AI systems is to use sources in the public domain, risks cementing the dominant positions of existing large corporations.

The open landscape already faces issues with big tech and their ability to influence legislation. The big corporations have lobbied to extend the duration of copyright, introduced the DMCA, are opposing the right to repair, and have the resources to continue lobbying and sue any new entrant who they deem to get too close. There are plenty of examples showing an unequal advantage in protecting what they think is theirs. The non-profit Fairly Trained certifies companies “willing to prove that they’ve trained their AI models on data that they own, have licensed, or that is in the public domain,” respecting copyright law: who’s going to benefit from this approach?

Unsuitable for public policies

Initiatives like Common Corpus and The Stack (used to train Starcoder2) are important achievements as they allow researchers to develop new AI systems while mitigating the risk of being sued. They also push the technical boundaries of what can be achieved with smaller datasets that don’t require a nuclear power plant to train new models. But I think they mask the underlying issue: AI needs data and limiting open datasets to only public domain sources will never give them a chance to match the size of the proprietary ones. The lobby for copyright maximalists is always looking for ways to expand scope and extend terms for copyright laws, and when they succeed it is a one-way ratchet. It would be a tragedy for society if legislators listened to their sophistry and made new laws doing this based on the apparent consensus that creators need protection from AI.
The role of data for training machine learning systems is a divisive topic and a complex one. Having datasets like Common Corpus is a very useful way for the science of AI to progress with better sources. For policies, we’d be better off pushing for something like the proposal advanced by Open Future and Creative Commons in their paper Towards a Books Data Commons for AI Training.

OSI participates in Columbia Convening on openness and AI; first readouts available

Stefano Maffulli — Thu, 04 Apr 2024 13:47:00 +0000

I was invited to join Mozilla and the Columbia Institute of Global Politics in an effort that explores what “open” should mean in the AI era. A cohort of 40 leading scholars and practitioners from Open Source AI startups and companies, non-profit AI labs, and civil society organizations came together on February 29 at the Columbia Convening to collaborate on ways to strengthen and leverage openness for the good of all. We believe openness can and must play a key role in the future of AI. The Columbia Convening took an important step toward developing a framework for openness in AI with the hope that open approaches can have a significant impact on AI, just as Open Source software did in the early days of the internet and World Wide Web.

This effort is aligned and contributes valuable knowledge to the ongoing process to find the Open Source AI Definition.

As a result of this first meeting of Columbia Convening, two readouts have been published; a technical memorandum for technical leaders and practitioners who are shaping the future of AI, and a policy memorandum for policymakers with a focus on openness in AI.

Technical readout

The Columbia Convening on Openness and AI Technical Readout was edited by Nik Marda with review contributions from myself, Deval Pandya, Irene Solaiman, and Victor Storchan.

The technical readout highlighted the challenges of understanding openness in AI. Approaches to openness are falling under three categories: gradient/spectrum, criteria scoring, and binary. The OSI is championing a binary approach to openness, where AI systems are either “open” or “closed” based on whether they meet a certain set of criteria.

The technical readout also provided a diagram that shows how the AI stack may be described by the different dimensions (AI artifacts, documentation, and distribution) of its various components and subcomponents.

Policy readout

The Columbia Convening on Openness and AI Policy Readout was edited by Udbhav Tiwari with review contributions from Kevin Klyman, Madhulika Srikumar, and myself.

The policy readout highlighted the benefits of openness, including:

Enhancing reproducible research and promoting innovation
Creating an open ecosystem of developers and makers
Promoting inclusion through open development culture and models
Facilitating accountability and supporting bias research
Fostering security through widespread scrutiny
Reducing costs and avoiding vendor lock-In
Equipping supervisory authorities with necessary tools
Making training and inference more resource-efficient, reducing environmental harm
Ensuring competition and dynamism
Providing recourse in decision-making

The policy readout also showcased a table with the potential benefits and drawbacks of each component of the AI stack, including the code, datasets, model weights, documentation, distribution, and guardrails.

Finally, the policy readout provided some policy recommendations:

Include standardized definitions of openness as part of AI standards
Promote agency, transparency and accountability
Facilitate innovation and mitigate monopolistic practices
Expand access to computational resources
Mandate risk assessment and management for certain AI applications
Hold independent audits and red teaming
Update privacy legislation to specifically address AI challenges
Updated legal framework to distinguish the responsibilities of different actors
Nurture AI research and development grounded in openness
Invest in education and specialized training programs
Adapt IP laws to support open licensing models
Engage the general public and stakeholders

You can follow along with the work of Columbia Convening at mozilla.org/research/cc and the work from the Open Source Initiative on the definition of Open Source AI at opensource.org/deepdive.

Letter to U.S. Commerce Secretary Raimondo urging protection of openness and transparency in AI

Stefano Maffulli — Mon, 25 Mar 2024 18:18:21 +0000

The Open Source Initiative (OSI) contributed, along with other members of civil society and academia, to a letter drafted by Mozilla and the Center for Democracy & Technology (CDT) asking the White House and Congress to exercise great caution when considering whether and how to regulate the publication of open models.

The letter demonstrates how openness allows collaborative efforts to build, shape and test AI for the benefit of all, and speaks of the need for policy, technology and advocacy in creating a better future through trustworthiness and accountability in AI innovation. The letter highlighted three broad points of consensus about openness and transparency in AI:

Open models can provide significant benefits to society, and policy should sustain and expand these benefits.
Policy should be based on clear evidence of marginal risks that open models pose compared to closed models.
Policy should consider a wide range of solutions to address well-defined marginal risks in a tailored fashion.

The letter was sent today, March 25, 2024, in advance of the Department of Commerce’s comment deadline on AI models which closes March 27. You can read the letter below and at CDT’s website.

Civil-Society-Letter-on-Openness-for-NTIA-Process-March-25-2024 Download

Results of 2024 elections of OSI board of directors

Stefano Maffulli — Tue, 19 Mar 2024 19:34:23 +0000

The polls just closed, the results are in. Congratulations to the returning directors Thierry Carrez and Josh Berkus, and the newly elected director Chris Aniszczyk.

Thierry Carrez has been confirmed and joins as a director elected by the Affiliate organizations. Chris Aniszczyk and Josh Berkus collected the votes of the Individual members.

The OSI thanks all of those who participated in the 2024 board elections by casting a ballot and asking questions to the candidates. We also want to extend our sincerest gratitude to all of those who stood for election. We were once again honored with an incredible slate of candidates who stepped forward from across the open source software community to support the OSI’s work, and advance the OSI’s mission. The 2024 nominees were again, remarkable: experts from a variety of fields and technologies with diverse skills and experience gained from working across the open source community. We hope the entire Open Source software community will join us in thanking them for their service and their leadership. We’re better off because of their contributions and commitment, and we thank them.

Next steps

The board of directors has formalized the election results in an ad-hoc meeting and invited the newly elected director to the onboarding meeting.

The complete election results

OSI Affiliate directors elections 2024

There were 6 candidates competing for 1 seat. The number of voters was 38 and there were 38 valid votes and 0 empty ballots.

Counting votes using Scottish STV.

Winner is Thierry Carrez.

Details from affiliates elections.

OSI Individual directors elections 2024

There were 11 candidates competing for 2 seats. The number of voters was 158 and there were 158 valid votes and 0 empty ballots.

Counting votes using Scottish STV.

Winners are Chris Aniszczyk and Josh Berkus.

Details from individuals elections.

A candid conversation on The Changelog Podcast about defining Open Source AI, and what is really at stake

Stefano Maffulli — Tue, 05 Mar 2024 06:00:00 +0000

I was recently invited to join hosts Adam Stacoviak and Jerod Santo on The Changelog podcast. The Changelog features deep technical reviews and conversations about the most recent news in the world of software, and this was the first time anyone from the OSI has appeared on the show.

After introducing the Open Source Initiative, we discussed the challenges of not only defending the Definition itself, but the idea that we need a Definition at all. And I was able to explain the complicated nature of being a global nonprofit organization defending the Open Source Definition for over 25 years.

I outlined the three programs that comprise the work of the OSI—legal and licensing, policy and standards, and advocacy and outreach—at which time we dove right into the project that falls under the latter program: the Open Source AI Definition.

Open Source AI is not the same as Open Source software. This reality led to the Deep Dive: AI project, now in year 3, in which OSI is collaborating with some of the largest corporations, researchers, creators, foundations and others.

The Changelog hosts asked a lot of great questions and we had a candid and productive conversation. I hope you’ll follow the link to listen to the full episode: Changelog Interviews: What exactly is Open Source AI?

As I shared with Adam and Jerod, I’m hosting bi-weekly discussions on the status of the project and we’ve put together a forum for public input, so if you are interested in learning more about this or contributing, you are welcome to join us at discuss.opensource.org.

New risk assessment framework offers clarity for open AI models

Stefano Maffulli — Tue, 27 Feb 2024 17:45:30 +0000

There is a debate within the AI community around the risks of widely releasing foundation models with their weights and the societal impact of that decision. Some are arguing that the wide availability of Llama2 or Stable Diffusion XL are a net negative for society. A position paper released today shows that there is insufficient evidence to effectively characterize the marginal risk of these models relative to other technologies.

The paper was authored by Sayash Kappor of Princeton University and Rishi Bommasani of Stanford University, me and others and is directed at AI developers, researchers investigating the risks of AI, competition regulators, and policymakers who are challenged with how to govern open foundation models.

This paper introduces a risk assessment framework to be used with open models. This resource helps explain why the marginal risk is low in some cases where we already have evidence from past waves of digital technology. It reveals that past work has focused on different subsets of the framework with different assumptions, serving to clarify disagreements about misuse risks. By outlining the necessary components of a complete analysis of the misuse risk of open foundation models, it lays out a path to a more constructive debate moving forward.

I hope this work will support a constructive debate where risks of AI are grounded in science and today’s reality, rather than hypothetical, future scenarios. This paper offers a position that balances the case against open foundation models with substantiated analysis and a useful framework on which to build. Please read the paper and leave your comments on Mastodon or LinkedIn.