Mia Lykou Lund – Open Source Initiative

Open Source AI Definition – Weekly update July 15

Mia Lykou Lund — Mon, 15 Jul 2024 19:26:16 +0000

It has been quiet over the 4th of July weekend on the forums and OSI has been speaking at different events:

@stefano spoke in a panel at the UN event OSPOs for Good. Access the recording here.
@mer is speaking at Open Source Community Africa
OSI was present at the Linux Foundation hosted AI_dev: Open Source GenAI & ML Summit Europe 2024. Read about the takeaways here.

Why and how to certify Open Source AI

@jberkus expresses concern about the extensive resources required to certify AI systems, estimating that it would take weeks of work per system. This scale makes it impractical for a volunteer committee like License Review.
@shujisado reflects on past controversies over license conformity, noting that Open Source AI has the potential for a greater economic impact than early Open Source” He acknowledges the need for a more robust certification process given this increased significance. He suggests that cooperation from the machine learning community or consortia might be necessary to address technical issues and monitor the certification process neutrally. He offers to help spread the word about OSAID within the Japanese ML/LLM development community.

@jberkus clarifies that the OSI would need full-time paid staff to handle the certifications, as the work cannot be managed by volunteers alone.

Open Source AI Definition – Weekly update July 1

Mia Lykou Lund — Mon, 01 Jul 2024 15:48:07 +0000

An open call to test OpenVLA

Last week @quaid suggested conducting a controlled experiment to determine if data information alone is sufficient to recreate an AI model with fidelity to the original. He shared insights from the OpenVLA project, noting its possible compliance with the requirements of draft v0.0.8 and suggesting a test suite to compare models created with full datasets versus data information.
- To this, @Stefano noted that there also are some master students at CMU who are conducting similar experiments to “kick the tires” of the draft definition.
- @quaid proposed more precise criteria for evaluating model similarity, such as “functionally similar” or “practically similar” and further suggested detailing the values sought from open data datasets to improve the experiment’s framework.

Interesting research paper: “Rethinking open source generative AI: open-washing and the EU AI Act”

@hook has shared a research paper they found interesting and relevant tilted Rethinking open source generative AI: open-washing and the EU AI Act.
- This paper has been shared before by its author @mark and discussed in the context of whether the OSAID should contain a partially open license, arguing that in doing so, open washing would be limited, stating that “ I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.” This would highlight the “degrees of openness”.
- They too present their findings in a visualization of the degrees of openness of different systems.
  - This is a point we have discussed before and note that the OSAID will not be a partially open license but a binary one. See week 22 summary for the context of this discussion.

Open Source AI Definition Town Hall – June 28, 2024

We held our 12th town hall meeting last week. You can access the recording and slides here if you missed it. The town hall presented some ideas for the next draft of the Definition, making it clear that there is no agreement yet on the data information concept and that part is still subject to change.
A new town hall meeting is scheduled for Friday, July 12.

Open Source AI Definition – Weekly update June 24

Mia Lykou Lund — Mon, 24 Jun 2024 19:36:07 +0000

Explaining the concept of Data information

Following @stefano’s publication regarding why the OSI considers training data to be “optional” under the checklist in Open Source AI Definition, the debate has continued. Here are the main points:

Preferred Form of Modification

@hartmans states finding an agreement on the meaning of “preferred form of modification” depends on the user’s objectives. The disagreement may stem from different priorities in ranking the freedoms associated with open source AI, though they emphasize prioritizing model weights for practical modifications. He suggested that data information could be more beneficial than raw data for understanding models and urged flexibility in AI definitions.
@shujisado highlighted that training data for machine learning models is a preferred form of modification but questioned if it is the most preferred. He further emphasized the need for a flexible definition for preferred forms of modification in AI.
@quaid supported the idea of conducting controlled experiments to determine if data information alone is sufficient to recreate AI models accurately. Suggested practical steps for testing the effectiveness of data information and encouraged community participation in such experiments.
- @stefano added that some students at CMU will run this kind of experiment (if full training dataset is needed or if data information is enough to recreate a model that can be tested for fidelity to the original) to test the definition.
@jberkus raised concerns about the practical assessment of data information and its ability to facilitate the recreation of AI systems. He questioned how to evaluate data information without recreating the AI system.
Practical Applications and Community Insights
- @hartmans proposed practical scenarios where data information could suffice for modifying AI models and suggested that the community’s flexibility in defining the preferred form of modification has been valuable for Debian.
- @quaid shared insights from his research on the OpenVLA project, noting its compliance with OSAID requirements. He further proposed conducting controlled experiments to verify if data information is enough to recreate models with fidelity.
General observations

@shujisado emphasized the need for flexible definitions in AI, drawing from open-source community experiences. Agreed on the complexity of training data issues and supported the flexible approach of OSI in defining the preferred form of modification.
@quaid suggested practical approaches for evaluating data information and its adequacy for recreating AI models and proposed further experiments and community involvement to refine the understanding and application of data information in open-source AI.

Are we evaluating Licenses or Systems?

@jberkus asked whether OSAID will apply to licenses or systems, noting that current drafts focus on systems. He questioned if a certification program for reviewing systems as open source or proprietary is the intended direction.
@shujisado confirmed that discussions are moving towards certifying AI systems and pointed at an existing thread. He emphasized the need for evaluating individual components of AI systems and expressed concern about OSI’s capacity to establish a certification mechanism, highlighting that it would significantly expand OSI’s role.

Open Source AI Definition – Weekly update June 17

Mia Lykou Lund — Mon, 17 Jun 2024 16:52:03 +0000

Explaining the concept of Data information

After much debate regarding training data, @stefano published a summary of the positions expressed and some clarifications about the terminology included in draft v.0.0.8. You can read the rationale about it and share your thoughts on the forum.
Initial thoughts:
- @Senficon (Felix Reda) adds that while the discussion has highlighted the case for data information, it’s crucial to understand the implications of copyright law on AI, particularly concerning access to training data. Open Source software relies on a legal element (copyright licenses) and an access element (availability of source code). However, this framework does not seamlessly apply to AI, as different copyright regimes allow text and data mining (TDM) for AI training but not the redistribution of datasets. This discrepancy means that requiring the publication of training datasets would make Open Source AI models illegal, despite TDM exceptions that facilitate AI development. Also, public domain status is not consistent internationally, complicating the creation of legally publishable datasets. Consequently, a definition of Open Source AI that imposes releasing datasets would impede collaborative improvements and limit practical significance. Emphasizing data innovation can help maintain Open Source principles without legal pitfalls.

Concerns and feedback on anchoring on the Model Openness Framework

@amcasari expresses concern about the usability and neutrality of the “Model Openness Framework” (MOF) for identifying AI systems, suggesting it doesn’t align well with current industry practices and isn’t ready for practical application without further feedback and iteration.
@shujisado points out that the MOF’s classification of components doesn’t depend on the specific IP laws applied, but rather on a general legal framework, and highlights that Japan’s IP law system differs from the US and EU, yet finds discussions based on the OSD consistent.
@stefano emphasizes the importance of having well-thought-out, timeless principles in the Open Source AI Definition document, while viewing the Checklist as a more frequently updated working document. He also supports the call to see practical examples of the framework in use and proposes separating the Checklist from the main document to reduce confusion.

Initial Report on Definition Validation

Reviews of eleven different AI systems have been published. We do these review to check existing systems compatibility with our current definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
- @mer has set up a review sheet for the Viking model upon request from @merlijn-sebrechts.
- @anatta8538 asks if MLOps is considered within the topic of the Model Openness Framework and whether CLIP, an LMM, would be consistent with the OSAID.
- @nick clarifies that the evaluation focuses on components as described in the Model Openness Framework, which includes development and deployment aspects but does not cover MLOps as a whole.

Why and how to certify Open Source AI

@Alek_Tarkowski agrees that certification of open-source AI will be crucial under the AI Act and highlights the importance of defining what constitutes an Open Source license. He points out the confusion surrounding terms like “free and open source license” and suggests that the issue of responsible AI licensing as a form of Open Source licensing needs resolution. Notes that some restrictive licenses are gaining traction and may need consideration for exemption from regulation, thus urging for a consensus.

Open Source AI Definition Town Hall – June 14, 2024

Slides and the recording of our previous townhall meeting can be found here.

Open Source AI Definition – Weekly update June 10

Mia Lykou Lund — Tue, 11 Jun 2024 21:40:15 +0000

Open Source AI needs to require data to be viable

With many different discussions happening at once, here are the main points:
- On the issue of training data
  - @mark is concerned with openness of AI not being meaningful if there is not a focus on the training data.” Model weights are the most inscrutable component of current generative AI, and providers that release only [the weights] should not get a free ‘openness’ pass.”
  - @stefano agrees with all of that but questions the criteria used to assign green marks in Mark’s paper, pointing out inconsistencies. They use the example of Pythia-Chat-Base-7, which relies on a dataset from OpenDataHub with potential issues like non-versioned data and stale links, failing to meet stringent requirements required by @juliaferraioli. Similar concerns are raised for other models like OLMo 7B Instruct, which lack specific data versioning details. Maffulli also highlights the case of Pythia-7B, which once may have been compliant but it’s now problematic due to the unavailability of its foundational dataset, the Pile, illustrating the complexities in maintaining an “open source” status over time, if the stringent proposal suggested by @juliaferraioli and the AWS team is adopted.
  - @shujisado adds that while he sympathizes with @juliaferraioli‘s request for datasets, @stefano‘s arguments in support of the concept of “Data information” are aligned with the OSI principles and are reasonable.
  - @spotaws stresses that “data information” alone is insufficient if the data itself is too vague.
  - @juliaferraioli adds that while replicating AI systems like OLMo or Pythia may seem impractical due to costs and statistical nature, the capability is crucial for broader adoption and consistency. She finds the current definition to be unclear and subjective.
  - @zack recommends to review StarCoder2, recognizing that it would be in the same category of BLOOM: a system with lots of transparency and a dataset made available but released with a restrictive license.
  - @Ezequiel_Lanza joined the conversation in support of the concept of Data information, claiming, with technical arguments that “sharing the dataset is not necessarily required and may not justify the potential risks associated with making it mandatory.”
  - Partially open / restrictive licenses
    - Continuing @marks points regarding restrictive licenses (like the ethical licenses), @stefano has added a link to an article highlighting some reasons why OSI is staying away from these licenses.
    - @pchestek further adds that a partially open license would create even more opportunities for open washing, as “open source AI” could have many meanings.
    - @mark clarified that rather than proposing a variety of meanings, they are seeking to highlight the dimensions of openness in their paper, exploring the broader landscape.
    - @stefano adds that in the 26 years of OSI, it has contended with numerous organizations claiming varying degrees of openness as “open source. This issue is now mirrored in AI, as companies seek the market value of being labeled Open Source. Open Source is binary: either users have full rights or they don’t, and any system that falls short is not Open Source AI, regardless of how “almost” open it is.
  - Field of use/restriction
    - @juliaferraioli believes that OSAID should include prohibitions against field-of-use restrictions.
    - @shujisado adds that OSAID specifies four freedoms as requirements for being considered open source and that this should be understood as the same since “freedom” is the same as “non-restricted”. The 10 clauses of the OSD have been replaced by the checklist in draft v0.0.8.
    - @juliaferraioli adds that individual components may be covered by their individual licenses, but the overall system may be subject to additional terms, which is why we need this to be explicit.

Initial Report on Definition Validation

@Mer has added how far we are regarding our system analysis compared to our current draft definition. Some points that remain incomplete have been highlighted.
Mistral (Mixtral 8x7B) is considered not in alignment with the OSAID because its data pre-processing code is not released under an OSI-approved license.

Can a derivative of non-open-source AI be considered Open Source AI?

@tarek_ziade shares his experience fine-tuning a “small” model (200M parameters) for a Firefox feature to describe images, using a base model for image encoding and text decoding. Despite not having 100% traceability of upstream data, Tarek argues that intentional fine-tuning and transparency make the new fine-tuned model open source. Any issues arising from downstream data can be addressed by the project maintainers, maintaining the model’s open source status.

Town hall recording out

We held our 10th town hall meeting a week and a half ago. You can access the recording here if you missed it.
A new town hall meeting is scheduled for this Friday, June 14.

Open Source AI Definition – Weekly update June 3

Mia Lykou Lund — Mon, 03 Jun 2024 18:27:11 +0000

Initial report on definition validation

A first draft of the report of the validation phase has been published. The validation phase is designed to review the compatibility of existing systems with the current draft definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
Problems and initial findings:
- Elusive documents: Not having system creators involved meant reviewers had to independently search for legal documents, resulting in many blanks in the document list and subsequent analysis.
- One component, many artifacts, and documents: Some components were linked to multiple artifacts and documents, complicating the review process as source code and documentation could be spread across several repositories and reports.
- Compounded components: Components in the checklist often combined multiple artifacts, such as training and validation code, making it difficult to track down specific legal documents.
- Compliant? Conformant? Six out of eleven required components need a legal framework that is “compliant” or “conformant” with the Open Source Definition, prompting a need for clearer guidance on reviewing non-software components.
- Reverting to the license: Reviewers suggested simplifying the process by relying on whether a legal document is OSI-approved, conformant, or compliant to guarantee the right to use, study, modify, and share the component, eliminating the need for independent assessment.
Next steps:
- As we are looking to fill in the gaps from above we call on both system creators and independent volunteers to complete various system reviews.
- If your familiar system is not on the list, contact Mer on the forum
Initial questions and queries:
- @jasonbrooks asks if the validation process should check if there’s “sufficiently detailed information about the data used to train the system so a skilled person can recreate a substantially equivalent system.” It’s unclear if this has been confirmed, and examples of skilled individuals achieving this would be helpful.
  - @stefano replies that the Preferred form lists enduring principles, while the Checklist details required components. Validation ensures components like training methodologies and data provenance are available, enabling system recreation. Mer’s report highlights the difficulty in finding these components, suggesting a need for a better method. One idea is a detailed survey for AI developers, though companies like Meta might misuse the “Open Source” label. Public pressure may eventually deter such abuses.
- @amcasari adds insights into the process of reviewing licenses.

Open Source AI needs to require data to be viable

This week, the conversation shifted heavily toward the possibilities of creating a gradient approach to open licensing.
@Markhas shared that he is publishing a paper regarding open washing, the AI ACT, and a case for a gradient notion of openness.
- In line with previous points mostly raised by @danish_contactor, Mark highlights the RAIL licenses and argues that it should count towards openness too, stating that “I think providers and users of LLMs should not be free to create oil spills in our information landscape and I think RAIL provides useful guardrails for that.”
- They also present their visualization of the degrees of openness of different systems
@stefano has reiterated that the open-source AI definition will remain binary, just like the Open Source Definition is binary. And responding to @Mark has and @danish_contactor, he linked to Kate Downing legal analysis of RAIL licensing framework.

Can a derivative of non-open-source AI be considered Open Source AI?

Answering @stefano’s earlier questions, @mark adds that it’s challenging to fine-tune a model without knowing the initial training data and techniques. Examples like Meta and Mistral fine-tunes show success despite the lack of transparency in the original training data. Intel’s Neural 7B and AllenAI’s Tulu 70B demonstrate effective fine-tuning with detailed disclosure of fine-tuning steps and data. However, these efforts can’t qualify as truly open AI systems due to the closed nature of the base models and potential legal liabilities.
@stefano closed the topic stating that, based on feedback, “Derivatives of non-Open Source AI cannot be Open Source AI”

Why and how to certify Open Source AI

@amscott added that AI developers will likely self-certify compliance with the OSAID, with objective certification needed for arbitration in nuanced cases. Like the OSD, the OSAID will mature through community practice. A simple self-certification tool could promote transparency and document good practices.
@mark added that The EU AI Act emphasizes “Open Source” systems, offering exemptions attractive to companies like Meta and Mistral. The AI Act requires disclosure templates overseen by an AI Office, potentially leading to intense lobbying efforts. If Open Source organizations influence regulation and certification, transparency may strengthen the Open Source ecosystem.

Question regarding the 0.0.8 definition

Question from @Jennifer Ding regarding why “information” is a focus for the data category and not the code and model categories.
@Matt White adds that OSD-Conformant (in the checklist) should be defined somewhere.
- He further adds (to Data Information, under checklist) that many “open” models withhold various forms of data, making it unreasonable to expect model producers to release all the information necessary for full replication of the data pipeline if data is not a required component of the definition
@Micheal Dolan adds that ”the use of OSD-compliant and OSD-conformant without any definitions of either term is difficult to parse the meaning of.” and suggests some solutions.

OSAID at PyCon US

Missing a recap of how we got to where we are now? OSI was present at PyCon in Pittsburgh where we held a workshop regarding our current definition and spoke with many knowledgeable shareholders. You can read about it here.

Open Source AI Definition – Weekly update May 27

Mia Lykou Lund — Tue, 28 May 2024 08:41:43 +0000

Open Source AI needs to require data to be viable

@juliaferraioli and the AWS team have reopened the debate regarding access to training data. This comes in a new forum which mirrors concerns raised in a previous one. They argue that to achieve modifiability, an AI system must ship the original training dataset used to train it. Full transparency and reproducibility require the release of all datasets used to train, validate, test, and benchmark. For Ferraioli, data is considered equivalent to source code for AI systems, therefore its inclusion should not be optional. In a message signed by the AWS Open Source team, she proposed that original training datasets or synthetic data with justification for non-release be required to meet the Open Source AI standard.
@stefano added some reminders as we reopen this debate. These are the points to keep in mind:
- Abandon the mental map that makes you look for the source of AI (or ML) as that map has been driving us in circles. Instead, we’re looking for the “preferred form to make modifications to the system”
- The law in most legislation around the world makes it illegal to distribute data, because of copyright, privacy and other laws. It’s also not as clear how the law treats datasets and it’s constantly changing
- Text of draft 0.0.8 is drafted to be vague on purpose regarding “Data information”. This is to resist the test of time and technology changes.
- When criticizing the draft, please provide specific examples in your question, and avoid arguing in the abstract.
@danish_contractor argues that the current draft is likely to disincentivize openness due to the community viewing models (BLOOM or StarCoder), which include usage restrictions to prevent harms, less favorably despite being more transparent, reproducible, and thus more “open” than models like Mistral.
@Pam Chestek clarified that Open Source has two angles: the rights to use, study, modify and share, coupled with those rights being unrestricted. Both are equally important.
This debate echoes earlier ones on recognizing open components of an AI system.

The FAQ page has been updated

The FAQ page is starting to take shape and we would appreciate more feedback. So far, we have preliminary answers to these questions:
- Why is the original training dataset not required?
- Why the grant of freedoms is to its users?
- What are the model parameters?
- Are model parameters copyrightable?
- What does “Available under OSD-compliant license” mean?
- What does “Available under OSD-conformant terms” mean?
- Why is the Open Source AI Definition includes a list of components while the Open Source Definition for software doesn’t say anything about documentation, roadmap and other useful things?
- Why is there no mention of safety and risk limitations in the Open Source AI Definition?

Draft v0.0.8 Review from LLM360

@vamiller has submitted on behalf of the LLM360 team a review of their models. In his view the v0.0.8 reflect the principles of Open Source applied to AI. He asks about the ODC-By licence, arguing that it is compatible with OSI’s principles but it’s a data-only license.

Join the next town hall meeting

The next town hall meeting will take place on May 31st at 3:00 pm – 4:00 pm UTC. We encourage all who can participate to attend. This week, we will delve deeper into the issues regarding access (or not) to training data.

Open Source AI Definition – Weekly update May 20

Mia Lykou Lund — Mon, 20 May 2024 14:43:57 +0000

A week loaded with important questions.

Overarching concerns with Draft v.0.0.8 and suggested modifications

A post signed by the AWS Open Source raised important questions, illustrating a disagreement on the concept of “Data information.”

A detailed post signed by the AWS Open Source team raises concerns about the draft concept of Data information in v0.0.8 and other important topics. I suggest reading their post. The major points discussed this week are:
- The discussion on training data is not settled. AWS Open Source team argues that for an Open Source AI Definition to be effective, the data used to train the AI system must be included, similar to the requirement for source code in Open Source software. They say the current definitions mark the inclusion of datasets as optional, undermining transparency and reproducibility.
- Their suggestion: Use synthetic data where the inclusion of actual datasets poses legal or privacy risks.
  - Valentino Giudice takes issues with the phrase “or AI systems, data is the equivalent of source code,” and states that “equivalent” is used too liberally here. For trained models, the dataset isn’t necessary to understand the model’s operations, which are determined by architecture and frameworks.
    - Ferraioli disagrees, stating that “A trained model cannot be considered open source without the data, processing code, and training code. Comparing a trained model to a software binary, we don’t call binaries open source without the source code being available and licensed as open source. “
  - Zacchiroli adds that they support the suggestion to use “high quality equivalent synthetic datasets” when the original data cannot be released. Although “equivalent” remains undefined and could create loopholes, this issue doesn’t worsen OSAID
- Some proposed modifications otherwise include:
- Require Release of Dependent Datasets
  - Mandate the release of training, testing, validation, and benchmarking datasets under an open data license or high-quality synthetic data if legal restrictions apply.
  - Update the “Data Information” section to make dataset release a requirement.
Prevent Restrictions on Outputs
- Prohibit restrictions on the use, modification, or distribution of outputs generated by Open Source AI systems.
Eliminate Optional Components
- Remove optional components from the OSAID to maintain a high standard of openness and transparency.
Address Combinatorial Ambiguity
- Ensure any license applied to the distribution of multiple components in an Open Source AI system is OSD-approved.

Why and how to certify Open Source AI

The post from AWS team contained a comment about certification process for Open Source AI that deserves a separate thread. There are pending questions to be answered:
- who exactly needs a certification that an AI system is Open Source AI?
- who is going to use such certification? Is anyone of the groups deploying open foundation models today thinking that they could use one? For what purpose?
- who is going to consume the information carried by the certification, why and how?
Zacchiroli adds that the need for certifying AI systems as OSAID compliant arises from inherent ambiguities in the definitions, such as terms like “sufficiently” and “high quality equivalent synthetic dataset.” Disagreements on compliance will require a judging authority, akin to OSI for the OSD. While managing judgments for OSAID might be more complex due to the potential volume, the community is likely to turn to OSI for such decisions.

Can a derivative of non-open-source AI be considered Open Source AI?

This question was asked on the draft document and moved to the forum for higher visibility. Is it technically possible to fine-tune a model without knowing the details of its initial training? Are there examples of successfully fine-tuned AI/ML systems where the initial training data and techniques were unknown but the fine-tuning data and methods were fully disclosed?
- Shuji Sado added that fine-tuning typically involves updating the weights of newly added layers and some layers of the pre-trained model, but not all layers, to maintain the benefits of pre-training.
- Valentino Giudice raised concerns over this point as multiple strategies for fine-tuning exist, allowing for flexibility in updating weights in any amount of existing layers without necessarily adding new ones. Even updating the entire network can be beneficial, as it leverages the pre-trained model’s information and can be more efficient than training a new model from scratch. Fine-tuning can slightly adjust the model’s performance or behaviour, integrating new data effectively.

Please, especially if you are knowledgeable in this field, we would love to hear more thoughts!

Open Source AI Definition – Weekly update May 13

Mia Lykou Lund — Tue, 14 May 2024 15:08:25 +0000

Early thoughts on “Apple sample code license”?

Apple has released a license to distribute its new model, OpenELM. The license looks BSD/MIT-like with the exclusion of patents. According to you, does it seem OSD compliant?
- Initial thoughts:
- @pchestek added that the license appears to be similar to open source but raises concerns about potential limitations on rights, particularly regarding patents. It highlights Apple’s approach of granting only a copyright license, which might not be sufficient for ensuring all necessary freedoms, especially in the context of AI models
- @shujisado agreed, saying that the terms related to trademarks and patents need to be scrutinized

Question regarding the 0.0.8 version

@Aspie96 asks clarifying questions regarding the list of open components and points out how, unlike “traditional” software which can be released as open source software without as easy as proprietary software, this definition seem to require a lot more components to be open.
- Stefano points out that “The “classic” Open Source Definition is applied to licenses, not to the software” and “ if a program is shipped with a license approved by the OSI then the software is considered Open Source”
- He further states that “Through the co-design process of the Open Source AI Definition we learned that to use, study, share and modify an ML system one needs a complex combo of multiple components each following diverse legal regimes (not just the usual copyright+patents.) Therefore we must describe in more details what is required to grant users the agency and control expected.”

The FAQ page is being developed

The frequently asked questions page is starting to take form
We add relevant questions that have arisen from the forums so far, though if you have any contributions in mind, please leave a comment!

Open Source Initiative at PyCon!

This week, Stefano, Mer and the OSI team are visiting Pittsburgh, PA, hosting the first workshop of our Open Source AI Definition Roadshow! We are starting to get more in-person feedback on our draft definition.

If you are at PyCon come visit us on the 17th, from 11 am to 1pm in the Open Space area!

Open Source AI Definition – Weekly update May 6

Mia Lykou Lund — Mon, 06 May 2024 16:02:27 +0000

Definition validation: Seeking volunteers

The process has entered a new phase: We are now seeking volunteers to validate the Open Source AI Definition, using it to review existing AI systems. The objective of the phase is to confirm that the Definition works as intended and understand where it fails.

A spreadsheet is given where you locate and link to the license, research paper, or other document that grants rights or provides information for each required component.
Systems include, but are not limited to:
- Arctic
- BLOOM
- Falcon
- Grok
- Llama 2
- Mistral
- OLMo
- OpenCV
- Phi-2
- Pythia
- T5
To volunteer by May 20th, please contact Mer on the forum

Summary of comments received on the Definition draft

Grammatical and wording corrections
- Some minor grammatical suggestions were made. These change and order the layout slightly differently, though the overall message remains.
- One user suggested to explain what Open Source is under the “preamble” and “Why we need open source AI”. Instead of speaking about why Open Source is important, the section should rather be an introduction to what it is and why it matters for AI.
- Under “Preferred form to make modifications to machine-learning systems” and “data information”, clarification is needed regarding “the training data set used”. It is not clear whether this means that all training data must be open source for the whole model to be.
  - Stefano Maffulli added here that the intention is to know what dataset was used, not to necessarily have it made available, and that it indeed seems to need clarification
Technical points
- Under “Preferred form to make modifications to machine-learning systems” the release of checkpoints is mentioned as an example of required components, under “model parameters”. An objection was raised, arguing that this poses an unnecessary burden: It’d be like requiring that for software to be Open Source, it should include past versions of the program.
  - Maffulli reiterated that this was merely an example but that this might need to be a submission to the FAQ page
- “Preferred form to make modifications to machine-learning systems” and “data information”, a “skilled person” is mentioned in the context of requiring sufficient information about the training data used to create a model. Question regarding why skill has to do with acquiring data
  - Clarification was given by Maffulli, pointing out that this is in the context of getting information about the data so that a “skilled person” can use, study, share and modify the AI system.
  - A user suggested that this confusion can be solved by changing the context of the wording “a skilled person can recreate”. From “using the same or similar data” to “if able to gain access to the same or similar data”.
  - A user points out that “skilled person” as a legal term used in patent law might not be appropriate as it has different legal connotations and precedence in different countries.
Discussion on why specifically we focus on machine learning (ML) as an AI system
- A question was raised regarding why we explicitly mention ML systems under “preferred form to make modification to an ML system” and subsequently the “checklist”, pointing out that not all AI systems are ML.
  - Maffulli replied that we address ML as they need special and urgent attention as rule-based AI systems can fit under the open source definition. This needs to be addressed in the FAQ

Town hall announcement

The 9th town hall meeting was held on the 3d of May. Access the recording here if you missed it!