Major Topics in OS and recommendations

In this chapter, we dive into the core aspects of the open source ecosystem, which is not only about software but also involves a rich tapestry of legal, operational, and cultural dimensions. We explore six main topics:

License considerations provide the backbone of open source, defining how software can be used, modified, and shared.
Standards that ensure software integration and innovation, but also critical for data consistency, accuracy, and interoperability in statistical processes.
Culture that explores the misconceptions and concerns about OSS.
Knowledge building being the communal learning and sharing process that lies at the heart of open source projects.
Governance that provides the structures and practices overseeing decision-making.
Security addressing the strategies and challenges in protecting open source software against potential vulnerabilities.

1. Licences considerations

The Open Source Initiative (https://opensource.org/licences) provides detailed information about all the officially approved open source licences. However, choosing an open source licence among these can still be tricky because it involves balancing multiple competing interests and considerations such as:

Long-term implications: Your choice is often permanent and affects how your project can be used, modified, and distributed.
Legal complexity: Different licences have subtle but important differences in terms of rights, obligations, and protections they provide.
Trade-offs: You need to balance:
- Project adoption vs. control.
- Commercial potential vs. open source principles.
- Protection of your rights vs. user freedom.
Ecosystem considerations: Your choice needs to be compatible with:
- Dependencies you’re using (components and software used).
- Platforms you are targeting (Operating Systems, Clouds, CRAN, GitHub, …).
- Community expectations.
- Local/regional considerations (like EUPL for European countries).

One important grouping is between “viral” and “permissive” licences. Viral/copyleft licences like GPL and EUPL require any derivative work to also be open source under the same licence, thus the licence requirements cascade throughout the entire project and its derivatives. Permissive licence like MIT and Apache on the other hand allow the code to be used in any way, including in closed-source projects, thus each user can choose their own licence. This implies different business impacts; viral licences can make commercial adoption more challenging and ensure ongoing openness while permissive licences are more business-friendly and allow others to create closed-source versions if desired.

The choice of open source license varies within the official statistical community, as highlighted by the license usage analysis ¹ of tools in the Awesome List (see Case Studies: Awesome List). The most popular are GPL (approx. 50%), MIT (approx. 20%) and EUPL (approx. 15%). The choice of individual organisation is influenced by multiple factors.

For example, Istat mainly use the EUPL 1.1 licence (for some R tools developed by Istat are available under GPL license) for the following reasons:

The EUPL licence is the only licence translated in Italian.
EUPL is the only licence “approved” and sponsored by European regulations and therefore legally valid in Italy.

SIS-CC releases .STAT suite (see Case Studies: SIS-CC) under the MIT licence. They chose a permissive licence for following reasons:

International collaboration: The MIT licence facilitates easy sharing and collaboration between statistical offices and organisations across different countries.
Public sector alignment: Many statistical organisations are public sector entities, and the MIT licence aligns well with government open source policies.
Integration flexibility: Statistical systems often need to integrate with various existing tools and systems - the MIT licence makes this legally straightforward.
Community building: The permissive nature of the MIT licence encourages participation from a wide range of contributors.

SORS at the current time is defining its open source policy (see Case Studies: SORS); they will use a permissive licence like Apache. As SORS wants to manage a community of NSOs working on its software tools, the reasons for choosing a permissive licence are similar to those of SIS-CC.

2. Standards

Open source software (OSS) has strong connections with the adoption of open standards in different fields. The presence of established standards facilitates the adoption of OSS, while the use of OSS, in turn, promotes the broader adoption and development of those standards. Here some of standards that play important roles:

Programming languages:

Open source programming languages, in particular R, have been instrumental in the adoption of open source policies within national statistical offices (NSOs) by providing a trusted, versatile platform specifically tailored for statistical analysis and data processing. The ecosystem of R packages ² supports a wide range of statistical and data science tasks, making it possible for NSOs to replace or complement proprietary software with reliable, community-validated tools. Additionally, the R community’s emphasis on reproducible research fosters a standardised approach to analysis and reporting, which enhances collaboration and long-term sustainability in NSOs’ statistical workflows.
Consistency and Compatibility: Standardising on widely-used programming languages such as R and Python ³ ensures compatibility across various statistical and data science tools. These languages have extensive libraries and frameworks that simplify the integration of different tools and applications, making it easier for organisations to share and collaborate on code.
Community and Ecosystem Support: Popular open source languages have large, active communities that continually expand and improve their ecosystems. This community support ensures regular updates, a wealth of resources, and broad compatibility, which is particularly valuable for organisations relying on open source tools.
Training and Knowledge Sharing: Adopting standardised programming languages streamlines training for new staff and facilitates knowledge sharing within and between organisations. Staff trained using Python or R can more easily adapt to similar environments, promoting a more unified skill set across NSOs, academia and other public agencies.

Data formats

Interoperability Across Systems: Standard data formats like CSV ⁴, JSON ⁵, and XML ⁶ enable data to be shared and processed by multiple systems without extensive reformatting. Using open, well-documented data formats ensures that datasets remain accessible and usable across different platforms and applications, fostering smooth data exchange.
Long-Term Accessibility and Data Preservation: Open data formats reduce the risk of data obsolescence and vendor lock-in. CSV and JSON, for instance, can be easily accessed and parsed by any software, which is crucial for long-term data usability. Open formats also ensure that future tools can read and interpret historical datasets without conversion issues.
Data Quality and Validation: Standardised data formats often come with tools or built-in capabilities for data validation, helping organisations maintain high data quality and consistency. For example, JSON and XML formats allow schema definitions that can be used to enforce data structure and integrity.

Exchange protocols

Global Compatibility and Integration: Exchange protocols such as SDMX are specifically designed for statistical data, ensuring interoperability across statistical organisations globally. By adhering to these protocols, statistical offices can integrate their data with international platforms and share information that are compatible with other countries and organisations, facilitating global data collaboration and analysis. In the SDMX website there is a rich list of software tools ⁷ to support SDMX ⁸ implementers and developers. Also authentication and authorisation standard protocols like Oauth ⁹ can support the integration of systems.
Efficient Data Sharing and Real-Time Access: Standard data exchange protocols, such as SDMX, RESTful APIs ¹⁰, and OData (Open Data Protocol ¹¹), allow data to be accessed, shared, and updated in real time. These protocols make it easy for systems to communicate and exchange data quickly, which is essential for large-scale data dissemination and integration across different statistical platforms.

ModernStats standards

By adhering to UNECE ModernStats standards (e.g., GSBPM, GSIM), NSOs can build modular, compatible open source solutions that support efficient, standardised workflows and foster collaboration within the global statistical community.
Open source tools designed to align with GSBPM can fit seamlessly into existing workflows, as they adhere to recognised standards for data processing, validation, and analysis. The use of a common process model facilitates the adoption of compatible levels of granularity between different statistical packages.
The use of GSIM as common reference for metadata platforms increases the compatibility and the possibility of integration between software packages used in the different phases of the statistical process.

3. Culture

Transition to open source software requires cultural shifts in organisations, particularly if closed source/proprietary offerings have been the norm for a long period of time. In addition, the adoption of open-source principles demands a certain ethos and culture of work to be present within the organisation.

Several common misconceptions about OSS contribute to hesitancy and resistance within organisations:

OSS is of a lower quality as it is free (with the rationale that things that are paid for must be of higher quality or value).
OSS is not seen to have the same level of support as that provided by vendors of proprietary software, who are responsible for ensuring their products quality and providing support if problems arise.
OSS is changeable by anyone so there is no reliability or consistency in the tools.
OSS is vulnerable to security threats, whereas proprietary software is safer given the commercial concerns of vendors (even though many organisations already use tools and software that are already built on OSS such as cloud computing solutions, security software).

These misconceptions, combined with existing organisational practices, contribute to cultural barriers and concerns that arise during the transition to OSS, such as:

Risk aversion: If there are existing tools and software, changes are perceived as risky, especially by long standing staff and operational management, where the production of outputs may be considered to be at risk.
Resource concerns: NSOs may face challenges in allocating budgets for implementation, integration, and ongoing maintenance, underestimating the advantages that can come from shared development with other developers. A change in general requires resourcing, and an active decision to pursue. It is easier to allow passive indecision to propagate the status quo.
Inexperience with the wider OSS community, and the large-scale adoption of OSS across technical spheres.
Trepidation about the capability required to maintain an internal code base, or tooling of versions adopted from OSS, or maintain a tool itself.
Preference for stability and control: The dependency on proprietary tools may create a sense of comfort. The focus of open source tools is often on evolving features and flexibility, which can seem less stable or controlled to NSOs.

Addressing these concerns is critical for NSOs to build a sustainable OSS culture across organisations. In the subsequent chapters, we discuss some of key challenges and strategies to address them through: knowledge building (Section 4), establishing governance (Section 5) and adopting security practices (Section 6).

4. Knowledge building

Below are general recommendations for national statistical offices (NSOs) about knowledge sharing, based on the use cases provided and focusing on capacity building, documentation, and training:

Capacity building through participating in collaborative networks:

Encourage open source familiarity: Promote the adoption of widely-used open source tools like R and Python by integrating them into the organisation’s workflows. Use structured training programs to make the staff familiar with these tools and their applications in statistical work.
Establish centers of excellence: Create dedicated teams or units within the NSO to specialise in open source tools and methodologies. These teams can act as internal consultants, providing technical support and sharing expertise across departments.
Leverage community collaboration: Actively participate in international and regional communities such as SIS-CC or open source forums to build institutional knowledge by learning from other NSOs’ experiences. Collaborative knowledge sharing accelerates capacity building and reduces redundant efforts.
Leverage partnerships with academic institutions and international organisations to enhance knowledge transfer, ensuring access to the latest methodologies and technologies.

Example: SIS-CC facilitates capacity building by aligning members around shared tools and standards like SDMX, promoting collective development and mutual learning.

Comprehensive and accessible documentation:

Develop and maintain comprehensive and consistent documentation for all statistical processes, methodologies, and software implementations. Use frameworks like GSIM to ensure structured documentation, enhancing clarity and usability for internal and external stakeholders.
Where possible, share documentation publicly to foster transparency and collaboration. This includes sharing metadata standards, data schemas, and technical manuals for open source tools used or developed by the NSO.
Regularly update technical manuals and process documentation to ensure continuity, especially when staff turnover occurs. NSOs should prioritise institutionalising knowledge over reliance on individual expertise.
Use open metadata standards such as DDI and SDMX to document statistical processes and data workflows comprehensively, enabling consistent understanding and reuse across organisations.

Example: IST metadata-driven system by SORS supports standardised documentation, enhancing internal and regional knowledge sharing for effective statistical production.

Ongoing training and skill development:

Offer regular training programs to build staff capacity in open source tools and modern statistical methodologies, ensuring teams remain adept at using and maintaining evolving technologies and methodologies.
Design training to address varied skill levels, from foundational workshops for beginners to advanced sessions for specialists that emphasises practical applications in statistical workflows. Modular training ensures staff across departments can develop relevant skills at their own pace.
For NSOs transitioning from proprietary to open source systems, implement mandatory training to ensure staff can effectively use and adapt to new tools. Istat’s systematic training for R users is an effective example.
Collaborate with international organisations, universities, or private sector experts to access specialised training and knowledge-sharing opportunities.

Promote a culture of peer-to-peer knowledge exchange:

Encourage knowledge sharing between employees through internal forums, mentorship programs, or cross-departmental collaborations as well as organising workshops, hackathons, or collaborative projects where staff can work together on open source solutions, sharing insights and expertise. Such initiatives help disseminate expertise and bridge skill gaps across the organisation.
Create opportunities for staff to present their innovations and solutions internally and externally, fostering an environment where shared contributions are valued and rewarded.

Example: The Data Clean ecosystem in Statistics Netherlands thrives on community contributions, enabling the exchange of modular, reusable tools that expand collective expertise.

Develop centralised knowledge repositories:

Implement centralised knowledge repositories to store and share documentation, training materials, FAQs, case studies and user feedback, making these resources readily accessible to all staff and external collaborators. This ensures institutional knowledge is preserved.
Use platforms like wikis, shared drives, or open source platforms to manage and distribute this information, ensuring continuity even when staff turnover occurs.

Example: The Awesome List of Official Statistics Software acts as a shared resource for NSOs, providing a curated repository of open source tools and their applications.

Support open knowledge sharing practices externally:

Actively contribute to open source communities by sharing code, documentation, and lessons learned as well as by participating in relevant international, regional and national expert meetings (e.g., The Use of R in Official Statistics conferences ¹²). This practice not only helps the broader community but also positions the NSO as a leader in statistical innovation.

Example: The SIS-CC’s .Stat Academy engages in sharing knowledge through online training, focusing on data modelling and SDMX. The availability of free training resources to support the statistical lifecycle acts as a powerful stimulus for the diffusion of open standards and software tools in the official statistics community.

5. Governance

The two governance models

In the case-studies chapter we saw different governance models used by the projects: we could analyse them following the Cathedral and Bazaar models, defined by Eric S. Raymond in his influential essay, The Cathedral and the Bazaar.

The Cathedral model is a structured, centralised approach where a core team of developers maintains the control over the project, releasing updates after extensive internal testing to ensure stability and polish. The process is more closed, with limited involvement from the broader community until a version is ready for public release. This model emphasises planning, longer development cycles, and top-down decision-making.

In contrast, the Bazaar model is a decentralised, open approach that thrives on community participation and rapid iteration. Code is developed transparently, with contributors working simultaneously, often releasing small updates frequently. Decisions are made collaboratively, encouraging contributions from anyone, which allows the project to evolve quickly. This model embraces an agile, bottom-up development style that prioritises community feedback and adaptability over centralised control.

Coming to our use-cases, from one side we have the SIS-CC community (see Case Studies: SIS-CC), organised with Strategic Level Group, Management Level Group and Architecture Task Force, where the NSOs can choose among three different levels of participation and the development is supported by a defined vision and by a set of training courses available on the web.

At the other extreme we have communities like the Awesome List or the Data Cleaning communities (see Case Studies: Data Cleaning and Case Studies: The Awesome List), where users propose their own packages, there are no predefined strategies or developments, each user is free to test the products and possibly participate in their development. The community is held together only by the interest in software that can be used in the processes of official statistics organisations.

Also at the level of the NSOs, we can see two different governance models: the Serbian one (Case Studies: SORS) is more “structured”, the software released is controlled by the “official” structure and the developments are managed by the IT department in collaboration with the users. In Istat (Case Studies: Istat), the packages released are managed by a group of methodologists that publish their packages on public platforms and follow their development with a sort of volunteering.

Recommendations on governance of open source software

Below we list some recommendations on governance that are valid for all models:

Define clear objectives for governance:

Establish a governance framework tailored to the NSO’s specific needs and objectives, ensuring alignment with statistical priorities such as data quality, security, and compliance.
Define which aspects of software development require centralised control (Cathedral) and which can benefit from community-driven contributions (Bazaar).

Encourage stakeholder engagement:

Develop mechanisms to engage statisticians, IT experts, and external contributors in governance discussions.
Use forums, workshops, and collaborative platforms to gather input on governance policies and software priorities.
Establish a supportive environment for external contributors by providing clear guidelines, documentation, and mentorship opportunities.

Apply governance models differently as needed:

Apply the Cathedral model to critical components such as data security, metadata management, and compliance tools while using the Bazaar model for auxiliary tools and add-ons.
Distributed autonomy: delegate governance of less critical projects or extensions to trusted community members while retaining oversight over core functions.

Establish quality assurance processes:

Introduce robust quality assurance mechanisms to ensure all software contributions meet high standards for statistical accuracy, usability, documentation, and security.
Require peer reviews for contributions to Bazaar-style projects and maintain a centralised team for final validation.

Risk management:

Develop a comprehensive risk management strategy to address potential issues such as untested contributions, security vulnerabilities, and dependency risks.
Regularly audit community-driven code for vulnerabilities and integrate automated testing tools into the development pipeline.

Knowledge building and sharing:

Provide training for all NSO staff on OSS governance, development practices, and community engagement.
Facilitate the exchange of experiences, challenges, and solutions among NSOs to improve OSS governance.
Create shared resources, such as guidelines for adopting and managing OSS, tailored to the needs of statistical organisations.

Transparency in governance:

Publish on the web governance policies, decision-making processes, and project roadmaps to ensure transparency and trust.
Clearly communicate the criteria for adopting, rejecting, or modifying community contributions.

Leverage international standards and guidelines:

Ensure that OSS governance aligns with international statistical standards, such as GSBPM, GSIM, and SDMX, to maintain consistency and interoperability.
Collaborate with international organisations to standardise governance approaches and share collective expertise.

6. Security

Security by obfuscation refers to the practice of relying on secrecy or obscurity of system details, such as source code or algorithms, to protect software or systems from threats. This practice provides a false sense of security and does not address underlying vulnerabilities. It can be easily circumvented by skilled attackers who can reverse-engineer the code.

Open source software takes a different approach to security. The source code is publicly available, allowing anyone to inspect, modify, and enhance it. This transparency can lead to more robust security for several reasons:

Many eyes principle: The idea that “given enough eyeballs, all bugs are shallow” (called Linus’s Law) suggests that the more people who review the code, the more likely vulnerabilities will be discovered and fixed. This collaborative effort can lead to more secure software.
Community contributions: Open source projects benefit from contributions from a global community of developers. This diverse input can lead to more innovative security solutions and faster identification of vulnerabilities.
Transparency: Open source software promotes transparency, which builds trust among users. Users can verify the security of the software themselves or rely on community reviews and audits.
Rapid response: Open source communities can quickly respond to security issues. Once a vulnerability is identified, patches and updates can be developed and distributed rapidly.
Peer review: Open source projects often undergo rigorous peer review processes. Code changes are scrutinised by multiple developers, reducing the likelihood of introducing new vulnerabilities.

Open source is no more a security risk than proprietary software, and due to the code being open to inspection there are grounds for arguing it is in fact more secure. However what it does have in common with proprietary software is a potential vulnerability to supply chain attacks. This is where rather than attacking the software itself, other packages (or services) imported or used are attacked instead. These may be proprietary or open source themselves. Defending against such attacks is too large a topic for discussion in this document, but no further security measures other than those that an organisation should be following regardless are required when using OSS in lieu of proprietary software.

Below are general recommendations for NSOs regarding security issues in adopting and developing open source software, with some references to the use cases presented. While these recommendations are interconnected, we present them separately to provide clarity and focus on specific aspects of security. Further discussion of security related issues can be found in Annex 2: SWOT analysis of OS adoption in NSOs.

Secure Adoption of Open Source Software Recommendations

Security assessments:
- Perform regular security audits of open source software before adoption. Assess for vulnerabilities in dependencies, outdated libraries, and compliance with data protection laws (e.g., GDPR) or standards (e.g., ISO/IEC 27001). Where possible, automate package vulnerability scans in a CI/CD pipeline or internal package repository.
- Tools: In the R-based ecosystem, managing third-party dependencies is critical; ensure packages are well-maintained and secure. Tools like SonarQube for static analysis or Snyk for dependency scanning can help identify vulnerabilities early.
Establish secure deployment practices:
- Use containerisation tools like Docker to isolate open source applications and ensure consistent and secure environments.
- Provide guidance on best practices for securely configuring and maintaining systems, as highlighted by the Awesome List use case.
- Consider adopting Configuration as Code (CaC) tools such as Ansible or Terraform to automate and enforce secure configurations.
Ensure data confidentiality:
- Implement anonymisation and pseudonymisation techniques for sensitive data before processing it with open source tools. This is particularly important for platforms handling confidential statistical data.
- Tools: Use tools like ARX for data anonymisation or OpenDP for implementing differential privacy to ensure data confidentiality.
Leverage open source trustworthiness:
- Use software with permissive and well-documented licences (e.g., Apache 2.0 or MIT), as chosen by SORS and SIS-CC in the related case studies, ensuring no hidden restrictions or licensing issues compromise data security.
International cooperation among NSOs:
- Enhance open source security by enabling shared expertise in code reviews, vulnerability identification, and the implementation of best practices. The collective effort ensures thorough scrutiny of software, reduces security risks, and promotes alignment with international standards.
- By pooling resources and collaborating, NSOs build more robust and resilient open source solutions that benefit the global statistical community.
- Tools: Collaboration platforms like Common Vulnerabilities and Exposures (CVE) or shared vulnerability databases can be leveraged to track and mitigate security threats across borders.

Secure Development of Open Source Software Recommendations

Embed security in the development lifecycle:
- Adopt practices, like DevSecOps, to integrate security at every stage of the software development lifecycle. This ensures that vulnerabilities are addressed proactively rather than reactively, as emphasised by SIS-CC’s centralised review and merge process.
- Utilise tools like Jenkins with security plugins, and integrate automated security scanning in CI/CD pipelines.
Implement contributor governance:
- Use Contributor Licence Agreements (CLAs) or similar governance tools to ensure all contributions to open source projects are secure, vetted, and legally compliant. This strategy is effective for ensuring that external contributions to tools like IST remain secure.
Limit dependency risks and conduct supply chain security:
- While open source software presents numerous security benefits, it is susceptible to supply chain attacks. This involves attackers targeting external dependencies rather than the software itself.
- Minimise reliance on external libraries or ensure dependency management by selecting only well-maintained and frequently updated packages (Istat’s focus on dependency stability for R packages is an example of such preemptive security in development).
- Mitigation strategies includes:
  - Regularly assess dependencies for security patches and updates.
  - Regularly scan for vulnerabilities in dependencies using tools like OWASP Dependency-Check or GitHub Dependabot.
  - Use trusted repositories and verify cryptographic signatures for all third-party packages.
  - Employ strategies such as dependency pinning and careful management of version upgrades.
Monitor threats and leverage community vetting:
- Continuously monitor for emerging threats within the open source ecosystem. Leverage tools like Threat Intelligence Platforms (TIPs) or community-driven security alerts to stay informed.
- Encourage open community feedback and peer review to identify and fix security vulnerabilities. The collaborative nature of the Data Clean ecosystem and the Awesome List ensures wider scrutiny and faster resolution of potential security issues.
Establish a clear incident response plan for addressing security vulnerabilities in open source software:
- This should include guidelines for quickly patching vulnerabilities, communicating risks, and coordinating with affected stakeholders. Consider automated patch management systems to minimise response time.

Footnotes

https://observablehq.com/@olavtenbosch/visualizing-awesomeofficialstatistics-org ↩︎
The best example is the software repository CRAN at https://cran.r-project.org/↩︎
A good third-party software repository is the Python Project Index (PyPI) at https://pypi.org/↩︎
An awesome list: https://github.com/secretGeek/AwesomeCSV ↩︎
JSON resources can be found at https://github.com/burningtree/awesome-json?tab=readme-ov-file , https://ajv.js.org/ and https://www.json.org/json-en.html ↩︎
An awesome list: https://github.com/StanimirIglev/awesome-xml ↩︎
Tools for SDMX implementers and developers https://sdmx.org/?page_id=4500 ↩︎
The Bank for International Settlements publishes another website with SDMX resources at https://www.sdmx.io/↩︎
https://oauth.net/2/↩︎
An Awesome list is available at https://github.com/marmelab/awesome-rest ↩︎
Tools and resources can be found at https://www.odata.org/↩︎
http://r-project.ro/conferences.html ↩︎