2 min read

PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models.
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software

I'm pleased to announce our dataset paper has been accepted at Mining Software Repositories 2024 (MSR 2024).

In the rapidly evolving field of deep learning, the use of pre-trained models (PTMs) has emerged as a crucial technology for software engineers aiming to integrate advanced functionalities into their applications without the high cost and complexity of building and training models from scratch. Despite their growing adoption, the intricacies of the PTM supply chain have largely remained a mystery, hindering a comprehensive understanding of their impact on downstream applications.

To bridge this gap, our latest research introduces the PeaTMOSS dataset, a robust collection designed to illuminate the landscape of PTM usage and its implications. The current version of this dataset encompasses metadata for 281,638 PTMs, with detailed snapshots of 14,296 models that achieve over 50 monthly downloads. Additionally, it includes 28,575 open-source software repositories from GitHub that utilize these models, creating a rich resource for analysis.

One of the standout features of PeaTMOSS is its inclusion of 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they employ. This mapping provides critical insights into how PTMs are being adopted and reused across various projects. To further enhance the dataset’s utility, we developed prompts for a large language model to automatically extract vital metadata, such as training datasets, parameters, and evaluation metrics.

Our analysis of PeaTMOSS offers the first set of summary statistics for the PTM supply chain, revealing trends in PTM development and common issues in PTM package documentation. Notably, our findings highlight inconsistencies in software licenses between PTMs and their dependent projects, an area ripe for further investigation.

PeaTMOSS sets the stage for future research, providing a foundational resource that opens up numerous opportunities to explore PTM dynamics. Whether you are interested in the development trends of PTMs, their downstream usage, or crosscutting questions about their impact, PeaTMOSS offers a treasure trove of data to support your needs.

This dataset not only sheds light on the current state of PTM adoption but also paves the way for more structured and comprehensive studies in the future. We invite the research community to delve into PeaTMOSS and uncover the many facets of the PTM supply chain, driving forward our collective understanding of this pivotal aspect of modern software development.

See https://ecommons.luc.edu/cs_facpubs/362/ for our publication.

See https://github.com/PurdueDualityLab/PeaTMOSS-Artifact for the actual dataset/artifact.