For package collections, among which is Python Package Index aka PyPI, it's crucial to provide complete metadata on published packages in an easily accessible way and easily processable format. For instance, it's required by Repology to be able to report outdated versions of Python modules packaged in distributions native repositories.
Unfortunately, PyPI does not currently provide such data in an usable way. According to the FAQ and PyPI API Reference, there are several ways to access package metadata:
- An endpoint to get information on individual project - not suitable as it requires thousands of HTTP requests to fetch data on all packages.
- XML-RPC API which has the same problem.
- Simple Project API which has the same problem and does not provide any information apart from downloads.
- BigQuery Datasets which require Google account for access.
None of these meet basic usability requirements. A simple single-file package metadata dump would be sufficient, but for some reason PyPI developers do not care (related upstream issues pypa/warehouse#347, pypa/warehouse#7403, pypa/warehouse#8802) to provide such file, so this service was set up to provide at least something - that is, a metadata dump for recently changed packages only.
The format of the following file is zstandard-compressed JSON containing an array of outputs of Project PyPI JSON API endpoint. Each package entry is additionally processed to remove description field and releases which are older than the latest release (as specified by the version field) to reduce the size of the dump.
Format: JSON compressed with zstd
Size: 108.13 MiB
Generated at 2024-10-08 15:45 UTC
Contains 159586 packages
Details of operation
This service works by polling XML-RPC changelog method to discover all package changes since the previous iteration, and then retrieves fresh metadata for each of them from JSON API. This information is then stored in the database and periodically dumped into a single JSON file.
Source code is located on GitHub.
Warranty
Note that this service by design provides incomplete data, no consistency guarantee is ever provided and you're using this data at your own risk. Additionally, note that XML-RPC API of PyPI is also deprecated with a suggested replacement of Latest Updates RSS feed, which only provides 40 latest changes without a mechanism to request larger history of updates, which cannot be used in a way that no updates are lost, so this service will be discontinued as soon as XML-RPC API is disabled.