To ban or not to ban, that is the pickle
While Hugging Face supports machine learning (ML) models in various formats, Pickle is among the most prevalent thanks to the popularity of PyTorch, a widely used ML library written in Python that uses Pickle serialization and deserialization for models. Pickle is an official Python module for object serialization, which in programming languages means turning an object into a byte stream — the reverse process is known as deserialization, or in Python terminology: pickling and unpickling.
The process of serialization and deserialization, especially of input from untrusted sources, has been the cause of many remote code execution vulnerabilities in a variety of programming languages. Similarly, the Python documentation for Pickle has a big red warning: “It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.”
That poses a problem for an open platform like Hugging Face, where users openly share and have to unpickle model data. On one hand, this opens the potential for abuse by ill-intentioned individuals who upload poisoned models, but on the other, banning this format would be too restrictive given PyTorch’s popularity. So Hugging Face chose the middle road, which is to attempt to scan and detect malicious Pickle files.